US20140032216A1 - Pronunciation Discovery for Spoken Words - Google Patents

Pronunciation Discovery for Spoken Words Download PDF

Info

Publication number
US20140032216A1
US20140032216A1 US14/041,857 US201314041857A US2014032216A1 US 20140032216 A1 US20140032216 A1 US 20140032216A1 US 201314041857 A US201314041857 A US 201314041857A US 2014032216 A1 US2014032216 A1 US 2014032216A1
Authority
US
United States
Prior art keywords
pronunciation
pronunciations
scored
lexicon
alternative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/041,857
Inventor
Daniel L. Roth
Laurence S. Gillick
Michael L. Shire
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Inc
Original Assignee
Nuance Communications Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nuance Communications Inc filed Critical Nuance Communications Inc
Priority to US14/041,857 priority Critical patent/US20140032216A1/en
Publication of US20140032216A1 publication Critical patent/US20140032216A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams

Definitions

  • This invention relates generally to wireless communication devices with speech recognition capabilities.
  • Wireless communications devices such as cellular telephones (cell phones) commonly employ speech recognition tools to simplify the user interface. For example, many cell phones can recognize and execute user commands to initiate an outgoing phone call, or answer an incoming phone call. Many cell phones can recognize a spoken name from a phone book, and automatically initiate a phone call to the number associated with the spoken name.
  • Handheld electronic devices e.g., mobile phones, PDAs, etc., referred to herein as “handhelds”
  • handhelds typically provide for user input via a keypad or similar interface, through which the user manually enters commands and/or alphanumeric data. Manually entering information may require the user to divert his attention from other important activities such as driving.
  • One solution to this problem is to equip the handheld with an embedded speech recognizer.
  • the speech recognizer may occasionally incorrectly decode the utterance from the user. To deal with such errors, some speech recognizers generate a list of N alternatives for the recognized transcript (i.e., the word or words corresponding to what the user uttered), referred to herein as the choice list (also known in the art as an N-best list), from which the user may choose the correct version.
  • the choice list also known in the art as an N-best list
  • One factor contributing to incorrect recognitions that is particularly relevant in the following description is variations in user pronunciation. A user with a certain dialect or accent may utter a word that does not score well with the phonetic representation of that word stored in the lexicon of the speech recognizer.
  • the described embodiment generates an alternative phonetic representation (i.e., alternative pronunciation) of an initial pronunciation of a word (or phrase).
  • the initial pronunciation of the word is not the highest-scoring word provided by the speech recognizer, but is rather a word chosen by the user from an N-best list of alternatives or entered manually.
  • the alternative phonetic representation is then stored as either a replacement for, or in addition to, the existing phonetic representation in the phonetic lexicon.
  • a speech recognizer processes an utterance from a user and generates a recognized transcript, along with an N-best list of alternatives.
  • the user chooses one of the alternatives to the recognized transcript, or enters an alternative transcript manually (if the correct transcript is not available from the speech recognizer).
  • the speech recognizer is constrained to recognize a hypothesis that differs from the initial transcript by no more than one phoneme. The score of this hypothesis thus represents the best scoring alternate pronunciation with respect to the utterance that is different from the initial pronunciation by at most one phoneme. If the score of this alternate pronunciation is higher (by some threshold) than that of the initial pronunciation by some threshold, the speech recognizer updates its lexicon by replacing the initial pronunciation currently in the lexicon with the alternate pronunciation. Alternatively, instead of replacing the pronunciation, the speech recognizer may add the new pronunciation, so that both pronunciations are in the lexicon.
  • the speech recognizer does not update its lexicon.
  • a method of generating an alternative pronunciation for a word or phrase given an initial pronunciation and a spoken example of the word or phrase includes providing the initial pronunciation of the word or phrase, generating the alternative pronunciation by searching a neighborhood of pronunciations about the initial pronunciation, and selecting a highest scoring pronunciation within the neighborhood of pronunciations.
  • the neighborhood may include pronunciations that differ from the neighborhood by some limited number or amount of speech sub-units, such as phonemes, syllables, diphones, triphones, or other such sub-units of speech known in the art.
  • the method includes searching the neighborhood of pronunciations that differ from the initial pronunciation by at most one phoneme, for example by using a speech recognition system to perform phoneme recognition with a constraint.
  • the method further includes using a phonetic recognizer to associate a score with each of the initial and/or the alternative pronunciations, and using one or both of these scores to decide whether to add the new pronunciation to the lexicon.
  • the method includes updating the associated lexicon by replacing the initial pronunciation in the lexicon with the highest-scoring alternative pronunciation, or by augmenting the lexicon by adding the alternative pronunciation.
  • the user may have an option of allowing or disallowing the update of the lexicon.
  • a method of generating an alternative pronunciation of an initial pronunciation includes generating an initial pronunciation corresponding to a spoken utterance, generating one or more potential alternative pronunciations by changing the initial pronunciation by one phoneme, and selecting a highest scoring potential alternative pronunciation with respect to the spoken utterance as the alternative pronunciation of the initial pronunciation.
  • a computer readable medium with stored instructions adapted for generating an alternative pronunciation of an initial pronunciation includes instructions for generating an initial pronunciation corresponding to a spoken utterance.
  • the medium further includes instructions for generating one or more potential alternative pronunciations by changing the initial pronunciation by one phoneme, and instructions for selecting a highest scoring potential alternative pronunciation with respect to the spoken utterance as the alternative pronunciation of the initial pronunciation.
  • a method of updating a lexicon used by a speech recognizer includes selecting a phonetic representation of a spoken utterance, generating a set of alternate phonetic representations by changing one or more phonemes in the phonetic representation, and scoring the set of alternate phonetic representations as to how well each one matches the spoken utterance, so as to produce a highest-scoring phonetic representation. The method further includes updating the lexicon with the highest scoring phonetic representation.
  • a method of generating an alternative pronunciation for a word or phrase given an initial pronunciation and a spoken example of the word or phrase includes providing the initial pronunciation of the word or phrase.
  • the method further includes generating the alternative pronunciation by searching a neighborhood of pronunciations about the initial pronunciation via a constrained search.
  • the neighborhood includes pronunciations that differ from the initial pronunciation by at most one phoneme.
  • the method also includes selecting a highest scoring pronunciation within the neighborhood of pronunciations.
  • a method of generating an alternative pronunciation of an initial pronunciation includes generating an initial pronunciation corresponding to a spoken utterance, generating one or more potential alternative pronunciations by constructing one or more hypotheses constrained so as to match the initial pronunciation except for phoneme, and selecting a highest scoring potential alternative pronunciation with respect to the spoken utterance as the alternative pronunciation of the initial pronunciation.
  • FIG. 1 shows a constraint (finite-state machine) used in phoneme recognition to find the best-scoring pronunciation that differs from the original pronunciation by at most one phoneme.
  • FIGS. 2 a and 2 b show, in flow diagram form, the operation of the described embodiment.
  • FIG. 3 shows a high-level block diagram of a smartphone.
  • the described embodiment is a cell phone with embedded speech recognition functionality that allows a user to bypass the manual keypad and enter commands and data via spoken words.
  • Embedded application software in the cellular telephone provides the speech recognition functionality (also referred to a speech recognizer).
  • the speech recognizer includes a process for updating its phonetic lexicon to better match a user's pronunciation.
  • the speech recognizer searches a lexicon of phonetic representations for the highest scoring match of the acoustic utterance, and provides a recognized transcript corresponding to that highest scoring phonetic representation.
  • the speech recognizer also provides the user with a list of alternatives to the recognized transcript (i.e., the N-best list).
  • the N-best list corresponds to the next N highest scoring phonetic representations (with respect to the utterance) in the lexicon.
  • the speech recognizer may update its phonetic lexicon with an alternative pronunciation that is within a neighborhood of the alternative transcript (referred to herein as the “initial transcript”) chosen by the user.
  • the speech recognizer searches the space of all pronunciations that differ from the initial pronunciation by no more than one phoneme. If the score of the pronunciation output by the speech recognizer is greater than the score of the initial pronunciation (by a predetermined threshold), the speech recognizer updates the lexicon with the new pronunciation.
  • the particular value of the threshold is selected to result in desired performance without changing the lexicon for insignificant variations of pronunciation. The threshold thus allows for filtering small pronunciation changes that do not provide a beneficial impact. Updating the lexicon includes replacing the initial pronunciation. Updating the lexicon may alternatively include augmenting the lexicon with the new pronunciation, without removing or otherwise replacing the initial pronunciation.
  • FIGS. 1 , 2 a , and 2 b show flow diagrams describing how the described embodiment updates its lexicon as generally set forth above. We then present a description of a typical cell phone system in which the general functionality can be implemented.
  • each of the embodiments described herein takes an utterance, i.e., a spoken example of a word or phrase, along with an initial pronunciation of that utterance (e.g., a pronunciation corresponding to a recognized transcript or an alternative to that transcript, or some other source of a pronunciation), and generates an alternative pronunciation that is within a “neighborhood” of the initial pronunciation.
  • this neighborhood is defined by a variation in the phonemes of the initial pronunciation (e.g., one phoneme different), but in general the neighborhood could be defined by any variation of the initial pronunciation that changes how well the changed pronunciation matches the utterance.
  • any pronunciation sub-unit e.g., syllables, diphones, triphones, etc., as an alternative to phonemes, may be used to define these variations. Further, the neighborhood could be defined by a combination of such variations. Also in this embodiment the initial pronunciation comes from a cell phone user's choice of an alternative recognized transcript, but in general the initial pronunciation could come from other sources. The concepts described herein merely require an initial pronunciation and a corresponding spoken example of that pronunciation. For a cell phone with a phonetic lexicon, all that is required is a spoken example of a word or phrase and a spelling of that word or phrase that can be used to find a pronunciation in the lexicon.
  • FIG. 1 shows the constraint (finite-state machine) used in the phoneme recognition including a first row 102 of states with the states constrained to phonemes p 1 through p 7 as shown, and an initial silence state s 1 and a final silence state s 2 .
  • the phonemes p 1 through p 7 represent the initial pronunciation described above.
  • Below the first row of states 102 is a second row of states 104 , which is essentially a duplicate of the initial pronunciation states in the first row 102 starting with the second phoneme.
  • the first row thus represents the sequence of phonemes in the initial pronunciation with no changes
  • the second row 104 represents the sequence of phonemes with one phoneme different.
  • the recognizer chooses the highest scoring input, i.e., the path that best matches the spoken utterance.
  • Possible hypothesis paths into the n th node of the second row 104 include (i) the (n-1) th state of the second row 104 , (ii) the (n-1) th “any phoneme” state, so that a different phoneme replaces the (n-1) h ′ phoneme the initial pronunciation, (iii) the (n-2) th phoneme of the initial pronunciation, effectively deleting the previous phoneme, or (iv) the n th “any phoneme” state, thereby inserting an additional phoneme into the hypothesis.
  • the recognized hypothesis will include at most one phoneme change (substitution, insertion, or deletion), and will represent the highest scoring hypothesis with at most one phoneme different.
  • the score at s 2 therefore corresponds to the best scoring pronunciation with at most one phoneme different from the initial pronunciation, which is used as the alterative pronunciation.
  • States p 7 and s 2 are shown in broken lines, because they have no input to the second row 104 result. In the preferred embodiment, insertions are excluded at the beginning and end of the utterance.
  • the process for updating the speech recognizer lexicon in the described embodiment is shown in FIGS. 2 a and 2 b .
  • the process begins when the user utters a word or phrase 120 (i.e., an utterance).
  • the speech recognizer evaluates 122 its phonetic lexicon of standard pronunciations with respect to the utterance using a phonetic recognizer, and selects 124 the highest-scoring member.
  • the speech recognizer presents 126 the highest scoring member to the user as the recognized transcript, and also presents 127 the next N highest scoring members as an N-best list of alternatives to the recognized transcript.
  • the user typically selects either (i) the recognized transcript 128 or (ii) one of the members of the N-best list 130 of alternatives, as what he actually uttered. However in some cases, neither the recognized transcript nor the N-best list includes 131 what the user actually uttered. In those cases, the user may either enter the word/phrase manually 132 , effectively bypassing the speech recognition functionality, or simply utter 134 the word or phrase again.
  • the speech processor does not update its lexicon, and waits for the next utterance. If the user selects an alternative from the N-best list 130 or manually enters the word/phrase, the speech recognizer generates 100 an alternative pronunciation from the initial pronunciation as described above.
  • the speech recognizer compares the score of the user's alternative (i.e., the initial pronunciation) to the score of the alternate pronunciation. If 140 the score of the alternate pronunciation is greater than the score of the initial pronunciation by a threshold, the speech recognizer replaces 142 the phonetic representation of the initial pronunciation in the lexicon with the alternative pronunciation generated 100 by the speech recognizer
  • Updating the lexicon to replace the initial pronunciation as described above removes that initial phonetic representation from future consideration by the speech processor.
  • Other users of the cell phone may pronounce words in such a way that would produce a better score on the original phonetic representation that was replaced than on the updated phonetic representation. Therefore another way to update the lexicon in the above-described procedure is to add the highest scoring phonetic representation to the lexicon without eliminating the original pronunciation, so that both pronunciations are included in the lexicon for future consideration by the speech processor.
  • the cell phone may provide the user with the option of whether or not to allow update.
  • This option may be on a case-by-case basis, so that each time a potential update is available, the user may affirmatively allow or disallow the update via a keystroke or spoken command.
  • This option can also be selected as an enable/disable function, so that the all updates are allowed when the user enables the function, and all updates are disallowed when the user disables the function.
  • the speech recognizer may be able to further improve the pronunciation through an iterative process. For example, if the score of the alternative pronunciation is better than the initial pronunciation by a predetermined threshold, the speech recognizer generates yet another pronunciation by taking the previously determined alternative pronunciation and finding a new, higher-scoring alternative pronunciation that differs from the previously determined alternative pronunciation by only one phoneme. This iterative process continues until the improvement drops below the predetermined threshold, indicating that the improvement is leveling off
  • a smartphone 200 is a typical platform that can provide such speech recognition functionality via embedded application software.
  • the described method of updating the phonetic lexicon may also be implemented in other portable phones, and in other hand held devices in general.
  • Smartphone 200 is a Microsoft PocketPC-powered phone which includes at its core a baseband DSP 202 (digital signal processor) for handling the cellular communication functions (including for example voiceband and channel coding functions) and an applications processor 204 (e.g. Intel StrongArm SA-110) on which the PocketPC operating system runs.
  • the phone supports GSM voice calls, SMS (Short Messaging Service) text messaging, wireless email, and desktop-like web browsing along with more traditional PDA features.
  • An RF synthesizer 206 and an RF radio transceiver 208 followed by a power amplifier module 210 , implement the transmit and receive functions.
  • the power amplifier module handles the final-stage RF transmit duties through an antenna 212 .
  • An interface ASIC 214 and an audio CODEC 216 provide interfaces to a speaker, a microphone, and other input/output devices provided in the phone such as a numeric or alphanumeric keypad (not shown) for entering commands and information.
  • DSP 202 uses a flash memory 218 for code store.
  • a Li-Ion (lithium-ion) battery 220 powers the phone and a power management module 222 coupled to DSP 202 manages power consumption within the phone.
  • SDRAM 224 and flash memory 226 provide volatile and non-volatile memory, respectively, for applications processor 214 . This arrangement of memory holds the code for the operating system, the code for customizable features such as the phone directory, and the code for any embedded applications software in the smartphone, including the voice recognition software described above.
  • the visual display device for the smartphone includes LCD driver chip 228 that drives LCD display 230 .
  • Clock module 232 provides the clock signals for the other devices within the phone and provides an indicator of real time. All of the above-described components are packages within an appropriately designed housing 234 .
  • Smartphone 200 described above represents the general internal structure of a number of different commercially available smartphones, and the internal circuit design of those phones is generally known in the art.
  • an application running on the applications processor 104 performs the process of updating the phonetic lexicon as described in FIGS. 1 , 2 a , and 2 b.

Abstract

A method for a portable device includes receiving a spoken utterance of a word or phrase, generating a plurality of alternative pronunciations of the spoken utterance, scoring one or more pronunciations of the plurality of alternative pronunciations using the spoken utterance, and updating a lexicon with at least one scored pronunciation.

Description

  • This application is a continuation of pending U.S. patent application Ser. No. 10/939,942, filed Sep. 13, 2004, which in turn claims priority from U.S. Provisional Patent Application 60/502,084, filed Sep. 11, 2003, which are incorporated herein by reference in their entirety.
  • TECHNICAL FIELD
  • This invention relates generally to wireless communication devices with speech recognition capabilities.
  • BACKGROUND ART
  • Wireless communications devices, such as cellular telephones (cell phones), commonly employ speech recognition tools to simplify the user interface. For example, many cell phones can recognize and execute user commands to initiate an outgoing phone call, or answer an incoming phone call. Many cell phones can recognize a spoken name from a phone book, and automatically initiate a phone call to the number associated with the spoken name.
  • Handheld electronic devices (e.g., mobile phones, PDAs, etc., referred to herein as “handhelds”) typically provide for user input via a keypad or similar interface, through which the user manually enters commands and/or alphanumeric data. Manually entering information may require the user to divert his attention from other important activities such as driving. One solution to this problem is to equip the handheld with an embedded speech recognizer.
  • Due to numerous factors, the speech recognizer may occasionally incorrectly decode the utterance from the user. To deal with such errors, some speech recognizers generate a list of N alternatives for the recognized transcript (i.e., the word or words corresponding to what the user uttered), referred to herein as the choice list (also known in the art as an N-best list), from which the user may choose the correct version. One factor contributing to incorrect recognitions that is particularly relevant in the following description is variations in user pronunciation. A user with a certain dialect or accent may utter a word that does not score well with the phonetic representation of that word stored in the lexicon of the speech recognizer.
  • SUMMARY
  • The described embodiment generates an alternative phonetic representation (i.e., alternative pronunciation) of an initial pronunciation of a word (or phrase). In general, the initial pronunciation of the word is not the highest-scoring word provided by the speech recognizer, but is rather a word chosen by the user from an N-best list of alternatives or entered manually. The alternative phonetic representation is then stored as either a replacement for, or in addition to, the existing phonetic representation in the phonetic lexicon.
  • In the described embodiment, a speech recognizer processes an utterance from a user and generates a recognized transcript, along with an N-best list of alternatives. For an initial transcript, the user chooses one of the alternatives to the recognized transcript, or enters an alternative transcript manually (if the correct transcript is not available from the speech recognizer). The speech recognizer is constrained to recognize a hypothesis that differs from the initial transcript by no more than one phoneme. The score of this hypothesis thus represents the best scoring alternate pronunciation with respect to the utterance that is different from the initial pronunciation by at most one phoneme. If the score of this alternate pronunciation is higher (by some threshold) than that of the initial pronunciation by some threshold, the speech recognizer updates its lexicon by replacing the initial pronunciation currently in the lexicon with the alternate pronunciation. Alternatively, instead of replacing the pronunciation, the speech recognizer may add the new pronunciation, so that both pronunciations are in the lexicon.
  • If the score of the new pronunciation is not higher (by some threshold) than the score of the initial pronunciation by more than some threshold, the speech recognizer does not update its lexicon.
  • In one aspect, a method of generating an alternative pronunciation for a word or phrase given an initial pronunciation and a spoken example of the word or phrase includes providing the initial pronunciation of the word or phrase, generating the alternative pronunciation by searching a neighborhood of pronunciations about the initial pronunciation, and selecting a highest scoring pronunciation within the neighborhood of pronunciations. The neighborhood may include pronunciations that differ from the neighborhood by some limited number or amount of speech sub-units, such as phonemes, syllables, diphones, triphones, or other such sub-units of speech known in the art.
  • The method includes searching the neighborhood of pronunciations that differ from the initial pronunciation by at most one phoneme, for example by using a speech recognition system to perform phoneme recognition with a constraint.
  • The method further includes using a phonetic recognizer to associate a score with each of the initial and/or the alternative pronunciations, and using one or both of these scores to decide whether to add the new pronunciation to the lexicon.
  • The method includes updating the associated lexicon by replacing the initial pronunciation in the lexicon with the highest-scoring alternative pronunciation, or by augmenting the lexicon by adding the alternative pronunciation. The user may have an option of allowing or disallowing the update of the lexicon.
  • In another aspect, a method of generating an alternative pronunciation of an initial pronunciation includes generating an initial pronunciation corresponding to a spoken utterance, generating one or more potential alternative pronunciations by changing the initial pronunciation by one phoneme, and selecting a highest scoring potential alternative pronunciation with respect to the spoken utterance as the alternative pronunciation of the initial pronunciation.
  • In another aspect, a computer readable medium with stored instructions adapted for generating an alternative pronunciation of an initial pronunciation includes instructions for generating an initial pronunciation corresponding to a spoken utterance. The medium further includes instructions for generating one or more potential alternative pronunciations by changing the initial pronunciation by one phoneme, and instructions for selecting a highest scoring potential alternative pronunciation with respect to the spoken utterance as the alternative pronunciation of the initial pronunciation.
  • In another aspect, a method of updating a lexicon used by a speech recognizer includes selecting a phonetic representation of a spoken utterance, generating a set of alternate phonetic representations by changing one or more phonemes in the phonetic representation, and scoring the set of alternate phonetic representations as to how well each one matches the spoken utterance, so as to produce a highest-scoring phonetic representation. The method further includes updating the lexicon with the highest scoring phonetic representation.
  • In another aspect, a method of generating an alternative pronunciation for a word or phrase given an initial pronunciation and a spoken example of the word or phrase includes providing the initial pronunciation of the word or phrase. The method further includes generating the alternative pronunciation by searching a neighborhood of pronunciations about the initial pronunciation via a constrained search. The neighborhood includes pronunciations that differ from the initial pronunciation by at most one phoneme. The method also includes selecting a highest scoring pronunciation within the neighborhood of pronunciations.
  • In another aspect, a method of generating an alternative pronunciation of an initial pronunciation includes generating an initial pronunciation corresponding to a spoken utterance, generating one or more potential alternative pronunciations by constructing one or more hypotheses constrained so as to match the initial pronunciation except for phoneme, and selecting a highest scoring potential alternative pronunciation with respect to the spoken utterance as the alternative pronunciation of the initial pronunciation.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows a constraint (finite-state machine) used in phoneme recognition to find the best-scoring pronunciation that differs from the original pronunciation by at most one phoneme.
  • FIGS. 2 a and 2 b show, in flow diagram form, the operation of the described embodiment.
  • FIG. 3 shows a high-level block diagram of a smartphone.
  • DETAILED DESCRIPTION
  • The described embodiment is a cell phone with embedded speech recognition functionality that allows a user to bypass the manual keypad and enter commands and data via spoken words. Embedded application software in the cellular telephone provides the speech recognition functionality (also referred to a speech recognizer). The speech recognizer includes a process for updating its phonetic lexicon to better match a user's pronunciation.
  • When the user utters a word or phrase, the speech recognizer searches a lexicon of phonetic representations for the highest scoring match of the acoustic utterance, and provides a recognized transcript corresponding to that highest scoring phonetic representation. The speech recognizer also provides the user with a list of alternatives to the recognized transcript (i.e., the N-best list). The N-best list corresponds to the next N highest scoring phonetic representations (with respect to the utterance) in the lexicon.
  • If the user selects an alternative from the N-best list instead of the recognized transcript, or if the user manually enters an alternative because the correct choice is not available in the recognized transcript or the N-best list, the speech recognizer may update its phonetic lexicon with an alternative pronunciation that is within a neighborhood of the alternative transcript (referred to herein as the “initial transcript”) chosen by the user.
  • The speech recognizer searches the space of all pronunciations that differ from the initial pronunciation by no more than one phoneme. If the score of the pronunciation output by the speech recognizer is greater than the score of the initial pronunciation (by a predetermined threshold), the speech recognizer updates the lexicon with the new pronunciation. The particular value of the threshold is selected to result in desired performance without changing the lexicon for insignificant variations of pronunciation. The threshold thus allows for filtering small pronunciation changes that do not provide a beneficial impact. Updating the lexicon includes replacing the initial pronunciation. Updating the lexicon may alternatively include augmenting the lexicon with the new pronunciation, without removing or otherwise replacing the initial pronunciation.
  • FIGS. 1, 2 a, and 2 b show flow diagrams describing how the described embodiment updates its lexicon as generally set forth above. We then present a description of a typical cell phone system in which the general functionality can be implemented.
  • In the most general sense, each of the embodiments described herein takes an utterance, i.e., a spoken example of a word or phrase, along with an initial pronunciation of that utterance (e.g., a pronunciation corresponding to a recognized transcript or an alternative to that transcript, or some other source of a pronunciation), and generates an alternative pronunciation that is within a “neighborhood” of the initial pronunciation. In the described embodiment, this neighborhood is defined by a variation in the phonemes of the initial pronunciation (e.g., one phoneme different), but in general the neighborhood could be defined by any variation of the initial pronunciation that changes how well the changed pronunciation matches the utterance. Any pronunciation sub-unit, e.g., syllables, diphones, triphones, etc., as an alternative to phonemes, may be used to define these variations. Further, the neighborhood could be defined by a combination of such variations. Also in this embodiment the initial pronunciation comes from a cell phone user's choice of an alternative recognized transcript, but in general the initial pronunciation could come from other sources. The concepts described herein merely require an initial pronunciation and a corresponding spoken example of that pronunciation. For a cell phone with a phonetic lexicon, all that is required is a spoken example of a word or phrase and a spelling of that word or phrase that can be used to find a pronunciation in the lexicon.
  • FIG. 1 shows the constraint (finite-state machine) used in the phoneme recognition including a first row 102 of states with the states constrained to phonemes p1 through p7 as shown, and an initial silence state s1 and a final silence state s2. The phonemes p1 through p7 represent the initial pronunciation described above. Below the first row of states 102 is a second row of states 104, which is essentially a duplicate of the initial pronunciation states in the first row 102 starting with the second phoneme. Between the first row 102 and the second row 104 are a number of “any phoneme” states (A) that can take on any particular phoneme identity. Potential transition paths are shown with arrowed lines. The first row thus represents the sequence of phonemes in the initial pronunciation with no changes, and the second row 104 represents the sequence of phonemes with one phoneme different. In the second row 104, where a node has more than one input, the recognizer chooses the highest scoring input, i.e., the path that best matches the spoken utterance. Possible hypothesis paths into the nth node of the second row 104 include (i) the (n-1)th state of the second row 104, (ii) the (n-1)th “any phoneme” state, so that a different phoneme replaces the (n-1)h′ phoneme the initial pronunciation, (iii) the (n-2)th phoneme of the initial pronunciation, effectively deleting the previous phoneme, or (iv) the nth “any phoneme” state, thereby inserting an additional phoneme into the hypothesis.
  • With this architecture, regardless of the path taken from the initial silence s1 to the end of the second row 104, the recognized hypothesis will include at most one phoneme change (substitution, insertion, or deletion), and will represent the highest scoring hypothesis with at most one phoneme different. The score at s2 therefore corresponds to the best scoring pronunciation with at most one phoneme different from the initial pronunciation, which is used as the alterative pronunciation. States p7 and s2 are shown in broken lines, because they have no input to the second row 104 result. In the preferred embodiment, insertions are excluded at the beginning and end of the utterance.
  • The process for updating the speech recognizer lexicon in the described embodiment is shown in FIGS. 2 a and 2 b. The process begins when the user utters a word or phrase 120 (i.e., an utterance). The speech recognizer evaluates 122 its phonetic lexicon of standard pronunciations with respect to the utterance using a phonetic recognizer, and selects 124 the highest-scoring member. The speech recognizer presents 126 the highest scoring member to the user as the recognized transcript, and also presents 127 the next N highest scoring members as an N-best list of alternatives to the recognized transcript.
  • The user typically selects either (i) the recognized transcript 128 or (ii) one of the members of the N-best list 130 of alternatives, as what he actually uttered. However in some cases, neither the recognized transcript nor the N-best list includes 131 what the user actually uttered. In those cases, the user may either enter the word/phrase manually 132, effectively bypassing the speech recognition functionality, or simply utter 134 the word or phrase again.
  • If the user selects the recognized transcript 128, the speech processor does not update its lexicon, and waits for the next utterance. If the user selects an alternative from the N-best list 130 or manually enters the word/phrase, the speech recognizer generates 100 an alternative pronunciation from the initial pronunciation as described above.
  • The speech recognizer compares the score of the user's alternative (i.e., the initial pronunciation) to the score of the alternate pronunciation. If 140 the score of the alternate pronunciation is greater than the score of the initial pronunciation by a threshold, the speech recognizer replaces 142 the phonetic representation of the initial pronunciation in the lexicon with the alternative pronunciation generated 100 by the speech recognizer
  • Updating the lexicon to replace the initial pronunciation as described above removes that initial phonetic representation from future consideration by the speech processor. Other users of the cell phone, however, may pronounce words in such a way that would produce a better score on the original phonetic representation that was replaced than on the updated phonetic representation. Therefore another way to update the lexicon in the above-described procedure is to add the highest scoring phonetic representation to the lexicon without eliminating the original pronunciation, so that both pronunciations are included in the lexicon for future consideration by the speech processor.
  • In either case of updating the lexicon (i.e., by replacement or augmentation), the cell phone may provide the user with the option of whether or not to allow update. This option may be on a case-by-case basis, so that each time a potential update is available, the user may affirmatively allow or disallow the update via a keystroke or spoken command. This option can also be selected as an enable/disable function, so that the all updates are allowed when the user enables the function, and all updates are disallowed when the user disables the function.
  • The speech recognizer may be able to further improve the pronunciation through an iterative process. For example, if the score of the alternative pronunciation is better than the initial pronunciation by a predetermined threshold, the speech recognizer generates yet another pronunciation by taking the previously determined alternative pronunciation and finding a new, higher-scoring alternative pronunciation that differs from the previously determined alternative pronunciation by only one phoneme. This iterative process continues until the improvement drops below the predetermined threshold, indicating that the improvement is leveling off
  • A smartphone 200, as shown in FIG. 3, is a typical platform that can provide such speech recognition functionality via embedded application software. In fact, the described method of updating the phonetic lexicon may also be implemented in other portable phones, and in other hand held devices in general.
  • Smartphone 200 is a Microsoft PocketPC-powered phone which includes at its core a baseband DSP 202 (digital signal processor) for handling the cellular communication functions (including for example voiceband and channel coding functions) and an applications processor 204 (e.g. Intel StrongArm SA-110) on which the PocketPC operating system runs. The phone supports GSM voice calls, SMS (Short Messaging Service) text messaging, wireless email, and desktop-like web browsing along with more traditional PDA features.
  • An RF synthesizer 206 and an RF radio transceiver 208, followed by a power amplifier module 210, implement the transmit and receive functions. The power amplifier module handles the final-stage RF transmit duties through an antenna 212. An interface ASIC 214 and an audio CODEC 216 provide interfaces to a speaker, a microphone, and other input/output devices provided in the phone such as a numeric or alphanumeric keypad (not shown) for entering commands and information.
  • DSP 202 uses a flash memory 218 for code store. A Li-Ion (lithium-ion) battery 220 powers the phone and a power management module 222 coupled to DSP 202 manages power consumption within the phone. SDRAM 224 and flash memory 226 provide volatile and non-volatile memory, respectively, for applications processor 214. This arrangement of memory holds the code for the operating system, the code for customizable features such as the phone directory, and the code for any embedded applications software in the smartphone, including the voice recognition software described above. The visual display device for the smartphone includes LCD driver chip 228 that drives LCD display 230. Clock module 232 provides the clock signals for the other devices within the phone and provides an indicator of real time. All of the above-described components are packages within an appropriately designed housing 234.
  • Smartphone 200 described above represents the general internal structure of a number of different commercially available smartphones, and the internal circuit design of those phones is generally known in the art.
  • In the described embodiment, an application running on the applications processor 104 performs the process of updating the phonetic lexicon as described in FIGS. 1, 2 a, and 2 b.
  • Other aspects, modifications, and embodiments are within the scope of the following claims.

Claims (18)

What is claimed is:
1. A method comprising:
receiving a spoken utterance of a word or phrase;
generating a plurality of alternative pronunciations of the spoken utterance;
scoring one or more pronunciations of the plurality of alternative pronunciations using the spoken utterance; and
updating a lexicon with at least one scored pronunciation.
2. The method of claim 1, wherein the at least one scored pronunciation is the highest scoring pronunciation of the scored pronunciations.
3. The method of claim 1, further comprising using a finite-state machine to generate the plurality of alternative pronunciations.
4. The method of claim 1, wherein updating the lexicon includes replacing an existing pronunciation with the at least one scored pronunciation.
5. The method of claim 1, wherein updating the lexicon includes adding a phonetic representation of the at least one scored pronunciation to an existing representation.
6. The method of claim 1, wherein the alternative pronunciations are generated by searching a neighborhood of pronunciations about an initial pronunciation of the spoken utterance.
7. A system comprising:
at least one processor; and
a memory device operatively connected to the at least one processor;
wherein, responsive to execution of program instructions accessible to the at least one processor, the at least one processor is configured to:
receive a spoken utterance of a word or phrase;
generate a plurality of alternative pronunciations of the spoken utterance;
score one or more pronunciations of the plurality of alternative pronunciations using the spoken utterance; and
update a lexicon with at least one scored pronunciation.
8. The system of claim 7, wherein the at least one scored pronunciation is the highest scoring pronunciation of the scored pronunciations.
9. The system of claim 7, wherein the at least one processor is configured to use a finite-state machine to generate the plurality of alternative pronunciations.
10. The system of claim 7, wherein the at least one processor is configured to update the lexicon by replacing an existing pronunciation with the at least one scored pronunciation.
11. The system of claim 7, wherein the at least one processor is configured to update the lexicon by adding a phonetic representation of the at least one scored pronunciation to an existing representation.
12. The system of claim 7, wherein the at least one processor is configured to generate the alternative pronunciations by searching a neighborhood of pronunciations about an initial pronunciation of the spoken utterance.
13. A computer program product encoded in a non-transitory computer-readable medium, which when executed by a computer causes the computer to perform the following operations:
receiving a spoken utterance of a word or phrase;
generating a plurality of alternative pronunciations of the spoken utterance;
scoring one or more pronunciations of the plurality of alternative pronunciations using the spoken utterance; and
updating a lexicon with at least one scored pronunciation.
14. The computer program product of claim 13, wherein the at least one scored pronunciation is the highest scoring pronunciation of the scored pronunciations.
15. The computer program product of claim 13, wherein the computer uses a finite-state machine to generate the plurality of alternative pronunciations.
16. The computer program product of claim 13, wherein the computer updates the lexicon by replacing an existing pronunciation with the at least one scored pronunciation.
17. The computer program product of claim 13, wherein the computer updates the lexicon by adding a phonetic representation of the at least one scored pronunciation to an existing representation.
18. The computer program product of claim 13, wherein the computer generates the alternative pronunciations by searching a neighborhood of pronunciations about an initial pronunciation of the spoken utterance.
US14/041,857 2003-09-11 2013-09-30 Pronunciation Discovery for Spoken Words Abandoned US20140032216A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/041,857 US20140032216A1 (en) 2003-09-11 2013-09-30 Pronunciation Discovery for Spoken Words

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US50208403P 2003-09-11 2003-09-11
US10/939,942 US8577681B2 (en) 2003-09-11 2004-09-13 Pronunciation discovery for spoken words
US14/041,857 US20140032216A1 (en) 2003-09-11 2013-09-30 Pronunciation Discovery for Spoken Words

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/939,942 Continuation US8577681B2 (en) 2003-09-11 2004-09-13 Pronunciation discovery for spoken words

Publications (1)

Publication Number Publication Date
US20140032216A1 true US20140032216A1 (en) 2014-01-30

Family

ID=34312351

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/939,942 Active 2032-09-07 US8577681B2 (en) 2003-09-11 2004-09-13 Pronunciation discovery for spoken words
US14/041,857 Abandoned US20140032216A1 (en) 2003-09-11 2013-09-30 Pronunciation Discovery for Spoken Words

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/939,942 Active 2032-09-07 US8577681B2 (en) 2003-09-11 2004-09-13 Pronunciation discovery for spoken words

Country Status (2)

Country Link
US (2) US8577681B2 (en)
WO (1) WO2005027093A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140006029A1 (en) * 2012-06-29 2014-01-02 Rosetta Stone Ltd. Systems and methods for modeling l1-specific phonological errors in computer-assisted pronunciation training system
US20140136210A1 (en) * 2012-11-14 2014-05-15 At&T Intellectual Property I, L.P. System and method for robust personalization of speech recognition
US20170221475A1 (en) * 2016-02-03 2017-08-03 Google Inc. Learning personalized entity pronunciations
US10013971B1 (en) 2016-12-29 2018-07-03 Google Llc Automated speech pronunciation attribution
US10019995B1 (en) 2011-03-01 2018-07-10 Alice J. Stiebel Methods and systems for language learning based on a series of pitch patterns
US11062615B1 (en) 2011-03-01 2021-07-13 Intelligibility Training LLC Methods and systems for remote language learning in a pandemic-aware world

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8793127B2 (en) * 2002-10-31 2014-07-29 Promptu Systems Corporation Method and apparatus for automatically determining speaker characteristics for speech-directed advertising or other enhancement of speech-controlled devices or services
WO2005027093A1 (en) * 2003-09-11 2005-03-24 Voice Signal Technologies, Inc. Generation of an alternative pronunciation
US7266495B1 (en) * 2003-09-12 2007-09-04 Nuance Communications, Inc. Method and system for learning linguistically valid word pronunciations from acoustic data
US7315811B2 (en) * 2003-12-31 2008-01-01 Dictaphone Corporation System and method for accented modification of a language model
US8543393B2 (en) * 2008-05-20 2013-09-24 Calabrio, Inc. Systems and methods of improving automated speech recognition accuracy using statistical analysis of search terms
CN102117614B (en) * 2010-01-05 2013-01-02 索尼爱立信移动通讯有限公司 Personalized text-to-speech synthesis and personalized speech feature extraction
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US8949125B1 (en) * 2010-06-16 2015-02-03 Google Inc. Annotating maps with user-contributed pronunciations
TWI564736B (en) * 2010-07-27 2017-01-01 Iq Tech Inc Method of merging single word and multiple words
US9640175B2 (en) * 2011-10-07 2017-05-02 Microsoft Technology Licensing, Llc Pronunciation learning from user correction
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US10957310B1 (en) 2012-07-23 2021-03-23 Soundhound, Inc. Integrated programming framework for speech and text understanding with meaning parsing
US20140067394A1 (en) * 2012-08-28 2014-03-06 King Abdulaziz City For Science And Technology System and method for decoding speech
US9697827B1 (en) * 2012-12-11 2017-07-04 Amazon Technologies, Inc. Error reduction in speech processing
US9626963B2 (en) * 2013-04-30 2017-04-18 Paypal, Inc. System and method of improving speech recognition using context
WO2014197334A2 (en) * 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
TWI508057B (en) * 2013-07-15 2015-11-11 Chunghwa Picture Tubes Ltd Speech recognition system and method
US11295730B1 (en) 2014-02-27 2022-04-05 Soundhound, Inc. Using phonetic variants in a local context to improve natural language understanding
CN104142909B (en) * 2014-05-07 2016-04-27 腾讯科技(深圳)有限公司 A kind of phonetic annotation of Chinese characters method and device
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9922643B2 (en) * 2014-12-23 2018-03-20 Nice Ltd. User-aided adaptation of a phonetic dictionary
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
DK201770439A1 (en) 2017-05-11 2018-12-13 Apple Inc. Offline personal assistant
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
DK179560B1 (en) 2017-05-16 2019-02-18 Apple Inc. Far-field extension for digital assistant services
US11280829B1 (en) * 2019-12-19 2022-03-22 Xlnx, Inc. System-on-chip having secure debug mode

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030110035A1 (en) * 2001-12-12 2003-06-12 Compaq Information Technologies Group, L.P. Systems and methods for combining subword detection and word detection for processing a spoken input
US20040215449A1 (en) * 2002-06-28 2004-10-28 Philippe Roy Multi-phoneme streamer and knowledge representation speech recognition system and method
US20050143970A1 (en) * 2003-09-11 2005-06-30 Voice Signal Technologies, Inc. Pronunciation discovery for spoken words

Family Cites Families (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5369727A (en) * 1991-05-16 1994-11-29 Matsushita Electric Industrial Co., Ltd. Method of speech recognition with correlation of similarities
US5805772A (en) * 1994-12-30 1998-09-08 Lucent Technologies Inc. Systems, methods and articles of manufacture for performing high resolution N-best string hypothesization
US5680511A (en) * 1995-06-07 1997-10-21 Dragon Systems, Inc. Systems and methods for word recognition
US5855000A (en) * 1995-09-08 1998-12-29 Carnegie Mellon University Method and apparatus for correcting and repairing machine-transcribed input using independent or cross-modal secondary input
US5822728A (en) * 1995-09-08 1998-10-13 Matsushita Electric Industrial Co., Ltd. Multistage word recognizer based on reliably detected phoneme similarity regions
US5712957A (en) * 1995-09-08 1998-01-27 Carnegie Mellon University Locating and correcting erroneously recognized portions of utterances by rescoring based on two n-best lists
US6064959A (en) * 1997-03-28 2000-05-16 Dragon Systems, Inc. Error correction in speech recognition
JP3535292B2 (en) * 1995-12-27 2004-06-07 Kddi株式会社 Speech recognition system
DE19639844A1 (en) * 1996-09-27 1998-04-02 Philips Patentverwaltung Method for deriving at least one sequence of words from a speech signal
US5950160A (en) * 1996-10-31 1999-09-07 Microsoft Corporation Method and system for displaying a variable number of alternative words during speech recognition
US5829000A (en) * 1996-10-31 1998-10-27 Microsoft Corporation Method and system for correcting misrecognized spoken words or phrases
US6137863A (en) * 1996-12-13 2000-10-24 At&T Corp. Statistical database correction of alphanumeric account numbers for speech recognition and touch-tone recognition
US5933804A (en) * 1997-04-10 1999-08-03 Microsoft Corporation Extensible speech recognition system that provides a user with audio feedback
US6076053A (en) * 1998-05-21 2000-06-13 Lucent Technologies Inc. Methods and apparatus for discriminative training and adaptation of pronunciation networks
US6243680B1 (en) * 1998-06-15 2001-06-05 Nortel Networks Limited Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US6192337B1 (en) * 1998-08-14 2001-02-20 International Business Machines Corporation Apparatus and methods for rejecting confusible words during training associated with a speech recognition system
US6684185B1 (en) 1998-09-04 2004-01-27 Matsushita Electric Industrial Co., Ltd. Small footprint language and vocabulary independent word recognizer using registration by word spelling
US6233553B1 (en) * 1998-09-04 2001-05-15 Matsushita Electric Industrial Co., Ltd. Method and system for automatically determining phonetic transcriptions associated with spelled words
US6363342B2 (en) * 1998-12-18 2002-03-26 Matsushita Electric Industrial Co., Ltd. System for developing word-pronunciation pairs
KR100310339B1 (en) * 1998-12-30 2002-01-17 윤종용 Voice recognition dialing method of mobile phone terminal
DE60026637T2 (en) * 1999-06-30 2006-10-05 International Business Machines Corp. Method for expanding the vocabulary of a speech recognition system
US6766069B1 (en) * 1999-12-21 2004-07-20 Xerox Corporation Text selection from images of documents using auto-completion
US6389394B1 (en) * 2000-02-09 2002-05-14 Speechworks International, Inc. Method and apparatus for improved speech recognition by modifying a pronunciation dictionary based on pattern definitions of alternate word pronunciations
US6963841B2 (en) * 2000-04-21 2005-11-08 Lessac Technology, Inc. Speech training method with alternative proper pronunciation database
US7149970B1 (en) * 2000-06-23 2006-12-12 Microsoft Corporation Method and system for filtering and selecting from a candidate list generated by a stochastic input method
US6856956B2 (en) * 2000-07-20 2005-02-15 Microsoft Corporation Method and apparatus for generating and displaying N-best alternatives in a speech recognition system
GB0027178D0 (en) * 2000-11-07 2000-12-27 Canon Kk Speech processing system
US6754625B2 (en) * 2000-12-26 2004-06-22 International Business Machines Corporation Augmentation of alternate word lists by acoustic confusability criterion
TW495736B (en) * 2001-02-21 2002-07-21 Ind Tech Res Inst Method for generating candidate strings in speech recognition
EP1239459A1 (en) * 2001-03-07 2002-09-11 Sony International (Europe) GmbH Adaptation of a speech recognizer to a non native speaker pronunciation
US6910012B2 (en) * 2001-05-16 2005-06-21 International Business Machines Corporation Method and system for speech recognition using phonetically similar word alternatives
US20020184019A1 (en) 2001-05-31 2002-12-05 International Business Machines Corporation Method of using empirical substitution data in speech recognition
US7809574B2 (en) * 2001-09-05 2010-10-05 Voice Signal Technologies Inc. Word recognition using choice lists
WO2004023455A2 (en) * 2002-09-06 2004-03-18 Voice Signal Technologies, Inc. Methods, systems, and programming for performing speech recognition
US7099828B2 (en) * 2001-11-07 2006-08-29 International Business Machines Corporation Method and apparatus for word pronunciation composition
US7089188B2 (en) * 2002-03-27 2006-08-08 Hewlett-Packard Development Company, L.P. Method to expand inputs for word or document searching
US7266495B1 (en) * 2003-09-12 2007-09-04 Nuance Communications, Inc. Method and system for learning linguistically valid word pronunciations from acoustic data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030110035A1 (en) * 2001-12-12 2003-06-12 Compaq Information Technologies Group, L.P. Systems and methods for combining subword detection and word detection for processing a spoken input
US20040215449A1 (en) * 2002-06-28 2004-10-28 Philippe Roy Multi-phoneme streamer and knowledge representation speech recognition system and method
US20050143970A1 (en) * 2003-09-11 2005-06-30 Voice Signal Technologies, Inc. Pronunciation discovery for spoken words

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10565997B1 (en) 2011-03-01 2020-02-18 Alice J. Stiebel Methods and systems for teaching a hebrew bible trope lesson
US11380334B1 (en) 2011-03-01 2022-07-05 Intelligible English LLC Methods and systems for interactive online language learning in a pandemic-aware world
US11062615B1 (en) 2011-03-01 2021-07-13 Intelligibility Training LLC Methods and systems for remote language learning in a pandemic-aware world
US10019995B1 (en) 2011-03-01 2018-07-10 Alice J. Stiebel Methods and systems for language learning based on a series of pitch patterns
US20140006029A1 (en) * 2012-06-29 2014-01-02 Rosetta Stone Ltd. Systems and methods for modeling l1-specific phonological errors in computer-assisted pronunciation training system
US10679616B2 (en) 2012-06-29 2020-06-09 Rosetta Stone Ltd. Generating acoustic models of alternative pronunciations for utterances spoken by a language learner in a non-native language
US10068569B2 (en) * 2012-06-29 2018-09-04 Rosetta Stone Ltd. Generating acoustic models of alternative pronunciations for utterances spoken by a language learner in a non-native language
US20140136210A1 (en) * 2012-11-14 2014-05-15 At&T Intellectual Property I, L.P. System and method for robust personalization of speech recognition
WO2017136028A1 (en) * 2016-02-03 2017-08-10 Google Inc. Learning personalized entity pronunciations
US10152965B2 (en) * 2016-02-03 2018-12-11 Google Llc Learning personalized entity pronunciations
CN107039038A (en) * 2016-02-03 2017-08-11 谷歌公司 Learn personalised entity pronunciation
US20170221475A1 (en) * 2016-02-03 2017-08-03 Google Inc. Learning personalized entity pronunciations
US10559296B2 (en) 2016-12-29 2020-02-11 Google Llc Automated speech pronunciation attribution
US10013971B1 (en) 2016-12-29 2018-07-03 Google Llc Automated speech pronunciation attribution
US11081099B2 (en) 2016-12-29 2021-08-03 Google Llc Automated speech pronunciation attribution

Also Published As

Publication number Publication date
US8577681B2 (en) 2013-11-05
WO2005027093A1 (en) 2005-03-24
US20050143970A1 (en) 2005-06-30

Similar Documents

Publication Publication Date Title
US8577681B2 (en) Pronunciation discovery for spoken words
EP1595245B1 (en) Method of producing alternate utterance hypotheses using auxiliary information on close competitors
EP1844464B1 (en) Methods and apparatus for automatically extending the voice-recognizer vocabulary of mobile communications devices
EP1291848B1 (en) Multilingual pronunciations for speech recognition
US7957972B2 (en) Voice recognition system and method thereof
EP1171870B1 (en) Spoken user interface for speech-enabled devices
US7203651B2 (en) Voice control system with multiple voice recognition engines
US7552045B2 (en) Method, apparatus and computer program product for providing flexible text based language identification
US7904298B2 (en) Predictive speech-to-text input
KR100769029B1 (en) Method and system for voice recognition of names in multiple languages
US20050273337A1 (en) Apparatus and method for synthesized audible response to an utterance in speaker-independent voice recognition
US20080126093A1 (en) Method, Apparatus and Computer Program Product for Providing a Language Based Interactive Multimedia System
US20050149327A1 (en) Text messaging via phrase recognition
US20050131685A1 (en) Installing language modules in a mobile communication device
US20050154587A1 (en) Voice enabled phone book interface for speaker dependent name recognition and phone number categorization
US20070129945A1 (en) Voice quality control for high quality speech reconstruction
WO2005031995A1 (en) Method and apparatus for providing a text message
EP1187431B1 (en) Portable terminal with voice dialing minimizing memory usage

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION