WO2009064281A1 - Method and system for providing speech recognition - Google Patents
Method and system for providing speech recognition Download PDFInfo
- Publication number
- WO2009064281A1 WO2009064281A1 PCT/US2007/079413 US2007079413W WO2009064281A1 WO 2009064281 A1 WO2009064281 A1 WO 2009064281A1 US 2007079413 W US2007079413 W US 2007079413W WO 2009064281 A1 WO2009064281 A1 WO 2009064281A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- name
- user
- speech recognition
- audio input
- database
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 48
- 230000004044 response Effects 0.000 claims abstract description 14
- 238000012790 confirmation Methods 0.000 claims 3
- 238000013459 approach Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 28
- 238000004891 communication Methods 0.000 description 27
- 238000010586 diagram Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 8
- 230000000153 supplemental effect Effects 0.000 description 8
- 238000012795 verification Methods 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 4
- 230000002452 interceptive effect Effects 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- NLYAJNPCOHFWQQ-UHFFFAOYSA-N kaolin Chemical compound O.O.O=[Al]O[Si](=O)O[Si](=O)O[Al]=O NLYAJNPCOHFWQQ-UHFFFAOYSA-N 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 241000272201 Columbiformes Species 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003750 conditioning effect Effects 0.000 description 1
- 235000014510 cooky Nutrition 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000005111 flow chemistry technique Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
- G10L17/24—Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
Definitions
- Speech recognition plays an important role in communication systems, for both gathering and supplying information to users.
- IVR interactive voice response
- DTMF dual-tone multi-frequency
- speech inputs For complicated transactions requiring a quantity of numbers, letters, and words to be input, the concept of an IVR system has been more appealing than its conception. Namely, typical DTMF interfaces have proven to be impractically slow for complex data entry. As such, organizations are becoming ever reliant upon voice based systems to augment DTMF inputs.
- voice based systems have introduced new, more challenging issues pertaining to the intricacies of spoken language and the infinite variations on human utterance. Accordingly, IVR systems implementing speech recognition technology have proven to be unacceptably inaccurate at converting a spoken utterance to a corresponding textual string or other equivalent symbolic representation.
- FIG. 1 is a diagram illustrating a communication system capable of providing speech recognition to acquire a name, in accordance with an embodiment of the present invention
- FIG. 2 is a diagram of an exemplary interactive voice response (IVR) unit, according to an embodiment of the present invention
- FIG. 3 is a diagram of a speech recognition system, in accordance with an embodiment of the present invention.
- FIGs. 4A and 4B are flowcharts of a speech recognition process, according to an embodiment of the present invention.
- FIG. 5 is a diagram of a computer system that can be used to implement various embodiments of the present invention.
- FIG. 1 is a diagram illustrating a communication system capable of providing speech recognition to acquire a name, in accordance with an embodiment of the present invention.
- a communication system 100 includes a speech recognition system (or logic) 101 that utilizes a name grammar database 103, a confidence database 105.
- the speech recognition system 101 operates with an interactive voice response (IVR) unit (or system) 107, which receives a voice call from a station 109 over a telephony network 111.
- the telephony network 111 can be a circuit-switched system or a packetized voice network (e.g., Voice over Internet Protocol (VoIP) network).
- VoIP Voice over Internet Protocol
- the packetized voice network 111 can be accessed by a suitable station 109 - e.g., computer, workstation, or other device (e.g., personal digital assistant (PDA), etc.) supporting microphone and speaker functionality.
- a suitable station 109 e.g., computer, workstation, or other device (e.g., personal digital assistant (PDA), etc.) supporting microphone and speaker functionality.
- PDA personal digital assistant
- the IVR system 107 collects and provides data to users.
- the IVR system 107 is more fully explained in FIG. 2. Data collection is supported by a data repository 113.
- the speech recognition system 101 is described with respect to the recognition of audio signals representing names.
- a user's name is arguably the most routinely gathered, commonly used piece of information.
- acquiring a user's name can be a difficult task for conventional systems, which utilize dual-tone multi-frequency (DTMF) input interfaces.
- DTMF interfaces become increasingly more impractical as the quantity of letters contained within an individual's name increases.
- phone designs notably cellular phones
- speech recognition have been introduced to supplement DTMF interfaces.
- speech recognition technology is hindered by a set of characteristic complexities independent from the types of utterances being converted. For instance, acoustic variability introduced by environmental background noise, microphone positioning, as well as transducer quality, add to the loss of conversion accuracy. In addition, speaker variability resulting from physical and emotional states, speaking rates, voice quality and intensity, sociolinguistic background, dialect, as well as vocal tract size and shape also contribute to the loss of recognition accuracy.
- the speech recognition system 101 can support a myriad of applications involving interaction with a human user, such as call flow processing, directory assistance, commerce transactions (e.g., airline ticketing, stock brokering, banking, order placement, etc.), browsing/collecting information, and the like.
- call flow processing e.g., call flow processing, directory assistance, commerce transactions (e.g., airline ticketing, stock brokering, banking, order placement, etc.), browsing/collecting information, and the like.
- commerce transactions e.g., airline ticketing, stock brokering, banking, order placement, etc.
- browsing/collecting information e.g., browsing/collecting information, and the like.
- the IVR system 107 can access the data repository 113 via a data network, which can include a local area network (LAN), a wide area network (WAN), a cellular or satellite network, the Internet, etc.
- a data network can include a local area network (LAN), a wide area network (WAN), a cellular or satellite network, the Internet, etc.
- data repository 113 can be directly linked to or included within IVR system 107.
- data repository 113 can be any type of information store (e.g., database, server, computer, etc) that associates personalized information with user names.
- This personalized information can include any one or combination of a birth date, an account number (e.g., bank, credit card, billing codes, etc.), a social security number (SSN), an address (e.g., work, home, internet protocol (IP), media access control (MAC), etc.), telephone listing (home, work, cellular, etc.), as well as any other form of uniquely identifiable datum, e.g., biometric code, voice print, etc.
- an account number e.g., bank, credit card, billing codes, etc.
- SSN social security number
- IP internet protocol
- MAC media access control
- telephone listing home, work, cellular, etc.
- any other form of uniquely identifiable datum e.g., biometric code, voice print, etc.
- the data repository 113 is configured to allow reverse searching for a user's name using one or more of the above listed personalized information forms. Moreover, data repository 113 can be automatically updated and maintained by any source, including third party vendors.
- the speech recognition system 101 is shown as a separate component, it is contemplated that the speech recognition system 101 can be integrated with the IVR system 107.
- FIG. 2 is a diagram of an exemplary interactive voice response (IVR) system, according to an embodiment of the present invention.
- the IVR system 107 includes a telephony interface 201, a resource manager 203, and a voice browser 205.
- the IVR system 107 utilizes the telephony interface 201 for communicating with one or more users over the telephony network 111. In alternative embodiments, other interfaces are utilized depending on the access method of the user.
- the IVR system components are shown as separate, distributed entities, the IVR system 107 can incorporate some or all of the functionalities into a single network element.
- the resource manager 203 provides various speech resources, such as a verification system 207, an automatic speech recognizer (ASR) 209, and a text-to-speech (TTS) engine 211.
- the TTS engine 211 converts textual information (digital signal) from the voice browser 205 to speech (analog signal) for playback to a user.
- the TTS engine 211 accomplishes this transition through a front-end input and a back-end output.
- the input converts raw text into its written-out word equivalent through text normalization, pre-processing, and/or tokenization. Words are then assigned phonetic transcriptions and divided into prosodic units, e.g., phrases, clauses, and/or sentences.
- the front-end input communicates a symbolic linguistic representation to the back- end output for synthesizing.
- the back- end output is capable generating speech waveforms through any one of the following synthesis processes: concatenative, unit selection, diphone, domain- specific, formant, articulatory, Hidden Markov Model (HMM), and other like methods, as well as any hybrid combination thereof.
- HMM Hidden Markov Model
- the ASR 209 can effectively behave as the speech recognition system 101, or alternatively be an interface to the speech recognition system 101; the particular embodiment depends on the application.
- the ASR 209 effectively converts a user's spoken language (represented by analog signals) into textual or an equivalent symbolic form (digital signal) for processing by the voice browser 205 and/or verification system 207.
- the voice browser 205 can play pre-recorded sound files to the user in lieu of, or in addition to, use of the TTS engine 211.
- the resource manager 203 can include an analog-to-digital and digital-to-analog converter (not shown) for signaling between the station 109, for example, and the voice browser 205.
- the voice browser 205 may contain speech recognition and synthesis logic (not shown) that implements the above, thereby extracting meaning from the user's spoken utterances and producing acoustic renditions of text directly.
- the verification system can be linked to the telephony interface 201, the ASR 209, or both components depending upon the method of authentication desired. Accordingly, a user name, password, code, or other unique identification can be required by the verification system 207 for limiting access to the voice browser 205. In this manner, users can be required to provide this information using either spoken utterances transmitted through the ASR 209 or DTMF signals transmitted via telephony interface 201. Alternatively, the verification system 207 can provide an unobtrusive level of security by positively identifying and screening users based on their voice prints transmitted from telephony interface 201. Thus, in either embodiment, the verification system 207 can keep sensitive transactions secure.
- the voice browser 205 functions as a gateway between a call, for example, and a variety of networked applications.
- the voice browser 205 can employ a microphone, keypad, and a speaker instead of a keyboard, mouse, and monitor of a conventional web-based system.
- the voice browser 205 processes pages of markup language, such as voice extensible markup language (VoiceXML), speech application language tags (SALT), hypertext markup language (HTML), and others such as wireless markup language (WML) for wireless application protocol (WAP) based cell phone applications, and the World Wide Web (W3) platform for handheld devices, residing on a server (not shown).
- markup language such as voice extensible markup language (VoiceXML), speech application language tags (SALT), hypertext markup language (HTML), and others such as wireless markup language (WML) for wireless application protocol (WAP) based cell phone applications, and the World Wide Web (W3) platform for handheld devices, residing on a server (not shown).
- the voice browser 205 can be configured accordingly, to include a VoiceXML- compliant browser, a SALT-complaint browser, an HTML-compliant browser, a WML- complaint browser or any other markup-language complaint browser, for communicating with users.
- the voice browser 205 can utilize a standardized networked infrastructure, i.e., hypertext transport protocol (HTTP), cookies, web caches, uniform resource locators (URLs), secure HTTP, etc., to establish and maintain connections.
- HTTP hypertext transport protocol
- cookies i.e., web caches, uniform resource locators (URLs), secure HTTP, etc.
- FIG. 3 is a diagram of a speech recognition system, in accordance with an embodiment of the present invention.
- the speech recognition system 101 can provide speaker dependent and/or independent automatic voice recognition of acoustic utterances from the user. Accordingly, the speech recognition system 101 processes voice communications transmitted over telephony network 111 to determine whether a word or a speech pattern matches any grammar or vocabulary stored within a database (e.g., name grammar database 103 or confidence database 105).
- the name grammar database 103 is populated with possible combinations of user names and spellings of those names. According to one embodiment of the present invention, the name grammar database 103 can be built according to the NUANCETM Say and Spell name grammar.
- the database 103 can include any grammar database including names and spellings of those names as well as a dictionary database, another grammar database, an acoustic model database, and/or a natural language definition database.
- Dictionary databases contain phonetic pronunciations for words used in grammar databases.
- Acoustic model databases define, among other things, the languages that the speech application utilizes.
- a database management system data is stored in one or more data containers, each container contains records, and the data within each record is organized into one or more fields.
- data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns.
- object-oriented databases the data containers are referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes.
- a supplemental grammar database 105 is used in conjunction with the name grammar database 103 to produce accurate recognition of user names.
- the confidence database 105 in an exemplary embodiment, can be derived from the primary name grammar database 103, such as an N-best list (with N being an integer that can be set according to the particular application).
- the N-Best result can include the expected name result that would increase recognition.
- the N-Best result is a list of items returned from the grammar that correlate well to the caller's utterance.
- the N-Best list is sorted by likelihood of a match and includes one or more entries. In this process, the correct name is added to this N-Best supplemental grammar.
- this supplemental grammar database can be dynamically built, in accordance with one embodiment of the present invention.
- s:v M 5 A decoy application 311 is utilized, according to an exemplary embodiment, to generate variations of the names within the N-best list to enhance the probability of recognition. These generated names, which can possibly include the correct name, are provided as additional entries into the confidence database 105.
- the speech recognition system 101 is configured to process acoustic utterances to determine whether a word or speech pattern matches any name stored within the name grammar database 103 and/or the confidence database 105. When a match is identified foi a parti culai utterance (or set of utterances) of the voice communication, the speech recognition system 101 sends an output signal for implementation by the verification system 207 and/or the voice browser 205.
- the speech recognition system 101 can include speaker dependent and/or independent voice recognition.
- the speech recognition system 101 can be implemented by any suitable voice recognition system capable of detecting and converting voice communications into text or other equivalent symbolic representations.
- the speech recognition system 101 includes a digitizer 301 for digitizing an audio input (e.g., speech), a parsing module 303, and an edge comparison module 305, as well as a confidence value generator 307 and interpretation generator 309. Moreover, the speech recognition system 101 makes use of the name grammar database 103, confidence database 105 to aid in more accurately recognizing a user's name; this process is more fully described with respect to FIGs. 4 A and 4B.
- the digitizer 301 accepts acoustic or audio signals (i.e., user utterances) from the telephony interface 201 and coverts them into digital signals through an analog-to-digital converter. Once digitized, the signal is converted into the frequency domain using known methods, e.g., discrete/fast/short form Fourier transform, etc., and combined into a frequency spectrum frame for further processing. Since the human ear can only perceive audible acoustics ranging from 20Hz to 2OkHz and since the human voice only typically produces utterances within the 500Hz to 2kHz range, the digitizer 301 can be optimized to operate within these ranges.
- the digitizer 301 can include a host of signal processing components, i.e., filters, amplifiers, modulators, compressors, error detectors/checkers, etc., for conditioning the signal, e.g., removing signal noises like ambient noise, canceling transmission echoing, etc.
- signal processing components i.e., filters, amplifiers, modulators, compressors, error detectors/checkers, etc.
- a corresponding digital signal is passed to the parsing module 303 for extracting acoustic parameters using known methods, e.g., linear predictive coding.
- the parsing module 303 can identify acoustic feature vectors that includes cepstral coefficients that identify the phonetic classifications and word boundaries of a user's utterance. It is recognized that other conventional modeling techniques can be used to extract one or more characteristics and/or patterns that classify distinctive acoustic portions of the digital signal.
- the various acoustic features defined by the parsing module 303 are input into the edge comparison module 309 for comparison with and identification as recognized words, i.e., first, middle, and/or last names of the user.
- the edge comparison module 305 can use any known speech recognition method and/or algorithm, e.g., hidden Markov Modeling (HMM), as well as the name grammar database 103 and the confidence database 105 to recognize user utterances as words.
- HMM hidden Markov Modeling
- the interpretation generator 309 passes an associated equivalent textual or symbolic representation (hereinafter collectively referred to as a "value") to the voice browser 205 and/or verification system 207 for appropriate processing.
- a grammar database stores all the possible combinations of user utterances, and associated values, that are validly accepted by a particular speech application.
- a simple grammar denoted as "YESNOGRAMMAR,” can be defined as follows:
- the edge comparison module 305 utilizes a confidence value generator 307 to determine the level of confidence that measures the correlation of a recognized utterance to a value of an item within the grammar database. High confidence values imply greater similarity between the recognized utterance and the value of an item within the grammar database. Conversely, a low confidence value will imply a poor similarity. In cases where an utterance is not recognized, i.e., the confidence value generator 307 perceives no similarity to any item within the grammar, the edge comparison module will produce an "out of grammar" condition and require the user to re- input their utterance.
- the IVR system 107 prompts the user with the question, "Have you ever been to Colorado?" If the user responds "yes,” the speech recognition system 101 recognizes the utterance and passes a "true” result to interpretation generator 309 for output to the appropriate device, e.g., voice browser 205, for system processing. If instead the user responded "maybe,” the utterance would not compare to either the "yes” or “no” values within the grammar, YESNOGRAMMAR. As such, a no recognition situation would result and the edge comparison module would produce an "out of grammar” condition and require the user to re-input their utterance.
- grammars are used to limit users to those values defined within the grammar, i.e., expected utterances. For instance, if a user was asked to utter a numerical identifier, such as a social security number (SSN), a grammar could limit the first digit to numbers zero through seven since no SSNs begins with an eight or a nine. Accordingly, if a user uttered a SSN beginning with an eight, when the utterance is analyzed by the speech recognition system 101 and compared against the limited grammar, the result will inevitably be an "out of grammar” condition. o N Unfortunately, user utterances cannot always be "pigeon holed" into expected utterances.
- SSN social security number
- the speech recognition system 101 utilizing the above YESNOGRAMMAR grammar, would not recognize a user utterance equating to the spoken words of "affirmative” in place of "yes” or “negative” in place of "no.”
- a simple name grammar entitled SURNAMES, can be defined as illustrated below:
- the names i.e., grammar values
- the names includes a name and a spelling of the name.
- FIG. 4 are flowcharts of a speech recognition process, according to an embodiment of the present invention.
- data e.g., account information, social security number, or other personalized information
- the name associated with the account can be retrieved, per step 403.
- the user is prompted for a name, as in step 405.
- the user is requested to say and spell the name.
- the resultant audio input from the user is received in response to the name prompt.
- the process applies, as in step 409, speech recognition to the audio input using a primary name grammar database, such as the name grammar database 103.
- step 411 It is determined, per step 411, whether an out of grammar condition exists. If such a condition occurs, the user is re- prompted for the name, as in step 413. This time, the process applies a high confidence database to output the recognized name (step 415). That is, the process utilizes a secondary name grammar database of high confidence (e.g., confidence database 105) to output the latest recognized name.
- a secondary name grammar database of high confidence e.g., confidence database 105
- the names from an N-best list are combined with the name associated with the account or social security number to generate a supplemental name grammar; this process can be performed dynamically. Decoy names similar to the actual name can also be added to this supplemental name grammar.
- the level of confidence - i.e., "high" - can be predetermined or pre-set according to the application.
- the process determines whether the recognized name matches the retrieved name (as obtained in step 403), per step 417. If a match exists, the latest recognized name is confirmed with the user, per step 421. To confirm, the process, for example, can provide a simple prompt as follows: "I heard ⁇ name>. Is that correct?"
- the speech recognition process confirms the latest recognized name with the user, and reassesses the name wording (step 423).
- the process for example, can provide a more directed prompt as follows: "I heard ⁇ name>. Are you sure that is the name of the account?"
- the expected result is not revealed to the caller; the caller must say the expected result and confirm. If the name is not correct, as determined in step 425, the process returns to step 413 to re-prompt the user. This process can be iterated any number of times (e.g., three times); that is, the number of iteration is configurable. If the user exceeds the maximum number of retries, the call can end with a failure event. Upon acknowledging that the name is correct, the process ends.
- the speech recognition process is now explained with respect to three scenarios related to an application for reporting of wages using SSNs as the personalized information.
- the first scenario involves using only the primary name grammar database 103, without the need to utilize the confidence database 105 (Table 1).
- the second scenario depicts the case in which the supplemental grammar database, e.g., confidence database 105, is required (Table 2).
- the last scenario as shown in Table 3, shows a failed condition.
- the speech recognition process of FIGs. 4A and 4B therefore, can be utilized to improve conventional speech recognition say and spell name capture.
- This approach allows the user's or caller's name to be acquired using another piece of information, or a data combination, such as a birth date and account or social security number.
- This actual name may be obtained and used in a supplemental name grammar to aid in the recognition of the caller's name.
- N The processes described herein for providing speech recognition may be implemented via software, hardware (e.g., general processor, Digital Signal Processing (DSP) chip, an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Arrays (FPGAs), etc.), firmware or a combination thereof.
- DSP Digital Signal Processing
- ASIC Application Specific Integrated Circuit
- FPGA Field Programmable Gate Arrays
- FIG. 5 illustrates a computer system 500 upon which an embodiment according to the present invention can be implemented.
- the computer system 500 includes a bus 501 or other communication mechanism for communicating information and a processor 503 coupled to the bus 501 for processing information.
- the computer system 500 also includes main memory 505, such as a random access memory (RAM) or other dynamic storage device, coupled to the bus 501 for storing information and instructions to be executed by the processor 503.
- Main memory 505 can also be used for storing temporary variables or other intermediate information during execution of instructions by the processor 503.
- the computer system 500 may further include a read only memory (ROM) 507 or other static storage device coupled to the bus 501 for storing static information and instructions for the processor 503.
- ROM read only memory
- a storage device 509 such as a magnetic disk or optical disk, is coupled to the bus 501 for persistently storing information and instructions.
- the computer system 500 may be coupled via the bus 501 to a display 511, such as a cathode ray tube (CRT), liquid crystal display, active matrix display, or plasma display, for displaying information to a computer user.
- a display 511 such as a cathode ray tube (CRT), liquid crystal display, active matrix display, or plasma display
- An input device 513 such as a keyboard including alphanumeric and other keys, is coupled to the bus 501 for communicating information and command selections to the processor 503.
- a cursor control 515 such as a mouse, a trackball, or cursor direction keys, for communicating direction information and command selections to the processor 503 and for controlling cursor movement on the display 511.
- the processes described herein are performed by the computer system 500, in response to the processor 503 executing an arrangement of instructions contained in main memory 505.
- Such instructions can be read into main memory 505 from another computer-readable medium, such as the storage device 509.
- Execution of the arrangement of instructions contained in main memory 505 causes the processor 503 to perform the process steps described herein.
- processors in a multi-processing arrangement may also be employed to execute the instructions contained in main memory 505.
- hard- wired circuitry may be used in place of or in combination with software instructions to implement the embodiment of the present invention.
- embodiments of the present invention are not limited to any specific combination of hardware circuitry and software.
- the computer system 500 also includes a communication interface 517 coupled to bus 501.
- the communication interface 517 provides a two-way data communication coupling to a network link 519 connected to a local network 521.
- the communication interface 517 may be a digital subscriber line (DSL) card or modem, an integrated services digital network (ISDN) card, a cable modem, a telephone modem, or any other communication interface to provide a data communication connection to a corresponding type of communication line.
- communication interface 517 may be a local area network (LAN) card (e.g. for EthernetTM or an Asynchronous Transfer Model (ATM) network) to provide a data communication connection to a compatible LAN.
- LAN local area network
- Wireless links can also be implemented.
- communication interface 517 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information.
- the communication interface 517 can include peripheral interface devices, such as a Universal Serial Bus (USB) interface, a PCMCIA (Personal Computer Memory Card International Association) interface, etc.
- USB Universal Serial Bus
- PCMCIA Personal Computer Memory Card International Association
- the network link 519 typically provides data communication through one or more networks to other data devices.
- the network link 519 may provide a connection through local network 521 to a host computer 523, which has connectivity to a network 525 (e.g. a wide area network (WAN) or the global packet data communication network now commonly referred to as the "Internet") or to data equipment operated by a service provider.
- the local network 521 and the network 525 both use electrical, electromagnetic, or optical signals to convey information and instructions.
- the signals through the various networks and the signals on the network link 519 and through the communication interface 517, which communicate digital data with the computer system 500, are exemplary forms of carrier waves bearing the information and instructions.
- the computer system 500 can send messages and receive data, including program code, through the network(s), the network link 519, and the communication interface 517.
- a server (not shown) might transmit requested code belonging to an application program for implementing an embodiment of the present invention through the network 525, the local network 521 and the communication interface 517.
- the processor 503 may execute the transmitted code while being received and/or store the code in the storage device 509, or other non- volatile storage for later execution. In this manner, the computer system 500 may obtain application code in the form of a carrier wave.
- Non- volatile media include, for example, optical or magnetic disks, such as the storage device 509.
- Volatile media include dynamic memory, such as main memory 505.
- Transmission media include coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 501. Transmission media can also take the form of acoustic, optical, or electromagnetic waves, such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, CDRW, DVD, any other optical medium, punch cards, paper tape, optical mark sheets, any other physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
- a floppy disk a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, CDRW, DVD, any other optical medium, punch cards, paper tape, optical mark sheets, any other physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
- the instructions for carrying out at least part of the present invention may initially be borne on a magnetic disk of a remote computer.
- the remote computer loads the instructions into main memory and sends the instructions over a telephone line using a modem.
- a modem of a local computer system receives the data on the telephone line and uses an infrared transmitter to convert the data to an infrared signal and transmit the infrared signal to a portable computing device, such as a personal digital assistant (PDA) or a laptop.
- PDA personal digital assistant
- An infrared detector on the portable computing device receives the information and instructions borne by the infrared signal and places the data on a bus.
- the bus conveys the data to main memory, from which a processor retrieves and executes the instructions.
- the instructions received by main memory can optionally be stored on storage device either before or after execution by processor.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
An approach for providing speech recognition is disclosed. A name is retrieved from a user based on data provided by the user. The user is prompted for a name of the user. A first audio input is received from the user in response to the prompt. Speech recognition is applied to the first audio input using a name grammar database to output a recognized name. A determination is made whether the recognized name matches the retrieved name. If no match is determined, the user is re-prompted for the name of the user for a second audio input. Speech recognition is applied to the second audio input using a confidence database having entries less than the name grammar database.
Description
METHOD AND SYSTEM FOR PROVIDING SPEECH RECOGNITION
RELATED APPLICATION
The present application claims priority of U.S. Patent Application Serial Number 11/526,395 filed September 25, 2006 (attorney docket number COS06005), the contents of which are hereby incorporated by reference.
BACKGROUND INFORMATION
Speech recognition plays an important role in communication systems, for both gathering and supplying information to users. Traditionally, interactive voice response (IVR) systems have relied upon a combination of dual-tone multi-frequency (DTMF) and speech inputs to acquire and process information. However, for complicated transactions requiring a quantity of numbers, letters, and words to be input, the concept of an IVR system has been more appealing than its conception. Namely, typical DTMF interfaces have proven to be impractically slow for complex data entry. As such, organizations are becoming ever reliant upon voice based systems to augment DTMF inputs. Unfortunately, voice based systems have introduced new, more challenging issues pertaining to the intricacies of spoken language and the infinite variations on human utterance. Accordingly, IVR systems implementing speech recognition technology have proven to be unacceptably inaccurate at converting a spoken utterance to a corresponding textual string or other equivalent symbolic representation.
- - " Therefore, there is a need for an improved approach for providing speech recognition.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
FIG. 1 is a diagram illustrating a communication system capable of providing speech recognition to acquire a name, in accordance with an embodiment of the present invention;
FIG. 2 is a diagram of an exemplary interactive voice response (IVR) unit, according to an embodiment of the present invention;
FIG. 3 is a diagram of a speech recognition system, in accordance with an embodiment of the present invention;
FIGs. 4A and 4B are flowcharts of a speech recognition process, according to an embodiment of the present invention; and
FIG. 5 is a diagram of a computer system that can be used to implement various embodiments of the present invention.
DESCRIPTION OF PREFERRED EMBODIMENTS
An apparatus, method, and software for providing speech recognition are described. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It is apparent, however, to one skilled in the art that the present invention may be practiced without these specific details or with an equivalent arrangement. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
Although the various embodiments of the present invention are described with respect to speech recognition of a pro-noun (e.g., name), it is contemplated that these embodiments have applicability to generalized speech recognition using equivalent interfaces and operations.
FIG. 1 is a diagram illustrating a communication system capable of providing speech recognition to acquire a name, in accordance with an embodiment of the present invention. A communication system 100 includes a speech recognition system (or logic) 101 that utilizes a name grammar database 103, a confidence database 105. The speech recognition system 101 operates with an interactive voice response (IVR) unit (or system) 107, which receives a voice call from a station 109 over a telephony network 111. The telephony network 111 can be a circuit-switched system or a packetized voice network (e.g., Voice over Internet Protocol (VoIP) network). The packetized voice network 111 can be accessed by a suitable station 109 - e.g., computer, workstation, or other device (e.g., personal digital assistant (PDA), etc.) supporting microphone and speaker functionality. The IVR system 107, among other functions, collects and provides data to users. The IVR system 107 is more fully explained in FIG. 2. Data collection is supported by a data repository 113.
^ For the purposes of illustration, the speech recognition system 101 is described with respect to the recognition of audio signals representing names. A user's name is arguably the most routinely gathered, commonly used piece of information. Unfortunately, acquiring a user's name can be a difficult task for conventional systems, which utilize dual-tone multi-frequency (DTMF) input interfaces. For instance, DTMF interfaces become increasingly more impractical
as the quantity of letters contained within an individual's name increases. Also, many phone designs (notably cellular phones) require the speaker and the dial-pad to be constructed together, such that it is convenient for the user to use the dial-pad and listen to voice prompts. As a result, speech recognition have been introduced to supplement DTMF interfaces.
Traditional speech recognition interfaces are highly dependent upon grammatical context and ordinary pronunciation rules to achieve accurate conversion results. However, with user names (or any proper nouns), these techniques have proven to be inadequate because these types of words generally have no significant grammatical context that can be used to differentiate among possible conversion alternatives. Further, ordinary pronunciation rules provide little, if any, beneficial value since proper nouns contain a disproportionately large number of nonstandard pronunciation variations. Thus, phonetic variability is exemplified not only by the loss of context but also by the acoustic differences between phonemes themselves.
Further, speech recognition technology is hindered by a set of characteristic complexities independent from the types of utterances being converted. For instance, acoustic variability introduced by environmental background noise, microphone positioning, as well as transducer quality, add to the loss of conversion accuracy. In addition, speaker variability resulting from physical and emotional states, speaking rates, voice quality and intensity, sociolinguistic background, dialect, as well as vocal tract size and shape also contribute to the loss of recognition accuracy.
Returning to FIG. 1, the speech recognition system 101, which is more fully described below with respect to FIG. 3, can support a myriad of applications involving interaction with a human user, such as call flow processing, directory assistance, commerce transactions (e.g., airline ticketing, stock brokering, banking, order placement, etc.), browsing/collecting information, and the like.
Although not shown, the IVR system 107 can access the data repository 113 via a data network, which can include a local area network (LAN), a wide area network (WAN), a cellular or satellite network, the Internet, etc. Further, those of ordinary skill in the art will appreciate that data repository 113 can be directly linked to or included within IVR system 107. As such,
data repository 113 can be any type of information store (e.g., database, server, computer, etc) that associates personalized information with user names. This personalized information can include any one or combination of a birth date, an account number (e.g., bank, credit card, billing codes, etc.), a social security number (SSN), an address (e.g., work, home, internet protocol (IP), media access control (MAC), etc.), telephone listing (home, work, cellular, etc.), as well as any other form of uniquely identifiable datum, e.g., biometric code, voice print, etc.
In one embodiment of the present invention, the data repository 113 is configured to allow reverse searching for a user's name using one or more of the above listed personalized information forms. Moreover, data repository 113 can be automatically updated and maintained by any source, including third party vendors.
Although the speech recognition system 101 is shown as a separate component, it is contemplated that the speech recognition system 101 can be integrated with the IVR system 107.
FIG. 2 is a diagram of an exemplary interactive voice response (IVR) system, according to an embodiment of the present invention. In this example, the IVR system 107 includes a telephony interface 201, a resource manager 203, and a voice browser 205. The IVR system 107 utilizes the telephony interface 201 for communicating with one or more users over the telephony network 111. In alternative embodiments, other interfaces are utilized depending on the access method of the user. Moreover, although the IVR system components are shown as separate, distributed entities, the IVR system 107 can incorporate some or all of the functionalities into a single network element.
As shown, the resource manager 203 provides various speech resources, such as a verification system 207, an automatic speech recognizer (ASR) 209, and a text-to-speech (TTS) engine 211. The TTS engine 211 converts textual information (digital signal) from the voice browser 205 to speech (analog signal) for playback to a user. The TTS engine 211 accomplishes this transition through a front-end input and a back-end output. The input converts raw text into its written-out word equivalent through text normalization, pre-processing, and/or tokenization. Words are then assigned phonetic transcriptions and divided into prosodic units, e.g., phrases, clauses, and/or sentences. Using this combination of phonetic transcriptions and prosody
arrangements, the front-end input communicates a symbolic linguistic representation to the back- end output for synthesizing. Based on the desired level of naturalness or intelligibility, the back- end output is capable generating speech waveforms through any one of the following synthesis processes: concatenative, unit selection, diphone, domain- specific, formant, articulatory, Hidden Markov Model (HMM), and other like methods, as well as any hybrid combination thereof. Through the synthesis process, the back-end output generates the actual sound output transmitted to user.
The ASR 209 can effectively behave as the speech recognition system 101, or alternatively be an interface to the speech recognition system 101; the particular embodiment depends on the application. The ASR 209 effectively converts a user's spoken language (represented by analog signals) into textual or an equivalent symbolic form (digital signal) for processing by the voice browser 205 and/or verification system 207.
The voice browser 205 can play pre-recorded sound files to the user in lieu of, or in addition to, use of the TTS engine 211. According to one embodiment of the present invention, the resource manager 203 can include an analog-to-digital and digital-to-analog converter (not shown) for signaling between the station 109, for example, and the voice browser 205. Further, in alternative embodiments, the voice browser 205 may contain speech recognition and synthesis logic (not shown) that implements the above, thereby extracting meaning from the user's spoken utterances and producing acoustic renditions of text directly.
^ The verification system can be linked to the telephony interface 201, the ASR 209, or both components depending upon the method of authentication desired. Accordingly, a user name, password, code, or other unique identification can be required by the verification system 207 for limiting access to the voice browser 205. In this manner, users can be required to provide this information using either spoken utterances transmitted through the ASR 209 or DTMF signals transmitted via telephony interface 201. Alternatively, the verification system 207 can provide an unobtrusive level of security by positively identifying and screening users based on their voice prints transmitted from telephony interface 201. Thus, in either embodiment, the verification system 207 can keep sensitive transactions secure.
The voice browser 205 functions as a gateway between a call, for example, and a variety of networked applications. The voice browser 205 can employ a microphone, keypad, and a speaker instead of a keyboard, mouse, and monitor of a conventional web-based system. The voice browser 205 processes pages of markup language, such as voice extensible markup language (VoiceXML), speech application language tags (SALT), hypertext markup language (HTML), and others such as wireless markup language (WML) for wireless application protocol (WAP) based cell phone applications, and the World Wide Web (W3) platform for handheld devices, residing on a server (not shown). Since a broad level of markup languages are supported, the voice browser 205 can be configured accordingly, to include a VoiceXML- compliant browser, a SALT-complaint browser, an HTML-compliant browser, a WML- complaint browser or any other markup-language complaint browser, for communicating with users. As with standard web services and applications, the voice browser 205 can utilize a standardized networked infrastructure, i.e., hypertext transport protocol (HTTP), cookies, web caches, uniform resource locators (URLs), secure HTTP, etc., to establish and maintain connections.
- x FIG. 3 is a diagram of a speech recognition system, in accordance with an embodiment of the present invention. The speech recognition system 101 can provide speaker dependent and/or independent automatic voice recognition of acoustic utterances from the user. Accordingly, the speech recognition system 101 processes voice communications transmitted over telephony network 111 to determine whether a word or a speech pattern matches any grammar or vocabulary stored within a database (e.g., name grammar database 103 or confidence database 105). The name grammar database 103 is populated with possible combinations of user names and spellings of those names. According to one embodiment of the present invention, the name grammar database 103 can be built according to the NUANCE™ Say and Spell name grammar.
> i x > In alternative embodiments, the database 103 can include any grammar database including names and spellings of those names as well as a dictionary database, another grammar database, an acoustic model database, and/or a natural language definition database. Dictionary
databases contain phonetic pronunciations for words used in grammar databases. Acoustic model databases define, among other things, the languages that the speech application utilizes.
: - - ; Moreover, while only one name grammar database 103 and one confidence database are shown, it is recognized that multiple databases may exist controlled by, for instance, a database management system (not shown). In a database management system, data is stored in one or more data containers, each container contains records, and the data within each record is organized into one or more fields. In relational database systems, the data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns. In object-oriented databases, the data containers are referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes.
; ':' ■ ; As seen in FIG. 3, a supplemental grammar database 105, denoted as "confidence database," is used in conjunction with the name grammar database 103 to produce accurate recognition of user names. The confidence database 105, in an exemplary embodiment, can be derived from the primary name grammar database 103, such as an N-best list (with N being an integer that can be set according to the particular application). The N-Best result can include the expected name result that would increase recognition. In other words, the N-Best result is a list of items returned from the grammar that correlate well to the caller's utterance. The N-Best list is sorted by likelihood of a match and includes one or more entries. In this process, the correct name is added to this N-Best supplemental grammar. According to one embodiment, there is no weighting or preference given to any item in this supplemental name grammar. This smaller subset of the full name grammar containing both decoy and the correct name will allow for a better recognition of the caller's name. This supplemental grammar database can be dynamically built, in accordance with one embodiment of the present invention. s:v M5. A decoy application 311 is utilized, according to an exemplary embodiment, to generate variations of the names within the N-best list to enhance the probability of recognition. These generated names, which can possibly include the correct name, are provided as additional entries into the confidence database 105.
The speech recognition system 101 is configured to process acoustic utterances to determine whether a word or speech pattern matches any name stored within the name grammar database 103 and/or the confidence database 105. When a match is identified foi a parti culai utterance (or set of utterances) of the voice communication, the speech recognition system 101 sends an output signal for implementation by the verification system 207 and/or the voice browser 205. Thus, it is contemplated that the speech recognition system 101 can include speaker dependent and/or independent voice recognition. Further, the speech recognition system 101 can be implemented by any suitable voice recognition system capable of detecting and converting voice communications into text or other equivalent symbolic representations.
As such, the speech recognition system 101 includes a digitizer 301 for digitizing an audio input (e.g., speech), a parsing module 303, and an edge comparison module 305, as well as a confidence value generator 307 and interpretation generator 309. Moreover, the speech recognition system 101 makes use of the name grammar database 103, confidence database 105 to aid in more accurately recognizing a user's name; this process is more fully described with respect to FIGs. 4 A and 4B.
In operation, the digitizer 301 accepts acoustic or audio signals (i.e., user utterances) from the telephony interface 201 and coverts them into digital signals through an analog-to-digital converter. Once digitized, the signal is converted into the frequency domain using known methods, e.g., discrete/fast/short form Fourier transform, etc., and combined into a frequency spectrum frame for further processing. Since the human ear can only perceive audible acoustics ranging from 20Hz to 2OkHz and since the human voice only typically produces utterances within the 500Hz to 2kHz range, the digitizer 301 can be optimized to operate within these ranges. It is noted the digitizer 301 can include a host of signal processing components, i.e., filters, amplifiers, modulators, compressors, error detectors/checkers, etc., for conditioning the signal, e.g., removing signal noises like ambient noise, canceling transmission echoing, etc.
After the digitizer 301 processes the analog signal, a corresponding digital signal is passed to the parsing module 303 for extracting acoustic parameters using known methods, e.g., linear predictive coding. For instance, the parsing module 303 can identify acoustic feature
vectors that includes cepstral coefficients that identify the phonetic classifications and word boundaries of a user's utterance. It is recognized that other conventional modeling techniques can be used to extract one or more characteristics and/or patterns that classify distinctive acoustic portions of the digital signal.
Once parsed, the various acoustic features defined by the parsing module 303 are input into the edge comparison module 309 for comparison with and identification as recognized words, i.e., first, middle, and/or last names of the user. Accordingly, the edge comparison module 305 can use any known speech recognition method and/or algorithm, e.g., hidden Markov Modeling (HMM), as well as the name grammar database 103 and the confidence database 105 to recognize user utterances as words. After the words are identified, the interpretation generator 309 passes an associated equivalent textual or symbolic representation (hereinafter collectively referred to as a "value") to the voice browser 205 and/or verification system 207 for appropriate processing.
In general, a grammar database stores all the possible combinations of user utterances, and associated values, that are validly accepted by a particular speech application. By way of example, a simple grammar, denoted as "YESNOGRAMMAR," can be defined as follows:
YESNOGRAMMAR
[
(yes) {true} (no) {false} ] I In this example, the contents of the grammar are contained within the [ ] brackets. Items within the ( ) brackets are used by the edge comparison module 305 for comparison against the acoustic features extracted from the user's utterances. When the acoustic features similarly compare to the items within the ( ) brackets, the value contained within the { } brackets is passed to the interpretation generator 309.
The edge comparison module 305 utilizes a confidence value generator 307 to determine the level of confidence that measures the correlation of a recognized utterance to a value of an item within the grammar database. High confidence values imply greater similarity between the
recognized utterance and the value of an item within the grammar database. Conversely, a low confidence value will imply a poor similarity. In cases where an utterance is not recognized, i.e., the confidence value generator 307 perceives no similarity to any item within the grammar, the edge comparison module will produce an "out of grammar" condition and require the user to re- input their utterance.
Using the simple YESNOGRAMMAR defined above, an exemplary speech recognition process is explained as follows. First, the IVR system 107 prompts the user with the question, "Have you ever been to Colorado?" If the user responds "yes," the speech recognition system 101 recognizes the utterance and passes a "true" result to interpretation generator 309 for output to the appropriate device, e.g., voice browser 205, for system processing. If instead the user responded "maybe," the utterance would not compare to either the "yes" or "no" values within the grammar, YESNOGRAMMAR. As such, a no recognition situation would result and the edge comparison module would produce an "out of grammar" condition and require the user to re-input their utterance.
* ' In this regard, grammars are used to limit users to those values defined within the grammar, i.e., expected utterances. For instance, if a user was asked to utter a numerical identifier, such as a social security number (SSN), a grammar could limit the first digit to numbers zero through seven since no SSNs begins with an eight or a nine. Accordingly, if a user uttered a SSN beginning with an eight, when the utterance is analyzed by the speech recognition system 101 and compared against the limited grammar, the result will inevitably be an "out of grammar" condition. o N Unfortunately, user utterances cannot always be "pigeon holed" into expected utterances. For instance, the speech recognition system 101 utilizing the above YESNOGRAMMAR grammar, would not recognize a user utterance equating to the spoken words of "affirmative" in place of "yes" or "negative" in place of "no." However, to attempt to provide every possible alternative utterance to an expected utterance is impractical, especially when the complexity of the expected utterance increases.
An acute subset of this impracticality arises with the speech recognition of proper nouns, or more specifically, with user names. A simple name grammar, entitled SURNAMES, can be defined as illustrated below:
SURNAMES
[
(white w h i t e) {white}
(brimm b r i m m) {brimm}
(cage c a g e) {cage}
(langford 1 a n g f o r d) {langford}
(whyte w h y t e) {whyte}
] In this example, the names, i.e., grammar values, includes a name and a spelling of the name.
Since an almost infinite array of user names exist, typical name grammars only contain a large percentage of possible names. Further, those names stored within the name grammar are typically arranged or otherwise "tuned" to account for name popularity. While these features minimize system resource overwhelming and provide "good" coverage for common names, users who utter those unique names not within the grammar will ultimately produce an "out of grammar" condition. Moreover, users who utilize uncommon pronunciations of common names, e.g., "Whyte" instead of "White," will be presented with the wrong name due to the phonetic similarities and "tuned" nature of name grammars. It is this impracticality that the speech recognition system 101 addresses. The operation of the speech recognition system 101 is next described.
FIG. 4 are flowcharts of a speech recognition process, according to an embodiment of the present invention. In step 401, data (e.g., account information, social security number, or other personalized information) is received from the user, as part of an application or call flow of the IVR system 107, for instance. Through use of a more readily recognizable data, such as an account or social security number, the name associated with the account can be retrieved, per step 403. Next, the user is prompted for a name, as in step 405. The user is requested to say and spell the name.
In step 407, the resultant audio input from the user is received in response to the name prompt. The process then applies, as in step 409, speech recognition to the audio input using a primary name grammar database, such as the name grammar database 103. It is determined, per step 411, whether an out of grammar condition exists. If such a condition occurs, the user is re- prompted for the name, as in step 413. This time, the process applies a high confidence database to output the recognized name (step 415). That is, the process utilizes a secondary name grammar database of high confidence (e.g., confidence database 105) to output the latest recognized name. In one embodiment, the names from an N-best list are combined with the name associated with the account or social security number to generate a supplemental name grammar; this process can be performed dynamically. Decoy names similar to the actual name can also be added to this supplemental name grammar. The level of confidence - i.e., "high" - can be predetermined or pre-set according to the application.
Thereafter, the process determines whether the recognized name matches the retrieved name (as obtained in step 403), per step 417. If a match exists, the latest recognized name is confirmed with the user, per step 421. To confirm, the process, for example, can provide a simple prompt as follows: "I heard <name>. Is that correct?"
If there is not a match, as determined per step 419, the speech recognition process confirms the latest recognized name with the user, and reassesses the name wording (step 423). To confirm, the process, for example, can provide a more directed prompt as follows: "I heard <name>. Are you sure that is the name of the account?"
According to one embodiment, for security purposes, the expected result is not revealed to the caller; the caller must say the expected result and confirm. If the name is not correct, as determined in step 425, the process returns to step 413 to re-prompt the user. This process can be iterated any number of times (e.g., three times); that is, the number of iteration is configurable. If the user exceeds the maximum number of retries, the call can end with a failure event. Upon acknowledging that the name is correct, the process ends.
For the purposes of illustration, this speech recognition process is now explained with respect to three scenarios related to an application for reporting of wages using SSNs as the
personalized information. The first scenario involves using only the primary name grammar database 103, without the need to utilize the confidence database 105 (Table 1). The second scenario depicts the case in which the supplemental grammar database, e.g., confidence database 105, is required (Table 2). The last scenario, as shown in Table 3, shows a failed condition.
Table 3
The speech recognition process of FIGs. 4A and 4B, therefore, can be utilized to improve conventional speech recognition say and spell name capture. This approach allows the user's or caller's name to be acquired using another piece of information, or a data combination, such as a birth date and account or social security number. This actual name may be obtained and used in a supplemental name grammar to aid in the recognition of the caller's name.
N The processes described herein for providing speech recognition may be implemented via software, hardware (e.g., general processor, Digital Signal Processing (DSP) chip, an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Arrays (FPGAs), etc.), firmware or a combination thereof. Such exemplary hardware for performing the described functions is detailed below.
FIG. 5 illustrates a computer system 500 upon which an embodiment according to the present invention can be implemented. For example, the processes described herein can be
implemented using the computer system 500. The computer system 500 includes a bus 501 or other communication mechanism for communicating information and a processor 503 coupled to the bus 501 for processing information. The computer system 500 also includes main memory 505, such as a random access memory (RAM) or other dynamic storage device, coupled to the bus 501 for storing information and instructions to be executed by the processor 503. Main memory 505 can also be used for storing temporary variables or other intermediate information during execution of instructions by the processor 503. The computer system 500 may further include a read only memory (ROM) 507 or other static storage device coupled to the bus 501 for storing static information and instructions for the processor 503. A storage device 509, such as a magnetic disk or optical disk, is coupled to the bus 501 for persistently storing information and instructions.
The computer system 500 may be coupled via the bus 501 to a display 511, such as a cathode ray tube (CRT), liquid crystal display, active matrix display, or plasma display, for displaying information to a computer user. An input device 513, such as a keyboard including alphanumeric and other keys, is coupled to the bus 501 for communicating information and command selections to the processor 503. Another type of user input device is a cursor control 515, such as a mouse, a trackball, or cursor direction keys, for communicating direction information and command selections to the processor 503 and for controlling cursor movement on the display 511.
According to one embodiment of the invention, the processes described herein are performed by the computer system 500, in response to the processor 503 executing an arrangement of instructions contained in main memory 505. Such instructions can be read into main memory 505 from another computer-readable medium, such as the storage device 509. Execution of the arrangement of instructions contained in main memory 505 causes the processor 503 to perform the process steps described herein. One or more processors in a multi-processing arrangement may also be employed to execute the instructions contained in main memory 505. In alternative embodiments, hard- wired circuitry may be used in place of or in combination with software instructions to implement the embodiment of the present invention. Thus, embodiments
of the present invention are not limited to any specific combination of hardware circuitry and software.
The computer system 500 also includes a communication interface 517 coupled to bus 501. The communication interface 517 provides a two-way data communication coupling to a network link 519 connected to a local network 521. For example, the communication interface 517 may be a digital subscriber line (DSL) card or modem, an integrated services digital network (ISDN) card, a cable modem, a telephone modem, or any other communication interface to provide a data communication connection to a corresponding type of communication line. As another example, communication interface 517 may be a local area network (LAN) card (e.g. for Ethernet™ or an Asynchronous Transfer Model (ATM) network) to provide a data communication connection to a compatible LAN. Wireless links can also be implemented. In any such implementation, communication interface 517 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information. Further, the communication interface 517 can include peripheral interface devices, such as a Universal Serial Bus (USB) interface, a PCMCIA (Personal Computer Memory Card International Association) interface, etc. Although a single communication interface 517 is depicted in FIG. 5, multiple communication interfaces can also be employed.
The network link 519 typically provides data communication through one or more networks to other data devices. For example, the network link 519 may provide a connection through local network 521 to a host computer 523, which has connectivity to a network 525 (e.g. a wide area network (WAN) or the global packet data communication network now commonly referred to as the "Internet") or to data equipment operated by a service provider. The local network 521 and the network 525 both use electrical, electromagnetic, or optical signals to convey information and instructions. The signals through the various networks and the signals on the network link 519 and through the communication interface 517, which communicate digital data with the computer system 500, are exemplary forms of carrier waves bearing the information and instructions.
The computer system 500 can send messages and receive data, including program code, through the network(s), the network link 519, and the communication interface 517. In the Internet example, a server (not shown) might transmit requested code belonging to an application program for implementing an embodiment of the present invention through the network 525, the local network 521 and the communication interface 517. The processor 503 may execute the transmitted code while being received and/or store the code in the storage device 509, or other non- volatile storage for later execution. In this manner, the computer system 500 may obtain application code in the form of a carrier wave.
The term "computer-readable medium" as used herein refers to any medium that participates in providing instructions to the processor 503 for execution. Such a medium may take many forms, including but not limited to non-volatile media, volatile media, and transmission media. Non- volatile media include, for example, optical or magnetic disks, such as the storage device 509. Volatile media include dynamic memory, such as main memory 505. Transmission media include coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 501. Transmission media can also take the form of acoustic, optical, or electromagnetic waves, such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, CDRW, DVD, any other optical medium, punch cards, paper tape, optical mark sheets, any other physical medium with patterns of holes or other optically recognizable indicia, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
Various forms of computer-readable media may be involved in providing instructions to a processor for execution. For example, the instructions for carrying out at least part of the present invention may initially be borne on a magnetic disk of a remote computer. In such a scenario, the remote computer loads the instructions into main memory and sends the instructions over a telephone line using a modem. A modem of a local computer system receives the data on the telephone line and uses an infrared transmitter to convert the data to an infrared signal and
transmit the infrared signal to a portable computing device, such as a personal digital assistant (PDA) or a laptop. An infrared detector on the portable computing device receives the information and instructions borne by the infrared signal and places the data on a bus. The bus conveys the data to main memory, from which a processor retrieves and executes the instructions. The instructions received by main memory can optionally be stored on storage device either before or after execution by processor.
In the preceding specification, various preferred embodiments have been described with reference to the accompanying drawings. It will, however, be evident that various modifications and changes may be made thereto, and additional embodiments may be implemented, without departing from the broader scope of the invention as set forth in the claims that flow. The specification and the drawings are accordingly to be regarded in an illustrative rather than restrictive sense.
Claims
1. A method, comprising: retrieving a name from a user based on data provided by the user; prompting the user for a name of the user; receiving a first audio input from the user in response to the prompt; applying speech recognition to the first audio input using a name grammar database to output a recognized name; determining whether the recognized name matches the retrieved name; re-prompting the user for the name of the user, if no match is determined; receiving a second audio input from the user in response to the re-prompt; and applying speech recognition to the second audio input using a confidence database having entries less than the name grammar database.
2. A method according to claim 1, further comprising: prompting the user for the data, wherein the data includes one of business information or personal information.
3. A method according to claim 1, further comprising: confirming the recognized name with the user,
4. A method according to claim 3, wherein the confirmation is performed by aurally providing the recognized name to the user.
5. A method according to claim 1, further comprising: determining a failed condition if no match is found with the retrieved name after a predetermined number of iteratively re-prompting the user for a name.
6. A method according to claim 1, wherein the confidence database has entries derived from the name grammar database, the entries being ranked by confidence level.
7. A method according to claim 6, further comprising: determining additional entries for the confidence database using a decoy application.
8. A method according to claim 1, further comprising: determining a confidence level of a comparison between the retrieved name and the recognized name associated with either the first audio input or the second audio input.
9. An apparatus, comprising: a speech recognition logic configured to receive a first audio input from a user, wherein the first audio input represents an uttered name provided by the user in response to a prompt, wherein a retrieved name of the user is previously retrieved based on data provided by the user, the speech recognition logic being further configured to apply speech recognition to the first audio input using a name grammar database to output a recognized name, and to determine whether the recognized name matches the retrieved name, wherein the user is re-prompted for an uttered name of the user for a second audio input, if no match is determined, the speech recognition logic further applying speech recognition to the second audio input using a confidence database having entries less than the name grammar database.
10. An apparatus according to claim 9, wherein the user is prompted for the data, and the data includes one of business information or personal information.
1 1. An apparatus according to claim 9, wherein the recognized name is confirmed with the user.
12. An apparatus according to claim 11, wherein the confirmation is performed by aurally providing the recognized name to the user.
13. An apparatus according to claim 9, wherein the speech recognition logic is further configured to determine a failed condition if no match is found with the retrieved name after a predetermined number of iteratively re-prompting the user for a name.
14. An apparatus according to claim 9, wherein the confidence database has entries derived from the name grammar database, the entries being ranked by confidence level.
15. An apparatus according to claim 14, wherein additional entries for the confidence database is determined using a decoy application.
16. An apparatus according to claim 9, wherein the speech recognition logic is further configured to determine a confidence level of a comparison between the retrieved name and the recognized name associated with either the first audio input or the second audio input.
17. A system, comprising: a voice response unit configured to retrieve a name from a user based on data provided by the user, and to prompt the user for a name of the user; and a speech recognition logic configured to receive a first audio input from the user in response to the prompt, and to apply speech recognition to the first audio input using a name grammar database to output a recognized name, the speech recognition logic being further configured to determine whether the recognized name matches the retrieved name, wherein the voice response unit is further configured to re-prompt the user for the name of the user for a second audio input, if no match is determined, wherein the speech recognition logic is further configured to apply speech recognition to the second audio input using a confidence database having entries less than the name grammar database.
18. A system according to claim 17, wherein the voice response unit is further configured to prompt the user for the data, wherein the data includes one of business information or personal information.
19. A system according to claim 17, wherein the recognized name is confirmed with the user.
20. A system according to claim 19, wherein the confirmation is performed by aurally providing the recognized name to the user.
21. A system according to claim 17, wherein the speech recognition logic is further configured to determine a failed condition if no match is found with the retrieved name after a predetermined number of iteratively re-prompting the user for a name.
22. A system according to claim 17, wherein the confidence database has entries derived from the name grammar database, the entries being ranked by confidence level.
23. A system according to claim 22, wherein additional entries for the confidence database is determined using a decoy application.
24. A system according to claim 17, wherein the speech recognition logic is further configured to determine a confidence level of a comparison between the retrieved name and the recognized name associated with either the first audio input or the second audio input.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2007800431458A CN101542591B (en) | 2006-09-25 | 2007-09-25 | Method and system for providing speech recognition |
EP07875229A EP2104935A4 (en) | 2006-09-25 | 2007-09-25 | Method and system for providing speech recognition |
HK09110147.2A HK1132831A1 (en) | 2006-09-25 | 2009-10-30 | Method and system for providing speech recognition |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/526,395 | 2006-09-25 | ||
US11/526,395 US8190431B2 (en) | 2006-09-25 | 2006-09-25 | Method and system for providing speech recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009064281A1 true WO2009064281A1 (en) | 2009-05-22 |
Family
ID=39226162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2007/079413 WO2009064281A1 (en) | 2006-09-25 | 2007-09-25 | Method and system for providing speech recognition |
Country Status (5)
Country | Link |
---|---|
US (2) | US8190431B2 (en) |
EP (1) | EP2104935A4 (en) |
CN (1) | CN101542591B (en) |
HK (1) | HK1132831A1 (en) |
WO (1) | WO2009064281A1 (en) |
Families Citing this family (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1933302A1 (en) * | 2006-12-12 | 2008-06-18 | Harman Becker Automotive Systems GmbH | Speech recognition method |
US20080243504A1 (en) * | 2007-03-30 | 2008-10-02 | Verizon Data Services, Inc. | System and method of speech recognition training based on confirmed speaker utterances |
US20080243498A1 (en) * | 2007-03-30 | 2008-10-02 | Verizon Data Services, Inc. | Method and system for providing interactive speech recognition using speaker data |
US20080243499A1 (en) * | 2007-03-30 | 2008-10-02 | Verizon Data Services, Inc. | System and method of speech recognition training based on confirmed speaker utterances |
US9197746B2 (en) * | 2008-02-05 | 2015-11-24 | Avaya Inc. | System, method and apparatus for authenticating calls |
US9736207B1 (en) * | 2008-06-13 | 2017-08-15 | West Corporation | Passive outdial support for mobile devices via WAP push of an MVSS URL |
US9183834B2 (en) * | 2009-07-22 | 2015-11-10 | Cisco Technology, Inc. | Speech recognition tuning tool |
EP2643832A4 (en) * | 2010-11-22 | 2016-10-12 | Listening Methods Llc | System and method for pattern recognition and analysis |
CN103187053B (en) * | 2011-12-31 | 2016-03-30 | 联想(北京)有限公司 | Input method and electronic equipment |
US9361878B2 (en) * | 2012-03-30 | 2016-06-07 | Michael Boukadakis | Computer-readable medium, system and method of providing domain-specific information |
US10255914B2 (en) | 2012-03-30 | 2019-04-09 | Michael Boukadakis | Digital concierge and method |
US20130317805A1 (en) * | 2012-05-24 | 2013-11-28 | Google Inc. | Systems and methods for detecting real names in different languages |
US20140036023A1 (en) * | 2012-05-31 | 2014-02-06 | Volio, Inc. | Conversational video experience |
US9123340B2 (en) * | 2013-03-01 | 2015-09-01 | Google Inc. | Detecting the end of a user question |
CN104238379B (en) * | 2013-06-07 | 2017-07-28 | 艾默生过程控制流量技术有限公司 | Transmitter, field instrument and the method for controlling transmitter |
EP2851896A1 (en) | 2013-09-19 | 2015-03-25 | Maluuba Inc. | Speech recognition using phoneme matching |
CN104580282B (en) * | 2013-10-12 | 2018-04-03 | 深圳市赛格导航科技股份有限公司 | A kind of vehicle-mounted voice system and method |
US9601108B2 (en) * | 2014-01-17 | 2017-03-21 | Microsoft Technology Licensing, Llc | Incorporating an exogenous large-vocabulary model into rule-based speech recognition |
DK3176779T3 (en) * | 2015-12-02 | 2019-04-08 | Tata Consultancy Services Ltd | SYSTEMS AND METHODS FOR SENSITIVE AUDIO ZONE RANGE |
CN106875943A (en) * | 2017-01-22 | 2017-06-20 | 上海云信留客信息科技有限公司 | A kind of speech recognition system for big data analysis |
US10388282B2 (en) * | 2017-01-25 | 2019-08-20 | CliniCloud Inc. | Medical voice command device |
CN109616123A (en) * | 2018-11-21 | 2019-04-12 | 安徽云融信息技术有限公司 | Based on the visually impaired people of big data with browser voice interactive method and device |
WO2020242595A1 (en) * | 2019-05-31 | 2020-12-03 | Apple Inc. | Voice identification in digital assistant systems |
EP3959714B1 (en) * | 2019-05-31 | 2024-04-17 | Apple Inc. | Voice identification in digital assistant systems |
DK201970511A1 (en) | 2019-05-31 | 2021-02-15 | Apple Inc | Voice identification in digital assistant systems |
CN110517675B (en) * | 2019-08-08 | 2021-12-03 | 出门问问信息科技有限公司 | Interaction method and device based on voice recognition, storage medium and electronic equipment |
CN118171655B (en) * | 2024-05-13 | 2024-07-12 | 北京中关村科金技术有限公司 | Name generation method and device, electronic equipment and computer program product |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6208965B1 (en) * | 1997-11-20 | 2001-03-27 | At&T Corp. | Method and apparatus for performing a name acquisition based on speech recognition |
US20040240633A1 (en) * | 2003-05-29 | 2004-12-02 | International Business Machines Corporation | Voice operated directory dialler |
US20050144014A1 (en) * | 2003-11-26 | 2005-06-30 | International Business Machines Corporation | Directory dialer name recognition |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU5803394A (en) * | 1992-12-17 | 1994-07-04 | Bell Atlantic Network Services, Inc. | Mechanized directory assistance |
CN1163869C (en) * | 1997-05-06 | 2004-08-25 | 语音工程国际公司 | System and method for developing interactive speech applications |
US5897616A (en) * | 1997-06-11 | 1999-04-27 | International Business Machines Corporation | Apparatus and methods for speaker verification/identification/classification employing non-acoustic and/or acoustic models and databases |
US6483896B1 (en) * | 1998-02-05 | 2002-11-19 | At&T Corp. | Speech recognition using telephone call parameters |
US7031925B1 (en) * | 1998-06-15 | 2006-04-18 | At&T Corp. | Method and apparatus for creating customer specific dynamic grammars |
US6389397B1 (en) * | 1998-12-23 | 2002-05-14 | Lucent Technologies, Inc. | User identification system using improved voice print identification processing |
US6978238B2 (en) * | 1999-07-12 | 2005-12-20 | Charles Schwab & Co., Inc. | Method and system for identifying a user by voice |
JP3312335B2 (en) * | 1999-07-30 | 2002-08-05 | 株式会社コムスクエア | User authentication method, user authentication system and recording medium |
US6510411B1 (en) * | 1999-10-29 | 2003-01-21 | Unisys Corporation | Task oriented dialog model and manager |
GB9928011D0 (en) * | 1999-11-27 | 2000-01-26 | Ibm | Voice processing system |
US6934684B2 (en) * | 2000-03-24 | 2005-08-23 | Dialsurf, Inc. | Voice-interactive marketplace providing promotion and promotion tracking, loyalty reward and redemption, and other features |
US6728348B2 (en) * | 2000-11-30 | 2004-04-27 | Comverse, Inc. | System for storing voice recognizable identifiers using a limited input device such as a telephone key pad |
US20020072917A1 (en) * | 2000-12-11 | 2002-06-13 | Irvin David Rand | Method and apparatus for speech recognition incorporating location information |
US6816578B1 (en) * | 2001-11-27 | 2004-11-09 | Nortel Networks Limited | Efficient instant messaging using a telephony interface |
US6714631B1 (en) * | 2002-10-31 | 2004-03-30 | Sbc Properties, L.P. | Method and system for an automated departure strategy |
GB2409561A (en) * | 2003-12-23 | 2005-06-29 | Canon Kk | A method of correcting errors in a speech recognition system |
US7899671B2 (en) * | 2004-02-05 | 2011-03-01 | Avaya, Inc. | Recognition results postprocessor for use in voice recognition systems |
US20060217978A1 (en) * | 2005-03-28 | 2006-09-28 | David Mitby | System and method for handling information in a voice recognition automated conversation |
-
2006
- 2006-09-25 US US11/526,395 patent/US8190431B2/en not_active Expired - Fee Related
-
2007
- 2007-09-25 CN CN2007800431458A patent/CN101542591B/en not_active Expired - Fee Related
- 2007-09-25 WO PCT/US2007/079413 patent/WO2009064281A1/en active Application Filing
- 2007-09-25 EP EP07875229A patent/EP2104935A4/en not_active Withdrawn
-
2009
- 2009-10-30 HK HK09110147.2A patent/HK1132831A1/en not_active IP Right Cessation
-
2011
- 2011-11-30 US US13/308,032 patent/US8457966B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6208965B1 (en) * | 1997-11-20 | 2001-03-27 | At&T Corp. | Method and apparatus for performing a name acquisition based on speech recognition |
US20040240633A1 (en) * | 2003-05-29 | 2004-12-02 | International Business Machines Corporation | Voice operated directory dialler |
US20050144014A1 (en) * | 2003-11-26 | 2005-06-30 | International Business Machines Corporation | Directory dialer name recognition |
Non-Patent Citations (1)
Title |
---|
See also references of EP2104935A4 * |
Also Published As
Publication number | Publication date |
---|---|
US8190431B2 (en) | 2012-05-29 |
EP2104935A1 (en) | 2009-09-30 |
US8457966B2 (en) | 2013-06-04 |
US20080077409A1 (en) | 2008-03-27 |
US20120143609A1 (en) | 2012-06-07 |
CN101542591B (en) | 2013-02-06 |
CN101542591A (en) | 2009-09-23 |
HK1132831A1 (en) | 2010-03-05 |
EP2104935A4 (en) | 2010-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8190431B2 (en) | Method and system for providing speech recognition | |
US8488750B2 (en) | Method and system of providing interactive speech recognition based on call routing | |
US20080243504A1 (en) | System and method of speech recognition training based on confirmed speaker utterances | |
US10027662B1 (en) | Dynamic user authentication | |
US6910012B2 (en) | Method and system for speech recognition using phonetically similar word alternatives | |
US6470315B1 (en) | Enrollment and modeling method and apparatus for robust speaker dependent speech models | |
Kumar et al. | A Hindi speech recognition system for connected words using HTK | |
US10163436B1 (en) | Training a speech processing system using spoken utterances | |
EP1171871B1 (en) | Recognition engines with complementary language models | |
CN1655235B (en) | Automatic identification of telephone callers based on voice characteristics | |
US6937983B2 (en) | Method and system for semantic speech recognition | |
US6571210B2 (en) | Confidence measure system using a near-miss pattern | |
US7533023B2 (en) | Intermediary speech processor in network environments transforming customized speech parameters | |
US10170107B1 (en) | Extendable label recognition of linguistic input | |
US7089184B2 (en) | Speech recognition for recognizing speaker-independent, continuous speech | |
EP1047046A2 (en) | Distributed architecture for training a speech recognition system | |
KR102097710B1 (en) | Apparatus and method for separating of dialogue | |
US20050137868A1 (en) | Biasing a speech recognizer based on prompt context | |
US11062711B2 (en) | Voice-controlled communication requests and responses | |
Thimmaraja Yadava et al. | Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling | |
US20080243499A1 (en) | System and method of speech recognition training based on confirmed speaker utterances | |
Żelasko et al. | AGH corpus of Polish speech | |
Patel et al. | Development of Large Vocabulary Speech Recognition System with Keyword Search for Manipuri. | |
US20040006469A1 (en) | Apparatus and method for updating lexicon | |
US20080243498A1 (en) | Method and system for providing interactive speech recognition using speaker data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200780043145.8 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2007875229 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2007875229 Country of ref document: EP |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07875229 Country of ref document: EP Kind code of ref document: A1 |