WO1999023641A1 - Systeme d'amelioration de composition vocale par nom - Google Patents

Systeme d'amelioration de composition vocale par nom Download PDF

Info

Publication number
WO1999023641A1
WO1999023641A1 PCT/US1998/022295 US9822295W WO9923641A1 WO 1999023641 A1 WO1999023641 A1 WO 1999023641A1 US 9822295 W US9822295 W US 9822295W WO 9923641 A1 WO9923641 A1 WO 9923641A1
Authority
WO
WIPO (PCT)
Prior art keywords
circuitry
models
phrases
model
generating
Prior art date
Application number
PCT/US1998/022295
Other languages
English (en)
Inventor
Thomas D. Fisher
Dearborn R. Mowry
Jeffrey J. Spiess
Original Assignee
Alcatel Usa Sourcing, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alcatel Usa Sourcing, L.P. filed Critical Alcatel Usa Sourcing, L.P.
Priority to AU11935/99A priority Critical patent/AU1193599A/en
Publication of WO1999023641A1 publication Critical patent/WO1999023641A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • TECHNICAL FIELD This invention relates in general to telecommunications and, more particularly, to a system for enhancing spoken name dialing.
  • Voice recognition can be either "speaker dependent", i.e., the recognition process is based on parameters acquired from a known individual, or "speaker independent”, i.e., the recognition process performs the recognition based on parameters acquired from a class of people.
  • Telecommunications companies use voice recognition for a number of services, such as spoken name speed dialing.
  • spoken name speed dialing the caller maintains a directory frequently called numbers in a voice directory which stores a representation of the caller's voice speaking the names and the associated telephone numbers.
  • the enrolled names can subsequently be called by merely speaking the name into the phone.
  • a call can be placed by saying "call the boss", “call Fred” or, more simply, by saying "the boss” or “Fred” into the phones receiver.
  • the voice recognition system should reject that name (Out Of Vocabulary Rejection or "OOVR"), and respond with a prompt such as "This name is not enrolled.” If the caller. uses a name which is in the directory, the voice recognition system will retrieve the number associated with the name and complete the connection.
  • OOVR Out Of Vocabulary Rejection
  • the recognition is not perfect all of the time. If a name is not in the directory, the voice recognition system may nonetheless determine that it is close enough to an enrolled name to warrant a match. Conversely, a name which is in the directory may nonetheless be rejected. While some errors in recognition are to be expected, it is desirable to reduce errors as much as possible. In particular, it has been found that the OOVR rate of present day systems is only on the order of 60-65%, while the IV (in vocabulary) matching is on the order of 97-99%.
  • speech is recognized by using models based on the respective users' phrases. Transitions of each model are weighted based upon the length of the associated phrase. An acoustic input is evaluated with respect to the models to find the best match.
  • the present invention provides significant improvement in out of vocabulary rejections with very slight changes with regard to in vocabulary matching.
  • Figure 1 illustrates a block diagram of a spoken name speed dialing system
  • Figure 2 illustrates a functional diagram of the name recognizing function
  • FIG 3 illustrates a prior art structure for a speaker dependent Hidden Markov Model (HMM)
  • Figure 4 illustrates a prior art application model
  • Figure 5 illustrates a weighted HMM used to increase the OOVR performance of a name recognizing system
  • Figure 6 illustrates a modified application model using the modified HMM of Figure 5;
  • Figure 7 illustrates a chart for exemplary penalization factors to modify standard HMMs relative to name length
  • Figure 8 illustrates a flow chart describing the model initialization function.
  • FIG. 1 illustrates a block diagram of a voice recognition system for spoken name dialing.
  • Telephones 10 are coupled to a telecommunications network 12.
  • the telecommunications network 12 is coupled to a media server 14 via a digital carrier 15, such as a Tl line.
  • the media server 14 may be a general purpose workstation and has a bus interface which allows its processing circuitry 16 (including one or more processors, hard drives and communications interfaces) to communicate with a plurality of DSPs (digital signal processors) 18.
  • DSPs digital signal processors
  • the voice recognition system works as follows. When an off-hook condition occurs, a special tone is sent to the phone to indicate that spoken name calling is enabled. At this point, the caller may either dial a number or speak the name of a desired party. The caller could say “call Bob” or simply "Bob". Other voice commands such as “directory” or "call forward” could be issued audibly.
  • the audio data is transmitted to the telecommunications network 12 where the signal is digitized and multiplexed onto the digital carrier 15.
  • Media server 14 receives the voice signal from the caller as digitized audio data over the digital carrier 15.
  • the media server 14 functions to recognize commands and names which have been enrolled into the user's voice dial directory. Assuming the audio input was a name or a "call ⁇ name>" command, the media server 14 would determine whether ⁇ name> had been enrolled in the caller's file and, assuming the name was enrolled, look up the number associated with the name in its internal database. The media server 14 would then output the DTMF (dual tone multi frequency) tones needed to complete the call.
  • DTMF dual tone multi frequency
  • OOVR Out Of Vocabulary Rejection
  • FIG. 2 illustrates a functional block diagram of the operation of the media server 14.
  • the functions of the media server 14 are separated into the application domain 24 and the resource domain 26.
  • the call manager application 28 is the host level application that accesses prompts, models, and spoken speed dial templates (SSDTs) from the database 30 and downloads this data via the resource bus 32 to the DSPs 18.
  • the prompts are digital voice data used to communicate with the caller.
  • the SSDTs are the caller specific GSF (Generalized Speech Features) vector files which represent the audio information from the caller for each name enrolled in the spoken speed dial database.
  • GSF Generalized Speech Features
  • an SSDT file would contain the GSFs for each of those names.
  • the models include speaker independent models and the application models which link speaker independent and speaker dependent models together (see Figures 4 and 6).
  • the DSP Task management function 34 notifies the speech task manager 36 about an impending recognition session.
  • the speech task manager 36 performs system initialization 38 (such as initializing counters and parameters and initializing the DSP) and then proceeds to perform model initialization 40.
  • system initialization 38 such as initializing counters and parameters and initializing the DSP
  • model initialization 40 user specific speech data from SSDT files are converted into standard models.
  • the standard models may be modified in order to penalize the transition probabilities associated with shorter names stored in the SSDT. All of the models reside in the model region 42, which is a memory associated with the DSP performing the recognizing function.
  • the DSP task management function 34 engages the DSP front end 44 to begin processing the data from the digital carrier 15.
  • the DSP front end converts the digitized audio data from the digital carrier 15 into LPC (linear predictive code) vectors and places the LPC vectors into LPC buffer 46.
  • the HG (Hierarchical Grammar) alignment engine 48 begins concurrent processing of the LPC vectors from the LPC buffer 46 and continues until the end point detection function 50 determines that the utterance is complete.
  • the feature extraction function 52 creates the GSFs from the LPC data in the LPC buffer 46.
  • the HG alignment engine 48 is the core of the recognizer. This component matches the LPC based feature vectors from feature extraction function 52 to the models in model region 42.
  • the recognition result is generated by the speech task manager 36 and communicated through the DSP task management function 34 to the call manager 28. The call manager 28 then decides on the next step of processing.
  • Figure 3 illustrates an example of a standard Hidden Markov Model (HMM) 60, in this case an FSA1 (FSA stands for "Finite State Automata”) model.
  • HMM Hidden Markov Model
  • FSA1 Finite State Automata
  • GSF vectors GSF vectors
  • Each feature corresponds to a frame of data, with the frame corresponding to 20 milliseconds of digitized audio.
  • the length of the model i.e., the number of states in the model
  • every other feature is dropped prior to storage, although this is an optional step.
  • an HMM 60 is generated from the stored features in the SSDT for each enrolled name.
  • Each feature of a stored name corresponds to a state 62 of the HMM 60.
  • each frame is associated with a pair of states 64, where each state of the pair 64 points to the same the same feature.
  • Transition probabilities are associated with transitions 65 between states.
  • transition probabilities between consecutive states is "0.8" while the transition probability for skipping a state is "0.2".
  • the features derived from the digital audio data from the digital carrier are evaluated with respect to the features associated with the states 62.
  • the evaluation at a state provides a score based on two factors: (1) how closely the respective feature vectors are aligned and (2) the probability of the transition to that state. Alignment of two vectors (xl,yl,zl) and (x2,y2,z2) can be calculated as the Euclidean distance.
  • the HMM models for each name are generated during the model initialization stage 40 from the feature vectors stored in the SSDT.
  • the model structure remains the same for each enrolled name, varying only in length, which is dependent upon the number of feature vectors of the enrolled name.
  • the transition probabilities are standard from HMM to HMM for a speaker dependent model; while the transition probabilities may be different from the 0.8 and 0.2 probabilities shown in Figure 3, they remain the consistent for each speaker dependent model.
  • the media server 14 uses speaker independent models as well to recognize command phrases such as "call” or "directory".
  • the speaker independent models could be FSA2 models or FSA1 models.
  • FSA2 models are similar to FSA1 models, except the features are associated with the transitions instead of the states.
  • Speaker independent models are trained with data over a class of speakers. Accordingly, unlike the speaker dependent models used herein, the transition probabilities are not standard but are generated during the training process.
  • Figure 4 illustrates an application model for spoken name speed dialing which includes both a speaker independent model 66 ("call"), several speaker dependent models 68 ("Bob”, “Lester”, “Ernie Kravitz") and two “garbage” models 70 and 72 (“garbagel” and “garbage2", respectively).
  • the garbage models are used to detect out of vocabulary speech, and are designed to have a higher evaluation score than the enrolled names for a spoken name which has not been enrolled.
  • the garbage models are trained as speaker independent models.
  • the DSPs 18 evaluate the incoming audio data to find the best match with the models.
  • the DSP evaluates eight possibilities: (1) “call Bob”, (2) “call Lester”, (3) “call Ernie Kravitz”, (4) "call ⁇ garbagel>", (5) “Bob”, (6) “Lester”, (7) “Ernie Kravitz”, (8) " ⁇ garbage2>”.
  • the recognition algorithm assigns a score to each of the possibilities, and the highest score is considered to be the closest match. In this example, if "call ⁇ garbagel>" or " ⁇ garbage2>” receives the highest score, an OOVR results.
  • the present invention improves upon the OOVR of the system by adjusting parameters during the model initialization stage 40 while the speaker dependent models are being initialized prior to recognition.
  • the standard probability factors for each speaker dependent model (0.8 and 0.2 in Figure 3) are modified as the model is being generated, such that the transition probability factors are dependent upon the length of each enrolled name.
  • the probability factors are modified by using a penalty factor corresponding to the length of an enrolled name.
  • Figure 5 illustrates the model of Figure 3 after being assigned a penalty factor of 0.5.
  • the transition probability factors of Figure 3 have each been multiplied by a 0.5, to result in a model with transition probabilities of 0.4 and 0.1, instead of 0.8 and 0.2, respectively.
  • Figure 6 illustrates an application model corresponding to the application model of Figure 4 showing exemplary penalty factors for each enrolled name.
  • Bob the shortest name is assigned a penalty factor of 0.3. Lester is assigned a penalty factor of 0.6 and Ernie Kravitz is assigned a penalty factor of 1.0 (i.e., this name is not penalized).
  • the transition probabilities for "Bob” would be 0.24 and 0.06
  • the transition probabilities for "Lester” would be 0.48 and 0.12
  • the transition probabilities for "Ernie Kravitz” would remain at 0.8 and 0.2.
  • Figure 7 illustrates a chart of penalization mapping which has been found to be effective.
  • enrolled names of eight frames or less are assigned a penalty factor of 0.3
  • names greater than eight frames and less than 12 are assigned a penalty factor of 0.6
  • Names having a length greater than twelve frames are assigned a penalty factor of 1.
  • Figure 12 illustrates a penalization mapping which has been found to be effective, adjustments such as increasing the number of penalization levels or changing the values of the penalties at specific name lengths could be made to optimize the effect of the mapping.
  • garbage models designed for the prior art may need to be modified to work optimally with the speaker dependent models described above.
  • the penalty structure given in Figure 7 it has been found that modifying the transition probabilities of the ⁇ garbage 1> model by a factor of 0.3 and the ⁇ garbage2> model by a factor of 0.8 is effective.
  • logarithmic transition probabilities are used in the scoring of each phrase to prevent the scores from becoming minuscule through multiple multiplications by a factor less than one. As a result, the lowest score represents the best match and the highest score represents the worst match.
  • the penalty factor increases evaluation scores associated with shorter names in this scheme. This has been found to improve the OOVR considerably. Results are shown below:
  • the enhanced version where short names are penalized, results in an increase in the OOVR rejection of approximately 9% points where "call' is used and of 8.3% where "call” is not used. There is a minimal reduction in the in vocabulary acceptance of 0.5% and 0.3%, respectively, for the "call" and "no call” situations.
  • Figure 8 is a flow chart describing operation of the model initialization stage, using the present invention.
  • the HMM header is initialized, which sets the pointers in the top level model structure and initializes other data in the application model.
  • the speaker independent models are then initialized in block 82 through a loop which initializes the four primary types of model structures: FSA1 (block 84), FSA2 (block 86), Frame (block 88) and NULL (block 90).
  • the NULL model is essentially a score which is a upper bound for extremely poor matches.
  • the Frame model is an entity representing the "acoustic", generally a vector. Unlike speaker dependent models in which only the GSF vectors are stored and the HMMs are derived from the GSF vectors for each session, speaker independent models are stored with both the GSF vectors and the HMM structure (including transition probabilities).
  • a loop begins to generate an HMM structure for every active speaker dependent phrase for the active caller.
  • the HMM is an FSA1 model type.
  • the speaker dependent models are initialized and in block 96 the SSDT for the active caller is decompressed. Steps 100 through 110 are repeated to create an HMM for each name in the SSDT for the active caller.
  • the transitions which define a mapping from one state of the HMM to another are generated.
  • the penalty for the current speaker dependent model is generated based on the number of frames (NFR AMES) in the corresponding speaker dependent name.
  • the penalty is determined by comparing the NFR AMES to a set of thresholds Tl and T2, where TKT2.
  • the penalty should be high for short names and low for long names. A high penalty corresponds to a low multiplication factor.
  • the standard transition probabilities are multiplied by the penalty to arrive at the final transition probabilities for the HMM which are attached to the HMM.
  • the HMM is attached to the overall application model, along with the speaker independent models. With the current speaker dependent model attached (via pointers) the construction of the HMM continues by associating the states of the HMM to transitions along with the transition probabilities in block 108.
  • the speaker dependent features from the SSDT are associated with the various states, via pointers, making the current HMM complete.
  • steps 100-110 are repeated if there are any more HMMs associated with the current SSDT; otherwise the initialization is over in block 114.
  • the present invention provides significant advantages over the prior art. Importantly, it provides a significant increase in OOVR, while having limited effect on in vocabulary acceptance. Further, it can be used to modify existing voice recognition systems.
  • weighting recognition models based on the name or phrase size could be also used for speaker independent recognition of commands and name dialing and for functions other than spoken name speed dialing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Selon cette invention, on effectue une composition vocale rapide par nom par l'intermédiaire d'un serveur qui compare un courant audio numérisé à des modèles monolocuteurs ou multilocuteurs. Au moins certains des modèles monolocuteurs sont générés de manière dynamique pour chaque session de reconnaissance vocale. Les modèles monolocuteurs sont pénalisés en fonction de leurs longueurs respectives de façon à éviter le rejet des noms non inscrits dans l'annuaire et à réduire les effets secondaires sur la transmission des noms inscrits dans l'annuaire.
PCT/US1998/022295 1997-11-04 1998-10-21 Systeme d'amelioration de composition vocale par nom WO1999023641A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU11935/99A AU1193599A (en) 1997-11-04 1998-10-21 System for entollement of a spoken name dialing service

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US6420497P 1997-11-04 1997-11-04
US60/064,204 1997-11-04
US11364898A 1998-07-10 1998-07-10
US09/113,648 1998-07-10

Publications (1)

Publication Number Publication Date
WO1999023641A1 true WO1999023641A1 (fr) 1999-05-14

Family

ID=26744263

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/022295 WO1999023641A1 (fr) 1997-11-04 1998-10-21 Systeme d'amelioration de composition vocale par nom

Country Status (2)

Country Link
AU (1) AU1193599A (fr)
WO (1) WO1999023641A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0762709A2 (fr) * 1995-09-12 1997-03-12 Texas Instruments Incorporated Méthode et système d'introduction de noms dans une base de données avec reconnaissance de la parole
US5679001A (en) * 1992-11-04 1997-10-21 The Secretary Of State For Defence In Her Britannic Majesty's Government Of The United Kingdom Of Great Britain And Northern Ireland Children's speech training aid

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5679001A (en) * 1992-11-04 1997-10-21 The Secretary Of State For Defence In Her Britannic Majesty's Government Of The United Kingdom Of Great Britain And Northern Ireland Children's speech training aid
EP0762709A2 (fr) * 1995-09-12 1997-03-12 Texas Instruments Incorporated Méthode et système d'introduction de noms dans une base de données avec reconnaissance de la parole

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DOBLER S ET AL: "A ROBUST CONNECTED-WORDS RECOGNIZER", SPEECH PROCESSING 1, SAN FRANCISCO, MAR. 23 - 26, 1992, vol. 1, no. CONF. 17, 23 March 1992 (1992-03-23), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 245 - 248, XP000341129 *

Also Published As

Publication number Publication date
AU1193599A (en) 1999-05-24

Similar Documents

Publication Publication Date Title
US5895448A (en) Methods and apparatus for generating and using speaker independent garbage models for speaker dependent speech recognition purpose
US6076054A (en) Methods and apparatus for generating and using out of vocabulary word models for speaker dependent speech recognition
US6061653A (en) Speech recognition system using shared speech models for multiple recognition processes
US5983177A (en) Method and apparatus for obtaining transcriptions from multiple training utterances
Rabiner Applications of speech recognition in the area of telecommunications
US5732187A (en) Speaker-dependent speech recognition using speaker independent models
US5893059A (en) Speech recoginition methods and apparatus
US7917364B2 (en) System and method using multiple automated speech recognition engines
CA2486125C (fr) Systeme et methode d'utilisation de metadonnees dans le traitement de la parole
US8694316B2 (en) Methods, apparatus and computer programs for automatic speech recognition
US6701293B2 (en) Combining N-best lists from multiple speech recognizers
US5912949A (en) Voice-dialing system using both spoken names and initials in recognition
CA2117932C (fr) Reconnaissance vocale a decision ponderee
EP2523441B1 (fr) Système de conversion de message vocal en texte, à grande échelle, indépendant de l'utilisateur et indépendant du dispositif
US7450698B2 (en) System and method of utilizing a hybrid semantic model for speech recognition
US5930336A (en) Voice dialing server for branch exchange telephone systems
US6014624A (en) Method and apparatus for transitioning from one voice recognition system to another
US6058363A (en) Method and system for speaker-independent recognition of user-defined phrases
US20030120493A1 (en) Method and system for updating and customizing recognition vocabulary
JPH07210190A (ja) 音声認識方法及びシステム
JPH08234788A (ja) 音声認識のバイアス等化方法および装置
WO2000068933A1 (fr) Adaptation d'un systeme de reconnaissance vocale sur plusieurs sessions a distance avec un locuteur
CN1613108A (zh) 多人的网络可访问依赖于说话者的声音模型
US7031923B1 (en) Verbal utterance rejection using a labeller with grammatical constraints
US20050049858A1 (en) Methods and systems for improving alphabetic speech recognition accuracy

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: CA

122 Ep: pct application non-entry in european phase