WO2001052238A1 - System and method for speech processing with limited training data - Google Patents

System and method for speech processing with limited training data Download PDF

Info

Publication number
WO2001052238A1
WO2001052238A1 PCT/US2001/001110 US0101110W WO0152238A1 WO 2001052238 A1 WO2001052238 A1 WO 2001052238A1 US 0101110 W US0101110 W US 0101110W WO 0152238 A1 WO0152238 A1 WO 0152238A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
model
training data
speech
language model
Prior art date
Application number
PCT/US2001/001110
Other languages
French (fr)
Inventor
Yuen Yee Lo
Pascale Fung
Original Assignee
Weniwen Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weniwen Technologies, Inc. filed Critical Weniwen Technologies, Inc.
Publication of WO2001052238A1 publication Critical patent/WO2001052238A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling

Definitions

  • the present invention relates to automated processing of speech.
  • Embodiments of the present invention relate to automated keyword spotting, automated dictation from speech, or other automated speech recognition.
  • Automated speech processing (ASP) systems include, for example, automated keyword spotting systems, automated dictation systems, other automated speech recognition systems, and the like. Automated keyword spotting systems attempt to detect the presence or absence of particular word(s) or phrase(s) in an utterance of speech input, using data processing. Automated dictation systems to attempt to convert an utterance of speech into its corresponding text, using data processing.
  • ASP systems typically include models or parameters that are established based on training data.
  • Training data may include, for example, collections of text or recorded speech, for the language or dialect that is to be handled by the speech processing system.
  • a method for establishing a language model to recognize speech of a first language comprises: obtaining language model parameters for a second language that is not the first language; obtaining training data for the first language; determining a language model based on the language model parameters for the second language and the training data for the target language; and making the language model available for use to handle speech of the first language.
  • a speech processing system includes a processor, a memory, a language model established by the above- described method, and speech recognition software configured to use the language model for recognizing speech.
  • a speech processing system includes: a processor; memory coupled to the processor; and logic stored in the memory configured to control the processor to: obtain training data for the first language; determine a language model based on the training data for the target language and on an existing language model for a second language; wherein the language model is for use in recognition of speech of the first language.
  • a speech processing system that is capable of recognizing words of a primarily colloquial language in an input utterance includes: a model for a word; a model for filler phrases, wherein the model for filler phrases was trained using online newsgroup articles; and speech recognition software configured to recognize an input utterance based on the model for a word and on the model for filler phrases.
  • FIG. 1 A is a block diagram of a computer system in which the present invention may be embodied.
  • FIG. IB is a block diagram of a software system of the present invention for controlling operation of the system of FIG. 1 A.
  • FIG. 2 is a schematic diagram of a typical continuous speech recognizer from which the present invention may be implemented.
  • FIG. 3 is a flow diagram that illustrates a method for establishing a colloquial language model in a target language.
  • the following description will focus on the currently-preferred embodiment of the present invention, which is operative in an environment typically including desktop computers, server computers, and portable computing devices, occasionally or permanently connected to one another.
  • the currently-preferred embodiment of the present invention may be implemented in an application operating in an Internet-connected environment and running under an operating system, such as the Microsoft® Windows operating system, on an IBM-compatible Personal Computer (PC) configured as an Internet server.
  • the present invention is not limited to any particular environment, device, or application. Instead, those skilled in the art will find that the present invention may be advantageously applied to any environment.
  • the present invention may be advantageously embodied on a variety of different platforms, including Macintosh, Linux, EPOC, BeOS, Solaris, UNIX, NextStep, and the like.
  • the present invention is especially suited to establishing an ASP system for a "target" language or dialect (for example, Cantonese Chinese) that is similar to, or is a dialect of, a "baseline” language or dialect (for example, Mandarin Chinese) for which portions of an ASP system (e.g., a language-model portion) has already been established.
  • the preferred approach of the present invention is to start with a existing baseline portion of an actual or hypothetical ASP system for the baseline language and to modify the baseline portion for use with the target language, based on a limited amount of training data that has been obtained for the target language.
  • the baseline portion may include, for example, models or parameters of the type used for an ASP system. These models or parameters may be, for example, models and parameters of a language model for the baseline language.
  • the limited amount of training data is preferably collected at least in part from online newsgroup articles, which tend to use colloquial vernacular, as opposed to newspaper articles, which tend to use more formal written language style.
  • Cantonese Chinese, or the like is the target language
  • Mandarin Chinese, or the like is the baseline language
  • the Chinese language includes four language groups, they are Mandarin, Cantonese, Fukienese and Hakka and many dialects. These four groups may be considered to be four languages.
  • a dialect is a language variety used by a particular population of speakers. While there are some similarities between the Chinese languages and dialects, there arfe many differences. In the present document, two Chinese languages, Cantonese and Mandarin, will be discussed in detail.
  • Spontaneous speech consists of colloquial words, corrections, hesitations, dis-fluency, short pauses, ill-formed grammar, and words that are not generally listed in the standard Chinese lexicon.
  • the present invention may be embodied on an information processing system such as the system 300 of FIG. 1A, which comprises a central processor 301, a main memory 302, an input/output (I/O) controller 303, a keyboard 304, a pointing device 305, pen device, or the like), a screen or display device 306, a mass storage 307 (e.g., hard disk, removable floppy disk, optical disk, magneto-optical disk, or flash memory, etc.), an audio input device 308 (e.g., a microphone, e.g., as found on a telephone that is coupled to the bus system 310), and an interface 309.
  • a real-time system clock is included with the system 300, in a conventional manner.
  • the various components of the system 300 communicate through a system bus 310 or similar architecture.
  • the system 300 may communicate with other devices through the interface or communication port 309, which may be an RS-232 serial port or the like.
  • Devices which will be commonly connected, occasionally or on a full time basis, to the interface 309 include a network 351 (e.g., LANs or the Internet), a laptop 352, a handheld organizer 354 (e.g., the Palm organizer, available from Palm Computing, Inc., a subsidiary of 3Com Corp. of Santa Clara, California.), a modem 353, and the like.
  • program logic (implementing the methodology described below) is loaded from the storage device or mass storage 307 into the main memory 302, for execution by the processor 301.
  • the user enters commands and data through (a) the keyboard 304, (b) the pointing device 305 which is typically a mouse, a track ball, or the like, and or (c) the audio input device by voice input, and/or (d) the like.
  • the computer system displays text and/or graphic images and other data on the display device 306, such as a cathode-ray tube or an LCD display.
  • a hard copy of the displayed information, or other information within the system 300 may be printed to other output devices (e.g., a printer), not shown, which would be connected to the bus system 310.
  • the computer system 300 includes an IBM PC-compatible personal computer (available from a variety of vendors, including IBM of Armonk, New York) running a Unix operating system (e.g., Linux, which is available from Red Hat Software, of Durham, North Carolina, U.S.A.).
  • the system 300 is an Internet or intranet or other type of network server, e.g., one connected to a worldwide publically accessible communication network, and receives input from (e.g., digitized audio voice input), and sends output to, a remote user via the interface 309 according to standard techniques and protocols.
  • a computer software system 320 is provided for directing the operation of the computer system 300.
  • Software system 320 which is stored in system memory 302 and on storage (e.g., disk memory) 307, includes a kernel or operating system (OS) 340 and a windows shell 350.
  • OS operating system
  • One or more application programs, such as client application software or "programs” 345 may be "loaded” (i.e., transferred from storage 307 into memory 302) for execution by the system 300.
  • System 320 includes a user interface (UI) 360, preferably a Graphical User
  • GUI User Interface
  • OS 340 and windows 345 together comprise Microsoft Windows software (e.g., Windows 9x or Windows NT, available from Microsoft Corporation of Redmond, Washington).
  • OS 340 is the Unix operating system (e.g., the Linux operating system).
  • One application program 200 is a speech recognition system, according to the present invention, that uses colloquial language models, e.g., for spontaneous Cantonese speech, which will be described in further detail. While the invention is described in some detail with specific reference to preferred embodiments and certain alternatives, there is no intent to limit the invention to that particular embodiment or those specific alternatives.
  • the present invention may be built upon a standard ASP system, e.g., one that uses Hidden Markov models (HMMs), by adding the structures, method steps, and computations described in the present document.
  • HMMs Hidden Markov models
  • ASP systems such as automated keyword spotting systems or automated speech recognition systems, and HMMs, are well known in the relevant art, and are described, for example, in the book, Fundamentals of Speech Recognition, by Lawrence Rabiner & B ⁇ ng-Hwang Juang, published by Prentice Hall (Signal Processing Series), Englewood Cliffs NJ, 1993, ISBN 0-13-015157-2, hereinafter referred to as RABINER 93.
  • FIG. 2 is a schematic diagram, adapted from FIG. 8.7 of RABINER 93, of a typical continuous speech recognizer 400.
  • the following description of the typical continuous speech recognizer 400, and its processing, is adapted from description in Section 8.8 of RABINER 93.
  • the first step in the processing is spectral analysis in a spectral analysis module 405 to derive the feature vectors 410 used to characterize the spectral properties of speech input 415.
  • the second step in the recognizer 400 is a combined word-level/sentence-level match, in a matching module 417 that includes a word-level match module 420 and a sentence-level match module 425.
  • a set of word models 440 is created by a word model composition module 445.
  • the set of word models 440 is created by concatenating each of the subword unit HMMs as specified in the word lexicon 435, in a conventional manner.
  • the way in which the sentence-level match is done is via a finite state network (FSN) realization of a word grammar 450 (the syntax of the system) and semantics 455, as expressed in a composite FSN language model 460.
  • FSN finite state network
  • the implementation of the combined word-level match/sentence-level match is via any conventional manner, for example, via any of the structures described in Chapter 7 of RABINER 93.
  • systems use structures similar to the conventional frame synchronous level-building method (usually with some type of beam search to restrict the range of paths) to solve for the best recognized sentence 465 (the result).
  • the preferred dictation system ultimately to be adapted for Cantonese Chinese, is based on phoneme continuous density hidden Markov models, with 16 Gaussian mixtures per state.
  • Each subword unit is modeled, for Chinese, as an initial part and a final part.
  • the initial part is modeled using a 3-state left-to-right HMM with no state skips.
  • the final part is modeled using a 4-state HMM.
  • initial parts are modeled by right context-dependent models.
  • Final parts are modeled by context-independent models.
  • the total units are 195 subword models, for Cantonese Chinese, which may be termed phone models.
  • the recognizer feature vector consists of the following 39 parameters: 12 Mel- warped frequency-based cepstra coefficients (MFCC), 12 delta cepstral coefficients, 12 delta-delta cepstral coefficients, energy, and the delta and delta-delta of the energy parameters.
  • the initial units used for Cantonese are: b, c, d, f, g, gw, h, j, k, kw, 1, m, n, ng, p, s, t, and z.
  • the final units used for Cantonese are: aa, aai, aak, aam, aan, aang, aap, aat, aau, ai, ak, am, an, ang, ap, at, au, e, ei, ek, en, eng, eoi, eon, eot, et, i, ik , im, in, ing, ip, it, iu, m, ng, o, oe, oek, oen, oeng, oi, ok, on, ong, ot, ou, u, ui, uk, un, ung, ut, yu, yun, and yut.
  • the language model includes a trained uni-gram distribution and a trained bi-gram distribution, per conventional practice.
  • the uni-gram distribution is the probability P(wl) that a given word (i.e., segment) in a sentence is actually the word wl of the lexicon.
  • the bi-gram distribution is the probability P(w2
  • the language model is established, according to the present invention, for the target language, as will be further discussed in a later section.
  • the preferred keyword spotting system is based on phoneme continuous density hidden Markov models.
  • 16 Gaussian mixtures are used per state in a keyword spotting system.
  • the keyword spotting system also includes a garbage model with conventional structure, for example, with five HMM states and 20 frame sizes, with 16 Gaussian mixtures per state.
  • Each subword unit is modeled in the same manner as discussed above for the dictation system.
  • the keyword spotting system is established for the target language, according to the present invention, as will be further discussed in a later section.
  • An automated dictation system for the target language is established by establishing a language model and an acoustic model according to the structure discussed in an earlier section. Establishing the language model will be discussed shortly in a later subsection. Given the language model, the acoustic models are established by being conventionally trained using acoustic training data (i.e., voice recordings) that are collected for the target language, namely, Cantonese. Collection of such acoustic training data will be fiirther discussed in a later section.
  • acoustic training data i.e., voice recordings
  • FIG. 3 is a flow diagram that illustrates a method 500 for establishing a colloquial language model in the target language, namely, Cantonese.
  • language model parameters are trained for a baseline language, using training data for the baseline language.
  • the baseline language is preferably a written language.
  • the baseline language is preferably Mandarin.
  • the baseline language model parameters are preferably from a baseline language model, for Mandarin, that is trained using large numbers of Mandarin newspaper articles, as will be further discussed in a later section.
  • training data of a target language is obtained.
  • the target language is a colloquial language that is similar to the (written) baseline language (e.g, similar due to sharing some words and some syntax).
  • the target language is preferably Cantonese.
  • training data of the target language is preferably obtained from Cantonese online newsgroup articles, as will be fiirther discussed in a later section. Preferably, at least about 12 million bytes of such training data is obtained; of course, more training data would be even better.
  • a step 520 words of the target language are obtained that are not in the lexicon of the baseline language. For example, these words may be determined from the training data of the target language, as will be further discussed in a later section. Preferably, at least about 600 colloquial phrases or terms are determined.
  • the words are added to the lexicon of the baseline language to obtain an updated lexicon. For example, the at least 600 colloquial phrases or terms may be added to the lexicon as words.
  • a step 530 language model parameters for the target language are trained based on the training data of the target language, for example, using the training data from step 520 of the target language and the updated lexicon. For example, a limited amount of training data of the target language is segmented using the updated lexicon and then frequencies of occurrence of lexicon entries in the limited amount of training data are counted, according to conventional practice, to obtain uni-gram and bi-gram distributions, which are examples of N-gram distributions.
  • a language model is determined for the target language based on the language model parameters for the baseline language and on the language model parameters determined in the step 530 for the target language. For example, the language model parameters determined in the step 530 for the target language are combined with the baseline language model for the baseline language to obtain a language model for the target language. For example, the combination is performed using language model adaptation by linear interpolation.
  • the language model parameters are trained for the preferred baseline language, Mandarin, using training data that includes
  • the language model parameters reflect Mandarin Chinese and form a baseline language model.
  • the training corpus size is about 40 million bytes.
  • the baseline language model includes uni-gram and bi-gram language models. These models are established by applying a Mandarin lexicon to segment the corpus, and then counting the uni-gram and bi-gram frequencies of occurrence within the corpus, in conventional manner.
  • the Mandarin lexicon may have, for example, roughly 36,000 entries.
  • a language model is determined for the target language.
  • the language model is preferably determined using linear interpolation language model adaptation.
  • the adapted uni-gram P adp (wJ is established as follows:
  • P c (w is the uni-gram of the limited Cantonese corpus (newsgroups).
  • the determined uni-gram Pf ⁇ rm ⁇ (w is P adp (wJ-) The determined bi-gram * s determined using linear interpolation as follows: , where a and b are the linear combination factors, and a may be set to 0.9 and b may be set to 0.1.
  • collocations of existing words are evaluated as candidate new words.
  • a collocation is a pair of words which appear together significantly more frequently than would be expected by chance.
  • a collocation is accepted as a new word if its strength is higher than a predetermined threshold and its spread is also higher than a predetermined threshold.
  • the thresholds are chosen based on the preference of the system builder, for example, after simple hand tuning by the system builder using various thresholds.
  • the established automated dictation system for the target language may be used, for example, within a distributed speech recognition system, or other speech recognition system, for example, as discussed in the co-owned and co-pending U.S. patent application serial no. 09/613,472, filed on July 11, 2000 and entitled "SYSTEM AND METHODS FOR ACCEPTING USER INPUT IN A DISTRIBUTED ENVIRONMENT IN A SCALABLE MANNER", hereinafter referred to as the USER INPUT REFERENCE, which is hereby incorporated by reference in its entirety for all purposes.
  • the established automated dictation system for use within the speech recognition system(s) of the USER INPUT REFERENCE are preferably set up as discussed in the USER INPUT REFERENCE.
  • the keyword spotting system is established using the structures as discussed above.
  • a garbage model is established to absorb non-lexical word sounds such as "urn”, “ha”, “er”, and the like.
  • the garbage model is trained using Cantonese training data.
  • Filler phrases are determined from the Cantonese training data, preferably such training data that is collected as is fiirther discussed below in a later section.
  • at least about 90 filler phrases are determined from the Cantonese training data that is collected via the Wizard-of-Oz collection system, as discussed below in a later section.
  • These filler phrases are modeled from the Cantonese training data.
  • at least about 600 keywords are modeled. These keywords preferably cover different domains, for example, domains including local or foreign news, weather information, travel information, computer games, restaurants, movies, education, and so forth for the keyword spotting task.
  • the established automated keyword spotting system for the target language may be used, for example, within a distributed speech recognition system, or other speech recognition system, for example, as discussed in the incorporated USER INPUT REFERENCE.
  • the established automated keyword spotting system for use within the speech recognition system(s) of the USER INPUT REFERENCE are preferably set up as discussed in the USER INPUT REFERENCE.
  • the established automated keyword spotting system for the target language may also be used, for example, within the utterance verification systems that are discussed in U.S. Patent Number serial no.
  • UTTERANCE VERIFICA ⁇ ON REFERENCE which is hereby incorporated by reference in its entirety for all purposes.
  • the established automated keyword spotting system for use within the speech recognition system(s) of the UTTERANCE VERIF1C ⁇ ON REFERENCE are preferably set up as discussed in the UTTERANCE VERIFICATION REFERENCE.
  • the speakers are told to speak naturally to the web browser to thereby surf the web for information.
  • the speakers' spontaneously spoken voice commands are recorded to form a spontaneous speech database.
  • the speakers can control the web browser, go to any link on the current page, or surf the Internet with their spontaneous speech input.
  • the web browser is controlled remotely by a human operator (the "Wizard") who directs web browser operations using a mouse and/or keyboard in response to the speakers' verbal instructions. Because the speakers are not aware of the existence of the operator and believe that a mere machine is responding to them, they give natural verbal commands.
  • the Wizard-of-Oz system uses at least eleven basic command keywords in both Chinese and EnglisL Each speaker is instructed to surf using spoken requests. The speakers can continue to search for information on the web or read the web pages by spontaneously speaking.
  • the collected speech generally includes "garbage” and “filler” speech, characteristic of spontaneous speech.
  • Garbage speech includes extraneous utterances such as “um”, “ah”, “er”, short pauses, and out of vocabulary words.
  • Filler speech include, for example, corrections, phrases such as “would you please ", "I want to know about !, and "I want !.
  • the recording may be carried out using a noise-cancelling uni-directional microphone.
  • the format of the wave files may be as follows: 16KHz sampling rate and 16 bit. In one embodiment of the present invention, a total of 39 speakers; 25 of them are male and 14 are female speakers were used. Each speaker spoke for approximately one hour. A total of 4,150 utterances were collected and transcribed. In general, of course, using even more training data would be even better.
  • the Wizard-of-Oz database collection system can be used to collect real spontaneous speech for analysis, i.e., for use in constructing a language model (bi-grams, tri-grams, and the like), but it takes a lot of effort to segment and to transcribe the data. Also, it is difficult to collect sufficient speech data for each specific task.
  • the newsgroup articles are up-to-date and include many colloquial phrases, new proper nouns or new compound words. For example, about six months of Hong Kong newsgroup data may be collected in various domains such as travel, music, computers, entertainment, politics, free talk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A system and method are provided. According to an embodiment of the invention, a method for establishing a language model to recognize speech of a first language comprises: obtaining language model parameters for a second language that is not the first language; obtaining training data for the first language; determining a language model based on the language model parameters for the second language and the training data for the target language (450); and making the language model available for use to handle speech of the first language (460).

Description

SYSTEM AND METHOD FOR SPEECH PROCESSING WITH LIMITED TRAINING DATA
RELATED APPLICATIONS The present application is related to, and claims the benefit of priority from, the following commonly-owned U.S. patent application by the same inventors, the disclosure of which are hereby incorporated by reference in its entirety, including any incorporations-by-reference, appendices, or attachments thereof, for all purposes: serial no. 60/175,368, filed on January 10, 2000 and entitled SYSTEM AND METHODS FOR COLLOQUIAL LANGUAGE MODELING FOR SPONTANEOUS
CANTONESE SPEECH; and serial no. <attorney docket number WIW-002.01>, filed on
January 9, 2001 and entitled SYSTEM AND METHOD FOR SPEECH PROCESSING WITH LIMITED TRAINING DATA.
TECHNICAL HELD The present invention relates to automated processing of speech. Embodiments of the present invention relate to automated keyword spotting, automated dictation from speech, or other automated speech recognition.
BACKGROUND OF THE INVENTION Automated speech processing (ASP) systems include, for example, automated keyword spotting systems, automated dictation systems, other automated speech recognition systems, and the like. Automated keyword spotting systems attempt to detect the presence or absence of particular word(s) or phrase(s) in an utterance of speech input, using data processing. Automated dictation systems to attempt to convert an utterance of speech into its corresponding text, using data processing.
ASP systems typically include models or parameters that are established based on training data. Training data may include, for example, collections of text or recorded speech, for the language or dialect that is to be handled by the speech processing system. In general, insufficient training data, or training data that is not in some sense similar to the expected input speech for the ASP system, leads to poor system performance Not fumisched at time of publication
and/or to a less powerful speech processing system.
In general, for any given language, it costs significant money, effort, and other resources to collect or otherwise obtain sufficient and suitable training data from which to establish models or parameters. For languages or dialects—for example, English or Mandarin Chinese— that are used by great numbers of people, the cost of obtaining sufficient and suitable training data is typically justified by the existence of a large potential user base for the speech processing system that is contemplated. For other languages or dialects— for example, Cantonese Chinese— that are used by relatively few people, the cost of obtaining sufficient and suitable training data is typically more difficult to justify. Furthermore, for languages or dialects—for example, Cantonese Chinese—that are primarily spoken languages and are used relatively infrequently as written languages, the cost of obtaining sufficient and suitable training text data is typically especially high. For the above reasons, there tends to be fewer, if any, "state of the art" speech processing systems that can handle certain less popular languages or dialects, especially such languages and dialects are primarily spoken languages.
SUMMARY OF THE INVENTION According to an embodiment of the invention, a method for establishing a language model to recognize speech of a first language comprises: obtaining language model parameters for a second language that is not the first language; obtaining training data for the first language; determining a language model based on the language model parameters for the second language and the training data for the target language; and making the language model available for use to handle speech of the first language.
According to another embodiment of the invention, a speech processing system includes a processor, a memory, a language model established by the above- described method, and speech recognition software configured to use the language model for recognizing speech.
According to another embodiment of the invention, a speech processing system includes: a processor; memory coupled to the processor; and logic stored in the memory configured to control the processor to: obtain training data for the first language; determine a language model based on the training data for the target language and on an existing language model for a second language; wherein the language model is for use in recognition of speech of the first language.
According to another embodiment of the invention, a speech processing system that is capable of recognizing words of a primarily colloquial language in an input utterance includes: a model for a word; a model for filler phrases, wherein the model for filler phrases was trained using online newsgroup articles; and speech recognition software configured to recognize an input utterance based on the model for a word and on the model for filler phrases.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 A is a block diagram of a computer system in which the present invention may be embodied.
FIG. IB is a block diagram of a software system of the present invention for controlling operation of the system of FIG. 1 A. FIG. 2 is a schematic diagram of a typical continuous speech recognizer from which the present invention may be implemented.
FIG. 3 is a flow diagram that illustrates a method for establishing a colloquial language model in a target language.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT
The following description will focus on the currently-preferred embodiment of the present invention, which is operative in an environment typically including desktop computers, server computers, and portable computing devices, occasionally or permanently connected to one another. The currently-preferred embodiment of the present invention may be implemented in an application operating in an Internet-connected environment and running under an operating system, such as the Microsoft® Windows operating system, on an IBM-compatible Personal Computer (PC) configured as an Internet server. The present invention, however, is not limited to any particular environment, device, or application. Instead, those skilled in the art will find that the present invention may be advantageously applied to any environment. For example, the present invention may be advantageously embodied on a variety of different platforms, including Macintosh, Linux, EPOC, BeOS, Solaris, UNIX, NextStep, and the like. For another example, although the following description will describe preferred embodiments that are adapted for the Cantonese Chinese language, the invention itself is not limited to the Cantonese Chinese language, and indeed may be embodied for other languages or dialects. The description of the exemplary embodiments which follows is, therefore, for the purpose of illustration and not limitation.
I. Overview
As described in the Background section, it is sometimes difficult to justify the expense of obtaining such sufficient amounts of suitable training data for establishing a high quality speech processing system for a language or dialect, especially for a language or dialect that has relatively few users or for a language or dialect that is primarily a spoken language. What is needed are means and methods by which the amount of resources needed for obtaining such sufficient amounts of suitable training data is reduced. For example, what is needed are means and methods by which collection of such training data is made easier and means and methods by which the amount of such training data, and therefore its cost, is reduced for establishing an ASP system for a language or dialect. The present invention satisfies these and other needs.
The present invention is especially suited to establishing an ASP system for a "target" language or dialect (for example, Cantonese Chinese) that is similar to, or is a dialect of, a "baseline" language or dialect (for example, Mandarin Chinese) for which portions of an ASP system (e.g., a language-model portion) has already been established. The preferred approach of the present invention is to start with a existing baseline portion of an actual or hypothetical ASP system for the baseline language and to modify the baseline portion for use with the target language, based on a limited amount of training data that has been obtained for the target language. The baseline portion may include, for example, models or parameters of the type used for an ASP system. These models or parameters may be, for example, models and parameters of a language model for the baseline language. For establishing an ASP system that can handle spontaneous speech, for a language that is primarily a spoken language, the limited amount of training data is preferably collected at least in part from online newsgroup articles, which tend to use colloquial vernacular, as opposed to newspaper articles, which tend to use more formal written language style.
In the preferred embodiment of the present invention, Cantonese Chinese, or the like, is the target language, and Mandarin Chinese, or the like, is the baseline language. The Chinese language includes four language groups, they are Mandarin, Cantonese, Fukienese and Hakka and many dialects. These four groups may be considered to be four languages. A dialect is a language variety used by a particular population of speakers. While there are some similarities between the Chinese languages and dialects, there arfe many differences. In the present document, two Chinese languages, Cantonese and Mandarin, will be discussed in detail. Spontaneous speech consists of colloquial words, corrections, hesitations, dis-fluency, short pauses, ill-formed grammar, and words that are not generally listed in the standard Chinese lexicon.
II. System Hardware The present invention may be embodied on an information processing system such as the system 300 of FIG. 1A, which comprises a central processor 301, a main memory 302, an input/output (I/O) controller 303, a keyboard 304, a pointing device 305, pen device, or the like), a screen or display device 306, a mass storage 307 (e.g., hard disk, removable floppy disk, optical disk, magneto-optical disk, or flash memory, etc.), an audio input device 308 (e.g., a microphone, e.g., as found on a telephone that is coupled to the bus system 310), and an interface 309. Although not shown separately, a real-time system clock is included with the system 300, in a conventional manner. The various components of the system 300 communicate through a system bus 310 or similar architecture. In addition, the system 300 may communicate with other devices through the interface or communication port 309, which may be an RS-232 serial port or the like. Devices which will be commonly connected, occasionally or on a full time basis, to the interface 309 include a network 351 (e.g., LANs or the Internet), a laptop 352, a handheld organizer 354 (e.g., the Palm organizer, available from Palm Computing, Inc., a subsidiary of 3Com Corp. of Santa Clara, California.), a modem 353, and the like. In operation, program logic (implementing the methodology described below) is loaded from the storage device or mass storage 307 into the main memory 302, for execution by the processor 301. During operation of the program (logic), the user enters commands and data through (a) the keyboard 304, (b) the pointing device 305 which is typically a mouse, a track ball, or the like, and or (c) the audio input device by voice input, and/or (d) the like. The computer system displays text and/or graphic images and other data on the display device 306, such as a cathode-ray tube or an LCD display. A hard copy of the displayed information, or other information within the system 300, may be printed to other output devices (e.g., a printer), not shown, which would be connected to the bus system 310. In a preferred embodiment, the computer system 300 includes an IBM PC-compatible personal computer (available from a variety of vendors, including IBM of Armonk, New York) running a Unix operating system (e.g., Linux, which is available from Red Hat Software, of Durham, North Carolina, U.S.A.). In a preferred embodiment, the system 300 is an Internet or intranet or other type of network server, e.g., one connected to a worldwide publically accessible communication network, and receives input from (e.g., digitized audio voice input), and sends output to, a remote user via the interface 309 according to standard techniques and protocols.
III. System Software
Illustrated in FIG. IB, a computer software system 320 is provided for directing the operation of the computer system 300. Software system 320, which is stored in system memory 302 and on storage (e.g., disk memory) 307, includes a kernel or operating system (OS) 340 and a windows shell 350. One or more application programs, such as client application software or "programs" 345 may be "loaded" (i.e., transferred from storage 307 into memory 302) for execution by the system 300. System 320 includes a user interface (UI) 360, preferably a Graphical User
Interface (GUI), for receiving user commands and data and for producing output to the user. These inputs, in turn, may be acted upon by the system 300 in accordance with instructions from operating system module 340, windows module 350, and/or client application module(s) 345. The UI 360 also serves to display user prompts and results of operation from the OS 340, windows 350, and applications) 345, whereupon the user may supply additional inputs or terminate the session. In a specific embodiment, OS 340 and windows 345 together comprise Microsoft Windows software (e.g., Windows 9x or Windows NT, available from Microsoft Corporation of Redmond, Washington). In the preferred embodiment, OS 340 is the Unix operating system (e.g., the Linux operating system). Although shown conceptually as a separate module, the UI is typically provided by interaction of the application modules with the windows shell and the OS 340. One application program 200 is a speech recognition system, according to the present invention, that uses colloquial language models, e.g., for spontaneous Cantonese speech, which will be described in further detail. While the invention is described in some detail with specific reference to preferred embodiments and certain alternatives, there is no intent to limit the invention to that particular embodiment or those specific alternatives.
IV. System Structure
A. Overview
The present invention may be built upon a standard ASP system, e.g., one that uses Hidden Markov models (HMMs), by adding the structures, method steps, and computations described in the present document. ASP systems such as automated keyword spotting systems or automated speech recognition systems, and HMMs, are well known in the relevant art, and are described, for example, in the book, Fundamentals of Speech Recognition, by Lawrence Rabiner & Bϋng-Hwang Juang, published by Prentice Hall (Signal Processing Series), Englewood Cliffs NJ, 1993, ISBN 0-13-015157-2, hereinafter referred to as RABINER 93.
B. The Dictation System
1. Overview The dictation system of the preferred embodiment of the present invention is built along the lines shown in FIG. 2. FIG. 2 is a schematic diagram, adapted from FIG. 8.7 of RABINER 93, of a typical continuous speech recognizer 400. The following description of the typical continuous speech recognizer 400, and its processing, is adapted from description in Section 8.8 of RABINER 93. The first step in the processing is spectral analysis in a spectral analysis module 405 to derive the feature vectors 410 used to characterize the spectral properties of speech input 415. The second step in the recognizer 400 is a combined word-level/sentence-level match, in a matching module 417 that includes a word-level match module 420 and a sentence-level match module 425. The way this combined match is accomplished is as follows. Using a set of subword HMMs 430 and a word lexicon 435, a set of word models 440 is created by a word model composition module 445. The set of word models 440 is created by concatenating each of the subword unit HMMs as specified in the word lexicon 435, in a conventional manner. The way in which the sentence-level match is done is via a finite state network (FSN) realization of a word grammar 450 (the syntax of the system) and semantics 455, as expressed in a composite FSN language model 460. The implementation of the combined word-level match/sentence-level match is via any conventional manner, for example, via any of the structures described in Chapter 7 of RABINER 93. Typically, systems use structures similar to the conventional frame synchronous level-building method (usually with some type of beam search to restrict the range of paths) to solve for the best recognized sentence 465 (the result).
2. Acoustic Model
The preferred dictation system, ultimately to be adapted for Cantonese Chinese, is based on phoneme continuous density hidden Markov models, with 16 Gaussian mixtures per state. Each subword unit is modeled, for Chinese, as an initial part and a final part. The initial part is modeled using a 3-state left-to-right HMM with no state skips. The final part is modeled using a 4-state HMM. There are 19 initial parts and 54 final parts if the tone information is ignored, for Cantonese Chinese. In the system, initial parts are modeled by right context-dependent models. Final parts are modeled by context-independent models. The total units are 195 subword models, for Cantonese Chinese, which may be termed phone models.
The recognizer feature vector consists of the following 39 parameters: 12 Mel- warped frequency-based cepstra coefficients (MFCC), 12 delta cepstral coefficients, 12 delta-delta cepstral coefficients, energy, and the delta and delta-delta of the energy parameters. The initial units used for Cantonese are: b, c, d, f, g, gw, h, j, k, kw, 1, m, n, ng, p, s, t, and z. The final units used for Cantonese are: aa, aai, aak, aam, aan, aang, aap, aat, aau, ai, ak, am, an, ang, ap, at, au, e, ei, ek, en, eng, eoi, eon, eot, et, i, ik , im, in, ing, ip, it, iu, m, ng, o, oe, oek, oen, oeng, oi, ok, on, ong, ot, ou, u, ui, uk, un, ung, ut, yu, yun, and yut.
2. Language Model The language model includes a trained uni-gram distribution and a trained bi-gram distribution, per conventional practice. The uni-gram distribution is the probability P(wl) that a given word (i.e., segment) in a sentence is actually the word wl of the lexicon. The bi-gram distribution is the probability P(w2|wl), which is the probability that a given word (i.e., segment) in a sentence is the word w2 of the lexicon, given that the word (i.e., segment) was preceded in the sentence by the word wl of the lexicon. The language model is established, according to the present invention, for the target language, as will be further discussed in a later section.
C. Keyword Spotting System Like the preferred dictation system, the preferred keyword spotting system is based on phoneme continuous density hidden Markov models. In an embodiment of the invention, 16 Gaussian mixtures are used per state in a keyword spotting system. The keyword spotting system also includes a garbage model with conventional structure, for example, with five HMM states and 20 frame sizes, with 16 Gaussian mixtures per state. Each subword unit is modeled in the same manner as discussed above for the dictation system. The keyword spotting system is established for the target language, according to the present invention, as will be further discussed in a later section.
V. Establishing an Automated Dictation System for the Target Language A. Establishing an Acoustic Model
An automated dictation system for the target language is established by establishing a language model and an acoustic model according to the structure discussed in an earlier section. Establishing the language model will be discussed shortly in a later subsection. Given the language model, the acoustic models are established by being conventionally trained using acoustic training data (i.e., voice recordings) that are collected for the target language, namely, Cantonese. Collection of such acoustic training data will be fiirther discussed in a later section.
B. Establishing the Language Model for the Target Language 1. Overview From analysis of Cantonese data collected as discussed in a later section, it is seen that content words in Mandarin and Cantonese are very similar, but filler or colloquial phrases (defined in a later section) are very different. To establish an automated dictation system for Cantonese, the target language, it is necessary to solve the following two problems. The first problem is that some colloquial or filler words in the Cantonese corpus never appear in a Mandarin corpus nor in a traditional Chinese lexicon. The second problem is that the spoken syntax of Cantonese and Mandarin are different, and thus the "correct" N-gram counts for Cantonese and Mandarin should be different.
In order to train an effective language model for colloquial Cantonese speech, a lot of Cantonese text data is required. For Mandarin, there is a sufficient text database to train a language model, but for Cantonese, there is no sufficient Cantonese text data. To solve the above problem, a conventional baseline Mandarin language model is obtained and is then adapted to form a colloquial Cantonese language model. FIG. 3 is a flow diagram that illustrates a method 500 for establishing a colloquial language model in the target language, namely, Cantonese. In a step 510, language model parameters are trained for a baseline language, using training data for the baseline language. For example, the baseline language is preferably a written language. For example, the baseline language is preferably Mandarin. For example, the baseline language model parameters are preferably from a baseline language model, for Mandarin, that is trained using large numbers of Mandarin newspaper articles, as will be further discussed in a later section. In a step 515, training data of a target language is obtained. For example, the target language is a colloquial language that is similar to the (written) baseline language (e.g, similar due to sharing some words and some syntax). For example, the target language is preferably Cantonese. For example, training data of the target language is preferably obtained from Cantonese online newsgroup articles, as will be fiirther discussed in a later section. Preferably, at least about 12 million bytes of such training data is obtained; of course, more training data would be even better. However, thanks to the present invention, even less than 12 million bytes of such training data is sufficient to establish an ASP with satisfactory performance. In a step 520, words of the target language are obtained that are not in the lexicon of the baseline language. For example, these words may be determined from the training data of the target language, as will be further discussed in a later section. Preferably, at least about 600 colloquial phrases or terms are determined. In a step 525, the words are added to the lexicon of the baseline language to obtain an updated lexicon. For example, the at least 600 colloquial phrases or terms may be added to the lexicon as words. In a step 530, language model parameters for the target language are trained based on the training data of the target language, for example, using the training data from step 520 of the target language and the updated lexicon. For example, a limited amount of training data of the target language is segmented using the updated lexicon and then frequencies of occurrence of lexicon entries in the limited amount of training data are counted, according to conventional practice, to obtain uni-gram and bi-gram distributions, which are examples of N-gram distributions. In a step 535, a language model is determined for the target language based on the language model parameters for the baseline language and on the language model parameters determined in the step 530 for the target language. For example, the language model parameters determined in the step 530 for the target language are combined with the baseline language model for the baseline language to obtain a language model for the target language. For example, the combination is performed using language model adaptation by linear interpolation.
2. Training a Baseline Language Model, for the Baseline Language
As mentioned above, in the step 510, the language model parameters are trained for the preferred baseline language, Mandarin, using training data that includes
Hong Kong newspaper articles which are written in formal Chinese (Mandarin). The language model parameters reflect Mandarin Chinese and form a baseline language model.
In an embodiment of the invention, the training corpus size is about 40 million bytes.
Generally, an even larger corpus would be still better. Per conventional practice, the baseline language model includes uni-gram and bi-gram language models. These models are established by applying a Mandarin lexicon to segment the corpus, and then counting the uni-gram and bi-gram frequencies of occurrence within the corpus, in conventional manner. The Mandarin lexicon may have, for example, roughly 36,000 entries.
3. Adapting the Language Model for the Target Language
As mentioned above, in the step 535, a language model is determined for the target language. The language model is preferably determined using linear interpolation language model adaptation. The adapted uni-gram Padp(wJ is established as follows:
Paφtø λj P J + O - λjϊPcW
, where λj and (I - λ ) are the combination factors;
P ) is the uni-gram of the large Mandarin corpus (newspapers);
Pc(w is the uni-gram of the limited Cantonese corpus (newsgroups); and
Figure imgf000014_0001
The adapted bi-gram
Figure imgf000014_0002
*s established as follows:
Figure imgf000014_0003
, where λ2 and (1 - λ^ are the combination factors; 1S me bi-gram of the large Mandarin corpus;
Figure imgf000014_0004
is trιe bi-gram of the limited Cantonese corpus; and
Figure imgf000014_0005
The determined uni-gram Pfιrmι(w is Padp(wJ- The determined bi-gram
Figure imgf000014_0006
*s determined using linear interpolation as follows:
Figure imgf000014_0007
, where a and b are the linear combination factors, and a may be set to 0.9 and b may be set to 0.1.
C. Determining New Words to Add to the Baseline Lexicon
As Chinese words are not separated by space, the newsgroup articles are segmented based on the Mandarin lexicon. After segmentation, unknown Chinese characters and single characters that appear with a high frequency are used as candidates for determining of new words, i.e., out-of-vocabulary (OOV) words. A statistical tool, such as CXtract, may be used to determine longer OOV words automatically. CXtract is further discussed in Pascale FUNG and Dekai WU, "Statistical Augmentation of a Chinese Machine-Readable Dictionary", in /« WVLC-2, Second Annual Workshop on Very Large Corpora (COLING-94), 1994, which is incorporated by reference in its entirety for all purposes. Terms that are longer than a single Chinese character are identified using statistical methods. For example, collocations of existing words, e.g., of existing single- character words, are evaluated as candidate new words. A collocation is a pair of words which appear together significantly more frequently than would be expected by chance. A collocation is accepted as a new word if its strength is higher than a predetermined threshold and its spread is also higher than a predetermined threshold. The thresholds are chosen based on the preference of the system builder, for example, after simple hand tuning by the system builder using various thresholds.
D. Using the Established Automated Dictation System
The established automated dictation system for the target language may be used, for example, within a distributed speech recognition system, or other speech recognition system, for example, as discussed in the co-owned and co-pending U.S. patent application serial no. 09/613,472, filed on July 11, 2000 and entitled "SYSTEM AND METHODS FOR ACCEPTING USER INPUT IN A DISTRIBUTED ENVIRONMENT IN A SCALABLE MANNER", hereinafter referred to as the USER INPUT REFERENCE, which is hereby incorporated by reference in its entirety for all purposes. The established automated dictation system for use within the speech recognition system(s) of the USER INPUT REFERENCE, are preferably set up as discussed in the USER INPUT REFERENCE.
VI. Establishing an Automated Keyword Spotting System for the Target Language
The keyword spotting system is established using the structures as discussed above. A garbage model is established to absorb non-lexical word sounds such as "urn", "ha", "er", and the like. The garbage model is trained using Cantonese training data. Filler phrases are determined from the Cantonese training data, preferably such training data that is collected as is fiirther discussed below in a later section. Preferably, at least about 90 filler phrases are determined from the Cantonese training data that is collected via the Wizard-of-Oz collection system, as discussed below in a later section. These filler phrases are modeled from the Cantonese training data. Preferably, at least about 600 keywords are modeled. These keywords preferably cover different domains, for example, domains including local or foreign news, weather information, travel information, computer games, restaurants, movies, education, and so forth for the keyword spotting task.
The established automated keyword spotting system for the target language may be used, for example, within a distributed speech recognition system, or other speech recognition system, for example, as discussed in the incorporated USER INPUT REFERENCE. The established automated keyword spotting system for use within the speech recognition system(s) of the USER INPUT REFERENCE are preferably set up as discussed in the USER INPUT REFERENCE. The established automated keyword spotting system for the target language may also be used, for example, within the utterance verification systems that are discussed in U.S. Patent Number serial no. , attorney docket number WIW-001.01, filed on January 9, 2001 and entitled SYSTEM AND METHOD FOR UTTERANCE VERIFICATION OF CHINESE LONG AND SHORT KEYWORDS, hereinafter referred to as the UTTERANCE VERIFICAΉON REFERENCE, which is hereby incorporated by reference in its entirety for all purposes. The established automated keyword spotting system for use within the speech recognition system(s) of the UTTERANCE VERIF1CΛΩON REFERENCE are preferably set up as discussed in the UTTERANCE VERIFICATION REFERENCE.
VII. Collection of a Limited Amount of Target-language Training Data A. Speech Training Data: "Wizard of Oz" To develop a spontaneous speech understanding system, a spontaneous speech database is collected for modeling the characteristics of spontaneous speech. It would be better to collect phrases from real life situations, but doing so is not easy. If a written script is given to a person, and the person is asked to speak naturally, then the resulting utterance would not be natural and would not capture true spontaneous speech characteristics. To overcome this problem, a semi-automatic database collection system is set up using a conventional Wizard-of-Oz scheme, adapted to collect spontaneous spoken commands to an envisioned voice-controlled web browser. A web browser is presented to human speakers. The speakers are told that the browser automatically responds to voice commands. The speakers are told to speak naturally to the web browser to thereby surf the web for information. The speakers' spontaneously spoken voice commands are recorded to form a spontaneous speech database. The speakers can control the web browser, go to any link on the current page, or surf the Internet with their spontaneous speech input. In reality, the web browser is controlled remotely by a human operator (the "Wizard") who directs web browser operations using a mouse and/or keyboard in response to the speakers' verbal instructions. Because the speakers are not aware of the existence of the operator and believe that a mere machine is responding to them, they give natural verbal commands.
In an embodiment of the invention, the Wizard-of-Oz system uses at least eleven basic command keywords in both Chinese and EnglisL Each speaker is instructed to surf using spoken requests. The speakers can continue to search for information on the web or read the web pages by spontaneously speaking. The collected speech generally includes "garbage" and "filler" speech, characteristic of spontaneous speech. Garbage speech includes extraneous utterances such as "um", "ah", "er", short pauses, and out of vocabulary words. Filler speech include, for example, corrections, phrases such as "would you please ...", "I want to know about ...", and "I want ...".
The recording may be carried out using a noise-cancelling uni-directional microphone. The format of the wave files may be as follows: 16KHz sampling rate and 16 bit. In one embodiment of the present invention, a total of 39 speakers; 25 of them are male and 14 are female speakers were used. Each speaker spoke for approximately one hour. A total of 4,150 utterances were collected and transcribed. In general, of course, using even more training data would be even better. B. Text Training Data: Online Discussion Groups
The Wizard-of-Oz database collection system can be used to collect real spontaneous speech for analysis, i.e., for use in constructing a language model (bi-grams, tri-grams, and the like), but it takes a lot of effort to segment and to transcribe the data. Also, it is difficult to collect sufficient speech data for each specific task.
In order to cover more topics, a huge amount of data is desired. It is more convenient to collect text-based data on various topics than to collect speech data. Chinese Cantonese, primarily a spoken language, is very different from Chinese Mandarin, which is a written language as well as a spoken language. Local (Hong Kong) newspapers are not suitable for modeling the Cantonese language because such newspapers primarily use formal Chinese Mandarin. There is not enough Cantonese text available. However, with the advent of online newsgroup articles, Cantonese text can be collected. In most newsgroup articles and FAQs (Frequently Asked Questions), people usually ask questions or express their views using written forms of colloquial language. Moreover, the written styles are very similar to language used in real life situations. The newsgroup articles are up-to-date and include many colloquial phrases, new proper nouns or new compound words. For example, about six months of Hong Kong newsgroup data may be collected in various domains such as travel, music, computers, entertainment, politics, free talk, and the like.
VIII. Further Comments
While the invention is described in some detail with specific reference to preferred embodiments and certain alternatives, there is no intent to limit the invention to those particular embodiments or specific alternatives. Thus, the true scope of the present invention is not limited to any one of the foregoing exemplary embodiments but is instead defined by the appended claims.

Claims

WHAT IS CLAIMED IS:
1. In a data processing environment, a method for establishing a language model to recognize speech of a first language, the method comprising: obtaining language model parameters for a second language that is not the first language; obtaining training data for the first language; deteπnining a language model based on the language model parameters for the second language and the training data for the target language; and making the language model available for use to handle speech of the first language.
2. The method of claim 1 wherein the step of obtaining training data for the first language comprises collecting text from online discussion group articles, wherein the training data is reflective of colloquial language usage.
3. The method of claim 1 wherein the training data for the first language is limited, and performing speech recognition using the language model leads to better recognition performance than would be obtained by using another language model that is determined using the training data for the first language without using the language model parameters for the second language.
4. The method of claim 3 wherein the training data for the first language consists of no more than 12 million bytes of text.
5. The method of claim 1 wherein the first language is Cantonese Chinese, and the second language is Mandarin Chinese.
6. The method of claim 1 wherein the first language is more of a colloquial language than a written language.
7. The method of claim 1 wherein the determining step comprises training language model parameters for the first language based on the training data of the first language.
8. The method of claim 7 wherein the determining step further comprises combining the language model parameters for the first language with language model parameters for the second language.
9. The method of claim 8 wherein the combining step comprises linearly interpolating N-grams parameters.
10. The method of claim 8 wherein the determining step further comprises identifying words in the training data that are not in a lexicon of the second language; and adding the words to the lexicon of the second language to obtain an updated lexicon.
11. The method of claim 7 wherein the language model parameters for the second language comprise a language model for the second language, and the determining step further comprises adapting the language model for the second language to reflect the first language.
12. The method of claim 1 further comprising: receiving input speech; evaluating the input speech using the language model; and recognizing the input speech based on result of the evaluating step.
13. A speech processing system comprising a processor, a memory, a language model established by the method of claim 1, and speech recognition software configured to use the language model for recognizing speech.
14. The method of claim 1 wherein the first and second languages have words in common.
15. A speech processing system comprising: a processor; memory coupled to the processor; and logic stored in the memory configured to control the processor to: obtain training data for the first language; and determine a language model based on the training data for the target language and on an existing language model for a second language; wherein the language model is for use in recognition of speech of the first language.
16. A speech processing system that is capable of recognizing words of a primarily colloquial language in an input utterance, the system comprising: a model for a word; a model for filler phrases, wherein the model for filler phrases was trained using online discussion articles; and speech recognition software configured to recognize an input utterance based on the model for a word and on the model for filler phrases.
PCT/US2001/001110 2000-01-10 2001-01-10 System and method for speech processing with limited training data WO2001052238A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US17536800P 2000-01-10 2000-01-10
US60/175,368 2000-01-10
US75803001A 2001-01-09 2001-01-09
US09/758,030 2001-01-09

Publications (1)

Publication Number Publication Date
WO2001052238A1 true WO2001052238A1 (en) 2001-07-19

Family

ID=26871141

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/001110 WO2001052238A1 (en) 2000-01-10 2001-01-10 System and method for speech processing with limited training data

Country Status (1)

Country Link
WO (1) WO2001052238A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130084976A1 (en) * 2011-10-01 2013-04-04 Microsoft Corporation Game paradigm for language learning and linguistic data generation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5764851A (en) * 1996-07-24 1998-06-09 Industrial Technology Research Institute Fast speech recognition method for mandarin words
US5787230A (en) * 1994-12-09 1998-07-28 Lee; Lin-Shan System and method of intelligent Mandarin speech input for Chinese computers

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5787230A (en) * 1994-12-09 1998-07-28 Lee; Lin-Shan System and method of intelligent Mandarin speech input for Chinese computers
US5764851A (en) * 1996-07-24 1998-06-09 Industrial Technology Research Institute Fast speech recognition method for mandarin words

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130084976A1 (en) * 2011-10-01 2013-04-04 Microsoft Corporation Game paradigm for language learning and linguistic data generation

Similar Documents

Publication Publication Date Title
JP4267081B2 (en) Pattern recognition registration in distributed systems
Averbuch et al. Experiments with the TANGORA 20,000 word speech recognizer
US7085716B1 (en) Speech recognition using word-in-phrase command
US6606597B1 (en) Augmented-word language model
US6952665B1 (en) Translating apparatus and method, and recording medium used therewith
Gao et al. IBM MASTOR: Multilingual automatic speech-to-speech translator
US20010041977A1 (en) Information processing apparatus, information processing method, and storage medium
Kirchhoff et al. Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition
Kumar et al. Development of Indian language speech databases for large vocabulary speech recognition systems
JP2002287787A (en) Disambiguation language model
JP2001100781A (en) Method and device for voice processing and recording medium
Casacuberta et al. Speech-to-speech translation based on finite-state transducers
Rabiner et al. Speech recognition: Statistical methods
Kipyatkova et al. Lexicon size and language model order optimization for Russian LVCSR
Kayte et al. Implementation of Marathi Language Speech Databases for Large Dictionary
Granell et al. Multimodality, interactivity, and crowdsourcing for document transcription
Matsuoka et al. Japanese large-vocabulary continuous-speech recognition using a business-newspaper corpus
Zhou et al. A hand-held speech-to-speech translation system
Shan et al. Search by voice in mandarin chinese
JP4962962B2 (en) Speech recognition device, automatic translation device, speech recognition method, program, and data structure
Stallard et al. The BBN transtalk speech-to-speech translation system
WO2001052238A1 (en) System and method for speech processing with limited training data
Nouza et al. A study on adapting Czech automatic speech recognition system to Croatian language
Wong et al. Acoustic modeling and language modeling for cantonese LVCSR.
Ackermann et al. Speedata: Multilingual spoken data entry

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CN KR