WO2001052238A1

WO2001052238A1 - System and method for speech processing with limited training data

Info

Publication number: WO2001052238A1
Application number: PCT/US2001/001110
Authority: WO
Inventors: Yuen Yee Lo; Pascale Fung
Original assignee: Weniwen Technologies, Inc.
Priority date: 2000-01-10
Filing date: 2001-01-10
Publication date: 2001-07-19

Abstract

A system and method are provided. According to an embodiment of the invention, a method for establishing a language model to recognize speech of a first language comprises: obtaining language model parameters for a second language that is not the first language; obtaining training data for the first language; determining a language model based on the language model parameters for the second language and the training data for the target language (450); and making the language model available for use to handle speech of the first language (460).

Description

SYSTEM AND METHOD FOR SPEECH PROCESSING WITH LIMITED TRAINING DATA

RELATED APPLICATIONS The present application is related to, and claims the benefit of priority from, the following commonly-owned U.S. patent application by the same inventors, the disclosure of which are hereby incorporated by reference in its entirety, including any incorporations-by-reference, appendices, or attachments thereof, for all purposes: serial no. 60/175,368, filed on January 10, 2000 and entitled SYSTEM AND METHODS FOR COLLOQUIAL LANGUAGE MODELING FOR SPONTANEOUS

CANTONESE SPEECH; and serial no. <attorney docket number WIW-002.01>, filed on

January 9, 2001 and entitled SYSTEM AND METHOD FOR SPEECH PROCESSING WITH LIMITED TRAINING DATA.

TECHNICAL HELD The present invention relates to automated processing of speech. Embodiments of the present invention relate to automated keyword spotting, automated dictation from speech, or other automated speech recognition.

BACKGROUND OF THE INVENTION Automated speech processing (ASP) systems include, for example, automated keyword spotting systems, automated dictation systems, other automated speech recognition systems, and the like. Automated keyword spotting systems attempt to detect the presence or absence of particular word(s) or phrase(s) in an utterance of speech input, using data processing. Automated dictation systems to attempt to convert an utterance of speech into its corresponding text, using data processing.

ASP systems typically include models or parameters that are established based on training data. Training data may include, for example, collections of text or recorded speech, for the language or dialect that is to be handled by the speech processing system. In general, insufficient training data, or training data that is not in some sense similar to the expected input speech for the ASP system, leads to poor system performance Not fumisched at time of publication

and/or to a less powerful speech processing system.

In general, for any given language, it costs significant money, effort, and other resources to collect or otherwise obtain sufficient and suitable training data from which to establish models or parameters. For languages or dialects—for example, English or Mandarin Chinese— that are used by great numbers of people, the cost of obtaining sufficient and suitable training data is typically justified by the existence of a large potential user base for the speech processing system that is contemplated. For other languages or dialects— for example, Cantonese Chinese— that are used by relatively few people, the cost of obtaining sufficient and suitable training data is typically more difficult to justify. Furthermore, for languages or dialects—for example, Cantonese Chinese—that are primarily spoken languages and are used relatively infrequently as written languages, the cost of obtaining sufficient and suitable training text data is typically especially high. For the above reasons, there tends to be fewer, if any, "state of the art" speech processing systems that can handle certain less popular languages or dialects, especially such languages and dialects are primarily spoken languages.

SUMMARY OF THE INVENTION According to an embodiment of the invention, a method for establishing a language model to recognize speech of a first language comprises: obtaining language model parameters for a second language that is not the first language; obtaining training data for the first language; determining a language model based on the language model parameters for the second language and the training data for the target language; and making the language model available for use to handle speech of the first language.

According to another embodiment of the invention, a speech processing system includes a processor, a memory, a language model established by the above- described method, and speech recognition software configured to use the language model for recognizing speech.

According to another embodiment of the invention, a speech processing system includes: a processor; memory coupled to the processor; and logic stored in the memory configured to control the processor to: obtain training data for the first language; determine a language model based on the training data for the target language and on an existing language model for a second language; wherein the language model is for use in recognition of speech of the first language.

According to another embodiment of the invention, a speech processing system that is capable of recognizing words of a primarily colloquial language in an input utterance includes: a model for a word; a model for filler phrases, wherein the model for filler phrases was trained using online newsgroup articles; and speech recognition software configured to recognize an input utterance based on the model for a word and on the model for filler phrases.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 A is a block diagram of a computer system in which the present invention may be embodied.

FIG. IB is a block diagram of a software system of the present invention for controlling operation of the system of FIG. 1 A. FIG. 2 is a schematic diagram of a typical continuous speech recognizer from which the present invention may be implemented.

FIG. 3 is a flow diagram that illustrates a method for establishing a colloquial language model in a target language.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT

The following description will focus on the currently-preferred embodiment of the present invention, which is operative in an environment typically including desktop computers, server computers, and portable computing devices, occasionally or permanently connected to one another. The currently-preferred embodiment of the present invention may be implemented in an application operating in an Internet-connected environment and running under an operating system, such as the Microsoft® Windows operating system, on an IBM-compatible Personal Computer (PC) configured as an Internet server. The present invention, however, is not limited to any particular environment, device, or application. Instead, those skilled in the art will find that the present invention may be advantageously applied to any environment. For example, the present invention may be advantageously embodied on a variety of different platforms, including Macintosh, Linux, EPOC, BeOS, Solaris, UNIX, NextStep, and the like. For another example, although the following description will describe preferred embodiments that are adapted for the Cantonese Chinese language, the invention itself is not limited to the Cantonese Chinese language, and indeed may be embodied for other languages or dialects. The description of the exemplary embodiments which follows is, therefore, for the purpose of illustration and not limitation.

I. Overview

As described in the Background section, it is sometimes difficult to justify the expense of obtaining such sufficient amounts of suitable training data for establishing a high quality speech processing system for a language or dialect, especially for a language or dialect that has relatively few users or for a language or dialect that is primarily a spoken language. What is needed are means and methods by which the amount of resources needed for obtaining such sufficient amounts of suitable training data is reduced. For example, what is needed are means and methods by which collection of such training data is made easier and means and methods by which the amount of such training data, and therefore its cost, is reduced for establishing an ASP system for a language or dialect. The present invention satisfies these and other needs.

The present invention is especially suited to establishing an ASP system for a "target" language or dialect (for example, Cantonese Chinese) that is similar to, or is a dialect of, a "baseline" language or dialect (for example, Mandarin Chinese) for which portions of an ASP system (e.g., a language-model portion) has already been established. The preferred approach of the present invention is to start with a existing baseline portion of an actual or hypothetical ASP system for the baseline language and to modify the baseline portion for use with the target language, based on a limited amount of training data that has been obtained for the target language. The baseline portion may include, for example, models or parameters of the type used for an ASP system. These models or parameters may be, for example, models and parameters of a language model for the baseline language. For establishing an ASP system that can handle spontaneous speech, for a language that is primarily a spoken language, the limited amount of training data is preferably collected at least in part from online newsgroup articles, which tend to use colloquial vernacular, as opposed to newspaper articles, which tend to use more formal written language style.

In the preferred embodiment of the present invention, Cantonese Chinese, or the like, is the target language, and Mandarin Chinese, or the like, is the baseline language. The Chinese language includes four language groups, they are Mandarin, Cantonese, Fukienese and Hakka and many dialects. These four groups may be considered to be four languages. A dialect is a language variety used by a particular population of speakers. While there are some similarities between the Chinese languages and dialects, there arfe many differences. In the present document, two Chinese languages, Cantonese and Mandarin, will be discussed in detail. Spontaneous speech consists of colloquial words, corrections, hesitations, dis-fluency, short pauses, ill-formed grammar, and words that are not generally listed in the standard Chinese lexicon.

II. System Hardware The present invention may be embodied on an information processing system such as the system 300 of FIG. 1A, which comprises a central processor 301, a main memory 302, an input/output (I/O) controller 303, a keyboard 304, a pointing device 305, pen device, or the like), a screen or display device 306, a mass storage 307 (e.g., hard disk, removable floppy disk, optical disk, magneto-optical disk, or flash memory, etc.), an audio input device 308 (e.g., a microphone, e.g., as found on a telephone that is coupled to the bus system 310), and an interface 309. Although not shown separately, a real-time system clock is included with the system 300, in a conventional manner. The various components of the system 300 communicate through a system bus 310 or similar architecture. In addition, the system 300 may communicate with other devices through the interface or communication port 309, which may be an RS-232 serial port or the like. Devices which will be commonly connected, occasionally or on a full time basis, to the interface 309 include a network 351 (e.g., LANs or the Internet), a laptop 352, a handheld organizer 354 (e.g., the Palm organizer, available from Palm Computing, Inc., a subsidiary of 3Com Corp. of Santa Clara, California.), a modem 353, and the like. In operation, program logic (implementing the methodology described below) is loaded from the storage device or mass storage 307 into the main memory 302, for execution by the processor 301. During operation of the program (logic), the user enters commands and data through (a) the keyboard 304, (b) the pointing device 305 which is typically a mouse, a track ball, or the like, and or (c) the audio input device by voice input, and/or (d) the like. The computer system displays text and/or graphic images and other data on the display device 306, such as a cathode-ray tube or an LCD display. A hard copy of the displayed information, or other information within the system 300, may be printed to other output devices (e.g., a printer), not shown, which would be connected to the bus system 310. In a preferred embodiment, the computer system 300 includes an IBM PC-compatible personal computer (available from a variety of vendors, including IBM of Armonk, New York) running a Unix operating system (e.g., Linux, which is available from Red Hat Software, of Durham, North Carolina, U.S.A.). In a preferred embodiment, the system 300 is an Internet or intranet or other type of network server, e.g., one connected to a worldwide publically accessible communication network, and receives input from (e.g., digitized audio voice input), and sends output to, a remote user via the interface 309 according to standard techniques and protocols.

III. System Software

Illustrated in FIG. IB, a computer software system 320 is provided for directing the operation of the computer system 300. Software system 320, which is stored in system memory 302 and on storage (e.g., disk memory) 307, includes a kernel or operating system (OS) 340 and a windows shell 350. One or more application programs, such as client application software or "programs" 345 may be "loaded" (i.e., transferred from storage 307 into memory 302) for execution by the system 300. System 320 includes a user interface (UI) 360, preferably a Graphical User

Interface (GUI), for receiving user commands and data and for producing output to the user. These inputs, in turn, may be acted upon by the system 300 in accordance with instructions from operating system module 340, windows module 350, and/or client application module(s) 345. The UI 360 also serves to display user prompts and results of operation from the OS 340, windows 350, and applications) 345, whereupon the user may supply additional inputs or terminate the session. In a specific embodiment, OS 340 and windows 345 together comprise Microsoft Windows software (e.g., Windows 9x or Windows NT, available from Microsoft Corporation of Redmond, Washington). In the preferred embodiment, OS 340 is the Unix operating system (e.g., the Linux operating system). Although shown conceptually as a separate module, the UI is typically provided by interaction of the application modules with the windows shell and the OS 340. One application program 200 is a speech recognition system, according to the present invention, that uses colloquial language models, e.g., for spontaneous Cantonese speech, which will be described in further detail. While the invention is described in some detail with specific reference to preferred embodiments and certain alternatives, there is no intent to limit the invention to that particular embodiment or those specific alternatives.

IV. System Structure

A. Overview

The present invention may be built upon a standard ASP system, e.g., one that uses Hidden Markov models (HMMs), by adding the structures, method steps, and computations described in the present document. ASP systems such as automated keyword spotting systems or automated speech recognition systems, and HMMs, are well known in the relevant art, and are described, for example, in the book, Fundamentals of Speech Recognition, by Lawrence Rabiner & Bϋng-Hwang Juang, published by Prentice Hall (Signal Processing Series), Englewood Cliffs NJ, 1993, ISBN 0-13-015157-2, hereinafter referred to as RABINER 93.

B. The Dictation System

1. Overview The dictation system of the preferred embodiment of the present invention is built along the lines shown in FIG. 2. FIG. 2 is a schematic diagram, adapted from FIG. 8.7 of RABINER 93, of a typical continuous speech recognizer 400. The following description of the typical continuous speech recognizer 400, and its processing, is adapted from description in Section 8.8 of RABINER 93. The first step in the processing is spectral analysis in a spectral analysis module 405 to derive the feature vectors 410 used to characterize the spectral properties of speech input 415. The second step in the recognizer 400 is a combined word-level/sentence-level match, in a matching module 417 that includes a word-level match module 420 and a sentence-level match module 425. The way this combined match is accomplished is as follows. Using a set of subword HMMs 430 and a word lexicon 435, a set of word models 440 is created by a word model composition module 445. The set of word models 440 is created by concatenating each of the subword unit HMMs as specified in the word lexicon 435, in a conventional manner. The way in which the sentence-level match is done is via a finite state network (FSN) realization of a word grammar 450 (the syntax of the system) and semantics 455, as expressed in a composite FSN language model 460. The implementation of the combined word-level match/sentence-level match is via any conventional manner, for example, via any of the structures described in Chapter 7 of RABINER 93. Typically, systems use structures similar to the conventional frame synchronous level-building method (usually with some type of beam search to restrict the range of paths) to solve for the best recognized sentence 465 (the result).

2. Acoustic Model

The preferred dictation system, ultimately to be adapted for Cantonese Chinese, is based on phoneme continuous density hidden Markov models, with 16 Gaussian mixtures per state. Each subword unit is modeled, for Chinese, as an initial part and a final part. The initial part is modeled using a 3-state left-to-right HMM with no state skips. The final part is modeled using a 4-state HMM. There are 19 initial parts and 54 final parts if the tone information is ignored, for Cantonese Chinese. In the system, initial parts are modeled by right context-dependent models. Final parts are modeled by context-independent models. The total units are 195 subword models, for Cantonese Chinese, which may be termed phone models.

The recognizer feature vector consists of the following 39 parameters: 12 Mel- warped frequency-based cepstra coefficients (MFCC), 12 delta cepstral coefficients, 12 delta-delta cepstral coefficients, energy, and the delta and delta-delta of the energy parameters. The initial units used for Cantonese are: b, c, d, f, g, gw, h, j, k, kw, 1, m, n, ng, p, s, t, and z. The final units used for Cantonese are: aa, aai, aak, aam, aan, aang, aap, aat, aau, ai, ak, am, an, ang, ap, at, au, e, ei, ek, en, eng, eoi, eon, eot, et, i, ik , im, in, ing, ip, it, iu, m, ng, o, oe, oek, oen, oeng, oi, ok, on, ong, ot, ou, u, ui, uk, un, ung, ut, yu, yun, and yut.

2. Language Model The language model includes a trained uni-gram distribution and a trained bi-gram distribution, per conventional practice. The uni-gram distribution is the probability P(wl) that a given word (i.e., segment) in a sentence is actually the word wl of the lexicon. The bi-gram distribution is the probability P(w2|wl), which is the probability that a given word (i.e., segment) in a sentence is the word w2 of the lexicon, given that the word (i.e., segment) was preceded in the sentence by the word wl of the lexicon. The language model is established, according to the present invention, for the target language, as will be further discussed in a later section.

C. Keyword Spotting System Like the preferred dictation system, the preferred keyword spotting system is based on phoneme continuous density hidden Markov models. In an embodiment of the invention, 16 Gaussian mixtures are used per state in a keyword spotting system. The keyword spotting system also includes a garbage model with conventional structure, for example, with five HMM states and 20 frame sizes, with 16 Gaussian mixtures per state. Each subword unit is modeled in the same manner as discussed above for the dictation system. The keyword spotting system is established for the target language, according to the present invention, as will be further discussed in a later section.

V. Establishing an Automated Dictation System for the Target Language A. Establishing an Acoustic Model

An automated dictation system for the target language is established by establishing a language model and an acoustic model according to the structure discussed in an earlier section. Establishing the language model will be discussed shortly in a later subsection. Given the language model, the acoustic models are established by being conventionally trained using acoustic training data (i.e., voice recordings) that are collected for the target language, namely, Cantonese. Collection of such acoustic training data will be fiirther discussed in a later section.

B. Establishing the Language Model for the Target Language 1. Overview From analysis of Cantonese data collected as discussed in a later section, it is seen that content words in Mandarin and Cantonese are very similar, but filler or colloquial phrases (defined in a later section) are very different. To establish an automated dictation system for Cantonese, the target language, it is necessary to solve the following two problems. The first problem is that some colloquial or filler words in the Cantonese corpus never appear in a Mandarin corpus nor in a traditional Chinese lexicon. The second problem is that the spoken syntax of Cantonese and Mandarin are different, and thus the "correct" N-gram counts for Cantonese and Mandarin should be different.

In order to train an effective language model for colloquial Cantonese speech, a lot of Cantonese text data is required. For Mandarin, there is a sufficient text database to train a language model, but for Cantonese, there is no sufficient Cantonese text data. To solve the above problem, a conventional baseline Mandarin language model is obtained and is then adapted to form a colloquial Cantonese language model. FIG. 3 is a flow diagram that illustrates a method 500 for establishing a colloquial language model in the target language, namely, Cantonese. In a step 510, language model parameters are trained for a baseline language, using training data for the baseline language. For example, the baseline language is preferably a written language. For example, the baseline language is preferably Mandarin. For example, the baseline language model parameters are preferably from a baseline language model, for Mandarin, that is trained using large numbers of Mandarin newspaper articles, as will be further discussed in a later section. In a step 515, training data of a target language is obtained. For example, the target language is a colloquial language that is similar to the (written) baseline language (e.g, similar due to sharing some words and some syntax). For example, the target language is preferably Cantonese. For example, training data of the target language is preferably obtained from Cantonese online newsgroup articles, as will be fiirther discussed in a later section. Preferably, at least about 12 million bytes of such training data is obtained; of course, more training data would be even better. However, thanks to the present invention, even less than 12 million bytes of such training data is sufficient to establish an ASP with satisfactory performance. In a step 520, words of the target language are obtained that are not in the lexicon of the baseline language. For example, these words may be determined from the training data of the target language, as will be further discussed in a later section. Preferably, at least about 600 colloquial phrases or terms are determined. In a step 525, the words are added to the lexicon of the baseline language to obtain an updated lexicon. For example, the at least 600 colloquial phrases or terms may be added to the lexicon as words. In a step 530, language model parameters for the target language are trained based on the training data of the target language, for example, using the training data from step 520 of the target language and the updated lexicon. For example, a limited amount of training data of the target language is segmented using the updated lexicon and then frequencies of occurrence of lexicon entries in the limited amount of training data are counted, according to conventional practice, to obtain uni-gram and bi-gram distributions, which are examples of N-gram distributions. In a step 535, a language model is determined for the target language based on the language model parameters for the baseline language and on the language model parameters determined in the step 530 for the target language. For example, the language model parameters determined in the step 530 for the target language are combined with the baseline language model for the baseline language to obtain a language model for the target language. For example, the combination is performed using language model adaptation by linear interpolation.

2. Training a Baseline Language Model, for the Baseline Language

As mentioned above, in the step 510, the language model parameters are trained for the preferred baseline language, Mandarin, using training data that includes

Hong Kong newspaper articles which are written in formal Chinese (Mandarin). The language model parameters reflect Mandarin Chinese and form a baseline language model.

In an embodiment of the invention, the training corpus size is about 40 million bytes.

Generally, an even larger corpus would be still better. Per conventional practice, the baseline language model includes uni-gram and bi-gram language models. These models are established by applying a Mandarin lexicon to segment the corpus, and then counting the uni-gram and bi-gram frequencies of occurrence within the corpus, in conventional manner. The Mandarin lexicon may have, for example, roughly 36,000 entries.

3. Adapting the Language Model for the Target Language

As mentioned above, in the step 535, a language model is determined for the target language. The language model is preferably determined using linear interpolation language model adaptation. The adapted uni-gram P_adp(wJ is established as follows:

P_aφtø λj P J + O - λjϊPcW

, where λj and (I - λ ) are the combination factors;

P ) is the uni-gram of the large Mandarin corpus (newspapers);

P_c(w is the uni-gram of the limited Cantonese corpus (newsgroups); and

The adapted bi-gram

*^s established as follows:

, where λ₂ and (1 - λ^ are the combination factors; ^{1S me} bi-gram of the large Mandarin corpus;

i^{s trι}e bi-gram of the limited Cantonese corpus; and

The determined uni-gram Pf_ιrmι(w is P_adp(wJ- The determined bi-gram

*^s determined using linear interpolation as follows:

, where a and b are the linear combination factors, and a may be set to 0.9 and b may be set to 0.1.

C. Determining New Words to Add to the Baseline Lexicon

As Chinese words are not separated by space, the newsgroup articles are segmented based on the Mandarin lexicon. After segmentation, unknown Chinese characters and single characters that appear with a high frequency are used as candidates for determining of new words, i.e., out-of-vocabulary (OOV) words. A statistical tool, such as CXtract, may be used to determine longer OOV words automatically. CXtract is further discussed in Pascale FUNG and Dekai WU, "Statistical Augmentation of a Chinese Machine-Readable Dictionary", in /« WVLC-2, Second Annual Workshop on Very Large Corpora (COLING-94), 1994, which is incorporated by reference in its entirety for all purposes. Terms that are longer than a single Chinese character are identified using statistical methods. For example, collocations of existing words, e.g., of existing single- character words, are evaluated as candidate new words. A collocation is a pair of words which appear together significantly more frequently than would be expected by chance. A collocation is accepted as a new word if its strength is higher than a predetermined threshold and its spread is also higher than a predetermined threshold. The thresholds are chosen based on the preference of the system builder, for example, after simple hand tuning by the system builder using various thresholds.

D. Using the Established Automated Dictation System

The established automated dictation system for the target language may be used, for example, within a distributed speech recognition system, or other speech recognition system, for example, as discussed in the co-owned and co-pending U.S. patent application serial no. 09/613,472, filed on July 11, 2000 and entitled "SYSTEM AND METHODS FOR ACCEPTING USER INPUT IN A DISTRIBUTED ENVIRONMENT IN A SCALABLE MANNER", hereinafter referred to as the USER INPUT REFERENCE, which is hereby incorporated by reference in its entirety for all purposes. The established automated dictation system for use within the speech recognition system(s) of the USER INPUT REFERENCE, are preferably set up as discussed in the USER INPUT REFERENCE.

VI. Establishing an Automated Keyword Spotting System for the Target Language

The keyword spotting system is established using the structures as discussed above. A garbage model is established to absorb non-lexical word sounds such as "urn", "ha", "er", and the like. The garbage model is trained using Cantonese training data. Filler phrases are determined from the Cantonese training data, preferably such training data that is collected as is fiirther discussed below in a later section. Preferably, at least about 90 filler phrases are determined from the Cantonese training data that is collected via the Wizard-of-Oz collection system, as discussed below in a later section. These filler phrases are modeled from the Cantonese training data. Preferably, at least about 600 keywords are modeled. These keywords preferably cover different domains, for example, domains including local or foreign news, weather information, travel information, computer games, restaurants, movies, education, and so forth for the keyword spotting task.

The established automated keyword spotting system for the target language may be used, for example, within a distributed speech recognition system, or other speech recognition system, for example, as discussed in the incorporated USER INPUT REFERENCE. The established automated keyword spotting system for use within the speech recognition system(s) of the USER INPUT REFERENCE are preferably set up as discussed in the USER INPUT REFERENCE. The established automated keyword spotting system for the target language may also be used, for example, within the utterance verification systems that are discussed in U.S. Patent Number serial no. , attorney docket number WIW-001.01, filed on January 9, 2001 and entitled SYSTEM AND METHOD FOR UTTERANCE VERIFICATION OF CHINESE LONG AND SHORT KEYWORDS, hereinafter referred to as the UTTERANCE VERIFICAΉON REFERENCE, which is hereby incorporated by reference in its entirety for all purposes. The established automated keyword spotting system for use within the speech recognition system(s) of the UTTERANCE VERIF1CΛΩON REFERENCE are preferably set up as discussed in the UTTERANCE VERIFICATION REFERENCE.

VII. Collection of a Limited Amount of Target-language Training Data A. Speech Training Data: "Wizard of Oz" To develop a spontaneous speech understanding system, a spontaneous speech database is collected for modeling the characteristics of spontaneous speech. It would be better to collect phrases from real life situations, but doing so is not easy. If a written script is given to a person, and the person is asked to speak naturally, then the resulting utterance would not be natural and would not capture true spontaneous speech characteristics. To overcome this problem, a semi-automatic database collection system is set up using a conventional Wizard-of-Oz scheme, adapted to collect spontaneous spoken commands to an envisioned voice-controlled web browser. A web browser is presented to human speakers. The speakers are told that the browser automatically responds to voice commands. The speakers are told to speak naturally to the web browser to thereby surf the web for information. The speakers' spontaneously spoken voice commands are recorded to form a spontaneous speech database. The speakers can control the web browser, go to any link on the current page, or surf the Internet with their spontaneous speech input. In reality, the web browser is controlled remotely by a human operator (the "Wizard") who directs web browser operations using a mouse and/or keyboard in response to the speakers' verbal instructions. Because the speakers are not aware of the existence of the operator and believe that a mere machine is responding to them, they give natural verbal commands.

In an embodiment of the invention, the Wizard-of-Oz system uses at least eleven basic command keywords in both Chinese and EnglisL Each speaker is instructed to surf using spoken requests. The speakers can continue to search for information on the web or read the web pages by spontaneously speaking. The collected speech generally includes "garbage" and "filler" speech, characteristic of spontaneous speech. Garbage speech includes extraneous utterances such as "um", "ah", "er", short pauses, and out of vocabulary words. Filler speech include, for example, corrections, phrases such as "would you please ...", "I want to know about ...", and "I want ...".

The recording may be carried out using a noise-cancelling uni-directional microphone. The format of the wave files may be as follows: 16KHz sampling rate and 16 bit. In one embodiment of the present invention, a total of 39 speakers; 25 of them are male and 14 are female speakers were used. Each speaker spoke for approximately one hour. A total of 4,150 utterances were collected and transcribed. In general, of course, using even more training data would be even better. B. Text Training Data: Online Discussion Groups

The Wizard-of-Oz database collection system can be used to collect real spontaneous speech for analysis, i.e., for use in constructing a language model (bi-grams, tri-grams, and the like), but it takes a lot of effort to segment and to transcribe the data. Also, it is difficult to collect sufficient speech data for each specific task.

In order to cover more topics, a huge amount of data is desired. It is more convenient to collect text-based data on various topics than to collect speech data. Chinese Cantonese, primarily a spoken language, is very different from Chinese Mandarin, which is a written language as well as a spoken language. Local (Hong Kong) newspapers are not suitable for modeling the Cantonese language because such newspapers primarily use formal Chinese Mandarin. There is not enough Cantonese text available. However, with the advent of online newsgroup articles, Cantonese text can be collected. In most newsgroup articles and FAQs (Frequently Asked Questions), people usually ask questions or express their views using written forms of colloquial language. Moreover, the written styles are very similar to language used in real life situations. The newsgroup articles are up-to-date and include many colloquial phrases, new proper nouns or new compound words. For example, about six months of Hong Kong newsgroup data may be collected in various domains such as travel, music, computers, entertainment, politics, free talk, and the like.

VIII. Further Comments

While the invention is described in some detail with specific reference to preferred embodiments and certain alternatives, there is no intent to limit the invention to those particular embodiments or specific alternatives. Thus, the true scope of the present invention is not limited to any one of the foregoing exemplary embodiments but is instead defined by the appended claims.

Claims

WHAT IS CLAIMED IS:

1. In a data processing environment, a method for establishing a language model to recognize speech of a first language, the method comprising: obtaining language model parameters for a second language that is not the first language; obtaining training data for the first language; deteπnining a language model based on the language model parameters for the second language and the training data for the target language; and making the language model available for use to handle speech of the first language.

2. The method of claim 1 wherein the step of obtaining training data for the first language comprises collecting text from online discussion group articles, wherein the training data is reflective of colloquial language usage.

3. The method of claim 1 wherein the training data for the first language is limited, and performing speech recognition using the language model leads to better recognition performance than would be obtained by using another language model that is determined using the training data for the first language without using the language model parameters for the second language.

4. The method of claim 3 wherein the training data for the first language consists of no more than 12 million bytes of text.

5. The method of claim 1 wherein the first language is Cantonese Chinese, and the second language is Mandarin Chinese.

6. The method of claim 1 wherein the first language is more of a colloquial language than a written language.

7. The method of claim 1 wherein the determining step comprises training language model parameters for the first language based on the training data of the first language.

8. The method of claim 7 wherein the determining step further comprises combining the language model parameters for the first language with language model parameters for the second language.

9. The method of claim 8 wherein the combining step comprises linearly interpolating N-grams parameters.

10. The method of claim 8 wherein the determining step further comprises identifying words in the training data that are not in a lexicon of the second language; and adding the words to the lexicon of the second language to obtain an updated lexicon.

11. The method of claim 7 wherein the language model parameters for the second language comprise a language model for the second language, and the determining step further comprises adapting the language model for the second language to reflect the first language.

12. The method of claim 1 further comprising: receiving input speech; evaluating the input speech using the language model; and recognizing the input speech based on result of the evaluating step.

13. A speech processing system comprising a processor, a memory, a language model established by the method of claim 1, and speech recognition software configured to use the language model for recognizing speech.

14. The method of claim 1 wherein the first and second languages have words in common.

15. A speech processing system comprising: a processor; memory coupled to the processor; and logic stored in the memory configured to control the processor to: obtain training data for the first language; and determine a language model based on the training data for the target language and on an existing language model for a second language; wherein the language model is for use in recognition of speech of the first language.

16. A speech processing system that is capable of recognizing words of a primarily colloquial language in an input utterance, the system comprising: a model for a word; a model for filler phrases, wherein the model for filler phrases was trained using online discussion articles; and speech recognition software configured to recognize an input utterance based on the model for a word and on the model for filler phrases.