WO2006083690A2

WO2006083690A2 - Language engine coordination and switching

Info

Publication number: WO2006083690A2
Application number: PCT/US2006/002838
Authority: WO
Inventors: David Carroll; James Albers
Original assignee: Embedded Technologies, Llc
Priority date: 2005-02-01
Filing date: 2006-01-27
Publication date: 2006-08-10
Also published as: WO2006083690A9; WO2006083690A3

Abstract

Translation systems and methods as illustrated and/or described herein, including speech-to-speech, text-to-text, speech-to-text and/or text-to-speech translation systems, devices, engines, applications and associated methods.

Description

LANGUAGE ENGINE COORDINATION AND SWITCHING

STATEMENT OF FEDERALLY SPONSERED RESEARCH AND DEVELOPMENT

Embodiments of the invention, were made with. U.S. government support under Contract No. N00014-99-C-0437 by the Office of Naval Research. The government has certain rights in the invention.

BACKGROUND OF THE INVENTION 1. Field of the Invention

Embodiments of the invention relate to language engine coordination and switching, e.g. in connection with language translation devices and methods. Certain embodiments of the invention relate to speech-to-speech translation devices and methods that allocate computing power and translation capabilities in more effective and more efficient ways than previously known, especially in mobile, wireless environments.

2. Description of Related Art

Known computer prototypes can translate non-complex speech, as it occurs in conversation. Field tests of such prototypes are planned and/or completed, for example in encounters between, soldiers and nationals of foreign countries. Conversations consisting of simple questions and declarative statements can be translated with relatively

high, or at least acceptable, degrees of accuracy. Versions of the DARPA One-Way System, for example, have been used in Bosnia and Kosovo on a limited basis, for maritime intercept operations (MIO) in the Arabian Gulf, and for medical support in the Marine Corps Urban Warrior exercise. The system has enabled U.S. soldiers to communicate with local inhabitants regarding previously unknown mine fields, for example, and to discern other information. Reported response has been very positive, and the need for easy-to-use systems clearly demonstrated.

The U.S. Army's Forward Area Language Converter (FALCon) system has allowed soldiers to convert foreign-language documents into approximate English- language translations. Speech recognition is reportedly being merged into the FALCon system, e.g, through the One-Way System, the Navy's Multi-Lingual Interview System, and components of Carnegie Mellon University's DIPLOMAT speech-to-speech translator.

Speech-to-speech translation presents numerous difficulties, however. Translation often requires "guessing" what is meant by a spoken word or phrase. People change verb tense in mid-thought, stutter, mispronounce words or pronounce them in different dialects, and otherwise mate speech-to-speech translation much more difficult than, mere text-to-text translation, for example. Ambient noise, interference, acoustics and other factors further complicate the process.

Speech-to-speech (or voice-to-voice) language translation typically requires application engines for speech-to-text, text-to-text, and then text-to-speech operations. According to one example, an English speaker approaches anon-English speaker and makes a relatively simple, declarative request, such as "I would like two tickets to Vienna," A translation device hears the request. A speech-to-text engine, involving e.g. voice-recognition software, converts the speech data into text. This conversion can occur not just word-by-word, but also by looking at phrases in context, e:g. for idiom detection and translation. A text-to-text engine converts the English text into foreign-language text. Finally, a text-to-speech engine converts the foreign-language text into speech that is audible to the foreign-language speaker. The translation device then reverses the process when the non-English speaker answers or addresses the English speaker. See, for example, Orenstein, David, "Stick It In Your Ear," Business 2.0 Magazine, May 29, 2001, which is incorporated herein by reference.

Various vendors provide these engines, which are continually evolving as computing technology improves and the need for global communications grows. Accordingly, such application engines are not standardised for working together efficiently. They are also not readily inserted into the most efficient and capable mix to accomplish voice-to-voice language translation, e.g. in a mobile environment

U.S. Patent No. 6,173,259 to Bijl, incorporated herein by reference, discloses a speech-to-text conversion system in which a single task may be subdivided into multiple individual work packages, to exploit multiple resources. The task of automatic speech recognition can be divided across several automatic speech, recognition processors, creating a type of parallel processing that allows reduced processing turnaround time. Additionally, a single correction operation can be subdivided across many correction terminals, possibly operating independently and in different locations and different time zones. Faster or cheaper document turnaround is stated as an advantage of the disclosed device. However, this reference discusses only speech, recognition as a divided task among multiple processors.

According to U.S. Patent No. 5,774,854 to Sharman, incorporated herein by reference, a text-to-speech, system includes a linguistic processor and an acoustic processor. The system can include two microprocessors, the linguistic processor operating on one microprocessor and the acoustic processor operating essentially in parallel therewith on the other microprocessor. By effectively running the linguistic processor and acoustic processor independently; the processing in them can be performed asynchronously and in parallel. The linguistic processor is typically run on a host workstation, while the acoustic processor runs on a separate digital processing chip on an adapter card attached to the workstation.

U.S. Patent No. 6,161,082, incorporated herein by reference, discloses a network- based language translation system. The disclosed system can perform translation tasks both for a single user and for multiple users, each, communicating in a different language. The greater processing power available in the network is stated to allow for translation of communications in both text and speech, and to accommodate increasingly capable language translation software programs.

Wearable computing devices are enjoying increasing popularity in consumer, commercial, military, industrial and other applications and markets. Leading wearable computers, systems, peripherals and associated devices are available from ViA, Inc., Minneapolis, MN. Attention also is directed to U.S. Patents Nos. 6,249,427, 6,108,197, 5,798,907, 5,581,492,^' 5,572,401, 5,555,490, 5,491,651 and 5,285,398, all of which, among others, are owned by ViA, Inc., and all of which are incorporated herein-by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be described with respect to the figures, in which:

Figure 1 shows an overview of bi-directional free-speech translation, according to an embodiment of the invention.

Figure 2 shows bi-directional free-speech translation software architecture, according to an embodiment of the invention.

Figure 3 shows bi-directional free-speech translation configuration/customization data, according to an embodiment of the invention.

Figure 4 shows bi-directional free-speech translation including a hybrid free- speech system with an adaptive phrase cache, according to an .embodiment of the invention. Figure 5 shows processing demand vs. time for a translation example, according to an embodiment of the invention.

Figure 6 is a schematic illustration showing communication and interaction between one or more processors and one ox more computers, according to embodiments of the invention.

Figure 7 shows a graphical user interface (GUI) according to an embodiment of the invention.

Figure 8 is a schematic illustration of a language translation system, according to an embodiment of the invention.

Figure 9 is a schematic illustration of an adaptive phrase cache enhancement to a language translation system, according to an embodiment of the invention.

Figure 10 is a schematic illustration of simultaneous listening by one or more elements, according to an embodiment of the invention.

Figure 11 is a schematic illustration of parallel processing with multiple elements, according to an embodiment of the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS In view of the above and other issues, aspects of the invention provide highly effective and efficient voice-to-voice language translation engine coordination and switching. Coordination and quick-swapping devices and methods, implemented via software, computer-readable media or the like, provide an optimal or near-optimal "mix" of engines and processing capabilities, depending on a variety of characteristics add requirements. Hybrid free speech translation systems with, adaptive phrase cache and optional audio stream splitting provide additional advantages. Portability and mohility are emphasized, making embodiments of the invention especially advantageous for use with wearable computing devices, such as those covered by the ViA, Inc. patents incorporated hy reference above.

Embodiments of the invention are described with reference to the accompanying figures and text:

• Figure 1 shows an overview of bi-directional free-speech translation, according to an embodiment of the invention.

• Figure 2 shows hi-directional free-speech translation software architecture, according to an embodiment of the invention.

• Figure 3 shows bi-directional free-speech translation configuration/customization data, according to an embodiment of the invention. Figure 3 references a dictionary stacking function, which allows users to add jargon, slang or other non-standard terminology, as well as a user's vernacular or an application-specific vocabulary, to a standard off-the- shelf dictionary or -other dictionary installed on a particialar device. Similarly,

personal, task and general vocabularies are available in connection with e.g. a speech-recognition component and and/other components according to embodiments of the invention. Functionality and control of e.g. how speech is translated are improved.

Figure 4 shows bi-directional free-speech translation including a hybrid free- speech system wiih an adaptive phrase cache, according to an embodiment of the invention. According to embodiments of the invention, a hybrid translation system uses a phrase cache, i.e. aphrase database, for a first attempt at translation. The system first looks for ahigh-confidenee hit in the cache. Finding such a. hit, a predefined translation is output. On a miss, i.e. if the system does not find a predefined phrase or other language component in the cache, the system falls back on a full dictation/translation engine optimally having relevant voice models/dictionaries and the hke. At that point, an appropriate user interface can provide a way to capture a free-form recognition/translation for reuse and later cleanup by a language expert for adding to the phrase base/cache.

Advantages of supporting a phrase-based front-end include the following: (1) In multiple use scenarios, e.g. interviewing locals or nationals regarding drug traffic, humanitarian relief, security, medical issues, etc., very often there is a core set of questions/instractions. This core set can be documented in appropriate manuals, emergency medical protocols, etc. (2) Phrase-based systems allow for very precise use of context/idiom to translate almost exact meaning. (3) If e.g. 80% or some other relatively high percentage of communication between speakers of different languages falls within a known set of phrases, using phrase-cache recognition and translation increases speed and accuracy. But systems according to embodiments of the invention also can go to a full-dictation--engine approach, if needed, providing the full benefits of full, free-speech language translation for extended conversations and the like, if needed.

With reference to the figures herein, for example, an administrative interface permits installation and management of up to at least eight separate application engines in a single system. Note engines (1) - (8), for example, described below. According to one specific embodiment, a Wearable Language Translator according to an embodiment of the invention is a wearable, Windows 2000 or later personal computer hosting an integrated set of speech and language engines designed to facilitate conversation between an English speaker and a non-English speaker. It should be noted that although certain embodiments of the invention are described with respect to English as the primary (e.g. user/interviewer) language and anon-English language as the secondary (e.g. subject/companion) language, the invention is equally applicable to other languages as well. • For bi-directional conversation between an English speaker and a speaker of another language X, a translator according to aspects of the invention uses a SAPI (Speech Application Programming Interface) compliant dictation engine (1) to recognize English (interviewer) questions and statements. An English- to-X translation engine (2) converts the recognized text to X text, and an X text-to-speech engine (3) speaks the result in language X.

• For responses, embodiments of the invention use either an X SAPI-compliant recognition engine (4), or an X ASR (Automatic Speech Recognition) engine (5) with a fixed vocabulary to recognize a companion's speech. An X-to- English translation engine (6) translates to English text, which is then spoken to a text-to-speech engine (7).

• The entire system is managed and controlled e.g. with English commands , using an English ASR engine (8), according to one embodiment.

• According to an aspect of the invention, software and/or other design features permit allocation of language engines to one or more backend servers and audio/GUI user interface to thinner clients, e.g. connected over a network. A wireless network presents particular advantages where mobility of users is important. A client-server design permits deployment of a robust multiple- language translation system at lower cost, where multiple client translation "terminals" can share high-end processing power of a single server.

Embodiments of the invention provide voice-to-voice language translation

processing load splitting/sharing that improves speed and timing of the processing in the application tasks by using multiple, wirelessly or otherwise connected processing devices. Application processing is tasked to one or more devices based on e.g. interactive voice-to-voice demand. Translation tasks are inherently "bursty," in the sense that times of relatively high processing demand are interspersed with human "think time" or non-verbal communications. Note e.g. Figure 5 in this regard, which shows processing demand vs. time for translation tasks occurring during communication between two users, e.g. users with wearable or mobile personal computers. According to aspects of the invention, multiple clients can offload translation tasks, wirelessly or otherwise, to one relatively large server that can efficiently handle multiple conversations, for example, and/or other processing devices can handle different parts of the load as needed, such as one or more on-site or off-site computers, third-party computers, laptops, desktops, palmtops, etc. Processing tasks, or portions thereof, related to the speech of one party to a conversation, can be shifted over to processors carried by or otherwise associated with one or more other parties to the conversation, e.g. without the knowledge of the parties if desired. Accordingly, if one looks at the demand levels associated with each of the activities occurring in a voice-to-voice language translation process, such as those referenced above, one sees highs and lows of processing demand as time passes. If processing is confined to a single processing device, in accordance with e.g. U.S. Patent No. 6,161,082, referenced above, then the various tasks performed by each of the application engines, and their relative timing, likely will cause many results/output to be delayed. This delay makes communication more awkward, difficult and inefficient. In a military environment, for example, such delay could have serious consequences. The need for processing time among, the various engines is likely to overlap to a significant extent; processing time requirements bump into each other. Therefore, according to embodiments of the invention, by sending all or part of one or more processing tasks out to one or more other processors, e.g. one or more wearable processors on one or more other users, one or more third- party processors, one or more local or remote servers or processors, etc., tremendous efficiencies are achieved according to embodiments of the invention. Any processing task or portion thereof can be sent from any processing device to any other available or operable processing device, according to embodiments of the invention. • According to one particular embodiment, a room or other environment local to a primary speaker and a secondary speaker may also include a third party, e.g, an observer or a participant who for whatever reason is currently silent or less talkative than the primary and secondary speakers. Any portion of the processing required for speech-to-speech or other translation between the primary and secondary speakers can be sent over to one or more processors associated with the third party, e.g. a processor of a wearable computing device that the third party is wearing. Once the desired processing occurs, the . output or other data is sent "back" to the originating processor/party. Of course, any such portion of the processing also can be sent to the primary or secondary speaker themselves, or other destinations as described herein.

• Figure 6 represents an example of the invention, according to which up to three or more wearable or other computers are to. wireless or other communication with each other, with one or more remote processors, with one of more local processors, and/or with one or more additional processors, to achieve the fully distributed processing functions and advantages described herein. Of course, any number of processors/computers can be used according to .embodiments of the invention, local to or remote from each other or in local/remote combinations, to distribute and share processing tasks or portions of processing tasks in accordance with the invention. • As referenced, one of the many "unique features according to embodiments of the invention is the ability to split various portions of required speech-to- speech translation processing tasks and send them as needed to one or more other processors, e.g. by dividing the tasks up between multiple remote or local processors. Embodiments of the invention are not limited to sending entire processing tasks to other processors, in the manner of e.g. a whole processing job being sent to a mainframe. Advantageously, according to embodiments of the invention, the parties to the translation need not know where the actual processing of portion thereof is occurring.

• According to another example, two or more individuals, located locally and/or remotely to each other, communicate using one or more translation devices or systems. One or more processors directly associated with the individuals, or remote processors, or a combination, share processing tasks based on processing requirements and availability of processing time. With task- sharing/handshaking according to embodiments of the invention, overall translation speeds decrease, translation delay is reduced, and ease of communication improves.

• Embodiments of the invention can be implemented in connection with flexible wearable computers (or other wearable of non-wearable computers), e.g. those available from ViA, Inc. According to one embodiment, a wearable translation device or system includes a flexible wearable computer equipped with, a 600 megahertz or better microprocessor, e.g. a Transmeta or other microprocessor, and runs on a Windows 2000 or better operating system. The ViA II wearable computer, and other products available or soon-to-be available from ViA, Inc., are especially beneficial for use. The computer preferably is compatible with or includes a keyboard, a handheld touch display, voice-recognition software, and/or one or more other interfaces. The computer also preferably includes or is associated with one or more of the following: one or more microphones, e.g. a handheld or headset microphone, or other microphone preferably (though not necessarily) body-worn or body- carried, one or more speakers, e.g. built in to the front of the computer or another portion of the computer, and/or otherwise preferably (though not necessarily) body-worn or body-carried, one or more hard disk, drives, e.g. a 2.5-inch (6.36 cm) hard drive with at least 6.2 gigabytes, power controls, located e.g. on a top of the computer for easy access, one or more batteries and battery connectors, e.g. a battery pack containing one or more lithium- ion, rechargeable batteries, one or more PC card slots or other modular/removable component access, e.g. two expansion sockets for two Type II PC cards or one Type III PC card, one or more Universal Serial Bus (USB) ports allowing peripheral devices to be plugged in, one or more AC/DC jacks or other power supply features, allowing use at home/office or on the road, e.g. in an automobile, one or more preferably integrated input/output jacks, for plugging in e.g. a digital display, speaker, touchscreen of other device, and a heat sink, e.g. a magnesium alloy running through the computer to dissipate heat from e.g. the one or more processors of the device. Also see Bonsor, Kevin, "How Universal Translators Will Work," http://www.howstuffworks.com/universal-translator.htm , which is incorporated herein by reference.

According to aspects of the invention, a graphical user interface (GUI) provides a simple way to control and display the progress of a conversation. Figure 7 shows an example according to an embodiment of the invention.

A microphone design includes control buttons or other features that permit control of the translation phases in a conversation, in connection with or independent of a GUI. When a button A is depressed, the system listens for and recognizes speech in English (or another language). When button A is released, the system translates text of the recognized speech to the second language, e.g. language X. Text-to-speech is then used to speak that text in language X. When button B is depressed, the system listens for and recognizes speech in language X. When button B is released, the system translates text of the recognized speech to English text, and then uses text-to- speech to speak that text in English. Embodiments of the invention provide one or more microphones, e.g. operably coupled together or used individually, to achieve high. ease-of-use and use as a relatively straightforward appliance. Of course, other manual or hands-free input devices, besides physical or virtual buttons, can be used.

• Software and/or hardware design according to embodiments of the invention splits the audio input stream, so that a translation device can simultaneously translate conversational speech while scanning for and acting upon voice commands. Thus, an audio splitter according to an embodiment of the invention handles simultaneous speech recognition (SR) and automatic speech recognition (ASR). A dictation recognizer or other SR engine, and an ASR engine, and/or other speech engines or the like, acquire, audio simultaaeously. Implementation with a hybrid system, as referenced above, is especially advantageous in view of the ability to split the incoming audio stream for input into two or more engines.

• Software and/or hardware designs according to embodiments of the invention can perform simultaneous translation from one language to multiple target languages, from multiple languages to one target language, or combinations thereof.

• Software and/or hardware designs according to embodiments of the invention can quickly accommodate new advances in speech engines, for example engines for languages not currently supported, machine and other translation, engines that handle idioms and context in better ways, text translation that uses a "translation memory" approach, etc.

• According to aspects of the invention, one or more processors can be associated with each individual for whom translation is being performed, with each application engine, or according to other processor-distribution designs.

• Whereas U.S. PatentNo. 6,173,259, incorporated by reference above, discusses only speech recognition as a divided task among multiple processors, embodiments of the present invention allow work for voice-to- voice language translation to be divided and performed in more effective and efficient ways than heretofore possible. Different processors at different locations, for example, perform various tasks or portions thereof: one processor can perform speech recognition (speech-to-text translation) independently while another can perform text-to-text translation and another text-to-speech translation, or portions of these tasks, for example.

• Although particular embodiments of the invention have been discussed with respect to a client-server engine allocation, software and hardware design according to embodiments of the invention allows many different configurations and allocations. According to embodiments of the invention, a free-speech, bi-directional translation system is hosted on a single mobile computer, multiple mobile computers, and/or mobile computers in combination with centralized or other servers or processors. Use with wearable computers and flexible wearable computers is contemplated, as is use with non-wearable or relatively non- mobile computers, e.g. desktop computers.

Architecture according to embodiments of the invention provides for speech- component plug-ins, e.g. into DCOM/SAPI and/or DCOM/MTAPI sockets. Architecture according to embodiments of the invention also explicitly supports offloading any component to backend or other servers or processors, in the manner of e.g. inherently distributed processing that results in a faster overall system or components thereof.

Adaptive Phrase Cache architecture according to embodiments of the invention provides additional advantages. If a phrase or other language component provides a hit in a corresponding cache, a known, high-quality, "hand-crafted" translation results. If a miss occurs instead, sophisticated speech recognition and translation occurs. If on a cache miss a useful phrase translation is produced, the system can be told to add the translation to its phrase cache. Over time, therefore, personnel operating in a particular area and/or doing aparticular job can build a large repository of high-quality phrase pairs that can be distributed to all personnel working with, the same language in similar circumstances.

• Embodiments of the invention permit hand-tuning of recognized phrases using text editing commands, hence the user has great control overproducing good input into the translator. Additionally, interspersed translated conversation and voice commands can control the software.

• According to embodiments illustrated with respect to Figure 8, tine or niore speakers communicate with one or more listeners. Each speaker and/or listener optionally communicates in a different language. Embodiments of the invention are implemented using a collection of one or more speech recognition engines such as dictation engines (SR), translation engines (MT), speech engines (TTS) and/or command and control engines (CC) that are configurable in desired ways. , Other types of engines, e.g. voice over IP engines, also are contemplated for use. Each engine or "node" as illustrated in Figure 8 is optionally allocated by software or other control to multiple computing devices or machines, based e.g. on processing requirements/power at a particular machine. The allocation also optionally is based on proximity to the speaker(s) or listeners). For example, if a particular audio interface is very close to a particular speaker or listener; the highest quality audio input and thus highest recognition, likely will occur using that audio interface and/or an associated engine. Audio input and/or output optionally is distributed. Other features and advantages according to aspects of the invention are apparent from Figure 8.

According to embodiments illustrated with respect to Figure 9, a phrase-cache lookup occurs. Phrase caches can he specifically customized for particular local or regional languages, accents, etc. If a spoken word, phrase or other utterance is recognized as being one of the "canned" phrases in the phrase cache, then a "hit" is indicated. Otherwise, upon a "miss", SR/MT translation occurs. The phrase- cache is dynamically tuned,- according to embodiments of the invention: words, phrases, etc. in the cache can he added, modified, improved, etc. in the cache, so that over time, in a particular language domain (e.g. a particular speech in a particular region), higher and higher Ht rates occur in the cache, quality of output phraseology is improved, and/or other . advantages accrue. Other features and advantages according to aspects of the invention are apparent from Figure 9.

According to Figure 10 embodiments, a command and .control engine, regular dictation engine, or other "agent" is conferenced into a verbal communication path, e.g. an analog, digital or other voice connection. According to the illustrated example, a personal information manager (PIM), personal digital assistant, other computing device or computer, or other e.g. digital-based helper or agent, for example, is commanded to find the telephone number for a third-party individual.. The e.g. computer speaks back the telephone number, the spoken command, or other information. The computer or other helper is listening during the conversation over the communication path and participating in that conversation as needed. The helper or other agent is akin to or is in addition to a speech or other engine, listening while a conversation occurs over the communication path, It listens in order to satisfy commands that it picks up as the communication is occurring, for example. The Figure 10 embodiments and other embodiments described herein illustrate the advantages that accrue with the convergence of command and control and manipulation of a computing device, for example, with the speech associated with various forms of normal communications between human beings (e.g. analog cellular pnone, digital cellular phone, voice over IP, or other forms of communications). Other features and advantages according to aspects of the invention are. apparent from Figure 10.

According to the Figure 11 embodiment, a "broker" such as a recognizer is used to take an utterance from one or more speakers and hand it simultaneously to multiple speech recognition engines (SR). Multiple results are returned, and the best result is chosen. The best result or case can.be chosen based e.g. on previous retries, for example, if speaker A is speaking to speaker B and an avenue is provided for speaker A to indicate that the recognition is inaccurate (e.g. a voice input to recognize the phrase "No." or "No, that's not what I said" or "No, that's not what I meant because that's not what I said."). Over time; one or more of the SR engines is judged to be the best for a particular speaker, for example. For example, there may be multiple engines from multiple vendors for a particular language, with some working better than others for speakers with certain dialects or regional accents. The "best" engine also can be determined partially or entirely based on the speed . with which a result is returned. Similarly, a translator-broker hands input from the recognizer simultaneously to several machine translation engines (MT). Over time, one or more of the MT engines is judged to be the best and thus is favored.

Also according to the Figure 10 embodiment, for example in trials or during initialization, the results of translation canbe recorded and analyzed post- facto, and correlated with the engine combinations that were used: e.g. SR 1 with MT 1, SR 1 with MT 2, SR 2 with MT 3, etc. etc. through all combinations or some subset of combinations. The best combinations of engines can be determined for a particular speaker-listener pair/combination, or other variable. A human domain expert can be used to conduct the analysis, for example. The resulting data/analysis can be added back into the scoring/weighting to determine which engine or combination of engines to use in a particular situation. For example, the best domain performance occurring within a particular time limit, e.g. a two-second response time, can be chosen. The most accurate engine or combination of engines might be avoided, for example, if it is outside the chosen time bounds. Historically less-accurate results can be favored in order to improve response time. Other variables or factors alternatively can be chosen for a particular environment, language, or other situation.

Many different options exist according to embodiments of the invention for e.g. mobile translators, mobile computer platforms, operator interfaces, and language translation software. Languages and language pairs to be supported, cost, and user requirements such as acceptable system weight and battery life are flexible and adaptable to suit a variety of operational situations and applications. Plug-and-play compatibility with the most recent speech- recognition and language translation software allows systems according to embodiments of the invention- to be easily updated and extended, e.g. to additional languages. According to particular embodiments of the invention) a near real-time, two-way, mobile, lightweight, robust and low-cost multilingual language translation device operates with minimal training and in a hands-free manner. • Attention is directed to the attached Appendix, which describ es additional features and advantages according to embodiments of the invention, along with particular non-limiting examples that can optionally be "used according to embodiments of the invention.

• Although certain embodiments of the invention have been described with respect to voice-to-voice translation systems, those reading this application snould appreciate that the invention is not limited to those embodiments. Text-to-text and other forms of translation will also benefit from the invention. Digital, analog, cellular, voice over IP, and other forms of communication are contemplated for use according to embodiments of the invention. Multiple engines optionally are associated with a single computer or computer system for language translation and/or language interface or communication, according to embodiments of the invention. Engines according to embodiments of the invention are optionally software-only or optionally are hard-coded in ASICs, DSPs, or other hardware. Upon reading this application, those of ordinary skill in the art will appreciate many other variants and changes that are to be considered within the scope of the invention, and that the invention is not necessarily to be considered limited to the specific embodiments disclosed herein. A. Project Summary

VIA- Team Mission Statement

To develop a near real-time, two-way, mobile, lightweight robust and low-cost multilingual language translation device that can be operated with minimal training in a hands-free manner.

The System Requirements document that was generated at the beginning of this project is included in this document as Appendix A. This document describes the functionality that was required of the Language Translator.

Bi-directional voice-to-voice language translation for continuous speech is functional for the German/English language pair. The delay time between speaking a phrase and the translated phrase being "spoken" by the system is approximately six-seconds.

Hardware for the system: a VM II 180 MHz wearable computer with 64 MB of RAM and a 3.2 GB hard drive, a. Tandy 20-009 Clip-on speaker and, depending on the noise environment in which the translator was being used, a Shure Prologue directional microphone or a Telex Verba headset. The computer weighs approximately 22 ounces and is worn around the waist. It is powered by two 15-ounce batteries, with a run time of approximately four hours on a single charge. The software elements include Lernout & Hauspie's VoiceXpress speech recognition software, Globalink's Power Translator Professional text-to-text translator, Lernout & Hauspie's TTS300 speech synthesis software, and ViA's speech engine enhancement code and operator interface and system integration software. B. System Description

B.1 Overview

ViA developed a near real-time, mobile, lightweight, robust and low-cost language translation system that can be operated with minimal training in a hands- free manner. This system, which is shown in Figure 1, supports the English/German language pair, A listing of the individual components is provided in Table 1. The system was designed so it can be readily expanded to include additional language pairs.

Figure 1 - Language Translation System with Telex Headset

Table 1 - System Components

B.2 Mobile Computer

The ViA II computer selected for the language translator is shown in Figure 1. The ViA II consists of two modules connected with a flexible circuit. It is approximately 9 % inches in length, 3 inches in height, and one inch thick, Its total weight, including batteries for four hours of continuous operations, is approximately 3.7 pounds. Interfaces on the ViA II include USB, PS/2., serial, two PC Card slots and an AC/DC power port. The ViA II also has a docking station that allows the computer to be used as a desktop PC. This docking station, which is shown in Figure 2, provides standard desktop interfaces (e.g., monitor, speaker, microphone, keyboard, mouse, serial and USB devices).

Figure 2 - ViA II Docking Station

B.3 Battery System

Two Molicel ME202 Lithium-Ion rechargeable batteries axe used to power the language translator system. These batteries, which are shown in Figure 3, provide, approximately four hours of continuous operation. In normal operation, the system is not continuously used. The built-in power management cycles down the CPU into a stand-by mode when speech recognition is not required. Thus, the average battery life will be greater than four hours. In our user tests, the battery life was typically around 5 1/2 hours.

Figure 3 - Molicel ME2O2 Li-Ion Rechargeable Battery

Figure 4 - Sample Microphone/Speaker Systems

B.4 Microphone

B.4.1 Overview

Since quality of sound capture is one of the most important aspects of speech, recognition, selecting an optimal microphone is one of the critical issues in the performance of the Language Translator. To provide a robust voice-to-text capability that will work in all types of environments, an Anti-Noise Canceling (ANC) microphone is needed. These microphones have the ability to separate spoken words from background noise, which dramatically improves the recognition rate of the voice-to-text software.

B.4.2 Evaluation of Microphone Systems

There are several vendors and research groups that provide ANC systems. Some of these systems must be worn close to the mouth, others involve pointing a microphone towards the speaker, While still others attempt to automatically "lock-on" to a particular speaker's voice. Several different configurations, such as headsets (e.g., Andrea's ANC-1000), collar mounts (e.g., Labtec's LVA-7370), wrist mounts (e.g.; ViA's. Wrist Interactive Device) handheld directional units (e.g., Logicon's ABF-4) and intelligent remote microphones were investigated for their suitability in the mobile translator system. Some examples of these systems are provided in Figure 4. It was determined that headset designs provide the best performance in noise canceling. However, this is an unacceptable form factor for many situations since either two headsets would be needed, or the participants would have to share a single unit. Both situations would make the language translator difficult to use. The alternative approach is to use directional microphones and thus, ViA spent considerable effort investigating these types of microphones. There are two approaches that are used for directional microphones: hardware implementations where multiple microphones are used to determine the direction of the sound and filter-out unwanted noise; and pure software approaches where neural networks are trained to simulate a human's ability to filter-out unwanted noise.

B .4.3 Selected Microphone Systems

Six headsets and two directional units were selected for actual testing with the Language Translator system. These tests were conducted in noise environments up to 100 dBs. The six headsets were Knowles' VR-3264, Telex's Verba, Andrea's 600 and 601 and VXI's VS4 and Parrot 20. The two handheld directional microphones were Andrea' s Far-field Array (which is sold in the Logicon's ABF-4 system) and Shure's Prologue.

r

The directional microphone was determined to be Andrea's linear array system that is included in Logicon's ABF-4 microphone. This is shown in Figure 6. The ABP-4 is designed to assist hearing impaired individuals. In its current format, it uses an FM wireless link to connect to a user's hearing aid.

Figure 6 - Logicon's ABF4_Directiσnaol Microphone

B.5 Speaker

B.5.1 Overview

Most vendors include a speaker system with their ANC headset microphone. However, a directional handheld microphone is the best solution for an unobtrusive interface (see Section CA2), and thus a separate speaker module needed lo be selected. The speaker system must provide clear audio of spoken words, be small and lightweight, robust enough to survive outdoor use (e.g., water and dust resistant), be low in cost and not require high power.

B.5.2 Evaluation of Speaker Systems

Most of the portable speaker systems designed for computers do not have an acceptable form factor for the Language Translator system. They are either designed to be placed on a flat table-top or attached to the edge of a laptop. Some systems with an acceptable form factor, such as HyperSpectral's pizeo-electric speaker system, were determined to be unsuitable because of their high power requirements. Four systems were selected for potential use in the Language Translator: Tandy's 20-009 Clip-on, Pryme's SMP-100; Kodel's FlatOut Traveler, and Mouser's Mylar 253-5008. The Mouser Speaker is the best suited design for the Language Translator (see www.mouser.com for further information). It provides sufficient frequency response for a normal speaking voice (550Hz to 7KHz), requires low-power (100mWatts) and is very small in size (0.8" diameter and 0.1" depth). B.6 Display B.6.1 Overview

There are several display options that were investigated for use with the mobile translator system. The goal was to make these interfaces unobtrusive (e,g., lightweight, comfortable, easy to access, etc.). In the system, the user needs to use a display and/or keyboard to configure the applications. One example is selecting which gender of voice to use for the speech synthesis software. Thus, in addition to the required microphone and speaker system, some form of display is needed. Interfaces chat were investigated include pocketable systems and wireless wrist-mounted designs. Each of these interfaces is described in the following paragraphs.

Note that having a display allows the mobile translator system to be also used as a common PC. Documents can be viewed and edited, databases such as phone numbers

accessed, email exchanged and the web accessed using wireless modems, all providing a multi-dimensional benefit to using the mobile translator.

B.6.2 . Pocket-Sized Touchscreen. Displays

ViA's current touchscreen displays, which work very well for detailed images such as diagrams and maps, are approximately 8.5" x 5" x 0.75". One such display, with a 6.5" screen, is shown in Figure 9; An 8,4" unit with a highly reflective color display that will be readable in bright sunlight is currently being developed and will be commercially available A linear array microphone could be embedded into such a display to provide mounting for the microphone/speaker. ViA also has developed a prototype pocketable display called the Optical Viewer. This display is shown in Figure 10. When positioned approximately one-inch from the eye, this system provides the equivalent viewing capabilities of a 17" diagonal desktop display.

B.6.3 Wrist-Mounted Displays ViA is developing a wireless wrist-mounted interface. The system, shown in Figure 11 , uses a low power RF interface to communicate from the wrist to the "belt" (a wearable computer). The screen itself will be readable in bright sunlight. The microphone/speaker will be embedded directly into the device. This system is expected to be available

Fϊgure 9 — Touchscreen Figure 10 - Optical Display Viewer

Figure 11 - Wrist Interactive Device

B.6.4 Selection of Display:

The display that was selected for the Language Translator is ViA' s 6.5" VGA-color touchscreen. This selection will be replaced by other displays as they become commercially available. Some of the potential candidates include the previously mentioned highly reflective 8.4" VGA color touchscreen, the Wrist Interactive Device and/or a new wireless 4" display that ViA is just beginning to develop.

B.7 Operator Interface

ViA developed two operator interface modes for the Language Translation system: touchscreen and voice. An example of the touchscreen interface is provided in Figure 12. The touchscreen interface is used to configure the system to the desired parameters and to view the transcription of the spoken words. Parameters that can be set with this interface include selecting previously entered user profiles (e.g., age, native language, etc.), selecting the direction of the translation (e.g., from German into English or from English Into German) and changing the gender of the computer synthesized audio output. The need for each of these three elements became evident during our user trials and commercialization studies and, thus, were included in the resulting system.

Figure 12 — Touchscreen Interface Showing German Translation of a Phrase Spoken in English

The voice-based interface for the system includes processing any spoken phrase and listening for a pause of greater than three seconds. The pause informs the system that the user is finished speaking and the translator should start processing the speech. all of the parameters that can be set using the touchscreen will also be accessible using just voice. For example, you will be able to say "set the voice to female" to select the type of voice that the computer should synthesize in generating the output wavefiles.

B.8 Language Translation Software

The system uses a three-step process to produce the audible translation: voice-to -text speech recognition; text-to-text translation; and text-to-speech voice synthesis. Commercially available software for each of these three modules was identified, evaluated and selected for use in the Language Translator. These three modules are enhanced and integrated using a ViA developed software package All four of these modules are described in this section. B.8.1 Voice-to-Text Speech Recognition B.8.1.1 Overview

In the past two years there has been a significant Improvement in the performance of voice engines. This is a result of technological advancements along many fronts such as the voice recognition algorithms, the processing power of PC platforms and anti-noise canceling microphones. Each of these three areas are being investigated to determine the best voice system to use for the language translator. For the language translator, ViA will be coupling their own voice engine enhancement software with commercially available voice recognition engines. These software packages are described in the following two sections.

B.8.1.2 Evaluation of Commercially Available Voice Engines:

As part of the activities, five commercially available SAPI engines were evaluated; Lernout & Hauspie's (L&H) VoiceXpress", Conversa's Lingo; IBM's ViaVoice; Dragon's Naturally Speaking and Microsoft's Whisper. Each of these systems were rated as to their suitability for the language translator system. The performance parameters included robustness in noisy environments, speed, accuracy, product cost, hardware requirements and, of special importance, the product's ability to support foreign languages.

B.8.2 Text-to-Text Translation. B.8.2.1 Overview:

One of the challenges of Text-to-text translations is interpreting the context of the phrase. In order to understand this context, the translator must be familiar not only with both languages, but also the culture and idioms of the each language's country and the vocabulary specific to the topic being discussed. Simply translating on a word-for-word basis often results in a translated sentence that incorrectly states the original meaning. Here are two examples that are commonly referenced:

English: "I am full" (as in, after a good meal) French literal translation: "Je suis plein." Meaning of literal translation: "I am pregnant." English: "I am a Berliner."

German translation without cultural context: "Ich bin ein Berliner."

Meaning of translation: "I am a jelly donut"

Developing software that understands context subtleties is an extremely difficult task. However, there are numerous commercially available software packages that are coming close to making this capability a reality. These software packages can be classified into three areas; Terminology Managers; Machine Translation packages; and Translation Memory software³. Each category has inherent strengths, shortcomings and price points that make it necessary to do a careful assessment of which technology, or which combination of technologies, is the best solution for the mobile translator. Each of these approaches, plus the opportunity to combine them to form a hybrid system, is discussed in the following paragraphs.

B.8.2.2 TerminoIogy Managers:

One of the difficulties in translation is appropriate handling of industry-specific terminology. For example the military, legal and medical domains, each have significant amounts of terminology that are specific to their applications. Translating these terms to a different target language is often a tedious task of researching the word to determine its meaning. Terminology managers assist with this translation process by providing four elements; terminology repository; rapid term lookup; automated terminology insertion and terminology extraction.

• Terminology repository: Terminology managers serve as a collection point for gathering and storing domain specific words and their translations.

• Rapid term lookup: Basic terminology managers translate domain specific words in a unidirectional, one-to-one correspondence, More sophisticated term managers store objects in a "concept" orientation with multilingual mapping in multiple directions. Some allow narrative term definition/description and even the storage of graphics to represent the concept Searching mechanisms can range from matching on simple word look-up to more advanced approaches that employ "fuzzy" searching techniques looking for matches at a conceptual level.

• Automated terminology insertion; Some terminology managers will insert the translated term into the target document without the need to re-type or cut-and- paste.

• Terminology extraction: Tools with this feature will linguistically analyze source and target documents of previous translations to more easily identify and extract terminology for import into the terminology manager.

² This of course is the phrase spoken by President Kennedy during the Berlin Crisis. The correct phrase that should have been stated is simply "Ich bin Berliner."

Several references repeat this breakdown of translation software technologies. One such source is Language Partners International of Evanston, Illinois. Current commercial Terminology Manager products include L&H's VoiceXpress for Medicine, VoiceXpress for Clinical Reporting, VoiceXpress for Legal and VoiceXpress for Safety, MTX'S Termex, Trados' s MultiTerm and TTT.

B.8.2.3 Machine Translation Software:

Machine Translation (MT) tools linguistically process source documents to create a translation "from scratch." Up until several years ago, these tools required large mainframe computer platforms for timely execution. However, with recent advances in PC and UNIX based systems, many of these high-end solutions are available in affordable versions with quality and accuracy that compares favorably with their mainframe parents.

Because the linguistic rules for parsing and analyzing source text vary by language, the number of languages supported by MT systems is more limited than other approaches. Additionally, there is a need for a sufficiently large core dictionary for the target language to obtain a minimum level of accuracy/quality. MT solutions are best applied in the following areas:

• "Gisting," where the user would like to understand the general meaning of the text.

• Screening large amounts of documentation in order to identify documents that warrant more accurate human translation.

• Conveying simple instructions or non-complex information.

There are numerous groups, both from industry and academia, performing research and development activities on MT. For example, the University of Maryland's Computational linguistics and Information Processing Laboratory (CLIP) is developing MT systems targeted towards syntactic realizations and underlying semantics words across different languages. In particular, they have developed extensive capabilities in Chinese/English language pairs. This work will improve the robustness of MT systems across multiple language domains. Another effort of note is New Mexico State's Artwork Program. Artwork is investigating the machine translation of spoken dialogue. The focus is developing approaches to providing robustness by exploiting models of the task domain and of conversational interaction, to generate relevant expectations against which the input can be interpreted. This effort may provide a solution for a direct speech-to- speech system in the not too distant future. Representative commercial products in this category include Langenscheidt's Tl, Globalitik GTS Power Translator, Intergraph Transcend, LOGOS Intelligent Translation System, PC Translator and SYSTRAN Professional for Windows. B.8-2.4 Translation Memory;

Translation Memory (TM) tools are based on the automated re-υse of previously translated terms and sentences. These tools assist, rather than, replace, the translator. For example, when using a TM-based tool, typically 20-50% or more of a document will require manual translation. With TM tools, the level of benefit is directly proportional to the amount of repetition to the document. Therefore long, technical manuals tend to be good candidates for TM whereas the use of TM for a mobile language translator is very limited. Thus, TM tools will not be used for the language translator. They will be included in ViA' ε survey for the sake of completeness, but will not be tested to the same extent as the other software packages. TM tools are especially helpful in translating updated versions of previously translated documents. Other benefits include:

• Better translation consistency across an entire document, especially valuable when multiple translators are involved.

• Ability to begin translation projects before source documents have been frozen.

TM-based systems are less sensitive to language directions than the other approaches and thus a wide range of languages are supported.

The development of efficient TM systems is being conducted both in industry and academia. One such effort is tbe Deductive and Object-Oriented Databases being developed by University of Toronto's Computer Science and the Computer Systems Research Institute. Representative commercial products in this category include EUROLANG Optimizer, Trados Workbench and IBM Translation Manager.

B.8.2.5 Hybrid Systems

Many vendors are coupling aspects of these three approaches into a single package, such, as Langenscheidt's T1 Professional and Transcend Natural Language Translator. Additional new approaches to language translation are being developed using artificial intelligence. For Example, L&H is developing neural networks that will perform postprocessing of Machine Translations. This capability, if successful, will make significant strides in completely automating the translation process. The neural nets are constructed by comparing the final version of a document that is manually translated by L&H's Mendez division, with, that of the same document processed by the Machine Translator. By forming this comparison, translation errors are detected and algorithms developed (ie., a neural net) to automatically perform the post-editing-process. Another effort of note is Pangloss, which is being jointly developed by Carnegie-Mellon University, New Mexico Stale University and thie University of Southern California. This system combines three different translation engines to formulate a "best-output" translation. The goal of this effort is to develop software for direct speech-to-speech translation. Table 4 - Comparison of Language Translation Software

B.8.3 Text-to-Speech B.8.3.1 Overview

Text-to-speech, also referred to as speech synthesis, is the technology the computer uses to produce the sounds an individual would make if he/she were reading the text aloud. Of all the technologies required for the mobile language translation system, speech synthesis requires the least computing power. There are two basic approaches that are used in speech synthesis; pulling voice wavefiles from a database and processing text-based command strings. For the former, large wavefile databases are assembled with an entry for each word. If different pronunciations of the word are desired (e.g., a male and a female voice), then multiple entries for each word are required. Examples of this type of speech synthesis approach include IBM's ViaVoice Outloud software and Talx's TalxWare. The alternate approach, called "formant synthesis," uses a mathematical model of the human vocal tract to reproduce the correct sounds. The technology is based on parameterized segment concatenation algorithms, where human voice samples such as diphones, triphones and tetraphones are stored and used to convert the text into speech. In-depth linguistic processing is used to intelligently convert spoken text to its correct pronunciation, combined with advanced prosody rules that provide natural sounding intonation. An example of this technology is L&H's TruVoice TTS3000/M software. This saves disk space, at the expense of increasing the computational requirements. Both of these approaches were investigated to determine which one is best suited for use in the mobile translation system.

With a SAPI engine, or even one that partially supported SAPI, the language translator would be free to choose any SAPI compliant voice engine. For that reason, ViA has chosen the Lernout and Hauspie TTS3000 to be the voice synthesis engine for this project, and recognizes that a trade has been made for flexibility versus immediate performance. The long-term goal will be to integrate Lernout and Haupsie's RealSpeak product (as availability will dictate).

B.9 System Robustness, Speed and Accuracy

One of the key requirements that was identified during our user interviews was the need for accuracy and the ability to use the system in noisy environments (e.g., in airports, on sidewalks, in factories, etc.). To improve the functionality of the Language Translator in such environments, and to increase the speed of the translation process, ViA. integrated software into the Translator. The package improves the performance of any Speech Application Programming Interface (SAPI) compliant voice engine by providing the following capabilities:

• Concurrent Multiple Dictionary Referencing: each context has a set of associated dictionaries. Therefore, when a given context is enabted, all of the necessary dictionaries are loaded, enabled and complied. By using this approach, all of the required dictionaries are pre-processed, improving the overall speed of the process. This multiple dictionary capability is required for the direct voice-to-voice language translation system that will be developed

• Automatic Gain Control: volume control routine provides instantaneous compensation for ambient background noise. This allows the language translator to he used in noisy environments ,(e.g., outdoors, in airports, etc.). • Echo Canceling: Automatic switching between full- and half-duplexing modes of operation provides an improved echo canceling capability over other commercially available products. This further enhances the robustness of the speech recognition software.

• Remote Program Support Tools: web-based formal, allows remote loading of data and new program files. Thus, if additional words need to be added to a dictionary (e.g., words specific to a particular application), this can be accomplished wirelessly in a mode that is transparent to the operator, B.10 System integration.

ViA's investigation of system integration techniques led to two different methodologies. The first approach was to use an intermediate application to transfer results from one step to the next. Most voice engines and translation engines offer integration support for Microsoft Word, and Corel WordPerfect. With this approach, ViA would have needed to develop an intermediate application that watches any active Microsoft Word documents for incoming text and then pass the text to the. translation process. This approach has two extreme drawbacks. The first arises due to the intermediate application. If the translator utilizes Microsoft Word, the speed of the application is severely limited. Secondly, no text-to-speech engines offer support outside of specialized programmer interfaces. As a result, one-third of the application (the speech synthesis audio output) would not be well-served by this approach. Because of each of these issues, this approach was not implemented.

The second method, which was the selected approach, took the text directly from one software development kit, and passed it to the next software development kit, and continued the process until all steps had been completed. This resulting application integrated all three software development kits (voice recognition, translation and text-to- speech.) into a single seamless environment. To maximize the usability of this system, the software supports both single and multiple platforms. Each system is loaded with, all three software development kits. In stand-alone mode, the application allows users to dictate any amount of text. After a 3 second pause between sentences, the translation begins. The touchscreen interface is used to change the direction of translation, for example, to go from German to English versus English to German. voice recognition software will be used to determine who is speaking (e.g., the German speaker or the English speaker), and the system will then automatically determine the direction to use for translation.

The distributed approach reduces the response time of the system. As a user speaks, all systems within network range of the primary machine receive the text, un-translated. In this approach, the receiver is responsible for translation, and playback. This allows near real-time use of the system. Multiple users can be speaking at once. Also, this approach allows for faster operation.

D.1 Overview

- Support of at least six bi-directional language pairs (suggested languages are German, Spanish, Italian, Portuguese, French and Cantonese Chinese)

- Support of single-direction translation for Japanese, Arabic, and Korean. In the future, this will be extended to bi-directional support.

- Multi-platform support. This includes the ability to run the translator using wireless connectivity between multiple platforms and to have simultaneous translation of original spoken phrases into multiple languages.

- Increased speed and accuracy. This will be partially accomplished by identifying frequently used phrases and entering these into the speech engine context and by developing the capability for end users to easily enter application specific terminology into the translator.

Improved microphone and speaker capabilities including support for dual microphones for a single platform.

- Improvements in the touchscreen interface.

- A voice-based interface that has the same functionality as the touchscreen interface. In other words, every command that is accessible by using the touchscreen, will also be accessible using voice commands.

- Use of server clusters for translation.

- The use of voice to identify speakers.

- Automatic gereration of typed transcripts of conversations.

D.4.3 Wireless Communications

One of the results was the desire to support multiple users rather than just two individuals using a common platform. This type of situation arises in a conference room setting, where multiple people can speak. Rather than passing the microphone to each individual, it is desired to have multiple units, preferably one for each participant in the room. There are several items that need to be developed to make such a capability a reality, one of which is wireless communications between these devices.

The ViA II has two PC Card slots. Because of this fact, integrating wireless communications is not a significant issue. This is a low-risk task. ViA has frequently deployed systems that use such a capability. The wireless communication links are typically used to only send data. In this application, the communication link will be used to transmit both data and voice.

D.5.1 Multi-platform Wireless Communications

software allows multiple users to participate in a Language Translation session. Each user will have the ability to select a group of people he/she wishes to speak with. This may be the entire group of people within range of the wireless communications link, or it may be a small subset. Selecting these people can be done using either the touchscreen or naming them by voice. If the touchscreen is being used, then a picture of the individual that is currently talking will be displayed. Another option, which will be user selectable, is to have the individual's name spoken each time a phrase they have stated is translated and synthesized for output. For example, each time "Chris" states a phrase, the word "Chris" would be inserted at the beginning of the resulting wavefile.

Another feature that will be developed in this effort is to provide an encrypted communications link. This will ensure that other people cannot "listen in" on private conversations.

D.5.2 Simultaneous Translation into Multiple Languages

This feature will allow the Language Translator to be used in environments where multiple languages are being spoken. One such example is the United Nations. To accomplish this, the Multi-platform Wireless Communications capabilities (see Section E.5.1) will be used to transmit untranslated wavefiles to the remote platforms. The remote platform will then perform the translation and voice synthesis tasks. D.5.4 Server Cluster Translation Capability

Objective; To allow remote servers to perform the text-to-text translation

During activities, several text-to-text language translation research projects were identified. The code being developed by many of these research groups will require a high performance server to produce near real-time translations. Thus, to leverage the efforts of these research groups, ViA. will develop software that captures the phrases spoken, transcribes these phrases into text, wirelessly transmits this text to the server for translation, retrieves the translated text, and finally generates a synthesized voice of this retrieved text file. All of this will be performed in a manner that is transparent to the end users.

D.5.7 Voicebased Command Set Interface

The use of a touchscreen increases the size, weight, power consumption and. cost of the system. For many users the benefit of using a touchscreen (e.g., the ability to show users an image as part of their discussions) will not offset these costs. Thus, ViA will also develop a voice-only interface to the system. The features described throughout Section E, with the exception of showing an image of the individual speaking, will be included to this interface. Parameters will be set by stating voice commands. For example, "set the language pair to English and Chinese."

D.5.8._Voice Recognition of Speakers

Objective: To automatically recognize who is speaking.

The approach used to identify the speaker in the demonstration system was that each individual would take turns speaking a phrase. This allowed the system to be rapidly developed and demonstrated. However, such a situation of having each individual alternate is not commonly used in actual conversations. Often one user will stop speaking, and after pausing to think for a moment, continue talking. an alternative approach needs to be implemented where a set sequence is not predefined.

There are several approaches to recognizing which speaker is talking at any given moment. Three low-to-medium risk methods are to have each participant wear their own platform, or have each participant speak into their own microphone but process the information on a single platform, or use a single platform with a single directional microphone but have the operator press a button located on the microphone to indicate who is speaking. A high-risk approach, but one that would provide an extremely easy-to- use interface, is to use voice-prints to identify which speaker is talking. In this approach, a new user would need to register his/her voice by saying a simple phrase such as, "Hello, my name is Chris." From then on, the computer would be able to recognize this individual.

D.5.9. Automatic Generation of Typed Transcripts of Conversations

Objective: Develop the capability to automatically generate typed transcripts of conversations performed using the Language Translator.

One of the features of the approach that ViA is using is that it will be possible to generate a typed transcription of each conversation. Included in this transcription will be who was speaking, what was said, and when the conversation occurred. This of course may be beneficial even in conversations where translation is not occurring. Thus, an English- to-English mode will also be included

D.6 Language Translation Software

Objective: To provide accurate and near real-time voice-to- voice language translation for multiple language pairs.

The system produced provides voice-to-volce translation for English/German. This language pair was selected to demonstrate the feasibility of producing a voice-to- voice language translator. During demonstrations of this system, potential users were asked which languages they would like to see included in this system. As a result of these interviews and research that was performed on the commercial potential of a language translator, eight additional languages were selected. These languages, in addition to German, are Spanish, Italian, Portuguese, French,

Chinese (Cantonese), Japanese, Korean and Arabic. Each of these languages will be paired with English. D,6.1 Speech Recognition

Objective. Provide accurate speech recognition of phrases spoken in seven different languages.

ViA's software is designed to accept any SAPI (Speech Application

Programmers Interface) compliant speech engine.

D.6.2 Translation Software

Objective: Provide fast and accurate translations of text between English and nine other languages .

ViA's software is designed to accept any translation software that is compatible with PCs. this will be expanded to include translation software that can run on any computing platform as long as the server is within range of the wireless communications link. Because the performance of translation software is rapidly improving and the program will support ten languages, instead of just two, this evaluation process will need to be expanded to additional languages and repeated for both German and English to ensure that the best commercial products are being used.

D.6.3 Speech Synthesis

Objective: Provide fast , high clarity speech synthesis for ten different languages.

ViA's software is designed to accept any speech synthesis software that is

SAPI (Speech Application Programmers Interface) compliant the number of languages to be supported will increase from German and English to also include Spanish, Italian, Portuguese, French, Chinese, Japanese, Korean and Arabic, an evaluation of speech synthesis packages for German and English was performed with TTS 3000 being selected. Because the performance of speech synthesis software is rapidly improving and the program will support ten languages instead of just two, this evaluation process will need to be expanded to additional languages and repeated for both German and English to ensure that the test commercial products are being used.

Another requirement is that in ViA's design of the system, each dictation engine keeps an accurate profile of the user's age-bracket and gender, which would ideally reflect the sound of the synthesized voice. Thus, the text-to-speech engine should at a minimum support both male and female sounding synthesis. This allows some personalisation when using the system.

D.7 System Integration

Objective: Integrate all of the components described in Sections E2-E6 into a single package that is lightweight, robust, comfortable to wear, easy to use and low in cost.

One of the significant challenges of the System is to enable a seamless transition between all of the software modules. The user should not be required to manually "assist" the software ian generating the translated phrases. ViA will build on their success to provide a seamless integration for the expanded system.

A-2.1 Design

The goal of the language translator project is to develop a near real-time, two- way, mobile, Iightweight, robust and low-cost multilingual translation device that can be operated in a hands-free manner.

A.2.2 Specific Design A.2.2.1 Usage a. Upon installation, a brief voice-profile training process must be undergone in order to guarantee accurate recognition. b. A user profile will also be configured that will include an approximate age group for the user, as well as gender. This will increase the recognition capabilities. c. The user that is speaking the native language (herein, referred to as the primary user), will speak either English or German. d. The computer will then receive the spoken data, and with, no interaction from either the primary user or the user that desires the translated text (herein referred to as the secondary user), translate the recognized data to the opposite language pair. (English <--> German) e. Upon successful translation, the language translator will then speak the translated data using a voice synthesis product to the secondary user in the translated language. f. The System will be rull duplex, therefore either user could speak as they receive a translated voice response.

A.2.2.2. Time Specifications a. After the primary user speaks a phrase or sentence, the translation will begin either after the user voices in an end-of-sentence sentinel ("period", "Question mark," "Exclamation point ") or after a two-second pause. b. When the translation begins, all external processing will cease in order to facilitate a quick translation. Since a machine translation approach is being used, a single sentence could take between 5-10 seconds to translate. c. Upon translation, the text will immediately begin the synthesis and playback process. d. All of these described steps will take place with no user interaction.

A.2.2.3 Audio Head-set a. Since the system is designed to be mobile, external, battery-powered speakers will be used to broadcast the translated speech. b. A mobile array microphone will be used to facilitate a more natural mobile environment.

A.2.2.4 Hardware Platform a. The system will be robust enough, and optimised to run in combination of multiple ViA II computers, but ideally will run locally on a single machine.

B. Project Summary

Technical Abstract:

Mission Statement

To develop end deploy an easy to use, near real-time, two-way, mobile, lightweight, robust and low-cost language translation device that can simultaneously support multiple languages, produce text transcriptions of spoken conversations, support full-duplex audio, and wirelessly connect to remote servers and platforms.

enhancements will include the following:

Support of at least six bi-directional language pairs (suggested languages are German,

Spanish, Italian, Portuguese, French and. Cantonese Chinese)

Support of single-direction translation for Japanese, Arabic, Korean.

Multi-platform support. This includes the ability to run the translator using wireless connectivity between multiple platforms and to have simultaneous translation of original spoken phrases into multiple languages.

Increased speed and accuracy.

Improved microphone and speaker capabilities.

Improvements in the touchscreen interface,

A voice-based interface that has the same functionality as the touchscreen interface.

Use of server clusters for translation.

The use of voice to identify speakers.

Automatic generation of typed transcripts of conversations.

Anticipated Benefits/Potential Commercial Applications of the Research or Development:

Applications include all individuals who require multi-lingual capabilities. The mobile translator will benefit a wide range of individuals including military personnel, airport employees, border patrol and customs agents, police, fire fighters, retail clerks, bank tellers, delivery personnel, phone operators, tourists and any industry that sells, develops or manufactures products to/in global markets or employs individuals that do not speak the native language.

List a maximum of 5 key words: Language Translation Hands-free Interface Voice Recognition Wearable Computers Anticipated benefits include:

1. Improved capabilities and lower operational costs for all applications that require multi-lingual capabilities. This includes a wide range of individuals such as military personnel government workers, airport employees, border patrol and customs agents, police, fire fighters, retail clerks, bank tellers, delivery personnel, phone operators, tourists and any industry that sells, develops or manufactures products to/in global markets or employs individuate that do not speak the native language.

2. The mobile translator system provides a common platform, based entirely on established PC standards, upon which researchers, developers and application designers can easily integrate their particular element with the overall system. Thus, as new language translation technologies, both software and hardware, become availably they can be readily integrated with the mobile translator.

3. By using the ViA wearable PC, a continued path for incorporating the latest advances in PC technology is provided. This ensures that the design will not become outdated.

4. The computer for the mobile translator is a full function PC that can be worn to provide a mobile solution, or placed in a dock to provide a desktop solution. Thus, the system will be multifunctional - meeting the user's computing needs well beyond that of the language translation capability.

Claims

WHATIS CLAIMED IS:

1. Translation systems and methods as illustrated and/or described hereto, including speech-to-speech, text-to-text, speech-to-text and/or text-to-speech translation systems, devices, engines, applications and associated methods.

2. A method of facilitating translation between language X and language Y, including:

(a) receiving input in language X;

(b) translating language-X input data, based on the language-X input, to language-Y output data;

(c) using at least one first processor in connection with translating the language-X input;

(d) providing output in language Y, based on the language-Y output data;

(e) receiving input in language Y;

(f) translating language-Y input data, based on the language-Y input, to language-X output data;

(g) using at least one second processor in connection with translating the language-Y input;

(h) providing output in language X, based on the language-X output data; and selectively shifting processing tasks or portions of processing tasks between the at least one first processor and the at least one second processor.

3. The method of claim 2, wherein, the language-X input is audible input; further wherein the language-Y output is audible output

4. The method of claim 3, wherein, the language-Y input is audible input; further wherein the language-X output is audible output.

5. The method of claim 2, wherein, the at least one first processor and the at least one second processor are respectively disposed within mobile and/or wearable computers.

6. The method of claim 2, wherein at least one of the first and second processors is part of a central server.

7. The method of claim 2, wherein at least one of the first and second processors is part of a remote server.

8. A central server or a remote server implementing the method of claim 6 or claim 7, respectively.

9. The method of claim 2, wherein the shifting occurs via wireless communication.

10. A wireless Communications network implementing the method of claim 9.

11. A network implementing the method of claim 2.

12. The method of claim 2, wherein the shifting of processing tasks is based on processing demands on the first and second processing devices and/or availability of processing time.

13. A method of distributing processing tasks associated with a voice-to-voice translation system, the method comprising providing two wearable computers, each associated with the speaker of a different language, and sharing translation-processing tasks, or portions of translation-processing tasks, selectively between the two wearable computers based on processing demand on each wearable computer and/or availability of processing time.

14. The method of claim 13, associated with text-to-text, speech-to-text and/or text-to-speech translation systems instead of a voice-to-voice translation system.

15. The method of claim 13, wherein the sharing of translation-processing tasks occurs between one or both wearable computers and a central or remote server, personal computer or other processing device; further wherein one or more of the computers optionally are non-wearable or relatively non-mobile.

16. A network, one or more servers, and/or one or more computers or processors for implementing the method of claim 13.

17. One or more disks, CD-ROMs, PC cards, other modular elements or other computer-readable media for implementing the methods of any of the above method claims.

18. Processing-task coordination and switching means, substantially as shown or described herein, for sharing translation-processing tasks, or portions of processing tasks, between processors, servers, computers, wearable computers, or other devices.

19. Hybrid translation systems, networks, computers, processors and/or devices, and associated methods, wherein a phrase or other language-component database is used for a first attempt at translation, and a dictatioh/translation engine is used, with optional voice models/dictionaries, for translation if the first attempt yields less-than-desired results.

20. Devices and methods according to claim 19, wherein an audio input stream, is split such that conversational speech can be translated while scanning for and acting upon voice commands, implemented with e.g. an audio splitter in connection with handling simultaneous speech, recognition (SR), automatic speech recognition (ASR), and optionally other processes/engines.