WO2003028010A1 - Text-to-speech native coding in a communication system - Google Patents

Text-to-speech native coding in a communication system Download PDF

Info

Publication number
WO2003028010A1
WO2003028010A1 PCT/US2002/026901 US0226901W WO03028010A1 WO 2003028010 A1 WO2003028010 A1 WO 2003028010A1 US 0226901 W US0226901 W US 0226901W WO 03028010 A1 WO03028010 A1 WO 03028010A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
phonics
speech
code table
coded speech
Prior art date
Application number
PCT/US2002/026901
Other languages
French (fr)
Inventor
Bin Wu
Fan He
Original Assignee
Motorola, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola, Inc. filed Critical Motorola, Inc.
Priority to EP02750495A priority Critical patent/EP1479067A4/en
Publication of WO2003028010A1 publication Critical patent/WO2003028010A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis

Definitions

  • the present invention relates generally to text-to-speech synthesis, and more particularly to text-to-speech synthesis in a communication system using native speech coding.
  • Radio communication devices such as cellular phones
  • Radio communication devices are no longer viewed as voice only devices.
  • some serious problems arise for the conventional cellular phones.
  • cellular phones are currently only capable of presenting data services in text format on a small screen. This requires screen scrolling or other user manipulation in order to get the data or message.
  • a wireless system has much higher data error rate and faces spectrum constraints, which makes providing real-time streaming audio, i.e. real-audio, to cellular users impractical.
  • One way to deal with these problems is text-to-speech encoding.
  • Text analysis is the process by which text is converted into a linguistic description that can be synthesized.
  • This linguistic description generally consists of the pronunciation of the speech to be synthesized along with other properties that determine the prosody of the speech. These other properties can include (1) syllable, word, phrase, and clause boundaries; (2) syllable stress; (3) part-of-speech information; and (4) explicit representations of prosody such as are provided by the ToBI labeling system, as known in the art, and further described in 2nd International Conference on Spoken Language Processing (ICSLP92): TOBI: "A Standard for Labeling English Prosody", Silverman et al, (Oct 1992).
  • ICSLP92 2nd International Conference on Spoken Language Processing
  • the pronunciation of speech included in the linguistic description is described as a sequence of phonetic units.
  • These phonetic units are generally phones or phonics, which are particular physical speech sounds, or allophones, which are particular ways in which a phoneme may be expressed.
  • a phoneme is a speech sound perceived by the speakers of a language).
  • the English phoneme "t” may be expressed as a closure followed by a burst, as a glottal stop, or as a flap. Each of these represents different allophones oft". Different sounds that may be produced when "t" is expressed as a flap represent different phonics.
  • Other phonetic units that are sometimes used are demisyllables and diphones. Demisyllables are half-syllables and diphones are sequences of two phonics.
  • Speech synthesis can be generated from phonics using a rule-based system.
  • the phonetic unit has a target phenome acoustic parameters (such as duration and intonation) for each segment type, and has rules for smoothing the parameter transitions between the segments.
  • the phonetic component has a parametric representation of a segment occurring in natural speech and concatenates these recorded segments, smoothing the boundaries between segments using predefined rules.
  • the speech is then processed through a vocoder for transmission.
  • Voice coders such as vector-sum or code excited linear prediction (CELP) vocoders are in general use in digital cellular communication devices.
  • CELP code excited linear prediction
  • the text-to-speech process as described above is computationally complex and extensive.
  • vocoder technology already uses the limits of computational power in a device in order to maintain voice quality at its highest possible level.
  • the text-to-speech process described above requires further signal processing in addition to the vocoder processing.
  • the process of converting text to phonics, applying acoustic parameters rules for each phonic, concatenation to provide a voiced signal, and voice coding require more processing power than just voice coding alone.
  • an improved text-to-speech coding system that reduces the amount of signal processing required to provide a voiced output.
  • it would be of benefit to be able to use the existing native speech coding incorporated into a communication device. It would also be advantageous if current low-cost technology could be used without the requirement for customized hardware.
  • FIG. 1 shows a flow chart of a text-to-speech system, in accordance with the present invention.
  • FIG. 2 shows a simplified block diagram of a text-to-speech system, in accordance with the present invention.
  • the present invention provides an improved text-to-speech system that reduces the amount of signal processing required to provide a voiced output by taking advantage of the digital signal processor (DSP) and sophisticated speech coding algorithms that already exist in cellular phones.
  • DSP digital signal processor
  • the present invention provides a system that converts an incoming text message into a voice output using the native cellular speech coding and existing hardware of a communication device, without a increase in memory requirements or processing power.
  • the present invention utilizes the exiting data interface between the microprocessor and DSP in a cellular radiotelephone along with existing software capabilities.
  • the present invention can be used in conjunction with any text based data services, such as Short Messaging Service (SMS) as used in the Global System for Mobile (GSM) communication system, for example.
  • SMS Short Messaging Service
  • GSM Global System for Mobile
  • Conventional cellular handsets have the following functionalities in place: (a) an air- to-air interface to retrieve test messages from remote service providers, (b) software to convert received binary data into appropriate text format, (c) audio server software to play audio to output devices, such as speakers or earphones for example, (d) highly efficient audio compression coding system to generate human voice through digital signal processing, and (e) a hardware interface between a microprocessor and a DSP.
  • a conventional cellular handset When receiving a text-based data message, a conventional cellular handset will convert the signal to text format (ASCII or Unicode), as is known in the art.
  • the present invention converts this formatted text string to speech.
  • a network server of the communication system can converts this formatted text string to speech and transmit this speech to a conventional cellular handset over a voice channel instead of a data channel
  • FIGs. 1 and 2 show a method and system for converting text-to-speech in accordance with the present invention.
  • the text will be converted to coded speech parameters native to the communication system, saving the processing steps of converting text-to-voice and then running the voice signal through a vocoder.
  • a first step 102 includes providing a code table 202 containing coded speech parameters.
  • code tables are known in the art and typically include Code Excitation Linear Predictors (CELP) and Vector Sum Excited Linear Predictors (VSELP) among others.
  • CELP Code Excitation Linear Predictors
  • VSELP Vector Sum Excited Linear Predictors
  • the code table 202 is stored in a memory. In effect, a code table contains compressed audio data representing critical speech parameters.
  • a next step 104 in the process is inputting a text message.
  • the text message is formatted in an existing format that can be read by the communication system without requiring hardware or software changes.
  • a next step 106 includes dividing the text message into phonics by an audio server 204.
  • the audio server 204 is realized in the microprocessor or DSP of the cellular handset, or can be done in the network server.
  • the text message is processed in an audio server 204 that is software based on a rule table for a particular language tailored to recognize the structure and phenomes of that language.
  • the audio server 204 breaks the sentences of the text into words by recognizing spaces and punctuation, and further divides the words into phonics.
  • a data message may contain other characters besides letters or may contain abbreviations, contractions, and other deviations from normal text. Therefore, before breaking a text message into sentences, these other characters or symbols, e.g.
  • the text can contain special characters.
  • the special characters include modifying information for the coded speech parameters, wherein after mapping the modifying information is applied to the coded speech parameters in order to provide more natural-sounding speech signal.
  • a special character (such as an ASCII symbol for example) can be used to indicate the accent or inflection of a word.
  • the word "manual” can be represented "ma'nual" in text.
  • the audio server software can then tune the phonetic to make the speech closer to a naturally inflected voice. This option requires the text messaging service or audio server to provide such special characters.
  • a next step 108 includes mapping each of the phonics from the audio server, by a mapping unit 206, against the code table 202 to find the coded speech parameters corresponding to each of the phonics.
  • each phonic is mapped into a corresponding digitized voice waveform that is compressed in the format that's native to a particular cellular system.
  • the native format can be the half rate vocoder format, as is known in the art.
  • each phonic has a predetermined digitized waveform, in the communication system native format, pre-stored in the memory.
  • the audio server 204 determines a phonic, and the mapping unit 206 matches each distinct phonic with a memory location index of predefined phonics in a look-up table 212 to point to a digitized wave file defining the equivalent native coded speech parameters from the code table 202.
  • the look-up table 212 is used to map individual phonics into the memory location of the compressed and digitized audio in the existing code table of the vocoder of the cellular phone.
  • the look-up table size is slightly less than one megabyte with the GSM voice compression algorithm.
  • the mapping unit (which can also be the audio server) can then assemble the digitized representations of the phonics, along with white noise for spaces between words, into a string of data using the knowledge of the word and sentence structure learned from breaking the text into phonics.
  • a signal processor 208 such as a DSP for example
  • the native coded speech parameters are subsequently processed in a signal processor 208 (such as a DSP for example) to provide a decompressed speech signal to an audio circuit 210 of the cellular phone handset, which includes an audio transducer.
  • a signal processor 208 such as a DSP for example
  • the coding system used for speech synthesis should be native to a particular cellular phone standard, since the DSP and its software are designed to decompress that particular coding format in an existing vocoder.
  • digitized audio should be stored in the full-rate vocoder coding format, and can be stored in half-rate vocoder coding format.
  • the interface between a DSP and a microprocessor is shared memory, the audio file can be directly placed into the shared memory. Once the sentence is assembled, an interrupt will be generated to trigger a read by DSP, which in turn will decompress and play the audio. If the interface is a serial or parallel bus, the compressed audio will be stored in a RAM buffer until sentence is complete. After that, the microprocessor will transfer the data to DSP for decompression and play. Preferably, the above steps are repeated for each sentence in the inputted text.
  • a transmitting step is included after the mapping step 108.
  • This transmitting step includes transmitting the coded speech parameters from a network server to a wireless communication device, and wherein the processing step is performed in the wireless communication device and all the previous steps 102-108 are performed in the network server.
  • all the steps 102-110 are performed within a wireless communication device.
  • the text message itself can be provided by a network server or another communication device.
  • a cellular radiotelephone is a hand held device very sensitive to size, weight and cost.
  • the hardware to realize the text- to-speech conversion of the present invention should use minimal number of parts and at low cost.
  • the look-up table of the phonics should be stored in flash memory for its non-volatility and high density. Because the flash memory cannot be addressed randomly, the digital data of the phonics need to be loaded into the random memory before being sent to the DSP.
  • the simplest way is to map the whole look-up table into the random memory, but this requires at least one megabyte of memory for a very simple look-up table.
  • Another option is to load one sector from flash memory into the random memory at a time, but it this still requires 64kbytes of extra random memory.
  • the following approach can be used: (a) find the starting and the ending addresses of the phonics in the look-up table, (b) save the starting and the ending addresses in the microprocessor registers, (c) use one microprocessor register as a counter, set to zero before reading the look-up table from the flash memory, adding one count to the counter for each read cycle, (d) read the look-up table from the flash memory in a non-synchronized mode or in a synchronized mode at a low clock frequency, so that the microprocessor can have enough time to perform necessary operation between the read cycles, and (e) use the microprocessor register to store one byte/word of data, comparing the counter value with starting address.
  • the counter value is less than the starting address, go back to the previous step and read the next byte/word from the flash memory. If the counter value is equal or greater than the starting address, compare the counter value with the ending address. If the counter value is less than the ending address, move the data from the microprocessor register into the random memory. If the counter value is greater than the ending address, go back to the previous step and finish the reading to the end of the current flash memory sector. In this way, the requirement of the random memory can be limited to the size of 200 bytes. Thus, no additional random memory is required for even the simplest cellular phone handsets. In the above example, phonics-digitized audio files are stored in a flash memory, which is accessible on a sector-by-sector basis.
  • ABS C is translated to a memory location array, ⁇ 3:210:200, 4:1500:180, 3:1000:150 ⁇ .
  • a memory buffer to store digitized audio is created based upon the total size required, in this case the sum of three phonics (200+180+150) plus a white noise segment for the space.
  • the memory location array is searched to locate all the audio files that are stored on this page, in this case A and C, which are then copied to their respected locations in the memory buffer.
  • SMS Short message service
  • GSM Global System for Mobile communications
  • TTS text-to-speech
  • the present invention allows the use of the many communication services having a low data rate text format, such as SMS for example. This can be used to advantage in real time driving directions, audio news, weather, location services, real time sports or breaking newscasts in text.
  • TTS technology also opens a door for voice game application in cellular phones at very low cost.
  • TTS can use much lower bandwidth with text based messaging. It will not load the network and worsen the capacity strain on of existing or future cellular networks. Further, the present invention allows incumbent network operators to offer a wide range of value-added services with the text messaging capabilities that already existed in their networks, instead of having to purchase licenses for new bandwidth and investing in new equipment. This also applies to third party service providers that, under today's and proposed technologies, face even higher obstacles than network operators in providing any kind of data services to cellular phone users. Since TTS can be used with any standard text messaging services, anyone with the access to text-messaging gateways can provide a variety of services to millions of cellular phone users. With the technology and equipment barrier removed, many new business opportunities will be opened up to the independent third party application providers.
  • the mobile TTS application also requires network server support.
  • the server should be optimized based on the data traffic and the cost per user.
  • the major daily cost of the local server is the data traffic.
  • Low data traffic reduces the server return on investment and the daily cost.
  • the present invention can increase low data traffic and moderate data traffic since text does not need to be sent "on demand" when data traffic bandwidth may be unavailable, but can wait for period of lower, available data traffic.

Abstract

A method of converting text to speech in a communication device includes providing (102) a code table containing coded speech parameters. Next steps include inputting (104) a text message into a communication device, and dividing (106) the text message into phonics. A next step includes mapping (108) each of the phonics against the code table to find the coded speech parameters corresponding to each of the phonics. A next step includes processing (110) the coded speech parameters corresponding to each of the phonics to provide an audio signal. In this way, text can be mapped directly to a vocoder table without intermediate translation steps.

Description

TEXT-TO-SPEECH NATIVE CODING IN A COMMUNICATION SYSTEM
FIELD OF THE INVENTION
The present invention relates generally to text-to-speech synthesis, and more particularly to text-to-speech synthesis in a communication system using native speech coding.
BACKGROUND OF THE INVENTION
Radio communication devices, such as cellular phones, are no longer viewed as voice only devices. With the advent of data based wireless services available to consumers, some serious problems arise for the conventional cellular phones. For example, cellular phones are currently only capable of presenting data services in text format on a small screen. This requires screen scrolling or other user manipulation in order to get the data or message. Also, comparing to landline systems, a wireless system has much higher data error rate and faces spectrum constraints, which makes providing real-time streaming audio, i.e. real-audio, to cellular users impractical. One way to deal with these problems is text-to-speech encoding.
The process of converting text to speech is generally broken down into two major blocks: text analysis and speech synthesis. Text analysis is the process by which text is converted into a linguistic description that can be synthesized. This linguistic description generally consists of the pronunciation of the speech to be synthesized along with other properties that determine the prosody of the speech. These other properties can include (1) syllable, word, phrase, and clause boundaries; (2) syllable stress; (3) part-of-speech information; and (4) explicit representations of prosody such as are provided by the ToBI labeling system, as known in the art, and further described in 2nd International Conference on Spoken Language Processing (ICSLP92): TOBI: "A Standard for Labeling English Prosody", Silverman et al, (Oct 1992). The pronunciation of speech included in the linguistic description is described as a sequence of phonetic units. These phonetic units are generally phones or phonics, which are particular physical speech sounds, or allophones, which are particular ways in which a phoneme may be expressed. (A phoneme is a speech sound perceived by the speakers of a language). For example, the English phoneme "t" may be expressed as a closure followed by a burst, as a glottal stop, or as a flap. Each of these represents different allophones oft". Different sounds that may be produced when "t" is expressed as a flap represent different phonics. Other phonetic units that are sometimes used are demisyllables and diphones. Demisyllables are half-syllables and diphones are sequences of two phonics.
Speech synthesis can be generated from phonics using a rule-based system. For example, the phonetic unit has a target phenome acoustic parameters (such as duration and intonation) for each segment type, and has rules for smoothing the parameter transitions between the segments. In a typical concatenation system, the phonetic component has a parametric representation of a segment occurring in natural speech and concatenates these recorded segments, smoothing the boundaries between segments using predefined rules. The speech is then processed through a vocoder for transmission. Voice coders, such as vector-sum or code excited linear prediction (CELP) vocoders are in general use in digital cellular communication devices. For example, US patent 4,817, 157, which is hereby incorporated by reference, describes such a vocoder implementation as used for the Global System for Mobile (GSM) communication system among others.
Unfortunately, the text-to-speech process as described above is computationally complex and extensive. For example, in existing digital communication devices, vocoder technology already uses the limits of computational power in a device in order to maintain voice quality at its highest possible level. However, the text-to-speech process described above requires further signal processing in addition to the vocoder processing. In other words, the process of converting text to phonics, applying acoustic parameters rules for each phonic, concatenation to provide a voiced signal, and voice coding require more processing power than just voice coding alone. Accordingly, there is a need for an improved text-to-speech coding system that reduces the amount of signal processing required to provide a voiced output. In particular, it would be of benefit to be able to use the existing native speech coding incorporated into a communication device. It would also be advantageous if current low-cost technology could be used without the requirement for customized hardware.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a flow chart of a text-to-speech system, in accordance with the present invention; and
FIG. 2 shows a simplified block diagram of a text-to-speech system, in accordance with the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention provides an improved text-to-speech system that reduces the amount of signal processing required to provide a voiced output by taking advantage of the digital signal processor (DSP) and sophisticated speech coding algorithms that already exist in cellular phones. In particular, the present invention provides a system that converts an incoming text message into a voice output using the native cellular speech coding and existing hardware of a communication device, without a increase in memory requirements or processing power.
Advantageously, the present invention utilizes the exiting data interface between the microprocessor and DSP in a cellular radiotelephone along with existing software capabilities. In addition, the present invention can be used in conjunction with any text based data services, such as Short Messaging Service (SMS) as used in the Global System for Mobile (GSM) communication system, for example. Conventional cellular handsets have the following functionalities in place: (a) an air- to-air interface to retrieve test messages from remote service providers, (b) software to convert received binary data into appropriate text format, (c) audio server software to play audio to output devices, such as speakers or earphones for example, (d) highly efficient audio compression coding system to generate human voice through digital signal processing, and (e) a hardware interface between a microprocessor and a DSP. When receiving a text-based data message, a conventional cellular handset will convert the signal to text format (ASCII or Unicode), as is known in the art. The present invention converts this formatted text string to speech. Alternatively, a network server of the communication system can converts this formatted text string to speech and transmit this speech to a conventional cellular handset over a voice channel instead of a data channel
FIGs. 1 and 2 show a method and system for converting text-to-speech in accordance with the present invention. In a preferred embodiment, the text will be converted to coded speech parameters native to the communication system, saving the processing steps of converting text-to-voice and then running the voice signal through a vocoder. In the method of the present invention, a first step 102 includes providing a code table 202 containing coded speech parameters. Such code tables are known in the art and typically include Code Excitation Linear Predictors (CELP) and Vector Sum Excited Linear Predictors (VSELP) among others. The code table 202 is stored in a memory. In effect, a code table contains compressed audio data representing critical speech parameters. As a result, the digital transfer of audio information can encoded and decoded using these code tables to reduce bandwidth providing more efficiency without a noticeable loss in voice quality. A next step 104 in the process is inputting a text message. Preferably, the text message is formatted in an existing format that can be read by the communication system without requiring hardware or software changes.
A next step 106 includes dividing the text message into phonics by an audio server 204. The audio server 204 is realized in the microprocessor or DSP of the cellular handset, or can be done in the network server. In particular, the text message is processed in an audio server 204 that is software based on a rule table for a particular language tailored to recognize the structure and phenomes of that language. The audio server 204 breaks the sentences of the text into words by recognizing spaces and punctuation, and further divides the words into phonics. Of course, a data message may contain other characters besides letters or may contain abbreviations, contractions, and other deviations from normal text. Therefore, before breaking a text message into sentences, these other characters or symbols, e.g. "$", numbers and common abbreviations, will be translated into their corresponding words by the audio server. To emulate the pause between each word in human speech, white noise is inserted between each word. For example, a 15ms period of white noise has been found adequate to separate words.
Optionally, the text can contain special characters. The special characters include modifying information for the coded speech parameters, wherein after mapping the modifying information is applied to the coded speech parameters in order to provide more natural-sounding speech signal. For example, a special character (such as an ASCII symbol for example) can be used to indicate the accent or inflection of a word. For instance, the word "manual" can be represented "ma'nual" in text. The audio server software can then tune the phonetic to make the speech closer to a naturally inflected voice. This option requires the text messaging service or audio server to provide such special characters.
After linguistic analysis, a next step 108 includes mapping each of the phonics from the audio server, by a mapping unit 206, against the code table 202 to find the coded speech parameters corresponding to each of the phonics. In particular, each phonic is mapped into a corresponding digitized voice waveform that is compressed in the format that's native to a particular cellular system. For instance, in the GSM communication system, the native format can be the half rate vocoder format, as is known in the art. More particularly, each phonic has a predetermined digitized waveform, in the communication system native format, pre-stored in the memory. The audio server 204 determines a phonic, and the mapping unit 206 matches each distinct phonic with a memory location index of predefined phonics in a look-up table 212 to point to a digitized wave file defining the equivalent native coded speech parameters from the code table 202. Preferably, the look-up table 212 is used to map individual phonics into the memory location of the compressed and digitized audio in the existing code table of the vocoder of the cellular phone. For the English language, the look-up table size is slightly less than one megabyte with the GSM voice compression algorithm.
For example, there are about 4119 possible phonic combinations in English or a similar language. On average, the speed of the speech is about 200 words/min (about 500 phonics per minute and 6.7 phonics per second), thus each phonic lasts 0.15s. With an 8kHz sample rate and a 16-bit resolution, there are about 2400 bytes/phonic (0.15s x 8kHz x 2bytes). With the 10:1 vocoder compression used in the GSM, the compressed digitized voice will be around 240 bytes/phonic. Thus, with about 4119 phonics the total size of the look-up table is about 989 kbytes for each language.
The mapping unit (which can also be the audio server) can then assemble the digitized representations of the phonics, along with white noise for spaces between words, into a string of data using the knowledge of the word and sentence structure learned from breaking the text into phonics.
In a next step 110, the native coded speech parameters, corresponding to each of the phonics from the previous step and along with suitable spaces, are subsequently processed in a signal processor 208 (such as a DSP for example) to provide a decompressed speech signal to an audio circuit 210 of the cellular phone handset, which includes an audio transducer. Inasmuch as the phonics are already coded in native parameters, the DSP needs no modification to properly provide a speech signal. To take advantage of the existing DSP capability, the coding system used for speech synthesis should be native to a particular cellular phone standard, since the DSP and its software are designed to decompress that particular coding format in an existing vocoder. For instance, in GSM-based handsets, digitized audio should be stored in the full-rate vocoder coding format, and can be stored in half-rate vocoder coding format. If the interface between a DSP and a microprocessor is shared memory, the audio file can be directly placed into the shared memory. Once the sentence is assembled, an interrupt will be generated to trigger a read by DSP, which in turn will decompress and play the audio. If the interface is a serial or parallel bus, the compressed audio will be stored in a RAM buffer until sentence is complete. After that, the microprocessor will transfer the data to DSP for decompression and play. Preferably, the above steps are repeated for each sentence in the inputted text.
However, it can be repeated for each phonic or up to the length of the available memory. For example, a paragraph, page or entire text can be inputted before being divided into phonics, hi one embodiment, a transmitting step is included after the mapping step 108. This transmitting step includes transmitting the coded speech parameters from a network server to a wireless communication device, and wherein the processing step is performed in the wireless communication device and all the previous steps 102-108 are performed in the network server. However, in a preferred embodiment, all the steps 102-110 are performed within a wireless communication device. The text message itself can be provided by a network server or another communication device.
Unlike desktop and laptop computers, a cellular radiotelephone is a hand held device very sensitive to size, weight and cost. Thus, the hardware to realize the text- to-speech conversion of the present invention should use minimal number of parts and at low cost. The look-up table of the phonics should be stored in flash memory for its non-volatility and high density. Because the flash memory cannot be addressed randomly, the digital data of the phonics need to be loaded into the random memory before being sent to the DSP. The simplest way is to map the whole look-up table into the random memory, but this requires at least one megabyte of memory for a very simple look-up table. Another option is to load one sector from flash memory into the random memory at a time, but it this still requires 64kbytes of extra random memory. For the purpose of minimizing the requirement of the memory, the following approach can be used: (a) find the starting and the ending addresses of the phonics in the look-up table, (b) save the starting and the ending addresses in the microprocessor registers, (c) use one microprocessor register as a counter, set to zero before reading the look-up table from the flash memory, adding one count to the counter for each read cycle, (d) read the look-up table from the flash memory in a non-synchronized mode or in a synchronized mode at a low clock frequency, so that the microprocessor can have enough time to perform necessary operation between the read cycles, and (e) use the microprocessor register to store one byte/word of data, comparing the counter value with starting address. If the counter value is less than the starting address, go back to the previous step and read the next byte/word from the flash memory. If the counter value is equal or greater than the starting address, compare the counter value with the ending address. If the counter value is less than the ending address, move the data from the microprocessor register into the random memory. If the counter value is greater than the ending address, go back to the previous step and finish the reading to the end of the current flash memory sector. In this way, the requirement of the random memory can be limited to the size of 200 bytes. Thus, no additional random memory is required for even the simplest cellular phone handsets. In the above example, phonics-digitized audio files are stored in a flash memory, which is accessible on a sector-by-sector basis. However, loading an entire page for one phonic file is both times consuming and inefficient. One method to improve the efficiency is to match all the phonics audio files stored on the same memory sector once it is loaded into the RAM. Instead of loading one memory page for one phonic then loading another page for next phonic, an intermediate array can be assembled that contains the memory locations of all phonics in a sentence. Table 1 shows a simple phonic-to-memory location look-up table.
Table 1
Look-up table structure
Figure imgf000010_0001
Consider a sentence, "AB C", with a space between B and C. In a direct method, page 3 will be loaded into RAM, then copy 200 bytes starting at location 210 to a memory buffer. Page 4 is then loaded, copy 180 bytes into a buffer starting at location 1500. Then copy a digitized white noise segment into the buffer, after that load page 3 again, copy 150 Bytes starting at location 1000 into the buffer. The text string is then converted to audio. An indirect method can also be used. The different between the direct and indirect method is that in direct method the software will not look ahead. Therefore, in the above example, (AB C), software will load page 3, locate and copy A, then load page 4 and locate and copy B, then reload page 3 and locate and copy C, while in the indirect method, software will load page 3 and copy both A and C into a pre-allocated memory buffer, than load page 4 and copy B into the buffer. In this way, only a two page load is required which saves time and processor power.
With an intermediate mapping method, "AB C" is translated to a memory location array, {3:210:200, 4:1500:180, 3:1000:150}. A memory buffer to store digitized audio is created based upon the total size required, in this case the sum of three phonics (200+180+150) plus a white noise segment for the space. Once loading page 3 into memory, the memory location array is searched to locate all the audio files that are stored on this page, in this case A and C, which are then copied to their respected locations in the memory buffer. With this method, we can significantly cut down the memory access time and improve the efficiency.
In practice, the present invention uses existing text based messaging services in a communication system. SMS (Short message service) is a popular text based message service for GSM system. Under certain situations, i.e. driving or it being too dark to read, converting a text message into speech is very desirable. In addition, all current menu, phone book and operational prompts are in text format in current cellular handsets. It is not possible for the visually impaired to navigate through these visual prompts. The text-to-speech (TTS) system as described above solves this problem. Instead of sending data in bandwidth intensive voice format (although this can also be used), the present invention allows the use of the many communication services having a low data rate text format, such as SMS for example. This can be used to advantage in real time driving directions, audio news, weather, location services, real time sports or breaking newscasts in text. TTS technology also opens a door for voice game application in cellular phones at very low cost.
Moreover, TTS can use much lower bandwidth with text based messaging. It will not load the network and worsen the capacity strain on of existing or future cellular networks. Further, the present invention allows incumbent network operators to offer a wide range of value-added services with the text messaging capabilities that already existed in their networks, instead of having to purchase licenses for new bandwidth and investing in new equipment. This also applies to third party service providers that, under today's and proposed technologies, face even higher obstacles than network operators in providing any kind of data services to cellular phone users. Since TTS can be used with any standard text messaging services, anyone with the access to text-messaging gateways can provide a variety of services to millions of cellular phone users. With the technology and equipment barrier removed, many new business opportunities will be opened up to the independent third party application providers.
Like existing mobile web applications, the mobile TTS application also requires network server support. The server should be optimized based on the data traffic and the cost per user. The major daily cost of the local server is the data traffic. Low data traffic reduces the server return on investment and the daily cost. The present invention can increase low data traffic and moderate data traffic since text does not need to be sent "on demand" when data traffic bandwidth may be unavailable, but can wait for period of lower, available data traffic.
Although the invention has been described and illustrated in the above description and drawings, it is understood that this description is by way of example only and that numerous changes and modifications can be made by those skilled in the art without departing from the broad scope of the invention. Although the present invention finds particular use in portable cellular radiotelephones, the invention could be applied to any communication device, including pagers, electronic organizers, and computers. The present invention should be limited only by the following claims.

Claims

CLAIMS What is claimed is:
1. A method of converting text to speech in a communication system, the method comprising the steps of: providing a code table containing coded speech parameters; inputting a text message; dividing the text message into phonics; mapping each of the phonics against the code table to find the coded speech parameters corresponding to each of the phonics; and subsequently processing the coded speech parameters corresponding to each of the phonics from the previous step to provide a speech signal.
2. The method of claim 1, wherein the dividing step includes dividing the text messages into phonics, spaces, and special characters.
3. The method of claim 2, wherein the special characters of the dividing step include modifying information for the coded speech parameters, wherein after the mapping step further comprising a step of applying the modifying information to the coded speech parameters in order to provide more natural-sounding speech signal from the processing step.
4. The method of claim 1, wherein in the providing step the code table includes one of code excited linear prediction parameters or vector sum excited linear prediction parameters.
5. The method of claim 1, wherein in the providing step the code table is an existing code table used in a vocoder in the communication system.
6. The method of claim 1, wherein the steps are performed in a wireless communication device.
7. The method of claim 1, wherein after the mapping step further comprising the step of transmitting the coded speech parameters from a network server to a wireless communication device, and wherein the processing step is performed in the wireless communication device and all the previous steps are performed in the network server.
PCT/US2002/026901 2001-09-25 2002-08-23 Text-to-speech native coding in a communication system WO2003028010A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP02750495A EP1479067A4 (en) 2001-09-25 2002-08-23 Text-to-speech native coding in a communication system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/962,747 US6681208B2 (en) 2001-09-25 2001-09-25 Text-to-speech native coding in a communication system
US09/962,747 2001-09-25

Publications (1)

Publication Number Publication Date
WO2003028010A1 true WO2003028010A1 (en) 2003-04-03

Family

ID=25506298

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/026901 WO2003028010A1 (en) 2001-09-25 2002-08-23 Text-to-speech native coding in a communication system

Country Status (5)

Country Link
US (1) US6681208B2 (en)
EP (1) EP1479067A4 (en)
CN (1) CN1559068A (en)
RU (1) RU2004112536A (en)
WO (1) WO2003028010A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005119652A1 (en) * 2004-06-02 2005-12-15 Nokia Corporation Mobile station and method for transmitting and receiving messages

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020111974A1 (en) * 2001-02-15 2002-08-15 International Business Machines Corporation Method and apparatus for early presentation of emphasized regions in a web page
US20060069567A1 (en) * 2001-12-10 2006-03-30 Tischer Steven N Methods, systems, and products for translating text to speech
US7483832B2 (en) * 2001-12-10 2009-01-27 At&T Intellectual Property I, L.P. Method and system for customizing voice translation of text to speech
US8073930B2 (en) * 2002-06-14 2011-12-06 Oracle International Corporation Screen reader remote access system
US20040049389A1 (en) * 2002-09-10 2004-03-11 Paul Marko Method and apparatus for streaming text to speech in a radio communication system
US20040098266A1 (en) * 2002-11-14 2004-05-20 International Business Machines Corporation Personal speech font
US20050131698A1 (en) * 2003-12-15 2005-06-16 Steven Tischer System, method, and storage medium for generating speech generation commands associated with computer readable information
US20070055526A1 (en) * 2005-08-25 2007-03-08 International Business Machines Corporation Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis
US8700404B1 (en) * 2005-08-27 2014-04-15 At&T Intellectual Property Ii, L.P. System and method for using semantic and syntactic graphs for utterance classification
US20070083367A1 (en) * 2005-10-11 2007-04-12 Motorola, Inc. Method and system for bandwidth efficient and enhanced concatenative synthesis based communication
US7786994B2 (en) * 2006-10-26 2010-08-31 Microsoft Corporation Determination of unicode points from glyph elements
TW200836571A (en) * 2007-02-16 2008-09-01 Inventec Appliances Corp System and method for transforming and transmitting data between terminal
RU2324296C1 (en) * 2007-03-26 2008-05-10 Закрытое акционерное общество "Ай-Ти Мобайл" Method for message exchanging and devices for implementation of this method
US8645140B2 (en) * 2009-02-25 2014-02-04 Blackberry Limited Electronic device and method of associating a voice font with a contact for text-to-speech conversion at the electronic device
CN101894547A (en) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 Speech synthesis method and system
GB2481992A (en) * 2010-07-13 2012-01-18 Sony Europe Ltd Updating text-to-speech converter for broadcast signal receiver
US9164983B2 (en) 2011-05-27 2015-10-20 Robert Bosch Gmbh Broad-coverage normalization system for social media language
RU2460154C1 (en) * 2011-06-15 2012-08-27 Александр Юрьевич Бредихин Method for automated text processing computer device realising said method
US9471901B2 (en) 2011-09-12 2016-10-18 International Business Machines Corporation Accessible white space in graphical representations of information
CH710280A1 (en) * 2014-10-24 2016-04-29 Elesta Gmbh Method and evaluation device for evaluating signals of an LED status indicator.
CN104992704B (en) * 2015-07-15 2017-06-20 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
US10708725B2 (en) * 2017-02-03 2020-07-07 T-Mobile Usa, Inc. Automated text-to-speech conversion, such as driving mode voice memo
WO2021102193A1 (en) * 2019-11-19 2021-05-27 Apptek, Llc Method and apparatus for forced duration in neural speech synthesis

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62165267A (en) * 1986-01-17 1987-07-21 Ricoh Co Ltd Voice word processor device
JPH05173586A (en) * 1991-12-25 1993-07-13 Matsushita Electric Ind Co Ltd Speech synthesizer
JPH05181492A (en) * 1991-12-27 1993-07-23 Oki Electric Ind Co Ltd Speech information output system
US5463715A (en) * 1992-12-30 1995-10-31 Innovation Technologies Method and apparatus for speech generation from phonetic codes
JPH08160990A (en) * 1994-12-09 1996-06-21 Oki Electric Ind Co Ltd Speech synthesizing device
JPH08335096A (en) * 1995-06-07 1996-12-17 Oki Electric Ind Co Ltd Text voice synthesizer
JP2000148175A (en) * 1998-09-10 2000-05-26 Ricoh Co Ltd Text voice converting device

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4405983A (en) * 1980-12-17 1983-09-20 Bell Telephone Laboratories, Incorporated Auxiliary memory for microprocessor stack overflow
US4817157A (en) 1988-01-07 1989-03-28 Motorola, Inc. Digital speech coder having improved vector excitation source
US4893197A (en) * 1988-12-29 1990-01-09 Dictaphone Corporation Pause compression and reconstitution for recording/playback apparatus
US4979216A (en) * 1989-02-17 1990-12-18 Malsheen Bathsheba J Text to speech synthesis system and method using context dependent vowel allophones
US5119425A (en) * 1990-01-02 1992-06-02 Raytheon Company Sound synthesizer
US5673362A (en) * 1991-11-12 1997-09-30 Fujitsu Limited Speech synthesis system in which a plurality of clients and at least one voice synthesizing server are connected to a local area network
JP3548230B2 (en) 1994-05-30 2004-07-28 キヤノン株式会社 Speech synthesis method and apparatus
US5864812A (en) * 1994-12-06 1999-01-26 Matsushita Electric Industrial Co., Ltd. Speech synthesizing method and apparatus for combining natural speech segments and synthesized speech segments
US5696879A (en) * 1995-05-31 1997-12-09 International Business Machines Corporation Method and apparatus for improved voice transmission
US5625687A (en) * 1995-08-31 1997-04-29 Lucent Technologies Inc. Arrangement for enhancing the processing of speech signals in digital speech interpolation equipment
IL116103A0 (en) * 1995-11-23 1996-01-31 Wireless Links International L Mobile data terminals with text to speech capability
JPH09179719A (en) * 1995-12-26 1997-07-11 Nec Corp Voice synthesizer
US5896393A (en) * 1996-05-23 1999-04-20 Advanced Micro Devices, Inc. Simplified file management scheme for flash memory
EP0834812A1 (en) * 1996-09-30 1998-04-08 Cummins Engine Company, Inc. A method for accessing flash memory and an automotive electronic control system
JP3349905B2 (en) 1996-12-10 2002-11-25 松下電器産業株式会社 Voice synthesis method and apparatus
JP3402100B2 (en) * 1996-12-27 2003-04-28 カシオ計算機株式会社 Voice control host device
US5924068A (en) * 1997-02-04 1999-07-13 Matsushita Electric Industrial Co. Ltd. Electronic news reception apparatus that selectively retains sections and searches by keyword or index for text to speech conversion
US5940791A (en) * 1997-05-09 1999-08-17 Washington University Method and apparatus for speech analysis and synthesis using lattice ladder notch filters
US6081780A (en) * 1998-04-28 2000-06-27 International Business Machines Corporation TTS and prosody based authoring system
US6246983B1 (en) * 1998-08-05 2001-06-12 Matsushita Electric Corporation Of America Text-to-speech e-mail reader with multi-modal reply processor
EP1045372A3 (en) * 1999-04-16 2001-08-29 Matsushita Electric Industrial Co., Ltd. Speech sound communication system
US6178402B1 (en) 1999-04-29 2001-01-23 Motorola, Inc. Method, apparatus and system for generating acoustic parameters in a text-to-speech system using a neural network
US20020147882A1 (en) * 2001-04-10 2002-10-10 Pua Khein Seng Universal serial bus flash memory storage device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62165267A (en) * 1986-01-17 1987-07-21 Ricoh Co Ltd Voice word processor device
JPH05173586A (en) * 1991-12-25 1993-07-13 Matsushita Electric Ind Co Ltd Speech synthesizer
JPH05181492A (en) * 1991-12-27 1993-07-23 Oki Electric Ind Co Ltd Speech information output system
US5463715A (en) * 1992-12-30 1995-10-31 Innovation Technologies Method and apparatus for speech generation from phonetic codes
JPH08160990A (en) * 1994-12-09 1996-06-21 Oki Electric Ind Co Ltd Speech synthesizing device
JPH08335096A (en) * 1995-06-07 1996-12-17 Oki Electric Ind Co Ltd Text voice synthesizer
JP2000148175A (en) * 1998-09-10 2000-05-26 Ricoh Co Ltd Text voice converting device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MOBIUS ET AL.: "Moedeling segmential duration in German text-to-speech synthesis", ICSLP 4TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE, vol. 4, October 1996 (1996-10-01), pages 2395 - 2398, XP010238148 *
O'MALLEY: "Text-to-speech conversion technology", COMPUTER MAGAZINE, vol. 23, no. 8, August 1990 (1990-08-01), pages 17 - 23, XP000150959 *
See also references of EP1479067A4 *
SPROAT ET AL.: "EMU: an E-mail preprocessor for text-to-speech", IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, December 1998 (1998-12-01), pages 239 - 244, XP010318317 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005119652A1 (en) * 2004-06-02 2005-12-15 Nokia Corporation Mobile station and method for transmitting and receiving messages

Also Published As

Publication number Publication date
CN1559068A (en) 2004-12-29
EP1479067A1 (en) 2004-11-24
EP1479067A4 (en) 2006-10-25
US20030061048A1 (en) 2003-03-27
US6681208B2 (en) 2004-01-20
RU2004112536A (en) 2005-03-27

Similar Documents

Publication Publication Date Title
US6681208B2 (en) Text-to-speech native coding in a communication system
US6625576B2 (en) Method and apparatus for performing text-to-speech conversion in a client/server environment
US7395078B2 (en) Voice over short message service
US20070106513A1 (en) Method for facilitating text to speech synthesis using a differential vocoder
US7035794B2 (en) Compressing and using a concatenative speech database in text-to-speech systems
US9761219B2 (en) System and method for distributed text-to-speech synthesis and intelligibility
US7013282B2 (en) System and method for text-to-speech processing in a portable device
US20040073428A1 (en) Apparatus, methods, and programming for speech synthesis via bit manipulations of compressed database
JP3446764B2 (en) Speech synthesis system and speech synthesis server
WO2005034082A1 (en) Method for synthesizing speech
CN1212601C (en) Imbedded voice synthesis method and system
US6502073B1 (en) Low data transmission rate and intelligible speech communication
EP2224426B1 (en) Electronic Device and Method of Associating a Voice Font with a Contact for Text-To-Speech Conversion at the Electronic Device
CN109065016B (en) Speech synthesis method, speech synthesis device, electronic equipment and non-transient computer storage medium
KR102548618B1 (en) Wireless communication apparatus using speech recognition and speech synthesis
Sarathy et al. Text to speech synthesis system for mobile applications
US20120236914A1 (en) In-Band Modem Signals for Use on a Cellular Telephone Voice Channel
Bharthi et al. Unit selection based speech synthesis for converting short text message into voice message in mobile phones
Burileanu et al. An optimized TTS system implementation using a Motorola StarCore SC140-based processor
JPH03288898A (en) Voice synthesizer
JP2003323191A (en) Access system to internet homepage adaptive to voice
JP2003202884A (en) Speech synthesis system
Chen et al. The system implementation of I-phone hardware by using low bit rate speech coding
JP2002140086A (en) Device for conversion from short message for portable telephone set into voice output
JP2004085786A (en) Text speech synthesizer, language processing server device, and program recording medium

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BY BZ CA CH CN CO CR CU CZ DE DM DZ EC EE ES FI GB GD GE GH HR HU ID IL IN IS JP KE KG KP KR LC LK LR LS LT LU LV MA MD MG MN MW MX MZ NO NZ OM PH PL PT RU SD SE SG SI SK SL TJ TM TN TR TZ UA UG UZ VN YU ZA ZM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ UG ZM ZW AM AZ BY KG KZ RU TJ TM AT BE BG CH CY CZ DK EE ES FI FR GB GR IE IT LU MC PT SE SK TR BF BJ CF CG CI GA GN GQ GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 689/DELNP/2004

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2002750495

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 20028187822

Country of ref document: CN

122 Ep: pct application non-entry in european phase
WWP Wipo information: published in national office

Ref document number: 2002750495

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Ref document number: JP

WWW Wipo information: withdrawn in national office

Ref document number: 2002750495

Country of ref document: EP