US20120089399A1 - Voice Over Short Messaging Service - Google Patents
Voice Over Short Messaging Service Download PDFInfo
- Publication number
- US20120089399A1 US20120089399A1 US13/329,444 US201113329444A US2012089399A1 US 20120089399 A1 US20120089399 A1 US 20120089399A1 US 201113329444 A US201113329444 A US 201113329444A US 2012089399 A1 US2012089399 A1 US 2012089399A1
- Authority
- US
- United States
- Prior art keywords
- text
- utterance
- representation
- message
- mobile communication
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W88/00—Devices specially adapted for wireless communication networks, e.g. terminals, base stations or access point devices
- H04W88/02—Terminal devices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0018—Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04B—TRANSMISSION
- H04B1/00—Details of transmission systems, not covered by a single one of groups H04B3/00 - H04B13/00; Details of transmission systems not characterised by the medium used for transmission
- H04B1/38—Transceivers, i.e. devices in which transmitter and receiver form a structural unit and in which at least one part is used for functions of transmitting and receiving
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/724—User interfaces specially adapted for cordless or mobile telephones
- H04M1/72403—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
- H04M1/7243—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
- H04M1/72433—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/724—User interfaces specially adapted for cordless or mobile telephones
- H04M1/72403—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
- H04M1/7243—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
- H04M1/72436—User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. SMS or e-mail
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/72—Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
- H04M1/725—Cordless telephones
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M1/00—Substation equipment, e.g. for use by subscribers
- H04M1/26—Devices for calling a subscriber
- H04M1/27—Devices whereby a plurality of signals may be stored simultaneously
- H04M1/271—Devices whereby a plurality of signals may be stored simultaneously controlled by voice recognition
Definitions
- This invention generally relates to conveying voice messages over communications channels that are available on mobile communication devices, e.g. cellular phones.
- cellular phones utilize voice coders/decoders, or codecs.
- Codecs remove much of the redundant or unnecessary information from a speech signal. Then the fundamental elements of the speech are transmitted over the network to a receiving cellular phone where they are decoded, or recombined with data that resembles the previously removed information. This results in reconstituted speech that can be recognized by the end user.
- the codecs must balance the need for minimal data transmission with the need to retain enough of the original speech information to sound natural when decoded on the receiving end.
- SMS text based mobile-to-mobile messaging
- SMS Short Message Service
- the user typically types in the message text through the small keyboard that is provided on the device.
- the messages are hardware limited to 160 characters, and are sent as packets through a low bandwidth, out-of-band message transfer channel. This allows for facile communication with minimal burden on the wireless network.
- SMS Short Term Evolution
- the invention features a method of sending a voice message via a mobile communication device.
- the method involves: receiving an utterance from a user of the mobile communication device; generating a non-text representation of the received utterance; inserting the non-text representation into a body of a text message; and sending the text message over a wireless messaging channel from the mobile communication device to a recipient's device.
- Embodiments include one or more of the following features.
- the mobile communication device is a cellular phone.
- Generating the non-text representation of the received utterance involves performing recognition on a signal derived from the received utterance to generate a string of symbols, wherein the string of symbols is the non-text representation.
- the symbols in the string of symbols are selected from the group consisting of phonemes, diphones, and triphones (more specifically, the symbols are phonemes).
- the wireless messaging channel is an SMS channel and the text message is an SMS message.
- the method also involves including an indicator with the text message identifying the text message as containing a non-text representation of the utterance.
- the non-text representation is a compressed version of the received utterance.
- the invention features a method of receiving on a mobile communication device a message representing an utterance.
- the method involves: over a wireless messaging channel receiving a text message, wherein the text message contains a non-text representation of the utterance; extracting the non-text representation from the text message; synthesizing an audio representation of the spoken utterance from the non-text representation; and playing the synthesized audio representation through an audio output device on the mobile communication device.
- Embodiments include one or more of the following features.
- the mobile communication device is a cellular phone.
- the non-text representation of the utterance is a string of symbols representing sounds of the utterance. The symbols in the string of symbols are selected from the group consisting of phonemes, diphones, and triphones (more specifically, they are phonemes).
- the wireless messaging channel is an SMS channel and the text message is an SMS message.
- the received text message includes an indicator identifying the text message as containing a non-text representation of the utterance.
- the non-text representation is a compressed version of the utterance.
- the invention features a mobile communication device for sending a voice message.
- the mobile communication device includes: a processor system; a microphone for receiving an utterance from a user of the mobile communication device; a transceiver; and memory storing code which when executed on the processor system causes the mobile communication device to: generate a non-text representation of the received utterance; insert the non-text representation into a body of a text message; and send the text message via the transceiver over a wireless messaging channel from the mobile communication device to a recipient's device.
- the invention features a mobile communication device for receiving a voice message.
- the mobile communication device includes: a processor system; a transceiver for receiving a text message that contains a non-text representation of an utterance; an audio output device; and memory storing code which when executed on the processor system causes the mobile communication device to: extract the non-text representation from the received text message; synthesize an audio representation of the spoken utterance from the non-text representation; and play the synthesized audio representation through the audio output device.
- FIG. 1 shows a block diagram of the phonetic recognition system.
- FIG. 2 shows a block diagram of the phonetic synthesis system.
- FIG. 3 shows a high-level block diagram of a smartphone incorporating phonetic recognition and synthesis systems.
- the described embodiment is a method of sending and receiving spoken or audio information over the SMS network available in cellular phones.
- a user speaks a desired message, or utterance, into a cellular phone.
- a phonetic recognition algorithm in the phone then generates a non-text representation of the utterance.
- An SMS application in the phone sends this non-text representation in the body of an SMS message via the SMS network to the recipient's phone.
- another SMS application extracts the non-text representation from the body of the SMS message.
- a synthesizer synthesizes an audio message from the non-text representation and plays the synthesized message to the recipient.
- FIG. 1 shows a high-level block diagram illustrating in greater detail the functionality that is implemented.
- a user speaks an utterance 110 into the cellular phone 100 and a feature extractor 130 in the front end of a recognition engine 120 within the phone processes that utterance to extract its acoustic features.
- feature extractor 130 includes a digitizer 102 that converts the received analog signal into a digital representation.
- Digitizer 102 divides the input signal into a sequence of overlapping frames and then outputs a digital representation of the signal within each of the frames.
- a filter 104 filters the spectrum of the signal to, among other things, reduce the influence of non-speech noise on the speech signal and to correct for various impairments caused by the spectral characteristics of the channel over which the utterance was received.
- the filtering process preserves the main verbal content of the utterance while eliminating various frequencies, e.g. the very high and very low frequencies that likely do not carry significant usable information.
- An analyzer 106 analyzes the filtered digital signal to extract the relevant acoustic features of the frames, i.e., the feature vector.
- the output of feature extractor 130 is an acoustic representation 140 of the received utterance.
- feature extractor 130 uses the MEL cepstrum coding technique to extract the relevant features.
- the phone stores a set of phonemes, which are basic phonetic units from which the sounds of the spoken language are constructed. It also stores an acoustic model for each of the phonemes and an index or pointer which identifies the phoneme.
- the acoustic model is statistical in nature and indicates the probability that a particular phoneme was spoken given the occurrence of a particular set of acoustic features.
- the recognition engine 120 employs an unconstrained phoneme recognizer 150 to determine the sequence of phonemes (i.e., phoneme string) that is most likely given the sequence of feature vectors which characterizes the user's utterance.
- the recognizer 150 is unconstrained in that it considers each candidate phoneme with equal weight, without presumption as to the order or to the language spoken by the user.
- phoneme recognizer 150 is a relatively crude recognizer that does not use a language model which enables it to identify the spoken words.
- Recognizer 150 statistically compares the acoustic representation of the utterance to acoustic representations of phonemes stored in a phoneme database 160 on the cell phone.
- Phoneme database 160 contains a sufficiently large set of phonemes, with their acoustic representation, to effectively describe the sounds that are found in the language of the user.
- the phoneme recognizer 150 performs a statistical comparison of the acoustic representations of the received utterance with the acoustic representations of the phonemes to identify the best match. It does this using a well-known technique referred to as hidden Markov model (HMM), though other statistical or non-statistical techniques or models that compare features of speech to stored phonetic units could also be used.
- HMM hidden Markov model
- Phoneme recognizer 150 outputs the recognized sequence of phonemes as a sequence of indices or pointers into its database of phonemes. That is, for each phoneme in the recognized string of phonemes, phoneme recognizer 150 looks up in its database phonemes the particular index or pointer that identifies that phoneme and it outputs that index or pointer. The output is a non-text representation of the spoken utterance, in this case, a phoneme string.
- the value of this string is that a synthesizer on the receiving end of the communication link can recreate the sequence of sounds that made up the utterance, i.e., it can recreate the utterance so that it would generally be recognizable to the user on the other end.
- the phoneme string will not be easily readable as text since word recognition is not performed.
- Phoneme recognizer 150 stores the phoneme string in a buffer 175 for an SMS application 180 that is also running on the cell phone.
- SMS application 180 generates a text message shell for receiving the non-text representation and populates its address field with the address of the recipient's phone.
- SMS application 180 inserts the stored phoneme string into the body of an SMS message, along with a flag identifying the message as containing a non-text phoneme string that is intended for a synthesizer on the receiving end. In effect, the flag alerts the SMS application on the other end to not treat the contents of the SMS as a text message that would normally be displayed to the user. SMS application 180 then wirelessly transmits the SMS message over the SMS channel to the recipient's cell phone.
- Phoneme recognizer 150 also stores other information in SMS buffer 175 which is useful in improving the quality and/or understandability of the sounds that are synthesized by the recipient's cell phone. For example, it also specifies the temporal length of each phoneme, its volume, and possibly other parameters which can be used to control the quality of the sounds generated by the synthesizer in the receiving phone. In the described embodiment, since phoneme recognizer 150 also recognizes pauses, it truncates those recognized pauses to conserve the space required to represent the utterance.
- utterance 110 would typically be compressed into a non-text representation 170 at a rate of approximately 200-700 bits per second or less.
- this corresponds to an utterance that is about 10 seconds long upon playback on the receiving cell phone.
- an alternative to the approach described above would be to perform full speech recognition on the utterance and send the recognized text message in the body of the SMS message.
- This requires that a full capability recognizer be present on the phone with a lexicon containing a dictionary of words of the type that would be spoken by the consumers to whom such a phone would be sold. That might not be practical, especially if the phone is intended for sale in a market like India, where there are over 350 different languages.
- the algorithms required to perform speech recognition in such an environment would be very sophisticated and complex; moreover, the resources required to perform that speech recognition would typically be beyond those that would be available on the inexpensive cell phones intended to be sold to the general population.
- a recognizer that needs to only generate a phonetic string representation of what was spoken, as opposed to the recognized text, is much less complex to build and requires significantly less onboard computational and memory resources.
- that set of phonemes required to support phoneme recognition is small, especially in comparison to the lexicon of words that would be necessary to perform full speech recognition. Indeed, using the universal phoneme set would enable the recognizer to handle most languages for the purposes described herein.
- the phonetic recognizer 150 when the phonetic recognizer 150 statistically matches segments of the acoustic representation of the utterance 170 to acoustic representations of the phonemes, the best-match phonemes might occasionally incorrectly match the utterance. For example the recognizer might interpret a “d” sound to be a “t,” because the features obtained by the feature extractor 130 are similar for both sounds, making neither sound a significantly better match than the other in the phonetic recognizer 150 . Such errors would have a more detrimental effect on speech-to-text recognition but would typically have far less detrimental effect in the applications described herein. To someone listening to the synthesized audio message, the presence of such errors in the phonetic string that is being synthesized are not likely to render the playback unintelligible. Indeed, they might not even be noticed.
- FIG. 2 shows a high-level block diagram illustrating the functionality implemented on the receiver side of the SMS channel.
- a cellular phone 200 operated by the recipient receives the SMS message containing the non-text representation of the utterance and an SMS application 280 processes the message for presentation to the user.
- a flag within the received SMS message identifies the contents of the SMS message as a phonetic string that must be processed by the synthesizer to generate an audio signal. In other words, the flag causes the SMS application to process the message differently from a normal text message for which the contents of the message would simply be displayed to the user.
- SMS application 280 passes the phonetic string to a synthesizer 220 within the cell phone, along with any stored parameters which were supplied to control synthesizer 220 and the way it generates the sounds.
- the recipient's cell phone like the sender's cell phone, also contains a database of phonemes along with their acoustic representations.
- the indices or pointers that make up the received phonetic string identify which phonemes from that database are to be synthesized to render the phonetic sting into an audio message.
- the synthesizer plays through the cell phone speaker the sequence of sounds that represent the phonetic string. In this way, the spoken utterance is transmitted to the recipient via the SMS message facility in non-real time.
- the SMS application could be programmed to recognize that the received non-text representation is to be constructed by concatenating the contents of more than one SMS message.
- the units of speech used to represent the utterance are phonemes.
- any one of a variety of other symbol sets, other than phonemes, could be used.
- the symbols could be diphones, triphones, syllables, demisyllables, or any other set that serves to effectively represent the sounds contained within the spoken utterances of the intended users.
- a “tailored” dictionary of phonetic units selected to optimally represent the sounds of the language used in that market could be incorporated in the device.
- a universal set of phonemes could be used which would enable the phone to recognize and represent most languages.
- the phoneme recognizer does not include a full language model and indeed might not include any language model at all. Its function is to simply recognize the sounds within the utterance. For this purpose, it is not necessary to recognize their meaning. And thus the resulting representation that is produced by the recognizer is a non-text representation which likely is not readable by the user. That does not mean, however, that the selected symbol set might not include words among the set of symbols. Short, single syllabic words might, for example, appear as symbols or units among the selected set of symbols.
- the phonetic recognition algorithm generates a compressed version of the spoken utterance.
- that compressed version is a non-text representation (i.e., a phonetic string).
- a vocoder could be used to generate the compressed representation and then that compressed representation would be inserted into the body of the SMS message.
- any algorithm that produces a non-text representation suitable for sending over SMS or another non-voice channel could be employed. It would be desirable that the selected algorithm be able to compress speech sufficiently so that it is possible to send an utterance that is long enough to convey meaningful information. On the receiving end, the appropriate decompression algorithm would need to be implemented to reconstruct the audio version of the spoken utterance.
- one such feature is to give the sending user the option of choosing a “voice” in which the receiving phone will replay the audio message to the receiving user.
- This feature is implemented by adding an additional string of characters representing “voice” parameters to the non-text representation of the utterance, which gives instructions to the synthesis algorithm. The user can select and/or adjust these parameters through a menu driven interface on the phone. These parameters would be used to tailor the synthesizer algorithm to produce the desired effect. In the same way, parameters can be included for playback speed, or other modifications to the audio message that make it sound more natural, or more representative of the sending user.
- both phones store a number of prerecorded messages such as “please record after the beep,” “enter the phone number of the person you want to send this to,” and so on.
- the phone audibly plays an appropriate message in response to user input.
- the phone would store multiple algorithms that allow for varying length and quality of the non-text representation of the utterance. Before the user records the utterance, the phone offers a length/quality choice to the user. The user inputs his response either verbally or via the phone keypad; then the phone uses the algorithm corresponding to the user instruction to process the utterance. The phone then adds a series of characters giving instruction to the receiving phone on how to synthesize the message from the non-text representation of the utterance.
- the cellular phone is a smartphone 300 , such as is illustrated by the high-level functional block diagram of FIG. 3 .
- Smartphone 300 is a Microsoft PocketPC-powered phone which includes at its core a baseband DSP 302 (digital signal processor) for handling the cellular communication functions (including for example voiceband and channel coding functions) and an applications processor 304 (e.g. Intel StrongArm SA-1110) on which the PocketPC operating system runs.
- the phone supports GSM voice calls, SMS (Short Messaging Service) text messaging, wireless email, and desktop-like web browsing along with more traditional PDA features.
- SMS Short Messaging Service
- the transmit and receive functions are implemented by an RF synthesizer 306 and an RF radio transceiver 308 followed by a power amplifier module 310 that handles the final-stage RF transmit duties through an antenna 312 .
- An interface ASIC (application-specific integrated circuit) 314 and an audio CODEC 316 provide interfaces to a speaker, a microphone, and other input/output devices provided in the phone such as a numeric or alphanumeric keypad (not shown) for entering commands and information.
- DSP 302 uses a flash memory 318 for code store.
- a Li-Ion (lithium-ion) battery 320 powers the phone and a power management module 322 coupled to DSP 302 manages power consumption within the phone.
- Volatile and non-volatile memory for applications processor 114 is provided in the form of SDRAM 324 and flash memory 326 , respectively.
- This arrangement of memory is used to hold the code for the operating system, the code for customizable features such as the phone directory, and the code for any applications software that might be included in the smartphone, including the phonetic recognition, synthesizer, and SMS application code mentioned above. It also stores the phoneme database, which includes the phonemes, acoustic representations of the phonemes, and symbols representing the phonemes.
- the visual display device for the smartphone includes an LCD driver chip 328 that drives an LCD display 330 .
- the device would not have to be a cellular phone at all, but would possess the functionality of receiving an utterance, converting it to a non-text representation of the utterance, and sending it over SMS or another non-voice channel.
- a laptop computer having a microphone, appropriate software to generate a non-text representation of an utterance, and a wireless transmitter that utilizes the SMS protocol and frequencies, or any other device with similar functionality, could also be implemented.
- SMS network is presented in the above example, any network over which one might send text, data and/or media other than voice could be used.
- MMS Multi-Media Service
- the MEL cepstrum coding technique mentioned above is just one example of many known alternatives for extracting and representing features of the received utterance. Any of the other known techniques, such as LPC cepstral coefficients for example, could be employed instead of the MEL cepstrum coding technique.
- Two examples of coding techniques that could be used to generate the non-text representations are: (1) Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, “A Very Low Bit Rate Speech Coder Using HMM with Speaker Adaptations,” paper presented at the 1998 ICASSP and a version also appearing in Systems and Computers in Japan, Volume 32, Issue 12, 2001. Pages 38-46; and (2) M.
Abstract
A method of operating a mobile communication device is described. A text message is received over a wireless messaging channel, wherein the text message contains a non-text representation of an utterance. The non-text representation is extracted from the text message, and an audio representation of the spoken utterance is synthesized from the non-text representation.
Description
- This application is a continuation of co-pending U.S. patent application Ser. No. 12/146,892, filed Jun. 26, 2008, which is a divisional of U.S. patent application Ser. No. 11/110,371, filed Apr. 20, 2005 and issued as U.S. Pat. No. 7,395,078, which in turn claimed priority from U.S. Provisional Patent Application 60/563,754, filed Apr. 20, 2004, all of which are incorporated herein by reference.
- This invention generally relates to conveying voice messages over communications channels that are available on mobile communication devices, e.g. cellular phones.
- To minimize the amount of voice information transmitted over a wireless communication network, and thus maximize the number of phone calls supportable on the network at any one time, cellular phones utilize voice coders/decoders, or codecs. Codecs remove much of the redundant or unnecessary information from a speech signal. Then the fundamental elements of the speech are transmitted over the network to a receiving cellular phone where they are decoded, or recombined with data that resembles the previously removed information. This results in reconstituted speech that can be recognized by the end user. The codecs must balance the need for minimal data transmission with the need to retain enough of the original speech information to sound natural when decoded on the receiving end. In general voice codecs today can compress speech signals to between 4.5 k-8 k bits per second, with 2.4 k bits per second being roughly the minimal rate required to maintain natural-sounding speech. Despite the ability to compress speech to these low bit rates, the network infrastructure for handling large volumes of voice calls is limited in many markets, particularly in emerging markets in developing countries. This can make the cost of a wireless phone call there significant.
- An alternate and increasingly popular method of communicating via cellular phones is text messaging. In response to the high costs of voice calls, text based mobile-to-mobile messaging called SMS, or Short Message Service, has become heavily used in some markets, particularly amongst younger demographics. SMS enables a user to transmit and receive short text messages at any time, regardless of whether a voice call is in progress. The user typically types in the message text through the small keyboard that is provided on the device. The messages are hardware limited to 160 characters, and are sent as packets through a low bandwidth, out-of-band message transfer channel. This allows for facile communication with minimal burden on the wireless network.
- Most legacy wireless network systems such as GSM, TDMA, and CDMA have a text/data channel capable of sending and receiving SMS, so the infrastructure for this service already exists even in emerging markets in developing countries. Some estimates now place the global number of SMS messages at nearly 40 billion messages per month. It is thought that SMS is now the most significant source of non-voice based revenue to wireless network operators worldwide. As a result carriers are very interested in promoting the use of SMS. Indeed, network operators in developing markets may limit the implementation of more advanced voice network infrastructures due to the large revenues associated with text messaging.
- In some markets the cost of cell phone calls is relatively expensive, making text messaging (e.g. via SMS) a desirable communication alternative. However, in a portion of those markets other barriers may exist to using text. Both the sender and the receiver must be able to read and/or write. But in emerging markets, such as India which has a very large population, the adult literacy rate is roughly 60% and thus a large number of people are not sufficiently literate to type text messages into the cell phone. Thus, for many consumers in such markets who can neither compose nor read a message, SMS text-messaging as a communication mode is not an effective alternative. At least some of the embodiments described herein provide a mechanism by which such consumers can nevertheless use the lower-cost, non-voice wireless communication channels for verbal communications instead of text messaging.
- In general, in one aspect, the invention features a method of sending a voice message via a mobile communication device. The method involves: receiving an utterance from a user of the mobile communication device; generating a non-text representation of the received utterance; inserting the non-text representation into a body of a text message; and sending the text message over a wireless messaging channel from the mobile communication device to a recipient's device.
- Embodiments include one or more of the following features. The mobile communication device is a cellular phone. Generating the non-text representation of the received utterance involves performing recognition on a signal derived from the received utterance to generate a string of symbols, wherein the string of symbols is the non-text representation. The symbols in the string of symbols are selected from the group consisting of phonemes, diphones, and triphones (more specifically, the symbols are phonemes). The wireless messaging channel is an SMS channel and the text message is an SMS message. The method also involves including an indicator with the text message identifying the text message as containing a non-text representation of the utterance. The non-text representation is a compressed version of the received utterance.
- In general, in another aspect, the invention features a method of receiving on a mobile communication device a message representing an utterance. The method involves: over a wireless messaging channel receiving a text message, wherein the text message contains a non-text representation of the utterance; extracting the non-text representation from the text message; synthesizing an audio representation of the spoken utterance from the non-text representation; and playing the synthesized audio representation through an audio output device on the mobile communication device.
- Embodiments include one or more of the following features. The mobile communication device is a cellular phone. The non-text representation of the utterance is a string of symbols representing sounds of the utterance. The symbols in the string of symbols are selected from the group consisting of phonemes, diphones, and triphones (more specifically, they are phonemes). The wireless messaging channel is an SMS channel and the text message is an SMS message. The received text message includes an indicator identifying the text message as containing a non-text representation of the utterance. The non-text representation is a compressed version of the utterance.
- In general, in still another aspect, the invention features a mobile communication device for sending a voice message. The mobile communication device includes: a processor system; a microphone for receiving an utterance from a user of the mobile communication device; a transceiver; and memory storing code which when executed on the processor system causes the mobile communication device to: generate a non-text representation of the received utterance; insert the non-text representation into a body of a text message; and send the text message via the transceiver over a wireless messaging channel from the mobile communication device to a recipient's device.
- In general, in still another aspect, the invention features a mobile communication device for receiving a voice message. The mobile communication device includes: a processor system; a transceiver for receiving a text message that contains a non-text representation of an utterance; an audio output device; and memory storing code which when executed on the processor system causes the mobile communication device to: extract the non-text representation from the received text message; synthesize an audio representation of the spoken utterance from the non-text representation; and play the synthesized audio representation through the audio output device.
- The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
-
FIG. 1 shows a block diagram of the phonetic recognition system. -
FIG. 2 shows a block diagram of the phonetic synthesis system. -
FIG. 3 shows a high-level block diagram of a smartphone incorporating phonetic recognition and synthesis systems. - The described embodiment is a method of sending and receiving spoken or audio information over the SMS network available in cellular phones. A user speaks a desired message, or utterance, into a cellular phone. A phonetic recognition algorithm in the phone then generates a non-text representation of the utterance. An SMS application in the phone sends this non-text representation in the body of an SMS message via the SMS network to the recipient's phone. At the recipient's phone, another SMS application extracts the non-text representation from the body of the SMS message. Then, a synthesizer synthesizes an audio message from the non-text representation and plays the synthesized message to the recipient.
-
FIG. 1 shows a high-level block diagram illustrating in greater detail the functionality that is implemented. A user speaks anutterance 110 into the cellular phone 100 and afeature extractor 130 in the front end of arecognition engine 120 within the phone processes that utterance to extract its acoustic features. Typically,feature extractor 130 includes adigitizer 102 that converts the received analog signal into a digital representation.Digitizer 102 divides the input signal into a sequence of overlapping frames and then outputs a digital representation of the signal within each of the frames. Afilter 104 then filters the spectrum of the signal to, among other things, reduce the influence of non-speech noise on the speech signal and to correct for various impairments caused by the spectral characteristics of the channel over which the utterance was received. The filtering process preserves the main verbal content of the utterance while eliminating various frequencies, e.g. the very high and very low frequencies that likely do not carry significant usable information. Ananalyzer 106 analyzes the filtered digital signal to extract the relevant acoustic features of the frames, i.e., the feature vector. The output offeature extractor 130 is anacoustic representation 140 of the received utterance. In the described embodiment,feature extractor 130 uses the MEL cepstrum coding technique to extract the relevant features. - In a
database 160 in memory, the phone stores a set of phonemes, which are basic phonetic units from which the sounds of the spoken language are constructed. It also stores an acoustic model for each of the phonemes and an index or pointer which identifies the phoneme. The acoustic model is statistical in nature and indicates the probability that a particular phoneme was spoken given the occurrence of a particular set of acoustic features. - In the described embodiment, the
recognition engine 120 employs anunconstrained phoneme recognizer 150 to determine the sequence of phonemes (i.e., phoneme string) that is most likely given the sequence of feature vectors which characterizes the user's utterance. Therecognizer 150 is unconstrained in that it considers each candidate phoneme with equal weight, without presumption as to the order or to the language spoken by the user. In other words,phoneme recognizer 150 is a relatively crude recognizer that does not use a language model which enables it to identify the spoken words. -
Recognizer 150 statistically compares the acoustic representation of the utterance to acoustic representations of phonemes stored in aphoneme database 160 on the cell phone.Phoneme database 160 contains a sufficiently large set of phonemes, with their acoustic representation, to effectively describe the sounds that are found in the language of the user. Thephoneme recognizer 150 performs a statistical comparison of the acoustic representations of the received utterance with the acoustic representations of the phonemes to identify the best match. It does this using a well-known technique referred to as hidden Markov model (HMM), though other statistical or non-statistical techniques or models that compare features of speech to stored phonetic units could also be used. -
Phoneme recognizer 150 outputs the recognized sequence of phonemes as a sequence of indices or pointers into its database of phonemes. That is, for each phoneme in the recognized string of phonemes,phoneme recognizer 150 looks up in its database phonemes the particular index or pointer that identifies that phoneme and it outputs that index or pointer. The output is a non-text representation of the spoken utterance, in this case, a phoneme string. The value of this string is that a synthesizer on the receiving end of the communication link can recreate the sequence of sounds that made up the utterance, i.e., it can recreate the utterance so that it would generally be recognizable to the user on the other end. Typically, however, the phoneme string will not be easily readable as text since word recognition is not performed. -
Phoneme recognizer 150 stores the phoneme string in abuffer 175 for anSMS application 180 that is also running on the cell phone.SMS application 180 generates a text message shell for receiving the non-text representation and populates its address field with the address of the recipient's phone. Whenbuffer 175 is full or the utterance is complete,SMS application 180 inserts the stored phoneme string into the body of an SMS message, along with a flag identifying the message as containing a non-text phoneme string that is intended for a synthesizer on the receiving end. In effect, the flag alerts the SMS application on the other end to not treat the contents of the SMS as a text message that would normally be displayed to the user.SMS application 180 then wirelessly transmits the SMS message over the SMS channel to the recipient's cell phone. -
Phoneme recognizer 150 also stores other information inSMS buffer 175 which is useful in improving the quality and/or understandability of the sounds that are synthesized by the recipient's cell phone. For example, it also specifies the temporal length of each phoneme, its volume, and possibly other parameters which can be used to control the quality of the sounds generated by the synthesizer in the receiving phone. In the described embodiment, sincephoneme recognizer 150 also recognizes pauses, it truncates those recognized pauses to conserve the space required to represent the utterance. - With the phonetic recognition algorithm,
utterance 110 would typically be compressed into anon-text representation 170 at a rate of approximately 200-700 bits per second or less. When sent over the SMS network, which in many areas has a single-message information limit of 1200 bits, this corresponds to an utterance that is about 10 seconds long upon playback on the receiving cell phone. - Note that an alternative to the approach described above would be to perform full speech recognition on the utterance and send the recognized text message in the body of the SMS message. This, however, requires that a full capability recognizer be present on the phone with a lexicon containing a dictionary of words of the type that would be spoken by the consumers to whom such a phone would be sold. That might not be practical, especially if the phone is intended for sale in a market like India, where there are over 350 different languages. The algorithms required to perform speech recognition in such an environment would be very sophisticated and complex; moreover, the resources required to perform that speech recognition would typically be beyond those that would be available on the inexpensive cell phones intended to be sold to the general population. On the other hand, a recognizer that needs to only generate a phonetic string representation of what was spoken, as opposed to the recognized text, is much less complex to build and requires significantly less onboard computational and memory resources. In addition, that set of phonemes required to support phoneme recognition is small, especially in comparison to the lexicon of words that would be necessary to perform full speech recognition. Indeed, using the universal phoneme set would enable the recognizer to handle most languages for the purposes described herein.
- It should also be noted that when the
phonetic recognizer 150 statistically matches segments of the acoustic representation of theutterance 170 to acoustic representations of the phonemes, the best-match phonemes might occasionally incorrectly match the utterance. For example the recognizer might interpret a “d” sound to be a “t,” because the features obtained by thefeature extractor 130 are similar for both sounds, making neither sound a significantly better match than the other in thephonetic recognizer 150. Such errors would have a more detrimental effect on speech-to-text recognition but would typically have far less detrimental effect in the applications described herein. To someone listening to the synthesized audio message, the presence of such errors in the phonetic string that is being synthesized are not likely to render the playback unintelligible. Indeed, they might not even be noticed. -
FIG. 2 shows a high-level block diagram illustrating the functionality implemented on the receiver side of the SMS channel. Acellular phone 200 operated by the recipient receives the SMS message containing the non-text representation of the utterance and anSMS application 280 processes the message for presentation to the user. A flag within the received SMS message identifies the contents of the SMS message as a phonetic string that must be processed by the synthesizer to generate an audio signal. In other words, the flag causes the SMS application to process the message differently from a normal text message for which the contents of the message would simply be displayed to the user.SMS application 280 passes the phonetic string to asynthesizer 220 within the cell phone, along with any stored parameters which were supplied to controlsynthesizer 220 and the way it generates the sounds. The recipient's cell phone, like the sender's cell phone, also contains a database of phonemes along with their acoustic representations. The indices or pointers that make up the received phonetic string identify which phonemes from that database are to be synthesized to render the phonetic sting into an audio message. The synthesizer plays through the cell phone speaker the sequence of sounds that represent the phonetic string. In this way, the spoken utterance is transmitted to the recipient via the SMS message facility in non-real time. - If appropriate, it is possible to program the SMS application to generate a sequence of multiple SMS messages to handle longer utterances for which the non-text representation will not fit into the body of a single message. In essence, the SMS application would “packetize” the phonetic string and send multiple SMS messages (or packets) to the recipient's cell phone, each message containing a part of the total utterance. Each message would be indexed or tagged so that the SMS on the recipient's side could accurately reconstruct the complete representation of the utterance. The SMS application on the recipient's side of the connection would also need to be programmed to recognize that the received non-text representation is to be constructed by concatenating the contents of more than one SMS message.
- In the embodiments described above, the units of speech used to represent the utterance are phonemes. However, any one of a variety of other symbol sets, other than phonemes, could be used. For example, the symbols could be diphones, triphones, syllables, demisyllables, or any other set that serves to effectively represent the sounds contained within the spoken utterances of the intended users.
- For an implementation that is targeted for a specific market, a “tailored” dictionary of phonetic units selected to optimally represent the sounds of the language used in that market could be incorporated in the device. Alternatively, a universal set of phonemes could be used which would enable the phone to recognize and represent most languages.
- As noted above, the phoneme recognizer does not include a full language model and indeed might not include any language model at all. Its function is to simply recognize the sounds within the utterance. For this purpose, it is not necessary to recognize their meaning. And thus the resulting representation that is produced by the recognizer is a non-text representation which likely is not readable by the user. That does not mean, however, that the selected symbol set might not include words among the set of symbols. Short, single syllabic words might, for example, appear as symbols or units among the selected set of symbols.
- In effect, the phonetic recognition algorithm generates a compressed version of the spoken utterance. In the described embodiment, that compressed version is a non-text representation (i.e., a phonetic string). In fact, other algorithms could be used which simply perform compression without performing any recognition. For example, instead of using a phoneme recognizer, a vocoder could be used to generate the compressed representation and then that compressed representation would be inserted into the body of the SMS message. In other words, any algorithm that produces a non-text representation suitable for sending over SMS or another non-voice channel could be employed. It would be desirable that the selected algorithm be able to compress speech sufficiently so that it is possible to send an utterance that is long enough to convey meaningful information. On the receiving end, the appropriate decompression algorithm would need to be implemented to reconstruct the audio version of the spoken utterance.
- Various features can be added to the system to enhance usability. As indicated above, one such feature is to give the sending user the option of choosing a “voice” in which the receiving phone will replay the audio message to the receiving user. This feature is implemented by adding an additional string of characters representing “voice” parameters to the non-text representation of the utterance, which gives instructions to the synthesis algorithm. The user can select and/or adjust these parameters through a menu driven interface on the phone. These parameters would be used to tailor the synthesizer algorithm to produce the desired effect. In the same way, parameters can be included for playback speed, or other modifications to the audio message that make it sound more natural, or more representative of the sending user. Another feature that can be implemented in the system is audio prompted guidance to both the sending and receiving users, which can better enable non-literate users to operate the system. In this case, both phones store a number of prerecorded messages such as “please record after the beep,” “enter the phone number of the person you want to send this to,” and so on. The phone audibly plays an appropriate message in response to user input.
- Another feature that can be implemented in the system is to allow the user to prioritize between the utterance length and quality of reproduction. In this case, the phone would store multiple algorithms that allow for varying length and quality of the non-text representation of the utterance. Before the user records the utterance, the phone offers a length/quality choice to the user. The user inputs his response either verbally or via the phone keypad; then the phone uses the algorithm corresponding to the user instruction to process the utterance. The phone then adds a series of characters giving instruction to the receiving phone on how to synthesize the message from the non-text representation of the utterance.
- In the described embodiment, the cellular phone is a
smartphone 300, such as is illustrated by the high-level functional block diagram ofFIG. 3 .Smartphone 300 is a Microsoft PocketPC-powered phone which includes at its core a baseband DSP 302 (digital signal processor) for handling the cellular communication functions (including for example voiceband and channel coding functions) and an applications processor 304 (e.g. Intel StrongArm SA-1110) on which the PocketPC operating system runs. The phone supports GSM voice calls, SMS (Short Messaging Service) text messaging, wireless email, and desktop-like web browsing along with more traditional PDA features. - The transmit and receive functions are implemented by an
RF synthesizer 306 and anRF radio transceiver 308 followed by apower amplifier module 310 that handles the final-stage RF transmit duties through anantenna 312. An interface ASIC (application-specific integrated circuit) 314 and anaudio CODEC 316 provide interfaces to a speaker, a microphone, and other input/output devices provided in the phone such as a numeric or alphanumeric keypad (not shown) for entering commands and information.DSP 302 uses aflash memory 318 for code store. A Li-Ion (lithium-ion)battery 320 powers the phone and apower management module 322 coupled toDSP 302 manages power consumption within the phone. Volatile and non-volatile memory for applications processor 114 is provided in the form ofSDRAM 324 andflash memory 326, respectively. This arrangement of memory is used to hold the code for the operating system, the code for customizable features such as the phone directory, and the code for any applications software that might be included in the smartphone, including the phonetic recognition, synthesizer, and SMS application code mentioned above. It also stores the phoneme database, which includes the phonemes, acoustic representations of the phonemes, and symbols representing the phonemes. - The visual display device for the smartphone includes an
LCD driver chip 328 that drives anLCD display 330. There is also aclock module 332 that provides the clock signals for the other devices within the phone and provides an indicator of real time. - All of the above-described components are packaged within an appropriately designed
housing 334. Since the smartphone described above is representative of the general internal structure of a number of different commercially available phones and since the internal circuit design of those phones is generally known to persons of ordinary skill in this art, further details about the components shown inFIG. 3 and their operation are not being provided and are not necessary to understanding the invention. - In general, the device would not have to be a cellular phone at all, but would possess the functionality of receiving an utterance, converting it to a non-text representation of the utterance, and sending it over SMS or another non-voice channel. For example a laptop computer having a microphone, appropriate software to generate a non-text representation of an utterance, and a wireless transmitter that utilizes the SMS protocol and frequencies, or any other device with similar functionality, could also be implemented.
- While the SMS network is presented in the above example, any network over which one might send text, data and/or media other than voice could be used. As an example one would also use an MMS (Multi-Media Service) messaging channel.
- Also, the MEL cepstrum coding technique mentioned above is just one example of many known alternatives for extracting and representing features of the received utterance. Any of the other known techniques, such as LPC cepstral coefficients for example, could be employed instead of the MEL cepstrum coding technique. Two examples of coding techniques that could be used to generate the non-text representations are: (1) Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, “A Very Low Bit Rate Speech Coder Using HMM with Speaker Adaptations,” paper presented at the 1998 ICASSP and a version also appearing in Systems and Computers in Japan, Volume 32, Issue 12, 2001. Pages 38-46; and (2) M. Habibullah Pagarkar, Lakshmi Gopalakrishnan, Nimish Sheth, Rizwana Shaikh, Virag Shah, “Language Independent Speech Compression Using Devanagari Phonetics,” found on the web at the following URL:
- http://www.geocities.com/virag81/docs.html, both of which are incorporated herein by reference.
- Other aspects, modifications, and embodiments are within the scope of the following claims.
Claims (18)
1. A method of operating a mobile communication device, said method comprising:
over a wireless messaging channel receiving a text message, wherein the text message contains a non-text representation of an utterance;
extracting the non-text representation from the text message; and
synthesizing an audio representation of the spoken utterance from the non-text representation.
2. The method of claim 1 , wherein the mobile communication device is a cellular phone.
3. The method of claim 1 , wherein the non-text representation of the utterance is a string of symbols representing sounds of the utterance.
4. The method of claim 3 , wherein the symbols in the string of symbols are selected from the group consisting of phonemes, diphones, triphones, syllables, and demisyllables.
5. The method of claim 3 , wherein the symbols in the string of symbols are phonemes.
6. The method of claim 1 , wherein the wireless messaging channel is an SMS channel and the text message is an SMS message.
7. The method of claim 1 , wherein the received text message includes an indicator identifying the text message as containing a non-text representation of the utterance.
8. The method of claim 1 , wherein the non-text representation is a compressed version of the utterance.
9. The method of claim 1 further comprising:
over the wireless messaging channel receiving a plurality of text messages in addition to the first-mentioned text message, said first-mentioned text message and said plurality of text messages forming a set of text messages, wherein each text message of the set of text messages contains a non-text representation of a different portion of the utterance; and
extracting the non-text representations from the plurality of text messages.
10. A mobile communication device for receiving a voice message, said mobile communication device comprising:
a processor system;
a transceiver for receiving a text message that contains a non-text representation of an utterance;
an audio output device; and
memory storing code which when executed on the processor system causes the mobile communication device to extract the non-text representation from the received text message; synthesize an audio representation of the spoken utterance from the non-text representation.
11. The mobile communication device of claim 10 , wherein the mobile communication device includes a cellular phone.
12. The mobile communication device of claim 10 , wherein the non-text representation of the utterance is a string of symbols representing sounds of the utterance.
13. The mobile communication device of claim 12 , wherein the symbols in the string of symbols are selected from the group consisting of phonemes, diphones, triphones, syllables, and demisyllables.
14. The mobile communication device of claim 12 , wherein the symbols in the string of symbols are phonemes.
15. The mobile communication device of claim 10 , wherein the wireless messaging channel is an SMS channel and the text message is an SMS message.
16. The mobile communication device of claim 10 , wherein the received text message includes an indicator identifying the text message as containing a non-text representation of the utterance.
17. The mobile communication device of claim 10 , wherein the non-text representation is a compressed version of the utterance.
18. The mobile communication device of claim 10 , wherein the code when executed on the processor further causes the mobile communication device to:
over the wireless messaging channel, receive a plurality of text messages in addition to the first-mentioned text message, said first-mentioned text message and said plurality of text messages forming a set of text messages, wherein each text message of the set of text messages contains a non-text representation of a different portion of the utterance; and
extract the non-text representations from the plurality of text messages.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/329,444 US20120089399A1 (en) | 2004-04-20 | 2011-12-19 | Voice Over Short Messaging Service |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US56375404P | 2004-04-20 | 2004-04-20 | |
US11/110,371 US7395078B2 (en) | 2004-04-20 | 2005-04-20 | Voice over short message service |
US12/146,892 US8081993B2 (en) | 2004-04-20 | 2008-06-26 | Voice over short message service |
US13/329,444 US20120089399A1 (en) | 2004-04-20 | 2011-12-19 | Voice Over Short Messaging Service |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/146,892 Continuation US8081993B2 (en) | 2004-04-20 | 2008-06-26 | Voice over short message service |
Publications (1)
Publication Number | Publication Date |
---|---|
US20120089399A1 true US20120089399A1 (en) | 2012-04-12 |
Family
ID=35197620
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/110,371 Active US7395078B2 (en) | 2004-04-20 | 2005-04-20 | Voice over short message service |
US12/146,892 Active 2027-05-20 US8081993B2 (en) | 2004-04-20 | 2008-06-26 | Voice over short message service |
US13/329,444 Abandoned US20120089399A1 (en) | 2004-04-20 | 2011-12-19 | Voice Over Short Messaging Service |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/110,371 Active US7395078B2 (en) | 2004-04-20 | 2005-04-20 | Voice over short message service |
US12/146,892 Active 2027-05-20 US8081993B2 (en) | 2004-04-20 | 2008-06-26 | Voice over short message service |
Country Status (7)
Country | Link |
---|---|
US (3) | US7395078B2 (en) |
JP (1) | JP2007534278A (en) |
KR (1) | KR20070007882A (en) |
CN (1) | CN101095287B (en) |
DE (1) | DE112005000924T5 (en) |
GB (1) | GB2429137B (en) |
WO (1) | WO2005104092A2 (en) |
Families Citing this family (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7184786B2 (en) * | 2003-12-23 | 2007-02-27 | Kirusa, Inc. | Techniques for combining voice with wireless text short message services |
US7817606B2 (en) * | 2004-04-05 | 2010-10-19 | Daniel J. LIN | Method for establishing network connections between stationary terminals and remote devices through mobile devices |
MX2007001740A (en) * | 2004-08-14 | 2007-04-20 | Kirusa Inc | Methods for identifying messages and communicating with users of a multimodal message service. |
US8054950B1 (en) * | 2005-06-17 | 2011-11-08 | Sprint Spectrum L.P. | Network initiation and pull of media from mobile devices |
US7769142B2 (en) * | 2005-07-14 | 2010-08-03 | Microsoft Corporation | Asynchronous discrete manageable instant voice messages |
US20070083367A1 (en) * | 2005-10-11 | 2007-04-12 | Motorola, Inc. | Method and system for bandwidth efficient and enhanced concatenative synthesis based communication |
US7761293B2 (en) * | 2006-03-06 | 2010-07-20 | Tran Bao Q | Spoken mobile engine |
US20070208564A1 (en) * | 2006-03-06 | 2007-09-06 | Available For Licensing | Telephone based search system |
US20070207782A1 (en) * | 2006-03-06 | 2007-09-06 | Tran Bao Q | Multimedia telephone |
US8917716B2 (en) | 2006-04-17 | 2014-12-23 | Muse Green Investments LLC | Mesh network telephone system |
US8229479B1 (en) | 2006-05-23 | 2012-07-24 | Nextel Communications, Inc. | Systems and methods for multimedia messaging |
US20070287477A1 (en) * | 2006-06-12 | 2007-12-13 | Available For Licensing | Mobile device with shakeable snow rendering |
US7701331B2 (en) * | 2006-06-12 | 2010-04-20 | Tran Bao Q | Mesh network door lock |
US7976386B2 (en) * | 2006-06-12 | 2011-07-12 | Tran Bao Q | Mesh network game controller with voice transmission, search capability, motion detection, and/or position detection |
EP1879000A1 (en) * | 2006-07-10 | 2008-01-16 | Harman Becker Automotive Systems GmbH | Transmission of text messages by navigation systems |
US8050693B2 (en) * | 2007-04-02 | 2011-11-01 | Yahoo! Inc. | Employing the SMS protocol as a transport layer protocol |
US20080318679A1 (en) * | 2007-06-21 | 2008-12-25 | Alexander Bach Tran | Foot game controller with motion detection and/or position detection |
US20090005022A1 (en) * | 2007-06-29 | 2009-01-01 | Nokia Corporation | Methods, Apparatuses and Computer Program Products for Providing a Party Defined Theme |
US8213580B2 (en) * | 2007-10-25 | 2012-07-03 | International Business Machines Corporation | Automated message conversion based on availability of bandwidth |
CN101212721B (en) * | 2007-12-25 | 2011-01-19 | 华为软件技术有限公司 | Information processing method, system, and information consolidation device |
US8364486B2 (en) * | 2008-03-12 | 2013-01-29 | Intelligent Mechatronic Systems Inc. | Speech understanding method and system |
US8639505B2 (en) * | 2008-04-23 | 2014-01-28 | Nvoq Incorporated | Method and systems for simplifying copying and pasting transcriptions generated from a dictation based speech-to-text system |
US8639512B2 (en) | 2008-04-23 | 2014-01-28 | Nvoq Incorporated | Method and systems for measuring user performance with speech-to-text conversion for dictation systems |
CN102655006A (en) * | 2011-03-03 | 2012-09-05 | 富泰华工业(深圳)有限公司 | Voice transmission device and voice transmission method |
US20120259633A1 (en) * | 2011-04-07 | 2012-10-11 | Microsoft Corporation | Audio-interactive message exchange |
US9111457B2 (en) * | 2011-09-20 | 2015-08-18 | International Business Machines Corporation | Voice pronunciation for text communication |
US9992021B1 (en) | 2013-03-14 | 2018-06-05 | GoTenna, Inc. | System and method for private and point-to-point communication between computing devices |
DE102013005844B3 (en) * | 2013-03-28 | 2014-08-28 | Technische Universität Braunschweig | Method for measuring quality of speech signal transmitted through e.g. voice over internet protocol, involves weighing partial deviations of each frames of time lengths of reference, and measuring speech signals by weighting factor |
CN106469041A (en) * | 2016-08-30 | 2017-03-01 | 北京小米移动软件有限公司 | The method and device of PUSH message, terminal unit |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6452947B1 (en) * | 1998-02-16 | 2002-09-17 | Fujitsu Limited | Information retrieval system and information terminal used in the same, and recording medium |
US6574598B1 (en) * | 1998-01-19 | 2003-06-03 | Sony Corporation | Transmitter and receiver, apparatus and method, all for delivery of information |
US20060161426A1 (en) * | 2005-01-19 | 2006-07-20 | Kyocera Corporation | Mobile terminal and text-to-speech method of same |
US20060258378A1 (en) * | 2003-06-20 | 2006-11-16 | Terho Kaikuranata | Mobile device for mapping sms characters to e.g. sound, vibration, or graphical effects |
US7184049B2 (en) * | 2002-05-24 | 2007-02-27 | British Telecommunications Public Limited Company | Image processing method and system |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05188990A (en) * | 1992-01-13 | 1993-07-30 | Oki Electric Ind Co Ltd | Speech recognizing method |
US6240384B1 (en) * | 1995-12-04 | 2001-05-29 | Kabushiki Kaisha Toshiba | Speech synthesis method |
JPH10260692A (en) * | 1997-03-18 | 1998-09-29 | Toshiba Corp | Method and system for recognition synthesis encoding and decoding of speech |
US6163765A (en) * | 1998-03-30 | 2000-12-19 | Motorola, Inc. | Subband normalization, transformation, and voiceness to recognize phonemes for text messaging in a radio communication system |
US6977921B1 (en) * | 1998-08-19 | 2005-12-20 | Lucent Technologies Inc. | Using discrete message-oriented services to deliver short audio communications |
US20030014253A1 (en) * | 1999-11-24 | 2003-01-16 | Conal P. Walsh | Application of speed reading techiques in text-to-speech generation |
US20020102966A1 (en) * | 2000-11-06 | 2002-08-01 | Lev Tsvi H. | Object identification method for portable devices |
EP1215659A1 (en) * | 2000-12-14 | 2002-06-19 | Nokia Corporation | Locally distibuted speech recognition system and method of its operation |
US6625576B2 (en) * | 2001-01-29 | 2003-09-23 | Lucent Technologies Inc. | Method and apparatus for performing text-to-speech conversion in a client/server environment |
US7088723B2 (en) * | 2001-02-23 | 2006-08-08 | Samsung Electronics Co., Ltd. | System and method for enhancing a voice channel in voice over internet protocol |
US7076738B2 (en) * | 2001-03-02 | 2006-07-11 | Semantic Compaction Systems | Computer device, method and article of manufacture for utilizing sequenced symbols to enable programmed application and commands |
WO2002077975A1 (en) * | 2001-03-27 | 2002-10-03 | Koninklijke Philips Electronics N.V. | Method to select and send text messages with a mobile |
US6990180B2 (en) * | 2001-04-05 | 2006-01-24 | Nokia Mobile Phones Limited | Short voice message (SVM) service method, apparatus and system |
ES2228739T3 (en) * | 2001-12-12 | 2005-04-16 | Siemens Aktiengesellschaft | PROCEDURE FOR LANGUAGE RECOGNITION SYSTEM AND PROCEDURE FOR THE OPERATION OF AN ASI SYSTEM. |
KR100450319B1 (en) * | 2001-12-24 | 2004-10-01 | 한국전자통신연구원 | Apparatus and Method for Communication with Reality in Virtual Environments |
-
2005
- 2005-04-20 GB GB0620538A patent/GB2429137B/en not_active Expired - Fee Related
- 2005-04-20 US US11/110,371 patent/US7395078B2/en active Active
- 2005-04-20 CN CN2005800163690A patent/CN101095287B/en not_active Expired - Fee Related
- 2005-04-20 DE DE112005000924T patent/DE112005000924T5/en not_active Ceased
- 2005-04-20 KR KR1020067022829A patent/KR20070007882A/en not_active Application Discontinuation
- 2005-04-20 WO PCT/US2005/013478 patent/WO2005104092A2/en active Application Filing
- 2005-04-20 JP JP2007509601A patent/JP2007534278A/en active Pending
-
2008
- 2008-06-26 US US12/146,892 patent/US8081993B2/en active Active
-
2011
- 2011-12-19 US US13/329,444 patent/US20120089399A1/en not_active Abandoned
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6574598B1 (en) * | 1998-01-19 | 2003-06-03 | Sony Corporation | Transmitter and receiver, apparatus and method, all for delivery of information |
US6452947B1 (en) * | 1998-02-16 | 2002-09-17 | Fujitsu Limited | Information retrieval system and information terminal used in the same, and recording medium |
US7184049B2 (en) * | 2002-05-24 | 2007-02-27 | British Telecommunications Public Limited Company | Image processing method and system |
US20060258378A1 (en) * | 2003-06-20 | 2006-11-16 | Terho Kaikuranata | Mobile device for mapping sms characters to e.g. sound, vibration, or graphical effects |
US20060161426A1 (en) * | 2005-01-19 | 2006-07-20 | Kyocera Corporation | Mobile terminal and text-to-speech method of same |
Also Published As
Publication number | Publication date |
---|---|
GB2429137A (en) | 2007-02-14 |
CN101095287B (en) | 2011-05-18 |
KR20070007882A (en) | 2007-01-16 |
US8081993B2 (en) | 2011-12-20 |
DE112005000924T5 (en) | 2008-07-17 |
US20090017849A1 (en) | 2009-01-15 |
GB0620538D0 (en) | 2006-11-29 |
WO2005104092A2 (en) | 2005-11-03 |
US7395078B2 (en) | 2008-07-01 |
US20050266831A1 (en) | 2005-12-01 |
JP2007534278A (en) | 2007-11-22 |
GB2429137B (en) | 2009-03-18 |
CN101095287A (en) | 2007-12-26 |
WO2005104092A3 (en) | 2007-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8081993B2 (en) | Voice over short message service | |
JP5247062B2 (en) | Method and system for providing a text display of a voice message to a communication device | |
US8019604B2 (en) | Method and apparatus for uniterm discovery and voice-to-voice search on mobile device | |
JP4607334B2 (en) | Distributed speech recognition system | |
US7124082B2 (en) | Phonetic speech-to-text-to-speech system and method | |
US6681208B2 (en) | Text-to-speech native coding in a communication system | |
US20100217591A1 (en) | Vowel recognition system and method in speech to text applictions | |
JPH09507105A (en) | Distributed speech recognition system | |
Husnjak et al. | Possibilities of using speech recognition systems of smart terminal devices in traffic environment | |
US9830903B2 (en) | Method and apparatus for using a vocal sample to customize text to speech applications | |
WO2007067837A2 (en) | Voice quality control for high quality speech reconstruction | |
JP3914612B2 (en) | Communications system | |
CN112908361B (en) | Spoken language pronunciation evaluation system based on small granularity | |
EP1298647B1 (en) | A communication device and a method for transmitting and receiving of natural speech, comprising a speech recognition module coupled to an encoder | |
Wutiwiwatchai et al. | Thai ASR development for network-based speech translation | |
JP3552200B2 (en) | Audio signal transmission device and audio signal transmission method | |
US20020116180A1 (en) | Method for transmission and storage of speech | |
JP2003323191A (en) | Access system to internet homepage adaptive to voice | |
CN116848581A (en) | Wireless communication device using speech recognition and speech synthesis | |
EP1103954A1 (en) | Digital speech acquisition, transmission, storage and search system and method | |
Lopes et al. | A 40 bps speech coding scheme | |
JP2003058177A (en) | Document read-aloud system | |
JPH10289092A (en) | Information processing system and information management method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: VOICE SIGNAL TECHNOLOGIES, INC., MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ROTH, DANIEL L.;REEL/FRAME:027487/0012 Effective date: 20050614 |
|
AS | Assignment |
Owner name: NUANCE COMMUNICATIONS, INC., MASSACHUSETTS Free format text: MERGER;ASSIGNOR:VOICE SIGNAL TECHNOLOGIES, INC.;REEL/FRAME:028952/0277 Effective date: 20070514 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |