EP1103954A1 - Digital speech acquisition, transmission, storage and search system and method - Google Patents

Digital speech acquisition, transmission, storage and search system and method Download PDF

Info

Publication number
EP1103954A1
EP1103954A1 EP99203979A EP99203979A EP1103954A1 EP 1103954 A1 EP1103954 A1 EP 1103954A1 EP 99203979 A EP99203979 A EP 99203979A EP 99203979 A EP99203979 A EP 99203979A EP 1103954 A1 EP1103954 A1 EP 1103954A1
Authority
EP
European Patent Office
Prior art keywords
speech
information
determined
prosody
digital
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99203979A
Other languages
German (de)
French (fr)
Inventor
Marc Lutz
Mark Vange
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lancaster Equities Ltd
Original Assignee
Lancaster Equities Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CA002251502A priority Critical patent/CA2251502A1/en
Application filed by Lancaster Equities Ltd filed Critical Lancaster Equities Ltd
Priority to EP99203979A priority patent/EP1103954A1/en
Priority to JP33648199A priority patent/JP2001154686A/en
Priority to AU61769/99A priority patent/AU6176999A/en
Publication of EP1103954A1 publication Critical patent/EP1103954A1/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • the present invention relates to digital speech systems and methods. More particularly, the present invention relates to methods of acquiring, transmitting, storing and searching speech in a digital format.
  • HMM Hidden Markov Modeling
  • HMM-based systems suffer from several disadvantages. Specifically, HMM-based systems employ pattern matching and statistical models to assign words to the digitized speech. This can require significant amounts of processing power and such systems do not offer good accuracy when the vocabulary they must analyze is not very restricted. Further, these systems typically require configuration (training) to be performed by each user to learn the user's speech characteristics before a user can utilize the system. Even when trained, HMM-based systems can make errors. For example, in one test of an HMM-based system after a full training session, the phrase, "I am based in Toronto” was transcribed as "I embraced Toronto” and "Ian blazed Toronto". Further specific training was required before "based" could be recognized by the system.
  • a compact disc can store a speech and telephone systems now commonly employ digital links over which the conversations are carried. More recently, voice conversations have been transmitted over digital packet networks and, it has recently become possible to conduct telephone conversations over the internet.
  • the user speaks into a microphone at a personal computer and the output of the microphone is digitized, packetized and transmitted through the internet by the personal computer to a remote computer where it is received, converted back into an analog signal and output to the receiving user.
  • digital voice storage and transmission systems also suffer from disadvantages.
  • One of the primary disadvantages is that, to obtain reasonable quality voice signals, the digitization process must have a relatively high sampling rate (e.g. a minimum of 6,500 samples per second and preferably 8,000 or 10,000) and thus digitized voice requires a relatively high amount of bandwidth (e.g. about 1,500 bits per second) for transmission.
  • Voxware, Inc. 305 College Road East, Princeton, New Jersey, USA sells a variety of digital voice codecs. These codecs employ 8kHz sampling of a 4kHz speech frequency and require between 1260 and 6400 bps (bits per second) for transmission, depending upon the desired quality and robustness of the transmission.
  • An additional disadvantage is that, if such digital voice transmissions are to be stored for later access, which can be required for call logging and other purposes, a large amount of storage space can be required. These bandwidth and/or storage requirements can prevent voice transmission from being provided as a feature for intemet multi-player games, etc.
  • a digital speech system comprising:
  • the system further comprises speech acquisition means to acquire an analog electronic representation of the speech and digitization means to convert the analog representation of the speech to the digital format.
  • the system further comprises an output means to output the facsimile to a user in an audible manner.
  • a method of acquiring and constructing digital speech comprising the steps of:
  • a digital speech system in accordance with an embodiment of the present invention, is indicated generally at 10 in Figure 1.
  • System 10 includes a transducer 12, such as a microphone or other suitable means to acquire speech in electronic form, a speech acquisition portion 14, as described below, a speech generation portion 16, also described below, and a speech output 18, such as a loudspeaker or other suitable means for presenting the constructed speech facsimile to a user or subsequent system.
  • Connection 20 between speech acquisition portion 14 and speech generation portion 16 can be a connection, such as a data network, or can be a storage and retrieval system, such as a digital mass storage device which allows acquired speech to be stored and to subsequently be provided to speech generation portion 16.
  • speech acquisition portion 14 includes five principal functional blocks.
  • portion 14 acquires, via transducer 12, speech to be processed and in block 28 the acquired speech is digitized at an appropriate sampling rate.
  • the digitized speech is analyzed to determine speech element information from the acquired speech and in block 36 prosody information is determined from the acquired speech.
  • the functions of blocks 32 and 36 can be performed in a serial manner, as shown in the Figure, or can be performed in parallel if desired. In either case, the time information relating the determined speech elements to the determined prosody is part of the determined information.
  • the acquisition and digitization of the speech at blocks 24 and 28 can be accomplished in any suitable manner as will occur to those of skill in the art and many conventional techniques to accomplish this are known. For example, in the environment of personal computers, it is not uncommon that speech can be acquired and digitized with an inexpensive microphone connected to an input port on a sound card, such as a SoundBlaster or compatible card. It is also contemplated that the functions of blocks 24 and 28 can be performed in separately and/or at a different time. For example, the acquired and digitized speech can be provided to speech system 10 from a previously recorded CD ROM, etc.
  • the speech elements referred to with respect to block 32 can be any representative elements, such as allophones, diphones, phonemes, etc., as will occur to those of skill in the art.
  • phonemes are determined from the acquired speech.
  • Phonemes are a formal representation of phones (the distinct sounds in speech) and include vowels, semivowels, dipthongs and constants. Different languages and/or accents have different phonemes and, depending upon the phoneme definition standard employed and its implementation, there are between about forty and about sixty phonemes in English.
  • Prosody which is the information determined in block 36, represents pitch, timing and other information associated with speech.
  • An example of prosody is the rising pitch at the end of a sentence to represent a question.
  • Other examples of prosody include pauses to represent the ends of sentences, etc.
  • the determination of prosody information from speech is known.
  • Conventional speech element determination systems for example phoneme determination systems, return a matrix for all possible phonemes indicating the probability that each particular phoneme was present.
  • Word recognizer systems search for the matrix for the letter combinations that have the highest probability of being a word.
  • the present invention can further employ these probabilities as a reference to tables of allophones, looking for sound similarities, overlap in sounds, and such, to assist in finding the most probable sound match. rather than the most probable word match. Matching on sounds, rather than words, results in a much better approximation of the way speech works, especially the way people mix and blend sounds in real speech.
  • the speech information comprising the speech element and prosody information
  • the information is encoded so that it can be effectively represented in a digital form. For example, if system 20 has sixty phonemes defined for it, for the English language, a set of sixty unique symbols can be defined to represent these phonemes and this symbol set can be represented by six bits of infonnation. As will be apparent, six bits are much less than the many bytes of information which would be required for a digitized sample of the spoken phoneme.
  • One encoding scheme that can be employed is to use the Unicode values for the International Phonetic Alphabet to represent the phonetic information and this is the presently preferred encoding scheme for the determined speech elements.
  • Prosody information is also encoded in a digital form, for example as numbers representing the duration of speech elements and pitch changes.
  • the encoding of the determined prosody information can be accomplished with a relatively small symbol set and/or in a size efficient manner, compared to digitized samples of speech.
  • the time relationship between the speech elements and prosody information is maintained.
  • the information encoded at block 40 is stored or transmitted, through connection 20, at block 44.
  • the present invention allows a relatively low bandwidth method of transmitting speech information and/or a relatively efficient storage method for speech.
  • prior art systems of which the inventor is aware either digitize speech for transmission and/or storage and reconvert that speech to an analog form for use, or perform a speech to text conversion.
  • the present invention determines information about the speech, the information being stored and/or transmitted in a manner that maintains the information without maintaining the actual speech.
  • the encoded determined information can be employed to construct a facsimile representation of the original speech or to otherwise interact and access the information of the original speech.
  • speech generation portion 16 includes four principal functional blocks.
  • functional block 52 the encoded information is received at or retrieved by the speech generation process.
  • the encoded information is decoded to obtain the speech element information and prosody information.
  • the speech information and related prosody information is compared to a database of speech sound elements available to the speech generator. Statistical modeling, or any other suitable selection process, is then employed to select appropriate available speech sound elements.
  • block 64 the selected speech sound elements are assembled in accordance with the prosody information and output to the user as a constructed facsimile of the original speech.
  • the speech sound elements can comprise digitized speech samples and/or synthetic speech elements, such as those produced by a vocoder or by FM synthesis.
  • the selected speech sound elements comprise appropriate instructions for the vocoder or FM synthesis system to create the desired sounds.
  • the determination of speech elements and prosody information and the selection of speech sound elements for construction of facsimiles of the original speech can be less than completely accurate.
  • most individuals do not always speak in a perfectly correct manner and yet their speech is still understood by others.
  • This understanding of 'flawed' speech is possible due to the redundancy of information in speech, including the context of the speech and the relationship between the speech elements and the prosody.
  • the present invention takes advantage of the human ability to understand even flawed speech without undue effort and it has been determined by the present inventor that relatively many errors in the overall process of constructing a facsimile of the original speech are easily tolerated by most humans.
  • Phoneme determination, and speech determination in general, are statistical in nature and prone to errors and 'almost right' recognition.
  • the present invention preferably takes advantage of the relationship in speech between phonemes and allophones to provide acceptable playback of speech.
  • English has between about forty and about sixty phonemes but there are many ways of pronouncing these phonemes, since they tend to be modified by dialect, preceding and following sounds, emotion, non-word speech components such as the pitch rise for questions, etc.
  • the present invention can use the speech relationship between phonemes and allophones to select the most probable phoneme matches by also considering the list of allophones.
  • One use of the present invention allows transmission of the encoded speech information to a remote location via a relatively low bandwidth data network.
  • a conversation can be conducted by two or more individuals over a packet network 100, such as the internet, with each party employing a computer system 104a, 104b such as a general purpose IBM compatible personal computer or a dedicated special purpose computer-based system.
  • Computer system 104 operates to acquire the user's speech via a microphone 108a, 108b and to determine and encode the speech elements and prosody information as acquisition portion 14 of system 10.
  • the encoded information is then transmitted through the intemet 100 to the other user.
  • each user's computer system 104 receives the encoded speech information from the other user or users and constructs a facsimile of the original speech as the speech generation portion 16 of system 10, as described above, which is then output to the user via a loudspeaker 112, headset or other suitable output device.
  • the relatively low bandwidth required to transmit the encoded speech information allows the present invention to be employed to permit players of online games to have spoken communication with other players over the limited bandwidth of their connections to the internet over which the other game-related information must also be transmitted.
  • the speech generation portion 16 at each user's computer system can employ digitized samples of'celebrity' voices as the speech sound elements. It is contemplated that a user can select a Darth Vader-like voice to be assigned to a first user and a HAL 9000-like voice to be assigned to a second user. Further, it is also contemplated that the speech generation portion 16 can include a predefined selection of celebrity of voices, and each first user can specify which of the available voices is to be used to reconstruct the first user's speech in the second user's speech generation portion 16.
  • the present invention can also be employed to make more efficient use of systems wherein bandwidth is expensive.
  • many corporations now operate call centers which, for labor cost and other reasons, are often located in locations which are distant from the customers who access the call center. While the final financial analysis still supports such distant call centers, much of the otherwise potential savings can be absorbed by the necessity that the corporation provide and pay for toll free access to the call center for the customers. While a call center would still have to fund toll free access, with the present invention it is possible to significantly reduce the total required bandwidth to reduce the corporation's costs.
  • the corporation can provide sub-centers 160 which can be accessed locally by customers 164.
  • Sub-centers 160 which can be personal computer systems, minicomputers, special purpose devices or any other suitable system as will occur to those of skill in the art, comprise the above-described speech acquisition portion 14 and speech generation portion 16 of system 10. Sub-centers 160 acquire, determine, encode and transmit speech information from customers 164 calling the sub-center 160 to call center 168 over a suitable communications link 166.
  • Communications link 166 can be a leased data connection, a packet network, satellite link or any other suitable data communications link, as will occur to those of skill in the art.
  • the transmitted speech information is received at a suitable call center computer system 168 at the call center.
  • Call center computer system 168 can be personal computer systems, minicomputers, special purpose devices or any other suitable system as will occur to those of skill in the art, and also comprises the above-described speech acquisition portion 14 and speech generation portion 16 of system 10.
  • speech generation portion 16 constructs a facsimile of the customer's original speech.
  • the attendant 172 at the call center then responds to the inquiry and the attendant's speech is processed by the speech acquisition portion 14 of call center computer system 168 to determine, encode and transmit, over link 166, speech information to the appropriate sub-center 160 where a construction of a facsimile of the attendant's speech is created and provided to the customer 164 by sub-center 160.
  • the present invention also provides a process whereby space efficient storage of speech can be accomplished.
  • every conversation between customers 164 and attendants 174 can be stored, in encoded speech information form, and requires less storage capacity than other conventional methods.
  • speech stored with the present invention can be effectively searched for the occurrence of words or phrases of interest.
  • speech generation portions 16 employed at sub-centers 160 can employ a preferred voice.
  • the corporation for which the call center is operated can have a celebrity corporate spokesperson and the celebrity's voice can be employed in the sub-center speech generation portions 16.
  • An additional feature of the present invention is that it can be used to alter or remove cultural or regional accents and speech features of the speakers.
  • One of the disadvantages with operating call centers distal customers is that accents and other speech features of the call center attendants can emphasize or otherwise interfere with the relationship between the attendant and the customer. For example, a customer from the US mid-west may be uncomfortable in speaking with an attendant in a call center in Ireland, who has a noticeable Irish accent. In the past, this has necessitated that call centers attempt to hire or train attendants without strong accents or other undesired identifying speech characteristics.
  • Figure 6 shows the principal functional blocks of a speech generation process in accordance with another embodiment of the present invention.
  • the speech generation process constructs a facsimile of the original speech from determined speech elements and prosody, it is possible to identify and reduce, or even eliminate, accents and other undesired identifying speech characteristics.
  • the speech generation process of Figure 6 is similar to that of Figure 3, discussed above.
  • encoded speech information is received via a communications link or is retrieved from a storage device.
  • the encoded information is decoded to obtain the speech elements and prosody information.
  • the decoded information is examined, using statistical models, pattern matching or other appropriate techniques, to recognize undesired speech characteristics and to substitute information into the speech element information and prosody information to reduce or remove those identified undesired characteristics to form modified speech information.
  • the modified speech information is used to select speech sound elements and in block 216 a constructed modified facsimile of the original speech is output to the user.
  • the recognition and substitution process to remove undesired speech aspects can be performed in the speech acquisition portion 14, prior to transmission or storage of the speech information, or in the speech generation portion 16, as discussed above.
  • Another feature of the present invention is the ability to facilitate the location of desired phrases or words in speech.
  • an attorney may wish to locate a phrase in an audio recording of the testimony of a witness, which can be several hours in length.
  • a search can be conducted only by either manually or automatically transcribing the testimony to text and then performing conventional text searching operations or by listening to the testimony recording. Neither of these approaches is satisfactory in many circumstances, with manual transcription being expensive, automatic transcription being subject to errors and listening to the recording being time consuming.
  • the speech to be searched can be processed, with a speech acquisition process similar to that of Figure 3, to determine encoded speech information.
  • Speech can either be stored only in encoded speech element form, such as for the call center call logging discussed above, or can be stored in both an encoded speech element form and an analog or digitized form. In either case, the determined speech elements are searched to locate words or phrases of interest.
  • the word or phrase is processed by the speech acquisition process to determine encoded speech information for the word or phrase.
  • the search is then performed on the speech by comparing the stored determined speech element and prosody information with the speech element and prosody information for the search word or phrase.
  • a suitable pattern matching or statistical algorithm can be employed to locate high probability matches between the speech information and the search information.
  • the speech has only been stored as encoded speech information, such as the above-mentioned call logging for a call center.
  • the identified matches are provided to a speech generation process, as described above, to let the searcher hear a construction of the appropriate portions of the speech.
  • the time information related to the located matches can be provided so that the user can directly access the appropriate times in the original speech.
  • the above-mentioned processes for removing accents and undesired speech characteristics can also be employed with this search process to augment the ability to locate phrases or words when the searcher and the searchee have different speech characteristics.
  • the stored determined speech information can be indexed, in any suitable manner as will occur to those of skill in the art, to enhance the speed at which searches of large datasets can be performed.
  • the present invention can also be employed in voice transcription systems.
  • a user can provide dictation which is processed to obtain encoded speech information, as described above, which can be stored in an efficient manner.
  • the encoded speech can be used to construct a facsimile of the original speech to the user, thus providing a spoken memo pad-like application, or can be subsequently transcribed to text.
  • digitized samples of the voice of the user can be user to reconstruct the speech, if desired.
  • the transcription can be performed by humans, in a conventional manner, or can be performed by speech to text transcriptions systems, such as the above-mentioned Dragonsoft and other systems.
  • One of the perceived advantages of the present invention when used with speech to text transcription systems, is that as the speech information is stored, it can be examined in a non-real time manner. This allows multiple passes to be performed by the speech to text transcription system. Also, as mentioned above, accents or other speech characteristics which can confuse a speech to text transcription system, can be removed or mitigated by the present invention and this can reduce or eliminate lengthy training of the transcription system by the user.
  • encoded speech information can be emailed to a recipient, who will then provide the speech information to a speech generation process, which can be a plug-in component for the email program, to hear a constructed facsimile of the original speech.
  • a speech generation process which can be a plug-in component for the email program, to hear a constructed facsimile of the original speech.
  • the present invention can provide voice recognition of users.
  • several base samples of the user's speech are processed to obtain speech element and prosody base information for that user and this information is stored in a database.
  • the user is requested to speak to provide a test sample of their speech to the system.
  • the test sample is processed to obtain speech element and prosody information from the test sample and the obtained information is compared to the base information previously determined for the user.
  • a statistical comparison is then performed between the information for the base samples and the test sample to determine the degree of correspondence therebetween.
  • the present invention is less sensitive to differences in the microphone and environment with which the base and the test samples are acquired and does not require that the user repeat the particular base samples originally provided to the system.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Facsimiles In General (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A digital speech system processes acquired speech to determine speech element information, such as the phonemes in the speech, and associated the prosody information. This determined information is then encoded for transmission and/or storage. The encoded information is decoded to recover the speech element information and the prosody information which can be provided to a speech generator to construct a facsimile of the original speech. Also, the speech element and prosody information can be searched to locate words or phrases of interest in the speech.

Description

    Field Of The Invention
  • The present invention relates to digital speech systems and methods. More particularly, the present invention relates to methods of acquiring, transmitting, storing and searching speech in a digital format.
  • Background Of The Invention
  • Much research and effort has been directed to speech recognition by computer systems and/or to transmission of speech in digital formats. For example, IBM, Dragonsofl and others sell products for personal computers which allow a user to speak computer commands into a microphone and have the computer respond accordingly to the spoken commands. More advanced systems which have recently become available allow a user to dictate into the microphone and have the computer transcribe the spoken words into written text.
  • These systems operate by converting the analog signal produced by the microphone into a digital signal. The computer system then analyses the digital signals, typically employing a technique called Hidden Markov Modeling (HMM), in an attempt to extract content and meaning from digitized speech. When employed with telephony and restricted ranges of vocabulary, recent HMM systems have been able to achieve rates of accuracy approaching 90%, or better.
  • However, HMM-based systems suffer from several disadvantages. Specifically, HMM-based systems employ pattern matching and statistical models to assign words to the digitized speech. This can require significant amounts of processing power and such systems do not offer good accuracy when the vocabulary they must analyze is not very restricted. Further, these systems typically require configuration (training) to be performed by each user to learn the user's speech characteristics before a user can utilize the system. Even when trained, HMM-based systems can make errors. For example, in one test of an HMM-based system after a full training session, the phrase, "I am based in Toronto" was transcribed as "I embraced Toronto" and "Ian blazed Toronto". Further specific training was required before "based" could be recognized by the system.
  • It is also known to transmit or store voice data in digital form. For example, a compact disc can store a speech and telephone systems now commonly employ digital links over which the conversations are carried. More recently, voice conversations have been transmitted over digital packet networks and, it has recently become possible to conduct telephone conversations over the internet. In the intemet systems, the user speaks into a microphone at a personal computer and the output of the microphone is digitized, packetized and transmitted through the internet by the personal computer to a remote computer where it is received, converted back into an analog signal and output to the receiving user.
  • However, digital voice storage and transmission systems also suffer from disadvantages. One of the primary disadvantages is that, to obtain reasonable quality voice signals, the digitization process must have a relatively high sampling rate (e.g. a minimum of 6,500 samples per second and preferably 8,000 or 10,000) and thus digitized voice requires a relatively high amount of bandwidth (e.g. about 1,500 bits per second) for transmission. For example, Voxware, Inc., 305 College Road East, Princeton, New Jersey, USA sells a variety of digital voice codecs. These codecs employ 8kHz sampling of a 4kHz speech frequency and require between 1260 and 6400 bps (bits per second) for transmission, depending upon the desired quality and robustness of the transmission.
  • An additional disadvantage is that, if such digital voice transmissions are to be stored for later access, which can be required for call logging and other purposes, a large amount of storage space can be required. These bandwidth and/or storage requirements can prevent voice transmission from being provided as a feature for intemet multi-player games, etc.
  • Yet another disadvantage with prior art digital speech systems is that they do not provide an acceptable method of searching the speech for words or phrases of interest. At best, the prior art systems allow "overspeed" playback of the speech to allow a user to listen to the speech at a fast playback rate to attempt to locate portions of interest in the speech which can then be played back at the correct playback rate. At worst, these systems require that the user listen to the speech at correct playback rates to locate portions of interest. These search techniques leave much to be desired, especially if a long portion of speech is to be searched.
  • It is desired to have a system and method which allows relatively efficient acquisition, transmission, searching and storage of speech information in a digital form.
  • Summary Of The Invention
  • It is an object of the present invention to provide a novel system and method of acquiring, transmitting, searching and/or storing speech information which obviates or mitigates at least one of the above-mentioned disadvantages of the prior art.
  • According to a first aspect of the present invention, there is provided a digital speech system, comprising:
  • speech element determination means operable on speech in a digital format to determine speech element information from said speech;
  • speech prosody determination means operable on said speech to determine prosody information from said speech;
  • encoding means to encode speech information comprising said determined speech element information, said determined speech prosody information and timing information relating thereto in a digital form;
  • decoding means to decode said encoded speech information to obtain said determined speech element information, said determined speech prosody information and said timing information;
  • comparison means to compare said determined speech element information and said determined speech prosody information to a database of speech elements to select speech sound elements which correspond thereto; and
  • speech generating means to assemble said selected speech sound elements to construct a facsimile of said speech.
  • Preferably, the system further comprises speech acquisition means to acquire an analog electronic representation of the speech and digitization means to convert the analog representation of the speech to the digital format. Also preferably, the system further comprises an output means to output the facsimile to a user in an audible manner.
  • According to another aspect of the present invention, there is provided a method of acquiring and constructing digital speech, comprising the steps of:
  • examining digitized speech to determine speech element information relating to said speech;
  • examining said digitized speech to determine prosody information relating to said speech;
  • encoding speech information corresponding to said determined speech element information, said determined prosody information and timing information relating thereto;
  • receiving and decoding said encoded speech information to obtain said timing information, said determined speech element information and said determined prosody information;
  • comparing said decoded speech element information and prosody information to a database to select corresponding speech sound elements; and
  • assembling said selected speech sound elements to construct a facsimile of said speech.
  • Brief Description Of The Drawings
  • Preferred embodiments of the present invention will now be described, by way of example only, with reference to the attached Figures, wherein:
  • Figure 1 shows a block diagram representation of a digital speech system in accordance with an embodiment of the present invention;
  • Figure 2 shows a flowchart of a digital speech acquisition process of Figure 1;
  • Figure 3 shows a flowchart of the speech generation process of Figure 1;
  • Figure 4 shows a schematic representation of a telephone-like communication system operating between two users connected to the internet in accordance with the present invention;
  • Figure 5 shows a schematic representation of a call center employing an embodiment of the present invention; and
  • Figure 6 shows a flowchart of a speech generation process in accordance with another embodiment of the present invention.
  • Detailed Description Of The Invention
  • A digital speech system, in accordance with an embodiment of the present invention, is indicated generally at 10 in Figure 1. System 10 includes a transducer 12, such as a microphone or other suitable means to acquire speech in electronic form, a speech acquisition portion 14, as described below, a speech generation portion 16, also described below, and a speech output 18, such as a loudspeaker or other suitable means for presenting the constructed speech facsimile to a user or subsequent system. Connection 20 between speech acquisition portion 14 and speech generation portion 16 can be a connection, such as a data network, or can be a storage and retrieval system, such as a digital mass storage device which allows acquired speech to be stored and to subsequently be provided to speech generation portion 16.
  • As shown is Figure 2, speech acquisition portion 14 includes five principal functional blocks. In block 24, portion 14 acquires, via transducer 12, speech to be processed and in block 28 the acquired speech is digitized at an appropriate sampling rate. In block 32, the digitized speech is analyzed to determine speech element information from the acquired speech and in block 36 prosody information is determined from the acquired speech. As will be apparent to those of skill in the art, the functions of blocks 32 and 36 can be performed in a serial manner, as shown in the Figure, or can be performed in parallel if desired. In either case, the time information relating the determined speech elements to the determined prosody is part of the determined information. Once the speech element and prosody information is determined in blocks 32 and 36, this information is encoded by block 40 and the resulting encoded speech information can be transmitted and/or stored by connection 20, as indicated at block 44.
  • The acquisition and digitization of the speech at blocks 24 and 28 can be accomplished in any suitable manner as will occur to those of skill in the art and many conventional techniques to accomplish this are known. For example, in the environment of personal computers, it is not uncommon that speech can be acquired and digitized with an inexpensive microphone connected to an input port on a sound card, such as a SoundBlaster or compatible card. It is also contemplated that the functions of blocks 24 and 28 can be performed in separately and/or at a different time. For example, the acquired and digitized speech can be provided to speech system 10 from a previously recorded CD ROM, etc.
  • The speech elements referred to with respect to block 32 can be any representative elements, such as allophones, diphones, phonemes, etc., as will occur to those of skill in the art. In a presently preferred embodiment of the invention, phonemes are determined from the acquired speech. Phonemes are a formal representation of phones (the distinct sounds in speech) and include vowels, semivowels, dipthongs and constants. Different languages and/or accents have different phonemes and, depending upon the phoneme definition standard employed and its implementation, there are between about forty and about sixty phonemes in English.
  • The analysis of speech, via digital processing, to determine phonemes is known and is, for example, discussed in, "Fundamentals of Speech Recognition: A Short Course", by Dr. Joseph Picone, Institute for Signal and Information Processing, Mississippi State University. This reference, the contents of which are incorporated herein by reference, is available at http://www.isip.msstate.edu/resources/index.html#journals.
  • Prosody, which is the information determined in block 36, represents pitch, timing and other information associated with speech. An example of prosody is the rising pitch at the end of a sentence to represent a question. Other examples of prosody include pauses to represent the ends of sentences, etc. Again, the determination of prosody information from speech is known. The paper, "Automatic Prosodic Analysis for Computer Aided Pronunciation Teaching", Paul Christopher Bagshaw, available at http://www.cstr.ed.ac.uk/pubs_by_year.html#theses provides a good description of the determination of prosody and the contents of this reference are also incorporated herein by reference.
  • Conventional speech element determination systems, for example phoneme determination systems, return a matrix for all possible phonemes indicating the probability that each particular phoneme was present. Word recognizer systems search for the matrix for the letter combinations that have the highest probability of being a word. Unlike these conventional speech element determination systems, the present invention can further employ these probabilities as a reference to tables of allophones, looking for sound similarities, overlap in sounds, and such, to assist in finding the most probable sound match. rather than the most probable word match. Matching on sounds, rather than words, results in a much better approximation of the way speech works, especially the way people mix and blend sounds in real speech.
  • Once the speech information, comprising the speech element and prosody information, has been determined in blocks 32 and 36, it is encoded at block 40. The information is encoded so that it can be effectively represented in a digital form. For example, if system 20 has sixty phonemes defined for it, for the English language, a set of sixty unique symbols can be defined to represent these phonemes and this symbol set can be represented by six bits of infonnation. As will be apparent, six bits are much less than the many bytes of information which would be required for a digitized sample of the spoken phoneme. One encoding scheme that can be employed is to use the Unicode values for the International Phonetic Alphabet to represent the phonetic information and this is the presently preferred encoding scheme for the determined speech elements.
  • Prosody information is also encoded in a digital form, for example as numbers representing the duration of speech elements and pitch changes. Again, the encoding of the determined prosody information can be accomplished with a relatively small symbol set and/or in a size efficient manner, compared to digitized samples of speech. As mentioned above, in the encoding, the time relationship between the speech elements and prosody information is maintained.
  • The information encoded at block 40 is stored or transmitted, through connection 20, at block 44.
  • The present invention allows a relatively low bandwidth method of transmitting speech information and/or a relatively efficient storage method for speech. As discussed above, prior art systems of which the inventor is aware either digitize speech for transmission and/or storage and reconvert that speech to an analog form for use, or perform a speech to text conversion. In contrast, the present invention determines information about the speech, the information being stored and/or transmitted in a manner that maintains the information without maintaining the actual speech. Specifically, the encoded determined information can be employed to construct a facsimile representation of the original speech or to otherwise interact and access the information of the original speech.
  • Construction of a facsimile representation of the original speech from the stored or transmitted encoded information can be accomplished in speech generation portion 16, as shown in Figure 3. As illustrated, speech generation portion 16 includes four principal functional blocks. In functional block 52 the encoded information is received at or retrieved by the speech generation process. In block 56, the encoded information is decoded to obtain the speech element information and prosody information. Next, in block 60, the speech information and related prosody information is compared to a database of speech sound elements available to the speech generator. Statistical modeling, or any other suitable selection process, is then employed to select appropriate available speech sound elements. In block 64, the selected speech sound elements are assembled in accordance with the prosody information and output to the user as a constructed facsimile of the original speech. As will be apparent to those of skill in the art, the speech sound elements can comprise digitized speech samples and/or synthetic speech elements, such as those produced by a vocoder or by FM synthesis. In the latter case, the selected speech sound elements comprise appropriate instructions for the vocoder or FM synthesis system to create the desired sounds.
  • As will be apparent to those of skill in the art, the determination of speech elements and prosody information and the selection of speech sound elements for construction of facsimiles of the original speech can be less than completely accurate. However, as is well known, most individuals do not always speak in a perfectly correct manner and yet their speech is still understood by others. This understanding of 'flawed' speech is possible due to the redundancy of information in speech, including the context of the speech and the relationship between the speech elements and the prosody. The present invention takes advantage of the human ability to understand even flawed speech without undue effort and it has been determined by the present inventor that relatively many errors in the overall process of constructing a facsimile of the original speech are easily tolerated by most humans.
  • Phoneme determination, and speech determination in general, are statistical in nature and prone to errors and 'almost right' recognition. The present invention preferably takes advantage of the relationship in speech between phonemes and allophones to provide acceptable playback of speech. As mentioned above, English has between about forty and about sixty phonemes but there are many ways of pronouncing these phonemes, since they tend to be modified by dialect, preceding and following sounds, emotion, non-word speech components such as the pitch rise for questions, etc. In constructing a facsimile of the original speech, the present invention can use the speech relationship between phonemes and allophones to select the most probable phoneme matches by also considering the list of allophones. As the top matches tend to be similar, for example "f" and "v" might be the top recognition for the first phone in "fire", an allophone which is a close match to both can be found by examining the allophone list against the phoneme probabilities. Selecting the best phoneme match based on both the phoneme probability and the allophone probability allows the present invention to obtain matches very close to the original sound.
  • One use of the present invention. which is illustrated schematically in Figure 4, allows transmission of the encoded speech information to a remote location via a relatively low bandwidth data network. For example, a conversation can be conducted by two or more individuals over a packet network 100, such as the internet, with each party employing a computer system 104a, 104b such as a general purpose IBM compatible personal computer or a dedicated special purpose computer-based system. Computer system 104 operates to acquire the user's speech via a microphone 108a, 108b and to determine and encode the speech elements and prosody information as acquisition portion 14 of system 10. The encoded information is then transmitted through the intemet 100 to the other user. Further, each user's computer system 104 receives the encoded speech information from the other user or users and constructs a facsimile of the original speech as the speech generation portion 16 of system 10, as described above, which is then output to the user via a loudspeaker 112, headset or other suitable output device.
  • The relatively low bandwidth required to transmit the encoded speech information, which can also be compressed prior to transmission and de-compressed after reception if desired, allows the present invention to be employed to permit players of online games to have spoken communication with other players over the limited bandwidth of their connections to the internet over which the other game-related information must also be transmitted.
  • In addition to use for bandwidth efficient communications, the present invention can provide a number of other features, advantages and uses. For example, in the above-mentioned online gaming uses, the speech generation portion 16 at each user's computer system can employ digitized samples of'celebrity' voices as the speech sound elements. It is contemplated that a user can select a Darth Vader-like voice to be assigned to a first user and a HAL 9000-like voice to be assigned to a second user. Further, it is also contemplated that the speech generation portion 16 can include a predefined selection of celebrity of voices, and each first user can specify which of the available voices is to be used to reconstruct the first user's speech in the second user's speech generation portion 16.
  • The present invention can also be employed to make more efficient use of systems wherein bandwidth is expensive. For example, many corporations now operate call centers which, for labor cost and other reasons, are often located in locations which are distant from the customers who access the call center. While the final financial analysis still supports such distant call centers, much of the otherwise potential savings can be absorbed by the necessity that the corporation provide and pay for toll free access to the call center for the customers. While a call center would still have to fund toll free access, with the present invention it is possible to significantly reduce the total required bandwidth to reduce the corporation's costs. In particular, as shown in Figure 5, the corporation can provide sub-centers 160 which can be accessed locally by customers 164. Sub-centers 160, which can be personal computer systems, minicomputers, special purpose devices or any other suitable system as will occur to those of skill in the art, comprise the above-described speech acquisition portion 14 and speech generation portion 16 of system 10. Sub-centers 160 acquire, determine, encode and transmit speech information from customers 164 calling the sub-center 160 to call center 168 over a suitable communications link 166. Communications link 166 can be a leased data connection, a packet network, satellite link or any other suitable data communications link, as will occur to those of skill in the art.
  • The transmitted speech information is received at a suitable call center computer system 168 at the call center. Call center computer system 168 can be personal computer systems, minicomputers, special purpose devices or any other suitable system as will occur to those of skill in the art, and also comprises the above-described speech acquisition portion 14 and speech generation portion 16 of system 10. When call center computer system 168 receives the transmitted speech information, speech generation portion 16 constructs a facsimile of the customer's original speech. The attendant 172 at the call center then responds to the inquiry and the attendant's speech is processed by the speech acquisition portion 14 of call center computer system 168 to determine, encode and transmit, over link 166, speech information to the appropriate sub-center 160 where a construction of a facsimile of the attendant's speech is created and provided to the customer 164 by sub-center 160.
  • In this manner, less bandwidth is required for communications link 166 between the sub-center 160 and call center computer system 168, thus reducing the corporation's costs. As is also indicated in the Figure, if desired and if sufficient processing capacity exists at sub-center 160, multiple calls can be simultaneously processed by sub-center 160 and speech information between sub-center 160 and call center computer system 168 can be appropriately multiplexed over the communications link 166.
  • In addition to reducing bandwidth requirements or making more efficient use of available bandwidth, the present invention also provides a process whereby space efficient storage of speech can be accomplished. In this specific case of a call center, every conversation between customers 164 and attendants 174 can be stored, in encoded speech information form, and requires less storage capacity than other conventional methods. Further, as discussed below, speech stored with the present invention can be effectively searched for the occurrence of words or phrases of interest.
  • As an additional advantage of the present invention, speech generation portions 16 employed at sub-centers 160 can employ a preferred voice. For example, the corporation for which the call center is operated can have a celebrity corporate spokesperson and the celebrity's voice can be employed in the sub-center speech generation portions 16.
  • An additional feature of the present invention is that it can be used to alter or remove cultural or regional accents and speech features of the speakers. One of the disadvantages with operating call centers distal customers is that accents and other speech features of the call center attendants can emphasize or otherwise interfere with the relationship between the attendant and the customer. For example, a customer from the US mid-west may be uncomfortable in speaking with an attendant in a call center in Ireland, who has a noticeable Irish accent. In the past, this has necessitated that call centers attempt to hire or train attendants without strong accents or other undesired identifying speech characteristics.
  • Figure 6 shows the principal functional blocks of a speech generation process in accordance with another embodiment of the present invention. In this embodiment, as the speech generation process constructs a facsimile of the original speech from determined speech elements and prosody, it is possible to identify and reduce, or even eliminate, accents and other undesired identifying speech characteristics. The speech generation process of Figure 6 is similar to that of Figure 3, discussed above. In block 200, encoded speech information is received via a communications link or is retrieved from a storage device. In block 204, the encoded information is decoded to obtain the speech elements and prosody information. In block 208, the decoded information is examined, using statistical models, pattern matching or other appropriate techniques, to recognize undesired speech characteristics and to substitute information into the speech element information and prosody information to reduce or remove those identified undesired characteristics to form modified speech information. In block 212, the modified speech information is used to select speech sound elements and in block 216 a constructed modified facsimile of the original speech is output to the user.
  • As will be apparent to those of skill in the art, the recognition and substitution process to remove undesired speech aspects can be performed in the speech acquisition portion 14, prior to transmission or storage of the speech information, or in the speech generation portion 16, as discussed above.
  • Another feature of the present invention is the ability to facilitate the location of desired phrases or words in speech. In many applications it is desired to have the ability to search stored speech for the occurrence of particular words or phrases. For example, an attorney may wish to locate a phrase in an audio recording of the testimony of a witness, which can be several hours in length. Presently, such a search can be conducted only by either manually or automatically transcribing the testimony to text and then performing conventional text searching operations or by listening to the testimony recording. Neither of these approaches is satisfactory in many circumstances, with manual transcription being expensive, automatic transcription being subject to errors and listening to the recording being time consuming.
  • With the present invention, the speech to be searched can be processed, with a speech acquisition process similar to that of Figure 3, to determine encoded speech information. Speech can either be stored only in encoded speech element form, such as for the call center call logging discussed above, or can be stored in both an encoded speech element form and an analog or digitized form. In either case, the determined speech elements are searched to locate words or phrases of interest.
  • When it is desired to locate the occurrence of a word or phrase of interest, the word or phrase is processed by the speech acquisition process to determine encoded speech information for the word or phrase. The search is then performed on the speech by comparing the stored determined speech element and prosody information with the speech element and prosody information for the search word or phrase. A suitable pattern matching or statistical algorithm can be employed to locate high probability matches between the speech information and the search information. For applications where the speech has only been stored as encoded speech information, such as the above-mentioned call logging for a call center. the identified matches are provided to a speech generation process, as described above, to let the searcher hear a construction of the appropriate portions of the speech. For applications where the original speech, in analog or digitized form, has also been stored, the time information related to the located matches can be provided so that the user can directly access the appropriate times in the original speech. As will be apparent to those of skill in the art, the above-mentioned processes for removing accents and undesired speech characteristics can also be employed with this search process to augment the ability to locate phrases or words when the searcher and the searchee have different speech characteristics. As will also be apparent, the stored determined speech information can be indexed, in any suitable manner as will occur to those of skill in the art, to enhance the speed at which searches of large datasets can be performed.
  • The present invention can also be employed in voice transcription systems. For example, a user can provide dictation which is processed to obtain encoded speech information, as described above, which can be stored in an efficient manner. The encoded speech can be used to construct a facsimile of the original speech to the user, thus providing a spoken memo pad-like application, or can be subsequently transcribed to text. In the former case, digitized samples of the voice of the user can be user to reconstruct the speech, if desired. In the latter case, the transcription can be performed by humans, in a conventional manner, or can be performed by speech to text transcriptions systems, such as the above-mentioned Dragonsoft and other systems.
  • One of the perceived advantages of the present invention, when used with speech to text transcription systems, is that as the speech information is stored, it can be examined in a non-real time manner. This allows multiple passes to be performed by the speech to text transcription system. Also, as mentioned above, accents or other speech characteristics which can confuse a speech to text transcription system, can be removed or mitigated by the present invention and this can reduce or eliminate lengthy training of the transcription system by the user.
  • As will be apparent to those of skill in the art, the present invention can also be employed with voice messaging type applications. For example, encoded speech information can be emailed to a recipient, who will then provide the speech information to a speech generation process, which can be a plug-in component for the email program, to hear a constructed facsimile of the original speech.
  • In another aspect, the present invention can provide voice recognition of users. In this embodiment, several base samples of the user's speech are processed to obtain speech element and prosody base information for that user and this information is stored in a database. To subsequently identify that user, the user is requested to speak to provide a test sample of their speech to the system. The test sample is processed to obtain speech element and prosody information from the test sample and the obtained information is compared to the base information previously determined for the user. A statistical comparison is then performed between the information for the base samples and the test sample to determine the degree of correspondence therebetween. Unlike prior art voice printing systems, the present invention is less sensitive to differences in the microphone and environment with which the base and the test samples are acquired and does not require that the user repeat the particular base samples originally provided to the system.
  • The above-described embodiments of the invention are intended to be examples of the present invention and alterations and modifications may be effected thereto, by those of skill in the art, without departing from the scope of the invention which is defined solely by the claims appended hereto.

Claims (9)

  1. A digital speech system, comprising:
    speech element determination means operable on speech in a digital format to determine speech element information from said speech;
    speech prosody determination means operable on said speech to determine prosody information from said speech;
    encoding means to encode speech information comprising said determined speech element information, said determined speech prosody information and timing information relating thereto in a digital form;
    decoding means to decode said encoded speech information to obtain said determined speech element information, said determined speech prosody information and said timing information;
    comparison means to compare said determined speech element information and said determined speech prosody information to a database of speech elements to select speech sound elements which correspond thereto; and
    speech generating means to assemble said selected speech sound elements to construct a facsimile of said speech.
  2. The digital speech system of claim 1 further comprising:
    speech acquisition means to acquire an analog electronic representation of said speech; and
    digitization means to convert said analog representation of said speech to said digital format.
  3. The digital speech system of claim 1 further comprising:
       an output means to produce an output of said facsimile in a manner audible to a user.
  4. The digital speech system of claim 1 wherein said comparison means further comprises recognition means to identify undesired speech characteristics, said selection of speech sound elements being performed to reduce the presence of said identified undesired speech characteristics in said facsimile.
  5. The digital speech system of claim 1 further comprising search means operable to receive an input representing a word or phrase of interest and to examine said determined speech element information to locate occurrences of said word or phrase therein.
  6. A method of acquiring and constructing digital speech, comprising the steps of:
    (i) examining digitized speech to determine speech element information relating to said speech;
    (ii) examining said digitized speech to determine prosody information relating to said speech;
    (iii) encoding speech information corresponding to said determined speech element information, said determined prosody information and timing information relating thereto;
    (iv) receiving and decoding said encoded speech information to obtain said timing information, said determined speech element information and said determined prosody information;
    (v) comparing said decoded speech element information and prosody information to a database to select corresponding speech sound elements; and
    (vi) assembling said selected speech sound elements to construct a facsimile of said speech.
  7. The method of claim 6 further comprising the steps of:
    acquiring an electronic representation of speech in an analog form; and
    digitizing said analog representation of speech to obtain said digitized speech for step (i),
  8. The method of claim 6 wherein step (v) further comprises comparing said decoded speech element information to a predefined database of undesired speech characteristics and selecting speech sound elements which reduce said undesired speech characteristics is said facsimile.
  9. The method of claim 6 further comprising the step of receiving from a user a word or phrase of interest and search said determined speech element information and said determined prosody information to locate occurrences of said received word or phrase and to identify said locations to said user.
EP99203979A 1998-10-26 1999-11-26 Digital speech acquisition, transmission, storage and search system and method Withdrawn EP1103954A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CA002251502A CA2251502A1 (en) 1998-10-26 1998-10-26 Digital speech acquisition, transmission, storage and search system and method
EP99203979A EP1103954A1 (en) 1998-10-26 1999-11-26 Digital speech acquisition, transmission, storage and search system and method
JP33648199A JP2001154686A (en) 1998-10-26 1999-11-26 Digital audio system, method of acquisition and construction of digital audio signal
AU61769/99A AU6176999A (en) 1998-10-26 1999-11-29 digital speech acquisition, transmission, storage and search system and method

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CA002251502A CA2251502A1 (en) 1998-10-26 1998-10-26 Digital speech acquisition, transmission, storage and search system and method
EP99203979A EP1103954A1 (en) 1998-10-26 1999-11-26 Digital speech acquisition, transmission, storage and search system and method
JP33648199A JP2001154686A (en) 1998-10-26 1999-11-26 Digital audio system, method of acquisition and construction of digital audio signal
AU61769/99A AU6176999A (en) 1998-10-26 1999-11-29 digital speech acquisition, transmission, storage and search system and method

Publications (1)

Publication Number Publication Date
EP1103954A1 true EP1103954A1 (en) 2001-05-30

Family

ID=31999270

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99203979A Withdrawn EP1103954A1 (en) 1998-10-26 1999-11-26 Digital speech acquisition, transmission, storage and search system and method

Country Status (4)

Country Link
EP (1) EP1103954A1 (en)
JP (1) JP2001154686A (en)
AU (1) AU6176999A (en)
CA (1) CA2251502A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5649060A (en) * 1993-10-18 1997-07-15 International Business Machines Corporation Automatic indexing and aligning of audio and text using speech recognition
GB2332841A (en) * 1997-12-24 1999-06-30 Motorola Ltd Speech communication systems
US5933805A (en) * 1996-12-13 1999-08-03 Intel Corporation Retaining prosody during speech analysis for later playback

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5649060A (en) * 1993-10-18 1997-07-15 International Business Machines Corporation Automatic indexing and aligning of audio and text using speech recognition
US5933805A (en) * 1996-12-13 1999-08-03 Intel Corporation Retaining prosody during speech analysis for later playback
GB2332841A (en) * 1997-12-24 1999-06-30 Motorola Ltd Speech communication systems

Also Published As

Publication number Publication date
AU6176999A (en) 2001-05-31
CA2251502A1 (en) 2000-04-26
JP2001154686A (en) 2001-06-08

Similar Documents

Publication Publication Date Title
US7124082B2 (en) Phonetic speech-to-text-to-speech system and method
Rabiner Applications of voice processing to telecommunications
US8447606B2 (en) Method and system for creating or updating entries in a speech recognition lexicon
US8706488B2 (en) Methods and apparatus for formant-based voice synthesis
US20100217591A1 (en) Vowel recognition system and method in speech to text applictions
US7143033B2 (en) Automatic multi-language phonetic transcribing system
US5911129A (en) Audio font used for capture and rendering
US20070088547A1 (en) Phonetic speech-to-text-to-speech system and method
US8386265B2 (en) Language translation with emotion metadata
US6119086A (en) Speech coding via speech recognition and synthesis based on pre-enrolled phonetic tokens
US8812314B2 (en) Method of and system for improving accuracy in a speech recognition system
JPH10507536A (en) Language recognition
US20080140398A1 (en) System and a Method For Representing Unrecognized Words in Speech to Text Conversions as Syllables
JPH11513144A (en) Interactive language training device
Qian et al. A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS
US20030158734A1 (en) Text to speech conversion using word concatenation
KR19980070329A (en) Method and system for speaker independent recognition of user defined phrases
US6148285A (en) Allophonic text-to-speech generator
CN101111885A (en) Audio recognition system for generating response audio by using audio data extracted
JP2020071676A (en) Speech summary generation apparatus, speech summary generation method, and program
KR100901640B1 (en) Method of selecting the training data based on non-uniform sampling for the speech recognition vector quantization
Atal et al. Speech research directions
EP1103954A1 (en) Digital speech acquisition, transmission, storage and search system and method
Mai et al. Design of Vietnamese speech corpus and current status
Oyo et al. A preliminary speech learning tool for improvement of African English accents

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

17P Request for examination filed

Effective date: 20011130

D17P Request for examination filed (deleted)
REG Reference to a national code

Ref country code: DE

Ref legal event code: 8566

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20011201