US20020184024A1 - Speech recognition for recognizing speaker-independent, continuous speech - Google Patents

Speech recognition for recognizing speaker-independent, continuous speech Download PDF

Info

Publication number
US20020184024A1
US20020184024A1 US09/813,965 US81396501A US2002184024A1 US 20020184024 A1 US20020184024 A1 US 20020184024A1 US 81396501 A US81396501 A US 81396501A US 2002184024 A1 US2002184024 A1 US 2002184024A1
Authority
US
United States
Prior art keywords
voice stream
frequency spectrum
transneme
transnemes
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US09/813,965
Other versions
US7089184B2 (en
Inventor
Phillip Rorex
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NURV CENTER TECHNOLOGIES Inc
Original Assignee
NURV CENTER TECHNOLOGIES Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NURV CENTER TECHNOLOGIES Inc filed Critical NURV CENTER TECHNOLOGIES Inc
Priority to US09/813,965 priority Critical patent/US7089184B2/en
Assigned to NURV CENTER TECHNOLOGIES, INC. reassignment NURV CENTER TECHNOLOGIES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ROREX, PHILLIP G.
Publication of US20020184024A1 publication Critical patent/US20020184024A1/en
Application granted granted Critical
Publication of US7089184B2 publication Critical patent/US7089184B2/en
Adjusted expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Definitions

  • the present invention relates generally to speech recognition, and more particularly to real-time speech recognition for recognizing speaker-independent, connected or continuous speech.
  • Speech recognition refers to the ability of a machine or device to receive, analyze and recognize human speech. Speech recognition is also often referred to as voice recognition. Speech recognition may potentially allow humans to interface with machines and devices in an easy, quick, productive and reliable manner. Accurate and reliable speech recognition is therefore highly sought after.
  • Speech recognition gives humans the capability of verbally generating documents, recording or transcribing speech, and audibly controlling devices. Speech recognition is desirable because speech occurs at a much faster rate than manual operations, such as typing on a keyboard or operating controls. A good typist can type about 80 words per minute, while typical speech can be in the range of about 200 or more words per minute.
  • speech recognition can allow remote control of electronic devices.
  • a computer or computer operated appliances could be speech controlled.
  • speech recognition may be used for hands-free operation of conventional devices.
  • one current application is the use of speech recognition for operating a cellular phone, such as in a vehicle. This may be desirable because the driver's attention should stay on the road.
  • Speech recognition processes and speech recognition devices currently exist. However, there are several difficulties that have prevented speech recognition from becoming practical and widely available. The main obstacle has been the wide variations in speech between persons. Different speakers have different speech characteristics, making speech recognition difficult or at best not satisfactorily reliable. For example, useful speech recognition must be able to identify not only words but also small word variations. Speech recognition must be able to differentiate between homonyms by using context. Speech recognition must be able to recognize silence, such as gaps between words. This may be difficult if the speaker is speaking rapidly and running words together. Speech recognition systems may have difficulty adjusting to changes in the pace of speech, changes in speech volume, and may be frustrated by accents or brogues that affect the speech.
  • Speech recognition technology has existed for some time in the prior art, and has become fairly reasonable in price. However, it has not yet achieved satisfactory reliability and is not therefore widely used. For example, as previously mentioned, devices and methods currently exist that capture and convert the speech into text, but generally require extensive training and make too many mistakes.
  • FIG. 1 shows a representative audio signal in a time domain.
  • the audio signal is generated by capture and conversion of an audio stream into an electronic voice stream signal, usually through a microphone or other sound transducer.
  • audible sound exists in the range of about 20 hertz (cycles) to about 20 kilohertz (kHz). Speech is a smaller subset of frequencies.
  • the electronic voice stream signal may be filtered and amplified and is generally digitized for processing.
  • FIG. 2 shows the voice stream after it has been converted from the time domain into the frequency domain. Conversion to the frequency domain offers advantages over the time domain. Human speech is generated by the mouth and the throat, and contains many different harmonics (it is generally not composed of a single component frequency). The audible speech signal of FIG. 2, therefore, is composed of many different frequency components at different amplitude levels. In the frequency domain, the speech recognition device may be able to more easily analyze the voice stream and detect meaning based on the frequency components of the voice stream.
  • FIG. 3 shows how the digitized frequency domain response may be digitally represented and stored.
  • Each digital level may represent a frequency or band of frequencies. For example, if the input voice stream is in the range of 1 kilohertz (kHz) to 10 kHz, and is separated into 128 frequency spectrum bands, each band (and corresponding frequency bin) would contain a digital value or amplitude for about 70 Hz of the speech frequency spectrum. This value may be varied in order to accommodate different portions of the audible sound spectrum. Speech does not typically employ all of the frequencies in the audible frequency range of 20 Hz to 20 kHz. Therefore, a speech recognition device may analyze only the frequencies from 1 kHz to 10 kHz, for example.
  • kHz kilohertz
  • an iterative statistical look-up may be performed to determine the parts of speech in a vocalization.
  • the parts are called phonemes, the smallest unit of sound in any particular language.
  • Various languages use phonemes that are not utilized in any other language.
  • the English language designates about 34 different phonemes.
  • the iterative statistical look-up employed by the prior art usually uses hidden Markov modeling (HMM) to statistically compare and determine the phonemes.
  • HMM hidden Markov modeling
  • the iterative statistical look-up compares multiple portions of the voice stream to stored phonemes in order to try to find a match. This generally requires multiple comparisons between a digitized sample and a phoneme database and a high computational workload. Therefore, by finding these phonemes, the speech recognition device can create a digital voice stream representation that represents the original vocalizations in a digital, machine-usable form.
  • the time-scale differences are compensated for by using one of two approaches.
  • a statistical modeling stretches or compresses the wave form in order to find a best fit match of a digitized voice stream segment to a set of stored spectral patterns or templates.
  • the dynamic time warping process uses a procedure that dynamically alters the time dimension to minimize the accumulated distance score for each template.
  • the hidden Markov model (HMM) method characterizes speech as a plurality of statistical chains.
  • the HMM method creates a statistical, finite-state Markov chain for each vocabulary word while it trains the data.
  • the HMM method then computes the probability of generating the state sequence for each vocabulary word. The word with the highest accumulated probability is selected as the correct identification.
  • time alignment is obtained indirectly through the sequence of states.
  • Another prior art approach is a speaker-dependent speech recognition wherein the speech recognition device is trained to a particular person's voice. Therefore, only the particular speaker is recognized, and that speaker must go through a training or “enrollment” process of reading or inputting a particular speech into the speech recognition device. A higher accuracy is achieved without increased cost or increased computational time.
  • the drawback is that use of speaker-dependent voice recognition is limited to one person, requires lengthy training periods, may require a lot of computation cycles, and is limited to only applications where the speaker's identity is known apriori.
  • a speech recognition device comprises an I/O device for accepting a voice stream and a frequency domain converter communicating with the I/O device.
  • the frequency domain converter converts the voice stream from a time domain to a frequency domain and generates a plurality of frequency domain outputs.
  • the speech recognition device further comprises a frequency domain output storage communicating with the frequency domain converter.
  • the frequency domain output storage comprises at least two frequency spectrum frame storages for storing at least a current frequency spectrum frame and a previous frequency spectrum frame.
  • a frequency spectrum frame storage of the at least two frequency spectrum frame storages comprises a plurality of frequency bins storing the plurality of frequency domain outputs.
  • the speech recognition device further comprises a processor communicating with the plurality of frequency bins and a memory communicating with the processor.
  • a frequency spectrum difference storage in the memory stores one or more frequency spectrum differences calculated as a difference between the current frequency spectrum frame and the previous frequency spectrum frame.
  • At least one feature storage is included in the memory for storing at least one feature extracted from the voice stream.
  • At least one transneme table is included in the memory, with the at least one transneme table including a plurality of transneme table entries and with a transneme table entry of the plurality of transneme table entries mapping a predetermined frequency spectrum difference to at least one predetermined transneme of a predetermined verbal language.
  • At least one mappings storage is included in the memory, with the at least one mappings storage storing one or more found transnemes.
  • At least one transneme-to-vocabulary database is included in the memory, with the at least one transneme-to-vocabulary database mapping a set of one or more found transnemes to at least one speech unit of the predetermined verbal language.
  • At least one voice stream representation storage is included in the memory, with the at least one voice stream representation storage storing a voice stream representation created from the one or more found transnemes.
  • the speech recognition device calculates a frequency spectrum difference between a current frequency spectrum frame and a previous frequency spectrum frame, maps the frequency spectrum difference to a transneme table, and converts the frequency spectrum difference to a transneme if the frequency spectrum difference is greater than a predetermined difference threshold.
  • the speech recognition device creates a digital voice stream representation of the voice stream from one or more transnemes thus produced.
  • a method for performing speech recognition on a voice stream comprises the steps of determining one or more candidate transnemes in the voice stream, mapping the one or more candidate transnemes to a transneme table to convert the one or more candidate transnemes to one or more found transnemes, and mapping the one or more found transnemes to a transneme-to-vocabulary database to convert the one or more found transnemes to one or more speech units.
  • a method for performing speech recognition on a voice stream comprises the step of calculating a frequency spectrum difference between a current frequency spectrum frame and a previous frequency spectrum frame.
  • the current frequency spectrum frame and the previous frequency spectrum frame are in a frequency domain and are separated by a predetermined time interval.
  • the method further comprises the step of mapping the frequency spectrum difference to a transneme table to convert the frequency spectrum difference to at least one transneme if the frequency spectrum difference is greater than a predetermined difference threshold.
  • a digital voice stream representation of the voice stream is created from one or more transnemes thus produced.
  • a method for performing speech recognition on a voice stream comprises the step of performing a frequency domain transformation on the voice stream upon a predetermined time interval to create a current frequency spectrum frame.
  • the method further comprises the step of normalizing the current frequency spectrum frame.
  • the method further comprises the step of calculating a frequency spectrum difference between the current frequency spectrum frame and a previous frequency spectrum frame.
  • the method further comprises the step of mapping the frequency spectrum difference to a transneme table to convert the frequency spectrum difference to at least one found transneme if the frequency spectrum difference is greater than a predetermined difference threshold. The method therefore creates a digital voice stream representation of the voice stream from one or more found transnemes thus produced.
  • FIG. 1 shows a representative audio signal in a time domain
  • FIG. 2 shows the voice stream after it has been converted from the time domain into the frequency domain
  • FIG. 3 shows how the digitized frequency domain response may be digitally represented and stored
  • FIG. 4 shows a speech recognition device according to one embodiment of the invention
  • FIG. 5 shows detail of a frequency domain output storage
  • FIG. 6 is a flowchart of a first speech recognition method embodiment according to the invention.
  • FIG. 7 shows a frequency spectrum frame according to a first embodiment of a frequency domain conversion
  • FIG. 8 shows a frequency spectrum frame according to a second embodiment of the frequency domain conversion
  • FIG. 9 is a flowchart of a second speech recognition method embodiment
  • FIG. 10 shows a first frequency spectrum frame obtained at a first point in time
  • FIG. 11 shows a second frequency spectrum frame obtained at a second point in time
  • FIG. 12 shows how the frequency domain transformation may be processed using overlapping frequency domain conversion windows
  • FIG. 13 is a flowchart of a third speech recognition method embodiment
  • FIGS. 14 - 16 show a frequency normalization operation on a current frequency spectrum frame
  • FIGS. 17 - 18 show an amplitude normalization operation on a current frequency spectrum frame.
  • FIG. 4 shows a speech recognition device 400 according to one embodiment of the invention.
  • the speech recognition device 400 includes an input/output (I/O) device 401 , a frequency domain converter 406 , frequency domain output storage 410 , a processor 414 , and a memory 420 .
  • the speech recognition device 400 of the invention performs speech recognition and converts a voice stream input into a digital voice stream representation.
  • the voice stream representation comprises a series of symbols that digitally represents the voice stream and may be used to recreate the voice stream.
  • the speech recognition is accomplished by finding transnemes in the voice stream and converting the found transnemes into speech units of a predetermined verbal language.
  • a speech unit may be a word, a portion of a word, a phrase, an expression, or any other type of verbal utterance that has an understood meaning in the predetermined verbal language.
  • a phoneme is generally described as being the smallest unit of sound in any particular language. Each vocalized phoneme is a distinct sound and therefore may be characterized by a substantially unique frequency domain response, substantially over the duration of the vocalization of the phoneme. In the English language, it is generally accepted that there are about 34 phonemes that are used to create all parts of the spoken language. There are less than 100 identified phonemes in all languages combined.
  • a transneme is a transition between the phoneme (or allophone) components of human speech. There are approximately 10,000 transnemes. Transnemes are therefore smaller subunits or components of speech, and are used by the invention to produce a speech recognition that is speaker-independent and that operates on connected speech (i.e., the speaker can talk normally and does not have to take care to voice each word separately and distinctly).
  • the I/O device 401 may be any type of device that is capable of accepting an audio signal input that includes a voice stream.
  • the I/O device 401 provides the audio signal input to the speech recognition device 400 .
  • the I/O device 401 may accept a digitized voice stream or may include a digitizer, such as an analog to digital converter (not shown) if the incoming audio signal is in analog form.
  • the I/O device 401 may accept a voice stream that is already compressed or that has already been converted into the frequency domain.
  • the I/O device 401 may be any type of input device, including a microphone or sound transducer, or some other form of interface device.
  • the I/O device 401 may additionally be a radio frequency receiver or transceiver, such as a radio frequency front-end, including an antenna, amplifiers, filters, down converters, etc., that produce an audio signal. This may include any type of radio receiver, such as a cell phone, satellite phone, pager, one or two-way radios, etc.
  • the I/O device 401 may be a TV receiver, an infrared receiver, an ultrasonic receiver, etc.
  • the I/O device 401 may be an interface, such as an interface to a digital computer network, such as the Internet, a local area network (LAN), etc., may be an interface to an analog network, such as an analog telephone network, etc.
  • the I/O device 401 may be capable of outputting a voice stream representation.
  • the speech recognition may be used to compress the voice stream for transmission to another device.
  • the speech recognition of the invention may be used in a cell phone.
  • the voice stream (or other sound input) may be received by the I/O device 401 and may be converted into text or speech representation symbols by the speech recognition device 400 .
  • the speech representation symbols may then be transmitted to a remote location or device by the I/O device 401 .
  • the speech representation symbols may be converted back into an audio output. This use of the invention for audio compression greatly reduces bandwidth requirements of the speech transmission.
  • the frequency domain converter 406 communicates with the I/O device 401 and converts the incoming voice stream signal into a frequency domain signal (See FIGS. 2 and 3).
  • the frequency domain converter 406 may be, for example, a discrete Fourier transform device or a fast Fourier transform device (FFT) that performs a Fourier transform on the time domain voice stream.
  • FFT fast Fourier transform
  • the frequency domain converter 406 may be a filter bank employing a plurality of filters, such as band pass filters that provide a frequency spectrum output, or may be a predictive coder.
  • the frequency domain converter 406 generates a plurality of outputs, with each output representing a predetermined frequency or frequency band.
  • the plurality of outputs of the frequency domain conversion are referred to herein as a frequency spectrum frame, with a frequency spectrum frame comprising a plurality of amplitude values that represent the frequency components of the voice stream over the predetermined frequency conversion window period (see FIG. 10, for example, and also see the discussion accompanying FIG. 13).
  • the number of outputs may be chosen according to a desired frequency band size and according to a range of audible frequencies desired to be analyzed.
  • the frequency domain converter 406 generates 128 outputs for a predetermined frequency domain conversion window.
  • FIG. 5 shows detail of the frequency domain output storage 410 .
  • the frequency domain output storage 410 communicates with the frequency domain converter 406 .
  • the frequency domain output storage 410 is a memory that comprises at least two frequency spectrum frames 410 a and 410 b. Alternatively, the frequency domain output storage 410 may be eliminated and the plurality of outputs from the frequency domain converter 406 may be stored in the memory 420 .
  • Each frequency spectrum frame 410 a, 410 b, etc. stores a set of digital values V 1 -V N that represents the amplitudes or quantities of frequency band components present in the voice stream over the predetermined frequency conversion window.
  • Each frame contains N bins, with N corresponding to the number of frequency domain transformation outputs generated by the frequency domain converter 406 .
  • Each bin therefore contains a frequency domain conversion output value V.
  • the sets of digital values V 1 -V N in successive frequency spectrum frames, such as those in frames 410 a and 410 b may be analyzed in order to process the input voice stream.
  • the processor 414 may be any type of processor, and communicates with the frequency domain output storage 410 in order to receive the frequency spectrum frame or frames contained within the frequency domain output storage 410 .
  • the processor 414 is also connected to the memory 420 and is optionally connected to the I/O device 401 whereby the processor may receive a digitized time domain signal.
  • the processor 414 may be connected to the I/O device 401 in order to extract features from the time domain signal, such as pitch and volume features. These speech features may be used for adding punctuation, emphasis, etc., to the voice stream representation output, for example, and may also be used to normalize the frequency spectrum frames (the normalization is discussed below in the text accompanying FIGS. 14 - 18 ).
  • the memory 420 communicates with the processor 414 and may be any type of storage device, including types of random access memory (RAM), types of read-only memory (ROM), magnetic tape or disc, bubble memory, optical memory, etc.
  • the memory 420 may be used to store an operating program that includes a speech recognition algorithm according to various aspects of the invention.
  • the memory 420 may include variables and data used to process the speech and may hold temporary values during processing.
  • the memory 420 may include a spectrum difference storage 421 , a feature storage 422 , at least one transneme table 423 , at least one mappings storage 424 , at least one transneme-to-vocabulary database 427 , and at least one voice stream representation 428 .
  • the memory 420 may optionally include a frequency spectrum frame storage (not shown) for storing one or more frequency spectrum frames, such as the output of the frequency domain converter 406 .
  • the spectrum difference storage 421 stores at least one frequency spectrum difference calculated from current and previous frequency spectrum frames.
  • the frequency spectrum difference therefore, is a set of values that represents a change in spectral properties of the voice stream.
  • the feature storage 422 contains at least one feature extracted from the voice stream, such as a volume or pitch feature, for example.
  • the feature may have been obtained from the voice stream in either the time domain or the frequency domain.
  • Data stored in the feature storage 422 may be used to normalize a frequency spectrum frame and to aid in grammar interpretation such as by adding punctuation.
  • the features may be used to provide context when matching a found transneme to speech units, words, or phrases.
  • the at least one transneme table 423 contains a plurality of transnemes.
  • Transnemes are used to analyze the input voice stream and are used to determine the parts of speech therein in order to create a voice stream representation in digital form. For example, the phrase, “Hello, World” is made up of 11 transnemes (“[ ]” represents silence). If smaller frame sizes are utilized, transnemes can be smaller parts of phonemes, with multiple transnemes per phoneme.
  • Transnemes are dictated by the sounds produced by the human vocal apparatus. Therefore, transnemes are essentially independent of the verbal language of the speaker. However, a particular language may only contain a subset of all of the existing transnemes. As a result, the transneme table 423 may contain only the transnemes necessary for a predetermined verbal language. This may be done in order to conserve memory space. In applications where language translation or speech recognition of multiple verbal languages is required, the transneme table may likely contain the entire set of transnemes.
  • the mappings storage 424 stores one or more mappings produced by comparing one or more frequency spectrum differences to at least one transneme table 423 . In other words, when transnemes are found through the matching process, they are accumulated in the mappings storage 424 .
  • the transneme-to-vocabulary database 427 maps found transnemes to one or more speech units, with a speech unit being a word, a portion of a word, a phrase, or any utterance that has a defined meaning.
  • the speech recognition device 400 may compare groupings of one or more found transnemes to the transneme-to-vocabulary database 427 in order to find speech units and create words and phrases.
  • the transneme-to-vocabulary database 427 may contain entries that map transnemes to one or more verbal languages, and may therefore convert transnemes into speech units of one or more predetermined verbal languages.
  • the speech recognition device 400 may include multiple transneme-to-vocabulary databases 427 , with additional transneme-to-vocabulary databases 427 capable of being added to give additional speech recognition capability in other languages.
  • the voice stream representation storage 428 is used to store found words and phrases as part of the creation of a voice stream representation.
  • the voice stream representation 428 may accumulate a series of symbols, such as text, that have been constructed from the voice stream input.
  • the speech recognition device 400 is speaker-independent and reliably processes and converts connected speech (connected speech is verbalized without any concern or effort to separate the words or talk in any specific manner in order to aid in speech recognition).
  • the speech recognition device 400 therefore operates on connected speech in that it can discern breaks between words and/or phrases and without excessive computational requirements. A reduced computational workload may translate to simpler and less expensive hardware.
  • frequency spectrum frame differencing eliminates the need for the complex and inefficient dynamic time warping procedure of the prior art, and therefore generally requires only about one cycle per comparison of a spectrum difference to a database or table. Therefore, unlike the prior art dynamic time warping and statistical modeling, the speech recognition device 400 of the present invention does not need to perform a large amount of comparisons in order to account for time variations in the voice stream.
  • the speech recognition device 400 may be a specialized device constructed for speech recognition or may be integrated into another electronic device.
  • the speech recognition device 400 may be integrated into any manner of electronic device, including cell phones, satellite phones, conventional land-line telephones (digital and analog), radios (one and two-way), pagers, personal digital assistants (PDAs), laptop and desktop computers, mainframes, digital network appliances and workstations, etc.
  • the speech recognition device 400 may be integrated into specialized controllers or general purpose computers by implementing software to perform speech recognition according to the present invention. Speech recognition and control may therefore be added to personal computers, automobiles and other vehicles, factory equipment and automation, robotics, security devices, etc.
  • FIG. 6 is a flowchart 600 of a first speech recognition method embodiment according to the invention.
  • the method determines one or more candidate transnemes in a voice stream. This is accomplished by analyzing the voice stream in a frequency domain and comparing frequency spectrum frames (captured over predetermined time periods) in order to determine one or more candidate transnemes.
  • the frequency spectrum frames may be obtained for overlapping time periods (windows).
  • a transneme is a transition between phonemes or allophones, and a candidate transneme may be determined by finding a significant spectrum variation in the frequency domain. Therefore, a comparison in the frequency domain may be performed between a current frequency spectrum frame (containing frequency components of predetermined frequencies or frequency bands) to a previous frequency spectrum frame in order to determine voice stream frequency changes over time. Periods of silence or periods of substantially no frequency change are ignored.
  • the I/O device 401 passes a digitized voice stream to the frequency domain converter 406 and to the processor 414 .
  • the processor 414 may extract predetermined speech features from the digitized time domain voice stream. These predetermined speech features may be used to add punctuation and may also be used to normalize the frequency spectrum frames based on a detected volume and pitch of the speaker. For example, a tonality rise by the speaker may signify a question or query, signifying a question mark or other appropriate punctuation in the completed voice stream representation.
  • the frequency domain converter 406 converts the digitized voice stream into a plurality of frequency domain signal outputs and stores them in the frequency domain output storage 410 .
  • the frequency domain converter 406 is preferably a Fourier transform device that performs Fourier transforms on the digitized input voice stream.
  • the outputs are preferably in the form of an array of values representing the different frequency bands present in the voice stream.
  • the frequency domain converter 406 may be any other device that is capable of converting the time domain signal into the frequency domain, such as a filter bank comprising a plurality of filters, such as band pass filters, for example.
  • FIG. 7 shows a frequency spectrum frame 700 according to a first embodiment of the frequency domain conversion.
  • the frequency spectrum frame 700 comprises a plurality of contiguous frequency bands that substantially covers a predetermined portion of the audible frequency spectrum.
  • FIG. 8 shows a frequency spectrum frame 800 according to a second embodiment of the frequency domain conversion.
  • the frequency spectrum frame 800 comprises a plurality of substantially individual frequencies or a plurality of non-contiguous frequency bands. Although some of the frequencies in the predetermined portion of the audible frequency spectrum are ignored, the frame 800 may still adequately reflect and characterize the various frequency components of the voice stream. By using only portions of the predetermined portion of the audible frequency spectrum and not using contiguous frequency bands, the amount of data processed in comparing frequency spectrum characteristics and finding transnemes may be further reduced. The result is a decrease in computational processing requirements and storage requirements.
  • the processor 414 accesses the frequency domain outputs in the frequency domain output storage 410 and creates a frequency spectrum difference.
  • a frequency spectrum difference evaluates to be non-zero, a transition between phonemes has occurred and the frequency spectrum difference has been used to determine a candidate transneme in the input voice stream.
  • a candidate transneme is mapped to the transneme table 423 (or other data conversion device) in order to determine whether it is a valid transneme.
  • a candidate transneme i.e., a valid frequency spectrum difference
  • a found transneme is converted into a found transneme.
  • the found transneme is mapped to at least one transneme-to-vocabulary database 427 (or other data conversion device) and are converted to one or more speech units.
  • the speech units may comprise words, portions of words, phrases, or any other utterance that has a recognized meaning.
  • the method 600 therefore converts the voice stream (or other audio input) into a digital voice stream representation.
  • the digital voice stream representation produced by the speech recognition comprises a series of digital symbols, such as text, for example, that may be used to represent the voice stream.
  • the voice stream representation may be stored or may be used or processed in some manner.
  • Speech recognition has many uses.
  • the speech recognition of the invention converts a voice stream input into a series of digital symbols that may be used for speech-to-text conversion, to generate commands and inputs for voice control, etc. This may encompass a broad range of applications, such as dictation, transcription, messaging, etc.
  • the speech recognition of the invention may also encompass voice control, and may be incorporated into any type of electronic device or electronically controlled device.
  • the speech recognition of the invention may analyze any type of non-speech audio input and convert it to a digital representation if an equivalent set of transneme definitions is known. For example, the speech recognition method could find possible applications in music.
  • the speech recognition of the invention may additionally be used to perform a highly effective compression on the voice stream. This is accomplished by converting the voice stream into a voice stream representation comprising a series of symbols. Due to the highly efficient and relatively simple conversion of the voice stream into digital symbols (such as numerical codes, including, for example, ASCII symbols for English letters and punctuation), the speech recognition of the invention may provide a highly effective audio signal compression.
  • Digitally captured and transmitted speech typically contains only frequencies in the 4 kHz to 10 kHz range and requires a data transmission rate of about 12 kbits per second.
  • speech may be transmitted at a data rate of about 120 bits per second to about 60 bits per second.
  • the result is a data compression of about 100 to 1 to about 200 to 1. This allows a device to decrease its data rate and hardware requirements while simultaneously allowing greater transmission quality (i.e., by capturing more of the 20 Hz to 20 kHz audible sound spectrum).
  • the voice stream compression may be advantageously used in a variety of ways.
  • One application may be to compress a voice stream for transmission.
  • This may include wireless transmission of the voice stream representation, such as a radio transmission, an infrared (IR) or optical transmission, an ultrasonic transmission, etc.
  • the voice stream compression may therefore be highly useful in communication devices such as cellular phones, satellite phones, pagers, radios, etc.
  • the transmission may be performed over some form of transmission media, such as wire, cable, optical fiber, etc. Therefore, the voice stream representation may be used to more efficiently communicate data over conventional analog telephone networks, digital telephone networks, digital packet networks, computer networks, etc.
  • Another compression application may be the use of speech recognition to compress data for manipulation and storage.
  • a voice stream may be converted into a series of digital symbols in order to drastically reduce storage space requirements. This may find advantageous application in areas such as answering machines, voice messaging, voice control of devices, etc.
  • the speech recognition may perform a language translation function.
  • the speech recognition device 400 may receive a voice stream of a first verbal language and, through use of an appropriate transneme-to-vocabulary table or tables 427 , may convert that voice stream into a voice stream representation of a second language.
  • FIG. 9 is a flowchart 900 of a second speech recognition method embodiment according to the invention.
  • a frequency spectrum difference is calculated.
  • the frequency spectrum difference is calculated between two frequency spectrum frames in order to determine whether a transneme has occurred.
  • the voice stream must have already been processed in some manner in order to create a frequency domain signal or representation. This may also include pre-processing such as amplification, filtering, and digitization.
  • FIG. 10 shows a first frequency spectrum frame 1000 obtained at a first point in time T 1 .
  • FIG. 11 shows a second frequency spectrum frame 1100 obtained at a second point in time T 2 . From these two frames, it can be seen that the frequency components of the voice stream have changed over the time period T 2 -T 1 (see dashed lines). Therefore, the frequency spectrum difference will reflect these spectral changes and may represent a transneme.
  • the difference may be compared to a predetermined difference threshold to see if a transneme has occurred. If the frequency spectrum difference is less than or equal to the predetermined difference threshold, the difference may be judged to be essentially zero and the current frequency spectrum frame may be ignored. A next frequency spectrum frame may then be obtained and processed.
  • the predetermined difference threshold takes into account noise effects and imposes a requirement that the frames must change by at least the predetermined difference threshold in order to be a valid transneme.
  • the predetermined difference threshold is about 5% of average amplitude of base frequency spectrum bin over a less than 100 millisecond frame size, although the predetermined difference threshold may range from about 3% to about 7%.
  • a frequency spectrum difference is preferably calculated about every 10 milliseconds due to the average time duration of a phoneme, but may be varied according to conditions or a desired resolution. For example, a frequency spectrum difference may be calculated more often in order to increase the resolution of the transneme differentiation and to potentially increase accuracy. However, the computational workload will increase as a consequence.
  • FIG. 12 shows how the frequency domain transformation may be processed using overlapping frequency domain conversion windows F 1 , F 2 , F 3 , etc.
  • the method ensures that no voice stream data is missed.
  • each data point in the voice stream is preferably processed twice due to the overlap.
  • the overlap ensures that the analysis of each transneme is more accurate by comparing each voice stream feature to the transneme table 423 more than once.
  • the frequency spectrum difference is mapped to some form of reference in order to determine the transneme that has occurred.
  • the reference will typically be one or more transneme tables or databases, with a predetermined frequency difference mapping to a predetermined transneme.
  • the predetermined frequency spectrum difference may map to more than one predetermined transneme. In use, there may be up to 5 or 10 transnemes in the transneme table 423 that may substantially match the predetermined frequency spectrum difference.
  • An optional size optimization function can be performed to compress that for memory-sensitive applications (at a small expense of processing cycles).
  • a final found transneme may be determined through use of the feature data and by comparison of transneme groupings to the transneme-to-vocabulary database 427 .
  • the found transneme or transnemes may be stored and accumulated in the mappings storage 424 .
  • the mapping of the frequency spectrum differences may be done using a hash-like vector-distance metric, which finds the best fit difference-to-transneme equivalence.
  • the transneme table 423 may be constructed from experimentally derived transneme data that comprises manually evaluated speech waveforms that are broken down into transnemes. Alternatively, the transneme table 423 may be created by converting existing phoneme data into transnemes.
  • the found transnemes are used to create a voice stream representation in digital form.
  • the found transnemes are preferably processed as groupings, such as a grouping of 10-20 transnemes, for example.
  • a free-text-search-like lookup is preferably performed against the transneme-to-vocabulary database 427 , using an inverted-index technique to find the best-fit mappings of found transnemes to a word or phrase.
  • Many duplications of words may exist in the transneme-to-vocabulary database 427 in order to accommodate various groupings and usages of words, homonym-decoding, speaker-restarts, etc.
  • the inverted index is an index into a plurality of text entries, such as a database, for example, that is produced during a text search of the database.
  • the inverted index search result indicates the database entry where the search term may be found and also indicates a location within the text of the matched entry.
  • a search is performed for each of the words in the database.
  • the search result indexes returned from the database search might appear as: blind (3, 8); (4, 0) god (2, 0) I (1, 0) is (2, 4); (3, 5) justice (4, 6) love (1, 2); (2, 7); (3, 0) you (1, 7)
  • blind is therefore in database entry 3 (“love is blind”), starting at character 8 , and is also in entry 4 (“blind justice”), starting at the first character.
  • the inverted index search technique may be applied to the vocabulary lookup of the transnemes by adding an additional transneme field to each database entry of the transneme-to-vocabulary database 427 .
  • the transneme-to-vocabulary database might then have the form: Fld1 Fld2 Fld3 Doc# Clear Text: Transneme Version 1 “I love you” 00Ah AhEE EE00 00LL LLUh UhVv VvYY Yoo ooww ww00 2 “god is love” 00Gg GgAh Ahdd ddih ihZZ ZZLL LLUh UhVv Vv00 3 “love is blind” 00LL LLUh UhVv Vvih ihZZ ZZ00 00Bb BbLl LlAh AhEE EENn NnDd Dd00 4 “blind justice” 00Bb BbLl LlAh AhEE EENn NnDd Dd00 4 “blin
  • Field 2 Clear text (in the predetermined verbal language)
  • candidate transnemes are identified from frequency spectrum differences, they may be used as inverted index query arguments and may be queried against the Field3 transneme version. Therefore, a transneme code or representation obtained from the transneme table 423 may be compared to the transneme versions in the transneme-to-vocabulary database 427 until a match is found. Any database entries which match the query are returned, along with a relative relevance ranking for each result.
  • the particular word that matched the search term may also be identified in the result field.
  • the word match may be determined from a separate “child” table, or may be determined from an additional field of the same table. For example, the database entry:
  • transneme-to-vocabulary database 427 might therefore also contain the entries: 5 “I” 00Ah AhEE EE00 6 “love” 00LL LLUh UhVv 7 “you” YYoo ooww ww00
  • This secondary mapping may be as efficiently done as a post-processing scan of the returned transnemes to identify the word boundaries.
  • the efficiency of the secondary mapping is linked to the number of words in the returned clear text phrases.
  • the frequency domain conversion window may be advanced.
  • the frequency domain conversion window may be advanced by only a portion of its length, so that a current frequency domain conversion overlaps a previous conversion.
  • the frequency conversion window is about 10 milliseconds in duration and it is advanced about half a window length.
  • An overlapping lookup is performed against the transneme-to-vocabulary database 427 . This may be done in order to ensure that no voice stream data is overlooked and to increase reliability of the analysis through multiple comparisons.
  • the overlapping conversions may also be used to clarify a match, using a context, and may be used to vote for best matches.
  • the overlapping frequency domain conversions may be used to create a listing for a transneme grouping of the search lookups of the transneme-to-vocabulary database 427 , with the transneme matches contained in the listing being organized and ranked according to relevance.
  • FIG. 13 is a flowchart 1300 of a third speech recognition method embodiment according to the invention.
  • predetermined features are extracted from the voice stream.
  • the predetermined features may include the volume or amplitude of the voice stream, the pitch (frequency), etc.
  • This feature extraction may be done on a time domain version of the voice stream, before the input voice stream is converted into the frequency domain.
  • the feature extraction may be mathematically extracted from the input voice stream in the frequency domain. For example, in a cellular phone application, a data stream received by the cellular phone may already be in the frequency domain, and the feature extraction may be mathematically performed on the received digital data in order to calculate the volume and the pitch of the speaker.
  • step 1303 a frequency domain transformation is performed on the digitized voice stream input to produce the frequency domain output.
  • the output comprises a plurality of frequency band values that represent various frequency components within the voice stream.
  • the frequency domain transformation may be performed by a FFT device.
  • the frequency domain transformation may be performed by a filter bank comprising a plurality of filters, such as band pass filters.
  • the output of the filter bank is a plurality of frequency band outputs having amplitudes that represent the frequency components within each band.
  • the frequency domain conversion may be performed for a predetermined time period or window, capturing frequency domain characteristics of the voice stream for a window of time.
  • the frequency domain conversion window is preferably about 10 milliseconds in size. However, this size may be varied, with the size being chosen to accommodate factors such as a desired resolution of the speech. In addition, the language of the speaker may be a factor in choosing the conversion window size.
  • the size of the conversion window must be chosen so as to balance the desired resolution against computational cycles, hardware complexity, etc., because an increase in speech recognition resolution may be accompanied by an increase in computational cycles and/or hardware/memory size.
  • step 1305 the frequency domain transformation output is stored. This is preferably done in a plurality of frequency bins, with each bin corresponding to a particular predetermined frequency or frequency band.
  • the frequency domain outputs are normalized.
  • This may include a frequency normalization and an amplitude normalization, for example.
  • the normalization is performed using speech features that have been extracted from the time domain digitized voice signal.
  • the voice features may include pitch, volume, speech rate, etc.
  • the normalization therefore, adjusts the values in the frequency bins in order to maintain frequency spectrum frames that are essentially constant (except for frequency components of the voice stream). This accommodates variations in volume and frequency. For example, even if the speaker is speaking very loudly, the overall volume should not matter and should not affect the speech recognition result. Likewise, a change in frequency/pitch by the speaker should not affect the speech recognition.
  • FIGS. 14 - 16 show a frequency normalization operation on a current frequency spectrum frame.
  • a base frequency may be used for a frequency normalization.
  • the base frequency is the frequency band (or bin) containing the greatest amplitude (i.e., the largest frequency component of the speech at that time).
  • the contents of the frequency bins may be shifted up or down in order to maintain the base frequency in a substantially constant frequency bin location.
  • FIG. 14 shows a previous frequency spectrum frame 1400 and a frequency bin set containing the corresponding values.
  • FIG. 15 shows a current frequency spectrum frame 1500 and a frequency bin set containing the corresponding values. It can be seen that the base frequency of the current frequency spectrum frame 1500 is higher in frequency (tonality) than the base frequency of the previous frequency spectrum frame 1400 . This may be due to various factors, such as emotion or emphasis on the part of the speaker, etc.
  • FIG. 16 shows a frequency normalization comprising a frequency shift of the current frequency spectrum frame 1500 , forming a new current frame 1600 .
  • This may be done merely by shifting the values (V 1 -V N , for example) in the frequency bin set. This may entail dropping one or more values in order to accommodate the shift.
  • a predetermined frequency shift threshold may be used to prevent excessive frequency normalization, allowing only a limited amount of shifting.
  • the normalization value may be saved and may be used in a word lookup to determine context and punctuation.
  • FIGS. 17 and 18 show an amplitude normalization operation on a current frequency spectrum frame.
  • the figures show two successive frequency spectrum frames 1700 and 1800 , where the only difference between the frames is an amplitude difference “a”.
  • This amplitude difference may be due to a change in volume of the speaker. However, this change in volume could potentially be seen as a transition between phonemes. Therefore, a normalization is desirable so that the speech recognition is essentially independent of the volume of the speaker.
  • the normalization may be accomplished by uniformly increasing or decreasing the values in all of the frequency bins in order to substantially match the amplitude of the previous frequency spectrum frame. In this example, the value “a” is subtracted from all of the frequency spectrum frame values V 1 -V N in all of the frequency bins.
  • a frequency spectrum difference is calculated.
  • the frequency spectrum difference is calculated between two frequency spectrum frames in order to determine whether a transneme has occurred.
  • the frequency spectrum difference therefore is a set of values that show the difference in frequency components from the current frequency spectrum frame and a previous frequency spectrum frame.
  • step 1320 the frequency spectrum difference is mapped to the transneme table 423 (or other reference or database) in order to determine a found transneme.
  • the transnemes are used to create a voice stream representation in digital form stored within the voice recognition device or computer.
  • the voice stream representation output is a data stream composed of text or digital representations, such as a series of symbols, complete with some suggested punctuation.

Abstract

A speech recognition method and apparatus are provided for converting a voice stream into a digital voice stream representation. A method for performing speech recognition on a voice stream according to a first method embodiment includes the steps of determining one or more candidate transnemes in the voice stream, mapping the one or more candidate transnemes to a transneme table to convert the one or more candidate transnemes to one or more found transnemes, and mapping the one or more found transnemes to a transneme-to-vocabulary database to convert the one or more found transnemes to one or more speech units.

Description

    BACKGROUND OF THE INVENTION
  • 1. Field of the Invention [0001]
  • The present invention relates generally to speech recognition, and more particularly to real-time speech recognition for recognizing speaker-independent, connected or continuous speech. [0002]
  • 2. Description of the Background Art [0003]
  • Speech recognition refers to the ability of a machine or device to receive, analyze and recognize human speech. Speech recognition is also often referred to as voice recognition. Speech recognition may potentially allow humans to interface with machines and devices in an easy, quick, productive and reliable manner. Accurate and reliable speech recognition is therefore highly sought after. [0004]
  • Speech recognition gives humans the capability of verbally generating documents, recording or transcribing speech, and audibly controlling devices. Speech recognition is desirable because speech occurs at a much faster rate than manual operations, such as typing on a keyboard or operating controls. A good typist can type about 80 words per minute, while typical speech can be in the range of about 200 or more words per minute. [0005]
  • In addition, speech recognition can allow remote control of electronic devices. Many applications exist for impaired persons who cannot operate conventional devices, such as persons who are at least partially paralyzed, blind, or medicated. For example, a computer or computer operated appliances could be speech controlled. [0006]
  • Moreover, speech recognition may be used for hands-free operation of conventional devices. For example, one current application is the use of speech recognition for operating a cellular phone, such as in a vehicle. This may be desirable because the driver's attention should stay on the road. [0007]
  • Speech recognition processes and speech recognition devices currently exist. However, there are several difficulties that have prevented speech recognition from becoming practical and widely available. The main obstacle has been the wide variations in speech between persons. Different speakers have different speech characteristics, making speech recognition difficult or at best not satisfactorily reliable. For example, useful speech recognition must be able to identify not only words but also small word variations. Speech recognition must be able to differentiate between homonyms by using context. Speech recognition must be able to recognize silence, such as gaps between words. This may be difficult if the speaker is speaking rapidly and running words together. Speech recognition systems may have difficulty adjusting to changes in the pace of speech, changes in speech volume, and may be frustrated by accents or brogues that affect the speech. [0008]
  • Speech recognition technology has existed for some time in the prior art, and has become fairly reasonable in price. However, it has not yet achieved satisfactory reliability and is not therefore widely used. For example, as previously mentioned, devices and methods currently exist that capture and convert the speech into text, but generally require extensive training and make too many mistakes. [0009]
  • FIG. 1 shows a representative audio signal in a time domain. The audio signal is generated by capture and conversion of an audio stream into an electronic voice stream signal, usually through a microphone or other sound transducer. Generally, audible sound exists in the range of about 20 hertz (cycles) to about 20 kilohertz (kHz). Speech is a smaller subset of frequencies. The electronic voice stream signal may be filtered and amplified and is generally digitized for processing. [0010]
  • FIG. 2 shows the voice stream after it has been converted from the time domain into the frequency domain. Conversion to the frequency domain offers advantages over the time domain. Human speech is generated by the mouth and the throat, and contains many different harmonics (it is generally not composed of a single component frequency). The audible speech signal of FIG. 2, therefore, is composed of many different frequency components at different amplitude levels. In the frequency domain, the speech recognition device may be able to more easily analyze the voice stream and detect meaning based on the frequency components of the voice stream. [0011]
  • FIG. 3 shows how the digitized frequency domain response may be digitally represented and stored. Each digital level may represent a frequency or band of frequencies. For example, if the input voice stream is in the range of 1 kilohertz (kHz) to 10 kHz, and is separated into 128 frequency spectrum bands, each band (and corresponding frequency bin) would contain a digital value or amplitude for about 70 Hz of the speech frequency spectrum. This value may be varied in order to accommodate different portions of the audible sound spectrum. Speech does not typically employ all of the frequencies in the audible frequency range of 20 Hz to 20 kHz. Therefore, a speech recognition device may analyze only the frequencies from 1 kHz to 10 kHz, for example. [0012]
  • Once the voice stream has been converted to the frequency domain, an iterative statistical look-up may be performed to determine the parts of speech in a vocalization. The parts are called phonemes, the smallest unit of sound in any particular language. Various languages use phonemes that are not utilized in any other language. The English language designates about 34 different phonemes. The iterative statistical look-up employed by the prior art usually uses hidden Markov modeling (HMM) to statistically compare and determine the phonemes. The iterative statistical look-up compares multiple portions of the voice stream to stored phonemes in order to try to find a match. This generally requires multiple comparisons between a digitized sample and a phoneme database and a high computational workload. Therefore, by finding these phonemes, the speech recognition device can create a digital voice stream representation that represents the original vocalizations in a digital, machine-usable form. [0013]
  • The main difficulty encountered in the prior art is that different speakers speak at different rates and therefore the phonemes may be stretched out or compressed. There is no standard length phoneme that a speech recognition device can look for. Therefore, during the comparison process, these time-scale differences must be compensated for. [0014]
  • In the prior art, the time-scale differences are compensated for by using one of two approaches. In the dynamic time warping process, a statistical modeling stretches or compresses the wave form in order to find a best fit match of a digitized voice stream segment to a set of stored spectral patterns or templates. The dynamic time warping process uses a procedure that dynamically alters the time dimension to minimize the accumulated distance score for each template. [0015]
  • In a second prior art approach, the hidden Markov model (HMM) method characterizes speech as a plurality of statistical chains. The HMM method creates a statistical, finite-state Markov chain for each vocabulary word while it trains the data. The HMM method then computes the probability of generating the state sequence for each vocabulary word. The word with the highest accumulated probability is selected as the correct identification. Under The HMM method, time alignment is obtained indirectly through the sequence of states. [0016]
  • The prior art speech recognition approaches have drawbacks. One drawback is that the prior art approach is not sufficiently accurate due to the many variations between speakers. The prior art speech recognition suffers from mistakes and may produce an output that does not quite match. [0017]
  • Another drawback is that the prior art method is computationally intensive. Both the dynamic time warping approach and the HMM statistical approach require many comparisons in order to find a match and many iterations in order to temporally stretch or compress the digitized voice stream sample to fit samples in the phoneme database. [0018]
  • There have been many attempts in the prior art to increase speech recognition accuracy and/or to decrease computational time. One way of somewhat reducing computational requirements and increasing accuracy is to limit the library of phonemes and/or words to a small set and ignore all utterances not in the library. This is acceptable for applications requiring only limited speech recognition capability, such as operating a phone where only a limited number of vocal commands are needed. However, it is not acceptable for general uses that require a large vocabulary (i.e., normal conversational speech). [0019]
  • Another prior art approach is a speaker-dependent speech recognition wherein the speech recognition device is trained to a particular person's voice. Therefore, only the particular speaker is recognized, and that speaker must go through a training or “enrollment” process of reading or inputting a particular speech into the speech recognition device. A higher accuracy is achieved without increased cost or increased computational time. The drawback is that use of speaker-dependent voice recognition is limited to one person, requires lengthy training periods, may require a lot of computation cycles, and is limited to only applications where the speaker's identity is known apriori. [0020]
  • What is needed, therefore, are improvements in speech recognition technology. [0021]
  • SUMMARY OF THE INVENTION
  • A speech recognition device is provided according to one embodiment of the invention. The speech recognition device comprises an I/O device for accepting a voice stream and a frequency domain converter communicating with the I/O device. The frequency domain converter converts the voice stream from a time domain to a frequency domain and generates a plurality of frequency domain outputs. The speech recognition device further comprises a frequency domain output storage communicating with the frequency domain converter. The frequency domain output storage comprises at least two frequency spectrum frame storages for storing at least a current frequency spectrum frame and a previous frequency spectrum frame. A frequency spectrum frame storage of the at least two frequency spectrum frame storages comprises a plurality of frequency bins storing the plurality of frequency domain outputs. The speech recognition device further comprises a processor communicating with the plurality of frequency bins and a memory communicating with the processor. A frequency spectrum difference storage in the memory stores one or more frequency spectrum differences calculated as a difference between the current frequency spectrum frame and the previous frequency spectrum frame. At least one feature storage is included in the memory for storing at least one feature extracted from the voice stream. At least one transneme table is included in the memory, with the at least one transneme table including a plurality of transneme table entries and with a transneme table entry of the plurality of transneme table entries mapping a predetermined frequency spectrum difference to at least one predetermined transneme of a predetermined verbal language. At least one mappings storage is included in the memory, with the at least one mappings storage storing one or more found transnemes. At least one transneme-to-vocabulary database is included in the memory, with the at least one transneme-to-vocabulary database mapping a set of one or more found transnemes to at least one speech unit of the predetermined verbal language. At least one voice stream representation storage is included in the memory, with the at least one voice stream representation storage storing a voice stream representation created from the one or more found transnemes. The speech recognition device calculates a frequency spectrum difference between a current frequency spectrum frame and a previous frequency spectrum frame, maps the frequency spectrum difference to a transneme table, and converts the frequency spectrum difference to a transneme if the frequency spectrum difference is greater than a predetermined difference threshold. The speech recognition device creates a digital voice stream representation of the voice stream from one or more transnemes thus produced. [0022]
  • A method for performing speech recognition on a voice stream is provided according to a first method embodiment of the invention. The method comprises the steps of determining one or more candidate transnemes in the voice stream, mapping the one or more candidate transnemes to a transneme table to convert the one or more candidate transnemes to one or more found transnemes, and mapping the one or more found transnemes to a transneme-to-vocabulary database to convert the one or more found transnemes to one or more speech units. [0023]
  • A method for performing speech recognition on a voice stream is provided according to a second method embodiment of the invention. The method comprises the step of calculating a frequency spectrum difference between a current frequency spectrum frame and a previous frequency spectrum frame. The current frequency spectrum frame and the previous frequency spectrum frame are in a frequency domain and are separated by a predetermined time interval. The method further comprises the step of mapping the frequency spectrum difference to a transneme table to convert the frequency spectrum difference to at least one transneme if the frequency spectrum difference is greater than a predetermined difference threshold. A digital voice stream representation of the voice stream is created from one or more transnemes thus produced. [0024]
  • A method for performing speech recognition on a voice stream is provided according to a third method embodiment of the invention. The method comprises the step of performing a frequency domain transformation on the voice stream upon a predetermined time interval to create a current frequency spectrum frame. The method further comprises the step of normalizing the current frequency spectrum frame. The method further comprises the step of calculating a frequency spectrum difference between the current frequency spectrum frame and a previous frequency spectrum frame. The method further comprises the step of mapping the frequency spectrum difference to a transneme table to convert the frequency spectrum difference to at least one found transneme if the frequency spectrum difference is greater than a predetermined difference threshold. The method therefore creates a digital voice stream representation of the voice stream from one or more found transnemes thus produced. [0025]
  • The above and other features and advantages of the present invention will be further understood from the following description of the preferred embodiments thereof, taken in conjunction with the accompanying drawings.[0026]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows a representative audio signal in a time domain; [0027]
  • FIG. 2 shows the voice stream after it has been converted from the time domain into the frequency domain; [0028]
  • FIG. 3 shows how the digitized frequency domain response may be digitally represented and stored; [0029]
  • FIG. 4 shows a speech recognition device according to one embodiment of the invention; [0030]
  • FIG. 5 shows detail of a frequency domain output storage; [0031]
  • FIG. 6 is a flowchart of a first speech recognition method embodiment according to the invention; [0032]
  • FIG. 7 shows a frequency spectrum frame according to a first embodiment of a frequency domain conversion; [0033]
  • FIG. 8 shows a frequency spectrum frame according to a second embodiment of the frequency domain conversion; [0034]
  • FIG. 9 is a flowchart of a second speech recognition method embodiment; [0035]
  • FIG. 10 shows a first frequency spectrum frame obtained at a first point in time; [0036]
  • FIG. 11 shows a second frequency spectrum frame obtained at a second point in time; [0037]
  • FIG. 12 shows how the frequency domain transformation may be processed using overlapping frequency domain conversion windows; [0038]
  • FIG. 13 is a flowchart of a third speech recognition method embodiment; [0039]
  • FIGS. [0040] 14-16 show a frequency normalization operation on a current frequency spectrum frame; and
  • FIGS. [0041] 17-18 show an amplitude normalization operation on a current frequency spectrum frame.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • FIG. 4 shows a [0042] speech recognition device 400 according to one embodiment of the invention. The speech recognition device 400 includes an input/output (I/O) device 401, a frequency domain converter 406, frequency domain output storage 410, a processor 414, and a memory 420.
  • The [0043] speech recognition device 400 of the invention performs speech recognition and converts a voice stream input into a digital voice stream representation. The voice stream representation comprises a series of symbols that digitally represents the voice stream and may be used to recreate the voice stream.
  • The speech recognition is accomplished by finding transnemes in the voice stream and converting the found transnemes into speech units of a predetermined verbal language. A speech unit may be a word, a portion of a word, a phrase, an expression, or any other type of verbal utterance that has an understood meaning in the predetermined verbal language. [0044]
  • A phoneme is generally described as being the smallest unit of sound in any particular language. Each vocalized phoneme is a distinct sound and therefore may be characterized by a substantially unique frequency domain response, substantially over the duration of the vocalization of the phoneme. In the English language, it is generally accepted that there are about 34 phonemes that are used to create all parts of the spoken language. There are less than 100 identified phonemes in all languages combined. [0045]
  • A transneme is a transition between the phoneme (or allophone) components of human speech. There are approximately 10,000 transnemes. Transnemes are therefore smaller subunits or components of speech, and are used by the invention to produce a speech recognition that is speaker-independent and that operates on connected speech (i.e., the speaker can talk normally and does not have to take care to voice each word separately and distinctly). [0046]
  • The speech recognition of the invention does not attempt to find and identify phonemes, as is done in the prior art. Instead, the [0047] speech recognition device 400 searches for transitions between and within phonemes (i.e., transnemes), with such transitions generally being shorter in duration than phonemes. Moreover, because a transneme is defined by two temporally adjacent phonemes or parts of phonemes, the number of transnemes is approximately equal to the square of the number of phonemes (i.e., 100×100=10,000). Identification of the transneme components of speech therefore achieved a greater efficiency, accuracy, and resolution than the various speech recognition techniques of the prior art.
  • The I/[0048] O device 401 may be any type of device that is capable of accepting an audio signal input that includes a voice stream. The I/O device 401 provides the audio signal input to the speech recognition device 400. The I/O device 401 may accept a digitized voice stream or may include a digitizer, such as an analog to digital converter (not shown) if the incoming audio signal is in analog form. The I/O device 401 may accept a voice stream that is already compressed or that has already been converted into the frequency domain.
  • The I/[0049] O device 401 may be any type of input device, including a microphone or sound transducer, or some other form of interface device. The I/O device 401 may additionally be a radio frequency receiver or transceiver, such as a radio frequency front-end, including an antenna, amplifiers, filters, down converters, etc., that produce an audio signal. This may include any type of radio receiver, such as a cell phone, satellite phone, pager, one or two-way radios, etc. Furthermore, the I/O device 401 may be a TV receiver, an infrared receiver, an ultrasonic receiver, etc. Alternatively, the I/O device 401 may be an interface, such as an interface to a digital computer network, such as the Internet, a local area network (LAN), etc., may be an interface to an analog network, such as an analog telephone network, etc.
  • In addition to accepting a voice stream, the I/[0050] O device 401 may be capable of outputting a voice stream representation. For example, the speech recognition may be used to compress the voice stream for transmission to another device. For example, the speech recognition of the invention may be used in a cell phone. The voice stream (or other sound input) may be received by the I/O device 401 and may be converted into text or speech representation symbols by the speech recognition device 400. The speech representation symbols may then be transmitted to a remote location or device by the I/O device 401. At the receiving device, the speech representation symbols may be converted back into an audio output. This use of the invention for audio compression greatly reduces bandwidth requirements of the speech transmission.
  • The [0051] frequency domain converter 406 communicates with the I/O device 401 and converts the incoming voice stream signal into a frequency domain signal (See FIGS. 2 and 3). The frequency domain converter 406 may be, for example, a discrete Fourier transform device or a fast Fourier transform device (FFT) that performs a Fourier transform on the time domain voice stream. Alternatively, the frequency domain converter 406 may be a filter bank employing a plurality of filters, such as band pass filters that provide a frequency spectrum output, or may be a predictive coder.
  • The [0052] frequency domain converter 406 generates a plurality of outputs, with each output representing a predetermined frequency or frequency band. The plurality of outputs of the frequency domain conversion are referred to herein as a frequency spectrum frame, with a frequency spectrum frame comprising a plurality of amplitude values that represent the frequency components of the voice stream over the predetermined frequency conversion window period (see FIG. 10, for example, and also see the discussion accompanying FIG. 13). The number of outputs may be chosen according to a desired frequency band size and according to a range of audible frequencies desired to be analyzed. In one embodiment, the frequency domain converter 406 generates 128 outputs for a predetermined frequency domain conversion window.
  • FIG. 5 shows detail of the frequency [0053] domain output storage 410. The frequency domain output storage 410 communicates with the frequency domain converter 406. The frequency domain output storage 410 is a memory that comprises at least two frequency spectrum frames 410 a and 410 b. Alternatively, the frequency domain output storage 410 may be eliminated and the plurality of outputs from the frequency domain converter 406 may be stored in the memory 420.
  • Each [0054] frequency spectrum frame 410 a, 410 b, etc., stores a set of digital values V1-VN that represents the amplitudes or quantities of frequency band components present in the voice stream over the predetermined frequency conversion window. Each frame contains N bins, with N corresponding to the number of frequency domain transformation outputs generated by the frequency domain converter 406. Each bin therefore contains a frequency domain conversion output value V. The sets of digital values V1-VN in successive frequency spectrum frames, such as those in frames 410 a and 410 b, may be analyzed in order to process the input voice stream.
  • The [0055] processor 414 may be any type of processor, and communicates with the frequency domain output storage 410 in order to receive the frequency spectrum frame or frames contained within the frequency domain output storage 410. The processor 414 is also connected to the memory 420 and is optionally connected to the I/O device 401 whereby the processor may receive a digitized time domain signal. The processor 414 may be connected to the I/O device 401 in order to extract features from the time domain signal, such as pitch and volume features. These speech features may be used for adding punctuation, emphasis, etc., to the voice stream representation output, for example, and may also be used to normalize the frequency spectrum frames (the normalization is discussed below in the text accompanying FIGS. 14-18).
  • The [0056] memory 420 communicates with the processor 414 and may be any type of storage device, including types of random access memory (RAM), types of read-only memory (ROM), magnetic tape or disc, bubble memory, optical memory, etc. The memory 420 may be used to store an operating program that includes a speech recognition algorithm according to various aspects of the invention. In addition, the memory 420 may include variables and data used to process the speech and may hold temporary values during processing. The memory 420, therefore, may include a spectrum difference storage 421, a feature storage 422, at least one transneme table 423, at least one mappings storage 424, at least one transneme-to-vocabulary database 427, and at least one voice stream representation 428. The memory 420 may optionally include a frequency spectrum frame storage (not shown) for storing one or more frequency spectrum frames, such as the output of the frequency domain converter 406.
  • The [0057] spectrum difference storage 421 stores at least one frequency spectrum difference calculated from current and previous frequency spectrum frames. The frequency spectrum difference, therefore, is a set of values that represents a change in spectral properties of the voice stream.
  • The [0058] feature storage 422 contains at least one feature extracted from the voice stream, such as a volume or pitch feature, for example. The feature may have been obtained from the voice stream in either the time domain or the frequency domain. Data stored in the feature storage 422 may be used to normalize a frequency spectrum frame and to aid in grammar interpretation such as by adding punctuation. In addition, the features may be used to provide context when matching a found transneme to speech units, words, or phrases.
  • The at least one transneme table [0059] 423 contains a plurality of transnemes. Transnemes are used to analyze the input voice stream and are used to determine the parts of speech therein in order to create a voice stream representation in digital form. For example, the phrase, “Hello, World” is made up of 11 transnemes (“[ ]” represents silence). If smaller frame sizes are utilized, transnemes can be smaller parts of phonemes, with multiple transnemes per phoneme.
  • [ ]-H [0060]
  • H-E [0061]
  • E-L [0062]
  • L-OW [0063]
  • OW-[ ][0064]
  • [ ]-W [0065]
  • W-EH [0066]
  • EH-R [0067]
  • R-L [0068]
  • L-D [0069]
  • D-[ ][0070]
  • Transnemes are dictated by the sounds produced by the human vocal apparatus. Therefore, transnemes are essentially independent of the verbal language of the speaker. However, a particular language may only contain a subset of all of the existing transnemes. As a result, the transneme table [0071] 423 may contain only the transnemes necessary for a predetermined verbal language. This may be done in order to conserve memory space. In applications where language translation or speech recognition of multiple verbal languages is required, the transneme table may likely contain the entire set of transnemes.
  • The [0072] mappings storage 424 stores one or more mappings produced by comparing one or more frequency spectrum differences to at least one transneme table 423. In other words, when transnemes are found through the matching process, they are accumulated in the mappings storage 424.
  • The transneme-to-[0073] vocabulary database 427 maps found transnemes to one or more speech units, with a speech unit being a word, a portion of a word, a phrase, or any utterance that has a defined meaning. By using the transneme-to-vocabulary database 427, the speech recognition device 400 may compare groupings of one or more found transnemes to the transneme-to-vocabulary database 427 in order to find speech units and create words and phrases.
  • The transneme-to-[0074] vocabulary database 427 may contain entries that map transnemes to one or more verbal languages, and may therefore convert transnemes into speech units of one or more predetermined verbal languages. In addition, the speech recognition device 400 may include multiple transneme-to-vocabulary databases 427, with additional transneme-to-vocabulary databases 427 capable of being added to give additional speech recognition capability in other languages.
  • The voice [0075] stream representation storage 428 is used to store found words and phrases as part of the creation of a voice stream representation. For example, the voice stream representation 428 may accumulate a series of symbols, such as text, that have been constructed from the voice stream input.
  • The [0076] speech recognition device 400 is speaker-independent and reliably processes and converts connected speech (connected speech is verbalized without any concern or effort to separate the words or talk in any specific manner in order to aid in speech recognition). The speech recognition device 400 therefore operates on connected speech in that it can discern breaks between words and/or phrases and without excessive computational requirements. A reduced computational workload may translate to simpler and less expensive hardware.
  • Another difference between the invention and the prior art is that gaps or silences between phonemes are detected and removed by the frequency spectrum differences. If a spectrum difference evaluates to be approximately zero, that means that the spectrum frame has not changed appreciably since the last frequency domain transformation. Therefore, a transition from one phoneme to another (i.e., a transneme) has not occurred and as a result the particular time sample may be ignored. [0077]
  • The use of frequency spectrum frame differencing eliminates the need for the complex and inefficient dynamic time warping procedure of the prior art, and therefore generally requires only about one cycle per comparison of a spectrum difference to a database or table. Therefore, unlike the prior art dynamic time warping and statistical modeling, the [0078] speech recognition device 400 of the present invention does not need to perform a large amount of comparisons in order to account for time variations in the voice stream.
  • The [0079] speech recognition device 400 may be a specialized device constructed for speech recognition or may be integrated into another electronic device. The speech recognition device 400 may be integrated into any manner of electronic device, including cell phones, satellite phones, conventional land-line telephones (digital and analog), radios (one and two-way), pagers, personal digital assistants (PDAs), laptop and desktop computers, mainframes, digital network appliances and workstations, etc. Alternatively, the speech recognition device 400 may be integrated into specialized controllers or general purpose computers by implementing software to perform speech recognition according to the present invention. Speech recognition and control may therefore be added to personal computers, automobiles and other vehicles, factory equipment and automation, robotics, security devices, etc.
  • FIG. 6 is a [0080] flowchart 600 of a first speech recognition method embodiment according to the invention. In step 606, the method determines one or more candidate transnemes in a voice stream. This is accomplished by analyzing the voice stream in a frequency domain and comparing frequency spectrum frames (captured over predetermined time periods) in order to determine one or more candidate transnemes. The frequency spectrum frames may be obtained for overlapping time periods (windows).
  • As previously described, a transneme is a transition between phonemes or allophones, and a candidate transneme may be determined by finding a significant spectrum variation in the frequency domain. Therefore, a comparison in the frequency domain may be performed between a current frequency spectrum frame (containing frequency components of predetermined frequencies or frequency bands) to a previous frequency spectrum frame in order to determine voice stream frequency changes over time. Periods of silence or periods of substantially no frequency change are ignored. [0081]
  • The I/[0082] O device 401 passes a digitized voice stream to the frequency domain converter 406 and to the processor 414. The processor 414 may extract predetermined speech features from the digitized time domain voice stream. These predetermined speech features may be used to add punctuation and may also be used to normalize the frequency spectrum frames based on a detected volume and pitch of the speaker. For example, a tonality rise by the speaker may signify a question or query, signifying a question mark or other appropriate punctuation in the completed voice stream representation.
  • The [0083] frequency domain converter 406 converts the digitized voice stream into a plurality of frequency domain signal outputs and stores them in the frequency domain output storage 410. The frequency domain converter 406 is preferably a Fourier transform device that performs Fourier transforms on the digitized input voice stream. The outputs are preferably in the form of an array of values representing the different frequency bands present in the voice stream. Alternatively, the frequency domain converter 406 may be any other device that is capable of converting the time domain signal into the frequency domain, such as a filter bank comprising a plurality of filters, such as band pass filters, for example.
  • FIG. 7 shows a [0084] frequency spectrum frame 700 according to a first embodiment of the frequency domain conversion. In the first embodiment, the frequency spectrum frame 700 comprises a plurality of contiguous frequency bands that substantially covers a predetermined portion of the audible frequency spectrum.
  • FIG. 8 shows a [0085] frequency spectrum frame 800 according to a second embodiment of the frequency domain conversion. In the second embodiment, the frequency spectrum frame 800 comprises a plurality of substantially individual frequencies or a plurality of non-contiguous frequency bands. Although some of the frequencies in the predetermined portion of the audible frequency spectrum are ignored, the frame 800 may still adequately reflect and characterize the various frequency components of the voice stream. By using only portions of the predetermined portion of the audible frequency spectrum and not using contiguous frequency bands, the amount of data processed in comparing frequency spectrum characteristics and finding transnemes may be further reduced. The result is a decrease in computational processing requirements and storage requirements.
  • As a further part of determining a transneme, the [0086] processor 414 accesses the frequency domain outputs in the frequency domain output storage 410 and creates a frequency spectrum difference. When a frequency spectrum difference evaluates to be non-zero, a transition between phonemes has occurred and the frequency spectrum difference has been used to determine a candidate transneme in the input voice stream.
  • In [0087] step 612, a candidate transneme is mapped to the transneme table 423 (or other data conversion device) in order to determine whether it is a valid transneme. By using one or more transneme tables 423, a candidate transneme (i.e., a valid frequency spectrum difference) is converted into a found transneme.
  • In [0088] step 615, the found transneme is mapped to at least one transneme-to-vocabulary database 427 (or other data conversion device) and are converted to one or more speech units. The speech units may comprise words, portions of words, phrases, or any other utterance that has a recognized meaning.
  • The [0089] method 600 therefore converts the voice stream (or other audio input) into a digital voice stream representation. The digital voice stream representation produced by the speech recognition comprises a series of digital symbols, such as text, for example, that may be used to represent the voice stream. The voice stream representation may be stored or may be used or processed in some manner.
  • Speech recognition according to the invention has many uses. The speech recognition of the invention converts a voice stream input into a series of digital symbols that may be used for speech-to-text conversion, to generate commands and inputs for voice control, etc. This may encompass a broad range of applications, such as dictation, transcription, messaging, etc. The speech recognition of the invention may also encompass voice control, and may be incorporated into any type of electronic device or electronically controlled device. Furthermore, the speech recognition of the invention may analyze any type of non-speech audio input and convert it to a digital representation if an equivalent set of transneme definitions is known. For example, the speech recognition method could find possible applications in music. [0090]
  • The speech recognition of the invention may additionally be used to perform a highly effective compression on the voice stream. This is accomplished by converting the voice stream into a voice stream representation comprising a series of symbols. Due to the highly efficient and relatively simple conversion of the voice stream into digital symbols (such as numerical codes, including, for example, ASCII symbols for English letters and punctuation), the speech recognition of the invention may provide a highly effective audio signal compression. [0091]
  • Digitally captured and transmitted speech typically contains only frequencies in the 4 kHz to 10 kHz range and requires a data transmission rate of about 12 kbits per second. However, employing the speech recognition of the invention, speech may be transmitted at a data rate of about 120 bits per second to about 60 bits per second. The result is a data compression of about 100 to 1 to about 200 to 1. This allows a device to decrease its data rate and hardware requirements while simultaneously allowing greater transmission quality (i.e., by capturing more of the 20 Hz to 20 kHz audible sound spectrum). [0092]
  • The voice stream compression may be advantageously used in a variety of ways. One application may be to compress a voice stream for transmission. This may include wireless transmission of the voice stream representation, such as a radio transmission, an infrared (IR) or optical transmission, an ultrasonic transmission, etc. The voice stream compression may therefore be highly useful in communication devices such as cellular phones, satellite phones, pagers, radios, etc. Alternatively, the transmission may be performed over some form of transmission media, such as wire, cable, optical fiber, etc. Therefore, the voice stream representation may be used to more efficiently communicate data over conventional analog telephone networks, digital telephone networks, digital packet networks, computer networks, etc. [0093]
  • Another compression application may be the use of speech recognition to compress data for manipulation and storage. A voice stream may be converted into a series of digital symbols in order to drastically reduce storage space requirements. This may find advantageous application in areas such as answering machines, voice messaging, voice control of devices, etc. [0094]
  • In yet another application, by including an appropriate transneme-to-vocabulary table or tables [0095] 427, the speech recognition may perform a language translation function. The speech recognition device 400 may receive a voice stream of a first verbal language and, through use of an appropriate transneme-to-vocabulary table or tables 427, may convert that voice stream into a voice stream representation of a second language.
  • FIG. 9 is a [0096] flowchart 900 of a second speech recognition method embodiment according to the invention. In step 903, a frequency spectrum difference is calculated. The frequency spectrum difference is calculated between two frequency spectrum frames in order to determine whether a transneme has occurred. The voice stream must have already been processed in some manner in order to create a frequency domain signal or representation. This may also include pre-processing such as amplification, filtering, and digitization.
  • FIG. 10 shows a first [0097] frequency spectrum frame 1000 obtained at a first point in time T1. FIG. 11 shows a second frequency spectrum frame 1100 obtained at a second point in time T2. From these two frames, it can be seen that the frequency components of the voice stream have changed over the time period T2-T1 (see dashed lines). Therefore, the frequency spectrum difference will reflect these spectral changes and may represent a transneme.
  • As part of the frequency spectrum difference calculation, the difference may be compared to a predetermined difference threshold to see if a transneme has occurred. If the frequency spectrum difference is less than or equal to the predetermined difference threshold, the difference may be judged to be essentially zero and the current frequency spectrum frame may be ignored. A next frequency spectrum frame may then be obtained and processed. [0098]
  • The predetermined difference threshold takes into account noise effects and imposes a requirement that the frames must change by at least the predetermined difference threshold in order to be a valid transneme. In one embodiment, the predetermined difference threshold is about 5% of average amplitude of base frequency spectrum bin over a less than 100 millisecond frame size, although the predetermined difference threshold may range from about 3% to about 7%. [0099]
  • A frequency spectrum difference is preferably calculated about every 10 milliseconds due to the average time duration of a phoneme, but may be varied according to conditions or a desired resolution. For example, a frequency spectrum difference may be calculated more often in order to increase the resolution of the transneme differentiation and to potentially increase accuracy. However, the computational workload will increase as a consequence. [0100]
  • FIG. 12 shows how the frequency domain transformation may be processed using overlapping frequency domain conversion windows F[0101] 1, F2, F3, etc. By using overlapping windows, the method ensures that no voice stream data is missed. On the contrary, each data point in the voice stream is preferably processed twice due to the overlap. In addition, the overlap ensures that the analysis of each transneme is more accurate by comparing each voice stream feature to the transneme table 423 more than once.
  • In [0102] step 906, the frequency spectrum difference is mapped to some form of reference in order to determine the transneme that has occurred. The reference will typically be one or more transneme tables or databases, with a predetermined frequency difference mapping to a predetermined transneme. The predetermined frequency spectrum difference may map to more than one predetermined transneme. In use, there may be up to 5 or 10 transnemes in the transneme table 423 that may substantially match the predetermined frequency spectrum difference. An optional size optimization function can be performed to compress that for memory-sensitive applications (at a small expense of processing cycles). A final found transneme may be determined through use of the feature data and by comparison of transneme groupings to the transneme-to-vocabulary database 427. The found transneme or transnemes may be stored and accumulated in the mappings storage 424.
  • Preferably, the mapping of the frequency spectrum differences may be done using a hash-like vector-distance metric, which finds the best fit difference-to-transneme equivalence. The transneme table [0103] 423 may be constructed from experimentally derived transneme data that comprises manually evaluated speech waveforms that are broken down into transnemes. Alternatively, the transneme table 423 may be created by converting existing phoneme data into transnemes.
  • In [0104] step 908, the found transnemes are used to create a voice stream representation in digital form. The found transnemes are preferably processed as groupings, such as a grouping of 10-20 transnemes, for example. After a grouping of transnemes has been accumulated, a free-text-search-like lookup is preferably performed against the transneme-to-vocabulary database 427, using an inverted-index technique to find the best-fit mappings of found transnemes to a word or phrase. Many duplications of words may exist in the transneme-to-vocabulary database 427 in order to accommodate various groupings and usages of words, homonym-decoding, speaker-restarts, etc.
  • The inverted index is an index into a plurality of text entries, such as a database, for example, that is produced during a text search of the database. The inverted index search result indicates the database entry where the search term may be found and also indicates a location within the text of the matched entry. [0105]
  • For example, consider a simple database (shown in lower case for simplicity) containing four text entries: [0106]
  • I love you [0107]
  • god is love [0108]
  • love is blind [0109]
  • blind justice [0110]
  • As an example, a search is performed for each of the words in the database. Using an inverted index search to index the search results by (entry, offset within the entry), the search result indexes returned from the database search might appear as: [0111]
    blind (3, 8); (4, 0)
    god (2, 0)
    I (1, 0)
    is (2, 4); (3, 5)
    justice (4, 6)
    love (1, 2); (2, 7); (3, 0)
    you (1, 7)
  • The word “blind” is therefore in database entry [0112] 3 (“love is blind”), starting at character 8, and is also in entry 4 (“blind justice”), starting at the first character.
  • To find documents containing both “is” and “love,” the search result indexes are examined to find intersections between the entries. In this case, both entries [0113] 2 and 3 contain the two words, so the database entries 2 and 3 both contain the two search terms. Therefore, each entry number of the “is” result may be compared to the “love” result, with an intersection occurring when a database entry number is present in both indexes. In an additional capability, an inverted index search can quickly find documents where the words are physically close by comparing the character offsets of a result having at least one intersecting entry.
  • The inverted index search technique may be applied to the vocabulary lookup of the transnemes by adding an additional transneme field to each database entry of the transneme-to-[0114] vocabulary database 427. The transneme-to-vocabulary database might then have the form:
    Fld1 Fld2 Fld3
    Doc# Clear Text: Transneme Version
    1 “I love you” 00Ah AhEE EE00 00LL LLUh UhVv VvYY
    YYoo ooww ww00
    2 “god is love” 00Gg GgAh Ahdd ddih ihZZ ZZLL LLUh
    UhVv Vv00
    3 “love is blind” 00LL LLUh UhVv Vvih ihZZ ZZ00
    00Bb BbLl LlAh AhEE EENn NnDd Dd00
    4 “blind justice” 00Bb BbLl LlAh AhEE EENn NnDd Dd00
    00Dj DjUh UhSs Sstt ttih ihss ss00
  • Where [0115]
  • Field 1: Record # (==row number, ==document number) [0116]
  • Field 2: Clear text (in the predetermined verbal language) [0117]
  • Field 3: Transneme version of text [0118]
  • Field 4: Sub-index of each word-to-transneme mapping (optional) [0119]
  • Once candidate transnemes are identified from frequency spectrum differences, they may be used as inverted index query arguments and may be queried against the Field3 transneme version. Therefore, a transneme code or representation obtained from the transneme table [0120] 423 may be compared to the transneme versions in the transneme-to-vocabulary database 427 until a match is found. Any database entries which match the query are returned, along with a relative relevance ranking for each result.
  • The particular word that matched the search term may also be identified in the result field. The word match may be determined from a separate “child” table, or may be determined from an additional field of the same table. For example, the database entry: [0121]
  • 1 “I love you” OOAh AhEE EEOO OOLL LLUh UhVv VvYY YYoo ooww wwOO [0122]
  • may also be indexed for each transneme-to-word mapping. The transneme-to-[0123] vocabulary database 427 might therefore also contain the entries:
    5 “I” 00Ah AhEE EE00
    6 “love” 00LL LLUh UhVv
    7 “you” YYoo ooww ww00
  • This secondary mapping may be as efficiently done as a post-processing scan of the returned transnemes to identify the word boundaries. The efficiency of the secondary mapping is linked to the number of words in the returned clear text phrases. [0124]
  • After the current grouping of found transnemes has been identified, successive additional iterations may be performed to continue the speech recognition process. At the start of each iteration, the frequency domain conversion window may be advanced. The frequency domain conversion window may be advanced by only a portion of its length, so that a current frequency domain conversion overlaps a previous conversion. In a preferred embodiment, the frequency conversion window is about 10 milliseconds in duration and it is advanced about half a window length. An overlapping lookup is performed against the transneme-to-[0125] vocabulary database 427. This may be done in order to ensure that no voice stream data is overlooked and to increase reliability of the analysis through multiple comparisons. The overlapping conversions may also be used to clarify a match, using a context, and may be used to vote for best matches. The overlapping frequency domain conversions may be used to create a listing for a transneme grouping of the search lookups of the transneme-to-vocabulary database 427, with the transneme matches contained in the listing being organized and ranked according to relevance.
  • FIG. 13 is a [0126] flowchart 1300 of a third speech recognition method embodiment according to the invention. In step 1301, predetermined features are extracted from the voice stream. The predetermined features may include the volume or amplitude of the voice stream, the pitch (frequency), etc. This feature extraction may be done on a time domain version of the voice stream, before the input voice stream is converted into the frequency domain. Alternatively, the feature extraction may be mathematically extracted from the input voice stream in the frequency domain. For example, in a cellular phone application, a data stream received by the cellular phone may already be in the frequency domain, and the feature extraction may be mathematically performed on the received digital data in order to calculate the volume and the pitch of the speaker.
  • In [0127] step 1303, a frequency domain transformation is performed on the digitized voice stream input to produce the frequency domain output. The output comprises a plurality of frequency band values that represent various frequency components within the voice stream.
  • The frequency domain transformation may be performed by a FFT device. Alternatively, the frequency domain transformation may be performed by a filter bank comprising a plurality of filters, such as band pass filters. The output of the filter bank is a plurality of frequency band outputs having amplitudes that represent the frequency components within each band. [0128]
  • The frequency domain conversion may be performed for a predetermined time period or window, capturing frequency domain characteristics of the voice stream for a window of time. The frequency domain conversion window is preferably about 10 milliseconds in size. However, this size may be varied, with the size being chosen to accommodate factors such as a desired resolution of the speech. In addition, the language of the speaker may be a factor in choosing the conversion window size. The size of the conversion window must be chosen so as to balance the desired resolution against computational cycles, hardware complexity, etc., because an increase in speech recognition resolution may be accompanied by an increase in computational cycles and/or hardware/memory size. [0129]
  • In [0130] step 1305, the frequency domain transformation output is stored. This is preferably done in a plurality of frequency bins, with each bin corresponding to a particular predetermined frequency or frequency band.
  • In [0131] step 1309, the frequency domain outputs are normalized. This may include a frequency normalization and an amplitude normalization, for example. The normalization is performed using speech features that have been extracted from the time domain digitized voice signal. The voice features may include pitch, volume, speech rate, etc. The normalization, therefore, adjusts the values in the frequency bins in order to maintain frequency spectrum frames that are essentially constant (except for frequency components of the voice stream). This accommodates variations in volume and frequency. For example, even if the speaker is speaking very loudly, the overall volume should not matter and should not affect the speech recognition result. Likewise, a change in frequency/pitch by the speaker should not affect the speech recognition.
  • FIGS. [0132] 14-16 show a frequency normalization operation on a current frequency spectrum frame. In order to prevent changes in tonality (frequency) from affecting the transneme determination, a base frequency may be used for a frequency normalization. The base frequency is the frequency band (or bin) containing the greatest amplitude (i.e., the largest frequency component of the speech at that time). During frequency normalization, the contents of the frequency bins may be shifted up or down in order to maintain the base frequency in a substantially constant frequency bin location.
  • FIG. 14 shows a previous [0133] frequency spectrum frame 1400 and a frequency bin set containing the corresponding values. FIG. 15 shows a current frequency spectrum frame 1500 and a frequency bin set containing the corresponding values. It can be seen that the base frequency of the current frequency spectrum frame 1500 is higher in frequency (tonality) than the base frequency of the previous frequency spectrum frame 1400. This may be due to various factors, such as emotion or emphasis on the part of the speaker, etc.
  • FIG. 16 shows a frequency normalization comprising a frequency shift of the current [0134] frequency spectrum frame 1500, forming a new current frame 1600. This may be done merely by shifting the values (V1-VN, for example) in the frequency bin set. This may entail dropping one or more values in order to accommodate the shift. A predetermined frequency shift threshold may be used to prevent excessive frequency normalization, allowing only a limited amount of shifting. In addition to normalizing the current frame, the normalization value may be saved and may be used in a word lookup to determine context and punctuation.
  • FIGS. 17 and 18 show an amplitude normalization operation on a current frequency spectrum frame. The figures show two successive frequency spectrum frames [0135] 1700 and 1800, where the only difference between the frames is an amplitude difference “a”. This amplitude difference may be due to a change in volume of the speaker. However, this change in volume could potentially be seen as a transition between phonemes. Therefore, a normalization is desirable so that the speech recognition is essentially independent of the volume of the speaker. The normalization may be accomplished by uniformly increasing or decreasing the values in all of the frequency bins in order to substantially match the amplitude of the previous frequency spectrum frame. In this example, the value “a” is subtracted from all of the frequency spectrum frame values V1-VN in all of the frequency bins.
  • In step [0136] 1316, a frequency spectrum difference is calculated. As previously discussed, the frequency spectrum difference is calculated between two frequency spectrum frames in order to determine whether a transneme has occurred. The frequency spectrum difference therefore is a set of values that show the difference in frequency components from the current frequency spectrum frame and a previous frequency spectrum frame.
  • In [0137] step 1320, the frequency spectrum difference is mapped to the transneme table 423 (or other reference or database) in order to determine a found transneme.
  • In [0138] step 1325, the transnemes are used to create a voice stream representation in digital form stored within the voice recognition device or computer. The voice stream representation output is a data stream composed of text or digital representations, such as a series of symbols, complete with some suggested punctuation.
  • While the invention has been described in detail above, the invention is not intended to be limited to the specific embodiments as described. It is evident that those skilled in the art may now make numerous uses and modifications of and departures from the specific embodiments described herein without departing from the inventive concepts. [0139]

Claims (72)

What is claimed is:
1. A speech recognition device, comprising:
an I/O device for accepting a voice stream;
a frequency domain converter communicating with said I/O device, said frequency domain converter converting said voice stream from a time domain to a frequency domain and generating a plurality of frequency domain outputs;
a frequency domain output storage communicating with said frequency domain converter, said frequency domain output storage comprising at least two frequency spectrum frame storages for storing at least a current frequency spectrum frame and a previous frequency spectrum frame, with a frequency spectrum frame storage of said at least two frequency spectrum frame storages comprising a plurality of frequency bins storing said plurality of frequency domain outputs;
a processor communicating with said plurality of frequency bins;
a memory communicating with said processor;
a frequency spectrum difference storage in said memory, with said frequency spectrum difference storage storing one or more frequency spectrum differences calculated as a difference between said current frequency spectrum frame and said previous frequency spectrum frame;
at least one feature storage in said memory for storing at least one feature extracted from said voice stream;
at least one transneme table in said memory, with said at least one transneme table including a plurality of transneme table entries and with a transneme table entry of said plurality of transneme table entries mapping a predetermined frequency spectrum difference to at least one predetermined transneme of a predetermined verbal language;
at least one mappings storage in said memory, with said at least one mappings storage storing one or more found transnemes;
at least one transneme-to-vocabulary database in said memory, with said at least one transneme-to-vocabulary database mapping a set of one or more found transnemes to at least one speech unit of said predetermined verbal language; and
at least one voice stream representation storage in said memory, with said at least one voice stream representation storage storing a voice stream representation created from said one or more found transnemes;
wherein said speech recognition device calculates a frequency spectrum difference between a current frequency spectrum frame and a previous frequency spectrum frame, maps said frequency spectrum difference to a transneme table, and converts said frequency spectrum difference to a transneme if said frequency spectrum difference is greater than a predetermined difference threshold, and creates a digital voice stream representation of said voice stream from one or more transnemes thus produced.
2. The speech recognition device of claim 1, wherein said voice stream is accepted as a digital voice stream.
3. The speech recognition device of claim 1, wherein said voice stream is compressed.
4. The speech recognition device of claim 1, wherein said I/O device comprises a microphone.
5. The speech recognition device of claim 1, wherein said I/O device comprises a wireless receiver.
6. The speech recognition device of claim 1, wherein said I/O device comprises a digital network interface.
7. The speech recognition device of claim 1, wherein said I/O device comprises an analog network interface.
8. The speech recognition device of claim 1, wherein said frequency domain converter is a Fourier transform device.
9. The speech recognition device of claim 1, wherein said frequency domain converter is a filter bank comprising a plurality of predetermined filters.
10. The speech recognition device of claim 1, wherein said frequency domain output storage is in said memory.
11. The speech recognition device of claim 1, wherein said memory further comprises a feature storage and said processor communicates with said frequency domain output storage and extracts at least one feature from said voice stream in a frequency domain and stores said at least one feature in said feature storage.
12. The speech recognition device of claim 1, wherein said memory further comprises a feature storage and said processor communicates with said I/O device and extracts at least one feature from said voice stream in a time domain and stores said at least one feature in said feature storage.
13. The speech recognition device of claim 1, wherein said frequency domain converter, said frequency domain output storage, said processor, and said memory are included on a digital signal processing (DSP) chip.
14. The speech recognition device of claim 1, wherein said digital voice stream representation comprises a series of symbols.
15. The speech recognition device of claim 1, wherein said digital voice stream representation comprises a series of text symbols.
16. The speech recognition device of claim 1, wherein said speech recognition device converts and compresses said voice stream into a compressed digital voice stream representation comprising a series of symbols.
17. The speech recognition device of claim 1, wherein said speech recognition device converts and compresses said voice stream into a compressed digital voice stream representation and transmits said compressed digital voice stream representation as a series of symbols.
18. The speech recognition device of claim 1, wherein said speech recognition device converts and compresses said voice stream into a compressed digital voice stream representation and stores said compressed digital voice stream representation as a series of symbols.
19. A method for performing speech recognition on a voice stream, comprising the steps of:
determining one or more candidate transnemes in said voice stream;
mapping said one or more candidate transnemes to a transneme table to convert said one or more candidate transnemes to one or more found transnemes; and
mapping said one or more found transnemes to a transneme-to-vocabulary database to convert said one or more found transnemes to one or more speech units.
20. The method of claim 19, wherein said one or more speech units are combined to create a digital voice stream representation of said voice stream.
21. The method of claim 19, wherein said one or more speech units are combined to create a digital voice stream representation of said voice stream, with said digital voice stream representation comprising a series of symbols.
22. The method of claim 19, wherein said one or more speech units are combined to create a digital voice stream representation of said voice stream, with said digital voice stream representation comprising a series of text symbols.
23. The method of claim 19, with said determining step further comprising comparing at least two frequency spectrum frames in a frequency domain in order to determine said one or more candidate transnemes.
24. The method of claim 19, wherein said voice stream is compressed by said method into a compressed digital voice stream representation comprising a series of symbols.
25. The method of claim 19, wherein said voice stream is compressed by said method into a compressed digital voice stream representation and wherein said method further comprises a step of transmitting said compressed digital voice stream representation as a series of symbols.
26. The method of claim 19, wherein said voice stream is compressed by said method into a compressed digital voice stream representation and wherein said method further comprises a step of storing said compressed digital voice stream representation as a series of symbols.
27. The method of claim 19, wherein a voice stream in a first verbal language is converted into a voice stream representation in a second language.
28. A method for performing speech recognition on a voice stream, comprising the steps of:
calculating a frequency spectrum difference between a current frequency spectrum frame and a previous frequency spectrum frame, with said current frequency spectrum frame and said previous frequency spectrum frame being in a frequency domain and being separated by a predetermined time interval; and
mapping said frequency spectrum difference to a transneme table to convert said frequency spectrum difference to at least one transneme if said frequency spectrum difference is greater than a predetermined difference threshold;
wherein a digital voice stream representation of said voice stream is created from one or more transnemes thus produced.
29. The method of claim 28, further including the steps of:
saving tonality level changes of said voice stream; and
using said tonality level changes to add punctuation to said voice stream representation.
30. The method of claim 28, wherein at least one feature is extracted from said voice stream in a time domain.
31. The method of claim 28, wherein at least one feature is mathematically extracted from said voice stream in a frequency domain.
32. The method of claim 28, wherein at least one feature is mathematically extracted from said voice stream in a frequency domain, and wherein said voice stream is a compressed voice stream already in said frequency domain.
33. The method of claim 28, further comprising the steps of:
performing a frequency domain transformation on said voice stream upon a predetermined time interval to create said current frequency spectrum frame;
storing said current frequency spectrum frame in a plurality of frequency bins; and
amplitude shifting and frequency shifting said current frequency spectrum frame based on a comparison of a current base frequency of said current frequency spectrum frame to a previous base frequency of a previous frequency spectrum frame.
34. The method of claim 28, wherein said predetermined time interval is less than a phoneme in length.
35. The method of claim 28, wherein said predetermined time interval is about ten milliseconds.
36. The method of claim 28, wherein said predetermined difference threshold is about 5% of average amplitude of a base frequency bin over a window of less than 100 milliseconds.
37. The method of claim 28, further comprising the steps of:
accumulating a predetermined number of transnemes;
performing a lookup of said predetermined number of transnemes against a transneme-to-vocabulary database; and
matching at least one transneme in said predetermined number of transnemes to at least one speech unit in said transneme-to-vocabulary database.
38. The method of claim 37 wherein about ten to about twenty transnemes are accumulated in said predetermined number of transnemes for performing said lookup against said transneme-to-vocabulary database.
39. The method of claim 37, with the step of performing a lookup against a transneme-to-vocabulary database further comprising performing a free-text-search lookup of said predetermined number of transnemes against said transneme-to-vocabulary database using inverted-index techniques in order to find one or more best-fit mappings of a segment of transnemes in said predetermined number of transnemes to at least one speech unit in said transneme-to-vocabulary database.
40. The method of claim 28, wherein said digital voice stream representation comprises a series of symbols.
41. The method of claim 28, wherein said digital voice stream representation comprises a series of text symbols.
42. The method of claim 28, wherein said voice stream is compressed into a compressed digital voice stream representation comprising a series of symbols.
43. The method of claim 28, wherein said voice stream is compressed by said method into a compressed digital voice stream representation and wherein said method further comprises a step of transmitting said compressed digital voice stream representation as a series of symbols.
44. The method of claim 28, wherein said voice stream is compressed by said method into a compressed digital voice stream representation and wherein said method further comprises a step of storing said compressed digital voice stream representation as a series of symbols.
45. The method of claim 28, wherein a voice stream in a first verbal language is converted into a voice stream representation in a second language.
46. A method for performing speech recognition on a voice stream, comprising the steps of:
performing a frequency domain transformation on said voice stream upon a predetermined time interval to create a current frequency spectrum frame;
normalizing said current frequency spectrum frame;
calculating a frequency spectrum difference between said current frequency spectrum frame and a previous frequency spectrum frame;
mapping said frequency spectrum difference to a transneme table to convert said frequency spectrum difference to at least one found transneme if said frequency spectrum difference is greater than a predetermined difference threshold; and
creating a digital voice stream representation of said voice stream from one or more found transnemes thus produced.
47. The method of claim 46, further including the steps of:
saving tonality level changes of said voice stream; and
using said tonality level changes to add punctuation to said voice stream representation.
48. The method of claim 46, wherein at least one feature is extracted from said voice stream in a time domain.
49. The method of claim 46, wherein at least one feature is mathematically extracted from said voice stream in a frequency domain.
50. The method of claim 46, wherein at least one feature is mathematically extracted from said voice stream in a frequency domain, and wherein said voice stream is a compressed voice stream already in said frequency domain.
51. The method of claim 46, with said step of performing a frequency domain transformation comprising performing time-overlapping frequency domain transformations.
52. The method of claim 46, with said step of performing a frequency domain transformation comprising performing a Fourier transformation.
53. The method of claim 46, with said step of performing a frequency domain transformation comprising performing time-overlapping frequency domain transformations of a predetermined transformation window about every 5 milliseconds.
54. The method of claim 46, with said step of performing a frequency domain transformation comprising performing time-overlapping frequency domain transformations of an about 10 millisecond transformation window about every 5 milliseconds.
55. The method of claim 46, further comprising the step of storing said current frequency spectrum frame in a plurality of current frequency bins.
56. The method of claim 46, with said step of normalizing comprising normalizing a base frequency of said current frequency spectrum frame to a base frequency of said previous frequency spectrum frame.
57. The method of claim 46, with said step of normalizing comprising frequency shifting said current frequency spectrum frame using an extracted pitch feature.
58. The method of claim 46, with said step of normalizing comprising amplitude shifting said current frequency spectrum frame using an extracted volume feature.
59. The method of claim 46, with said step of normalizing comprising amplitude shifting and frequency shifting said current frequency spectrum frame based on a comparison of a current base frequency of said current frequency spectrum frame to a previous base frequency of said previous frequency spectrum frame.
60. The method of claim 46, further comprising the step of storing said current frequency spectrum frame in a plurality of current frequency bins and with said step of calculating said frequency spectrum difference comprising calculating a plurality of difference values between a plurality of current frequency spectrum frame bin values in said plurality of current frequency bins and a plurality of previous frequency spectrum frame bin values.
61. The method of claim 46, wherein said predetermined time interval is less than a phoneme in length.
62. The method of claim 46, wherein said predetermined time interval is about ten milliseconds.
63. The method of claim 46, wherein said predetermined threshold is about 5% of average amplitude of a base frequency bin over a window of less than 100 milliseconds.
64. The method of claim 46, further comprising the steps of:
accumulating a predetermined number of transnemes;
performing a lookup of said predetermined number of transnemes against a transneme-to-vocabulary database; and
matching at least one transneme in said predetermined number of transnemes to at least one speech unit in said transneme-to-vocabulary database.
65. The method of claim 64, where in about ten to about twenty transnemes are accumulated in said predetermined number of transnemes for performing said lookup against said transneme-to-vocabulary database.
66. The method of claim 64 with the step of performing a lookup against a transneme-to-vocabulary database further comprising performing a free-text-search lookup of said predetermined number of transnemes against said transneme-to-vocabulary database using inverted-index techniques in order to find one or more best-fit mappings of a segment of transnemes in said predetermined number of transnemes to at least one speech unit in said transneme-to-vocabulary database.
67. The method of claim 46, wherein said digital voice stream representation comprises a series of symbols.
68. The method of claim 46, wherein said digital voice stream representation comprises a series of text symbols.
69. The method of claim 46, wherein said voice stream is compressed into a compressed digital voice stream representation comprising a series of symbols.
70. The method of claim 46, wherein said voice stream is compressed by said method into a compressed digital voice stream representation and wherein said method further comprises a step of transmitting said compressed digital voice stream representation as a series of symbols.
71. The method of claim 46, wherein said voice stream is compressed by said method into a compressed digital voice stream representation and wherein said method further comprises a step of storing said compressed digital voice stream representation as a series of symbols.
72. The method of claim 46, wherein a voice stream in a first verbal language is converted into a voice stream representation in a second language.
US09/813,965 2001-03-22 2001-03-22 Speech recognition for recognizing speaker-independent, continuous speech Expired - Lifetime US7089184B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/813,965 US7089184B2 (en) 2001-03-22 2001-03-22 Speech recognition for recognizing speaker-independent, continuous speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US09/813,965 US7089184B2 (en) 2001-03-22 2001-03-22 Speech recognition for recognizing speaker-independent, continuous speech

Publications (2)

Publication Number Publication Date
US20020184024A1 true US20020184024A1 (en) 2002-12-05
US7089184B2 US7089184B2 (en) 2006-08-08

Family

ID=25213874

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/813,965 Expired - Lifetime US7089184B2 (en) 2001-03-22 2001-03-22 Speech recognition for recognizing speaker-independent, continuous speech

Country Status (1)

Country Link
US (1) US7089184B2 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030115169A1 (en) * 2001-12-17 2003-06-19 Hongzhuan Ye System and method for management of transcribed documents
US20030130843A1 (en) * 2001-12-17 2003-07-10 Ky Dung H. System and method for speech recognition and transcription
US20030220788A1 (en) * 2001-12-17 2003-11-27 Xl8 Systems, Inc. System and method for speech recognition and transcription
US20050207543A1 (en) * 2004-03-18 2005-09-22 Sony Corporation, A Japanese Corporation Method and apparatus for voice interactive messaging
US20050210389A1 (en) * 2004-03-17 2005-09-22 Targit A/S Hyper related OLAP
US20060106843A1 (en) * 2004-11-17 2006-05-18 Targit A/S Database track history
US20060122824A1 (en) * 2004-12-07 2006-06-08 Nec Corporation Sound data providing system, method thereof, exchange and program
US20060167690A1 (en) * 2003-03-28 2006-07-27 Kabushiki Kaisha Kenwood Speech signal compression device, speech signal compression method, and program
US20070174262A1 (en) * 2003-05-15 2007-07-26 Morten Middelfart Presentation of data using meta-morphing
US20080301539A1 (en) * 2007-04-30 2008-12-04 Targit A/S Computer-implemented method and a computer system and a computer readable medium for creating videos, podcasts or slide presentations from a business intelligence application
US20090187845A1 (en) * 2006-05-16 2009-07-23 Targit A/S Method of preparing an intelligent dashboard for data monitoring
US7783628B2 (en) 2003-05-15 2010-08-24 Targit A/S Method and user interface for making a presentation of data using meta-morphing
US7949674B2 (en) 2006-07-17 2011-05-24 Targit A/S Integration of documents with OLAP using search
US20110238698A1 (en) * 2010-03-25 2011-09-29 Rovi Technologies Corporation Searching text and other types of content by using a frequency domain
US20130238311A1 (en) * 2013-04-21 2013-09-12 Sierra JY Lou Method and Implementation of Providing a Communication User Terminal with Adapting Language Translation
WO2016206644A1 (en) * 2015-06-26 2016-12-29 北京贝虎机器人技术有限公司 Robot control engine and system
US9542939B1 (en) * 2012-08-31 2017-01-10 Amazon Technologies, Inc. Duration ratio modeling for improved speech recognition
US20210132253A1 (en) * 2019-11-01 2021-05-06 Saudi Arabian Oil Company Automatic geological formations tops picking using dynamic time warping (dtw)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7580838B2 (en) * 2002-11-22 2009-08-25 Scansoft, Inc. Automatic insertion of non-verbalized punctuation
US7877255B2 (en) * 2006-03-31 2011-01-25 Voice Signal Technologies, Inc. Speech recognition using channel verification
US8510109B2 (en) 2007-08-22 2013-08-13 Canyon Ip Holdings Llc Continuous speech transcription performance indication
EP2008193B1 (en) 2006-04-05 2012-11-28 Canyon IP Holdings LLC Hosted voice recognition system for wireless devices
US9436951B1 (en) 2007-08-22 2016-09-06 Amazon Technologies, Inc. Facilitating presentation by mobile device of additional content for a word or phrase upon utterance thereof
US20090124272A1 (en) * 2006-04-05 2009-05-14 Marc White Filtering transcriptions of utterances
US8005671B2 (en) * 2006-12-04 2011-08-23 Qualcomm Incorporated Systems and methods for dynamic normalization to reduce loss in precision for low-level signals
US9973450B2 (en) 2007-09-17 2018-05-15 Amazon Technologies, Inc. Methods and systems for dynamically updating web service profile information by parsing transcribed message strings
US8140632B1 (en) * 2007-08-22 2012-03-20 Victor Roditis Jablokov Facilitating presentation by mobile device of additional content for a word or phrase upon utterance thereof
US9053489B2 (en) 2007-08-22 2015-06-09 Canyon Ip Holdings Llc Facilitating presentation of ads relating to words of a message
JP5322208B2 (en) * 2008-06-30 2013-10-23 株式会社東芝 Speech recognition apparatus and method
US20100332224A1 (en) * 2009-06-30 2010-12-30 Nokia Corporation Method and apparatus for converting text to audio and tactile output
CN108461081B (en) * 2018-03-21 2020-07-31 北京金山安全软件有限公司 Voice control method, device, equipment and storage medium

Citations (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3592969A (en) * 1968-07-24 1971-07-13 Matsushita Electric Ind Co Ltd Speech analyzing apparatus
US3621150A (en) * 1969-09-17 1971-11-16 Sanders Associates Inc Speech processor for changing voice pitch
US4181813A (en) * 1978-05-08 1980-01-01 John Marley System and method for speech recognition
US4284846A (en) * 1978-05-08 1981-08-18 John Marley System and method for sound recognition
US4313197A (en) * 1980-04-09 1982-01-26 Bell Telephone Laboratories, Incorporated Spread spectrum arrangement for (de)multiplexing speech signals and nonspeech signals
US4435831A (en) * 1981-12-28 1984-03-06 Mozer Forrest Shrago Method and apparatus for time domain compression and synthesis of unvoiced audible signals
US4528688A (en) * 1980-11-12 1985-07-09 Hitachi, Ltd. Continuous speech recognition method
US4716592A (en) * 1982-12-24 1987-12-29 Nec Corporation Method and apparatus for encoding voice signals
US4817158A (en) * 1984-10-19 1989-03-28 International Business Machines Corporation Normalization of speech signals
US4829574A (en) * 1983-06-17 1989-05-09 The University Of Melbourne Signal processing
US5058166A (en) * 1987-04-03 1991-10-15 U.S. Philips Corp. Method of recognizing coherently spoken words
US5097510A (en) * 1989-11-07 1992-03-17 Gs Systems, Inc. Artificial intelligence pattern-recognition-based noise reduction system for speech processing
US5455888A (en) * 1992-12-04 1995-10-03 Northern Telecom Limited Speech bandwidth extension method and apparatus
US5544049A (en) * 1992-09-29 1996-08-06 Xerox Corporation Method for performing a search of a plurality of documents for similarity to a plurality of query words
US5550949A (en) * 1992-12-25 1996-08-27 Yozan Inc. Method for compressing voice data by dividing extracted voice frequency domain parameters by weighting values
US5666466A (en) * 1994-12-27 1997-09-09 Rutgers, The State University Of New Jersey Method and apparatus for speaker recognition using selected spectral information
US5689617A (en) * 1995-03-14 1997-11-18 Apple Computer, Inc. Speech recognition system which returns recognition results as a reconstructed language model with attached data values
US5692103A (en) * 1993-04-23 1997-11-25 Matra Communication Method of speech recognition with learning
US5696879A (en) * 1995-05-31 1997-12-09 International Business Machines Corporation Method and apparatus for improved voice transmission
US5729657A (en) * 1993-11-25 1998-03-17 Telia Ab Time compression/expansion of phonemes based on the information carrying elements of the phonemes
US5729741A (en) * 1995-04-10 1998-03-17 Golden Enterprises, Inc. System for storage and retrieval of diverse types of information obtained from different media sources which includes video, audio, and text transcriptions
US5729694A (en) * 1996-02-06 1998-03-17 The Regents Of The University Of California Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
US5745873A (en) * 1992-05-01 1998-04-28 Massachusetts Institute Of Technology Speech recognition using final decision based on tentative decisions
US5751905A (en) * 1995-03-15 1998-05-12 International Business Machines Corporation Statistical acoustic processing method and apparatus for speech recognition using a toned phoneme system
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US5839099A (en) * 1996-06-11 1998-11-17 Guvolt, Inc. Signal conditioning apparatus
US5956685A (en) * 1994-09-12 1999-09-21 Arcadia, Inc. Sound characteristic converter, sound-label association apparatus and method therefor
US5970453A (en) * 1995-01-07 1999-10-19 International Business Machines Corporation Method and system for synthesizing speech
US5978764A (en) * 1995-03-07 1999-11-02 British Telecommunications Public Limited Company Speech synthesis
US6003004A (en) * 1998-01-08 1999-12-14 Advanced Recognition Technologies, Inc. Speech recognition method and system using compressed speech data
US6032116A (en) * 1997-06-27 2000-02-29 Advanced Micro Devices, Inc. Distance measure in a speech recognition system for speech recognition using frequency shifting factors to compensate for input signal frequency shifts
US6073100A (en) * 1997-03-31 2000-06-06 Goodridge, Jr.; Alan G Method and apparatus for synthesizing signals using transform-domain match-output extension
US6138089A (en) * 1999-03-10 2000-10-24 Infolio, Inc. Apparatus system and method for speech compression and decompression
US6304845B1 (en) * 1998-02-03 2001-10-16 Siemens Aktiengesellschaft Method of transmitting voice data
US20020029139A1 (en) * 2000-06-30 2002-03-07 Peter Buth Method of composing messages for speech output
US20020103646A1 (en) * 2001-01-29 2002-08-01 Kochanski Gregory P. Method and apparatus for performing text-to-speech conversion in a client/server environment
US6510410B1 (en) * 2000-07-28 2003-01-21 International Business Machines Corporation Method and apparatus for recognizing tone languages using pitch information

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4780906A (en) 1984-02-17 1988-10-25 Texas Instruments Incorporated Speaker-independent word recognition method and system based upon zero-crossing rate and energy measurement of analog speech signal
US4718087A (en) 1984-05-11 1988-01-05 Texas Instruments Incorporated Method and system for encoding digital speech information
US4908865A (en) 1984-12-27 1990-03-13 Texas Instruments Incorporated Speaker independent speech recognition method and system
US4977599A (en) 1985-05-29 1990-12-11 International Business Machines Corporation Speech recognition employing a set of Markov models that includes Markov models representing transitions to and from silence
US4852180A (en) 1987-04-03 1989-07-25 American Telephone And Telegraph Company, At&T Bell Laboratories Speech recognition by acoustic/phonetic system and technique
DE3931638A1 (en) 1989-09-22 1991-04-04 Standard Elektrik Lorenz Ag METHOD FOR SPEAKER ADAPTIVE RECOGNITION OF LANGUAGE
DE4131387A1 (en) 1991-09-20 1993-03-25 Siemens Ag METHOD FOR RECOGNIZING PATTERNS IN TIME VARIANTS OF MEASURING SIGNALS
US5333236A (en) 1992-09-10 1994-07-26 International Business Machines Corporation Speech recognizer having a speech coder for an acoustic match based on context-dependent speech-transition acoustic models
GB2272554A (en) 1992-11-13 1994-05-18 Creative Tech Ltd Recognizing speech by using wavelet transform and transient response therefrom
US5627939A (en) 1993-09-03 1997-05-06 Microsoft Corporation Speech recognition system and method employing data compression

Patent Citations (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3592969A (en) * 1968-07-24 1971-07-13 Matsushita Electric Ind Co Ltd Speech analyzing apparatus
US3621150A (en) * 1969-09-17 1971-11-16 Sanders Associates Inc Speech processor for changing voice pitch
US4181813A (en) * 1978-05-08 1980-01-01 John Marley System and method for speech recognition
US4284846A (en) * 1978-05-08 1981-08-18 John Marley System and method for sound recognition
US4313197A (en) * 1980-04-09 1982-01-26 Bell Telephone Laboratories, Incorporated Spread spectrum arrangement for (de)multiplexing speech signals and nonspeech signals
US4528688A (en) * 1980-11-12 1985-07-09 Hitachi, Ltd. Continuous speech recognition method
US4435831A (en) * 1981-12-28 1984-03-06 Mozer Forrest Shrago Method and apparatus for time domain compression and synthesis of unvoiced audible signals
US4716592A (en) * 1982-12-24 1987-12-29 Nec Corporation Method and apparatus for encoding voice signals
US4829574A (en) * 1983-06-17 1989-05-09 The University Of Melbourne Signal processing
US4817158A (en) * 1984-10-19 1989-03-28 International Business Machines Corporation Normalization of speech signals
US5058166A (en) * 1987-04-03 1991-10-15 U.S. Philips Corp. Method of recognizing coherently spoken words
US5097510A (en) * 1989-11-07 1992-03-17 Gs Systems, Inc. Artificial intelligence pattern-recognition-based noise reduction system for speech processing
US5745873A (en) * 1992-05-01 1998-04-28 Massachusetts Institute Of Technology Speech recognition using final decision based on tentative decisions
US5544049A (en) * 1992-09-29 1996-08-06 Xerox Corporation Method for performing a search of a plurality of documents for similarity to a plurality of query words
US5455888A (en) * 1992-12-04 1995-10-03 Northern Telecom Limited Speech bandwidth extension method and apparatus
US5550949A (en) * 1992-12-25 1996-08-27 Yozan Inc. Method for compressing voice data by dividing extracted voice frequency domain parameters by weighting values
US5692103A (en) * 1993-04-23 1997-11-25 Matra Communication Method of speech recognition with learning
US5729657A (en) * 1993-11-25 1998-03-17 Telia Ab Time compression/expansion of phonemes based on the information carrying elements of the phonemes
US5956685A (en) * 1994-09-12 1999-09-21 Arcadia, Inc. Sound characteristic converter, sound-label association apparatus and method therefor
US5666466A (en) * 1994-12-27 1997-09-09 Rutgers, The State University Of New Jersey Method and apparatus for speaker recognition using selected spectral information
US5970453A (en) * 1995-01-07 1999-10-19 International Business Machines Corporation Method and system for synthesizing speech
US5978764A (en) * 1995-03-07 1999-11-02 British Telecommunications Public Limited Company Speech synthesis
US5689617A (en) * 1995-03-14 1997-11-18 Apple Computer, Inc. Speech recognition system which returns recognition results as a reconstructed language model with attached data values
US5751905A (en) * 1995-03-15 1998-05-12 International Business Machines Corporation Statistical acoustic processing method and apparatus for speech recognition using a toned phoneme system
US5729741A (en) * 1995-04-10 1998-03-17 Golden Enterprises, Inc. System for storage and retrieval of diverse types of information obtained from different media sources which includes video, audio, and text transcriptions
US5696879A (en) * 1995-05-31 1997-12-09 International Business Machines Corporation Method and apparatus for improved voice transmission
US5799276A (en) * 1995-11-07 1998-08-25 Accent Incorporated Knowledge-based speech recognition system and methods having frame length computed based upon estimated pitch period of vocalic intervals
US5729694A (en) * 1996-02-06 1998-03-17 The Regents Of The University Of California Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
US5839099A (en) * 1996-06-11 1998-11-17 Guvolt, Inc. Signal conditioning apparatus
US6073100A (en) * 1997-03-31 2000-06-06 Goodridge, Jr.; Alan G Method and apparatus for synthesizing signals using transform-domain match-output extension
US6032116A (en) * 1997-06-27 2000-02-29 Advanced Micro Devices, Inc. Distance measure in a speech recognition system for speech recognition using frequency shifting factors to compensate for input signal frequency shifts
US6003004A (en) * 1998-01-08 1999-12-14 Advanced Recognition Technologies, Inc. Speech recognition method and system using compressed speech data
US6304845B1 (en) * 1998-02-03 2001-10-16 Siemens Aktiengesellschaft Method of transmitting voice data
US6138089A (en) * 1999-03-10 2000-10-24 Infolio, Inc. Apparatus system and method for speech compression and decompression
US20020029139A1 (en) * 2000-06-30 2002-03-07 Peter Buth Method of composing messages for speech output
US6510410B1 (en) * 2000-07-28 2003-01-21 International Business Machines Corporation Method and apparatus for recognizing tone languages using pitch information
US20020103646A1 (en) * 2001-01-29 2002-08-01 Kochanski Gregory P. Method and apparatus for performing text-to-speech conversion in a client/server environment
US6625576B2 (en) * 2001-01-29 2003-09-23 Lucent Technologies Inc. Method and apparatus for performing text-to-speech conversion in a client/server environment

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030130843A1 (en) * 2001-12-17 2003-07-10 Ky Dung H. System and method for speech recognition and transcription
US20030220788A1 (en) * 2001-12-17 2003-11-27 Xl8 Systems, Inc. System and method for speech recognition and transcription
US6990445B2 (en) 2001-12-17 2006-01-24 Xl8 Systems, Inc. System and method for speech recognition and transcription
US20030115169A1 (en) * 2001-12-17 2003-06-19 Hongzhuan Ye System and method for management of transcribed documents
US7653540B2 (en) * 2003-03-28 2010-01-26 Kabushiki Kaisha Kenwood Speech signal compression device, speech signal compression method, and program
US20060167690A1 (en) * 2003-03-28 2006-07-27 Kabushiki Kaisha Kenwood Speech signal compression device, speech signal compression method, and program
US7783628B2 (en) 2003-05-15 2010-08-24 Targit A/S Method and user interface for making a presentation of data using meta-morphing
US20070174262A1 (en) * 2003-05-15 2007-07-26 Morten Middelfart Presentation of data using meta-morphing
US7779018B2 (en) * 2003-05-15 2010-08-17 Targit A/S Presentation of data using meta-morphing
US8468444B2 (en) 2004-03-17 2013-06-18 Targit A/S Hyper related OLAP
US20050210389A1 (en) * 2004-03-17 2005-09-22 Targit A/S Hyper related OLAP
US8345830B2 (en) 2004-03-18 2013-01-01 Sony Corporation Method and apparatus for voice interactive messaging
US20050207543A1 (en) * 2004-03-18 2005-09-22 Sony Corporation, A Japanese Corporation Method and apparatus for voice interactive messaging
US20100020948A1 (en) * 2004-03-18 2010-01-28 Kyoko Takeda Method and Apparatus For Voice Interactive Messaging
US7570746B2 (en) 2004-03-18 2009-08-04 Sony Corporation Method and apparatus for voice interactive messaging
US8755494B2 (en) 2004-03-18 2014-06-17 Sony Corporation Method and apparatus for voice interactive messaging
US7774295B2 (en) 2004-11-17 2010-08-10 Targit A/S Database track history
US20060106843A1 (en) * 2004-11-17 2006-05-18 Targit A/S Database track history
US20060122824A1 (en) * 2004-12-07 2006-06-08 Nec Corporation Sound data providing system, method thereof, exchange and program
US8059794B2 (en) * 2004-12-07 2011-11-15 Nec Corporation Sound data providing system, method thereof, exchange and program
US20090187845A1 (en) * 2006-05-16 2009-07-23 Targit A/S Method of preparing an intelligent dashboard for data monitoring
US7949674B2 (en) 2006-07-17 2011-05-24 Targit A/S Integration of documents with OLAP using search
US20080301539A1 (en) * 2007-04-30 2008-12-04 Targit A/S Computer-implemented method and a computer system and a computer readable medium for creating videos, podcasts or slide presentations from a business intelligence application
US8725766B2 (en) * 2010-03-25 2014-05-13 Rovi Technologies Corporation Searching text and other types of content by using a frequency domain
US20110238698A1 (en) * 2010-03-25 2011-09-29 Rovi Technologies Corporation Searching text and other types of content by using a frequency domain
US9542939B1 (en) * 2012-08-31 2017-01-10 Amazon Technologies, Inc. Duration ratio modeling for improved speech recognition
US20130238311A1 (en) * 2013-04-21 2013-09-12 Sierra JY Lou Method and Implementation of Providing a Communication User Terminal with Adapting Language Translation
WO2016206644A1 (en) * 2015-06-26 2016-12-29 北京贝虎机器人技术有限公司 Robot control engine and system
US20210132253A1 (en) * 2019-11-01 2021-05-06 Saudi Arabian Oil Company Automatic geological formations tops picking using dynamic time warping (dtw)
US11914099B2 (en) * 2019-11-01 2024-02-27 Saudi Arabian Oil Company Automatic geological formations tops picking using dynamic time warping (DTW)

Also Published As

Publication number Publication date
US7089184B2 (en) 2006-08-08

Similar Documents

Publication Publication Date Title
US7089184B2 (en) Speech recognition for recognizing speaker-independent, continuous speech
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
US11594215B2 (en) Contextual voice user interface
US11056097B2 (en) Method and system for generating advanced feature discrimination vectors for use in speech recognition
EP1301922B1 (en) System and method for voice recognition with a plurality of voice recognition engines
US6751595B2 (en) Multi-stage large vocabulary speech recognition system and method
EP1936606B1 (en) Multi-stage speech recognition
US10176809B1 (en) Customized compression and decompression of audio data
US6553342B1 (en) Tone based speech recognition
EP2048655A1 (en) Context sensitive multi-stage speech recognition
JP2003316386A (en) Method, device, and program for speech recognition
EP2104935A1 (en) Method and system for providing speech recognition
US11798559B2 (en) Voice-controlled communication requests and responses
US20080243504A1 (en) System and method of speech recognition training based on confirmed speaker utterances
US11715472B2 (en) Speech-processing system
US20030220792A1 (en) Speech recognition apparatus, speech recognition method, and computer-readable recording medium in which speech recognition program is recorded
JP3535292B2 (en) Speech recognition system
CN113744722A (en) Off-line speech recognition matching device and method for limited sentence library
US20080243499A1 (en) System and method of speech recognition training based on confirmed speaker utterances
US20070129945A1 (en) Voice quality control for high quality speech reconstruction
US11735178B1 (en) Speech-processing system
KR100480506B1 (en) Speech recognition method
KR100622019B1 (en) Voice interface system and method
US20080243498A1 (en) Method and system for providing interactive speech recognition using speaker data
KR100304788B1 (en) Method for telephone number information using continuous speech recognition

Legal Events

Date Code Title Description
AS Assignment

Owner name: NURV CENTER TECHNOLOGIES, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ROREX, PHILLIP G.;REEL/FRAME:011625/0911

Effective date: 20010305

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2553)

Year of fee payment: 12