WO1996003741A9 - System and method for facilitating speech transcription - Google Patents

System and method for facilitating speech transcription

Info

Publication number
WO1996003741A9
WO1996003741A9 PCT/US1995/009130 US9509130W WO9603741A9 WO 1996003741 A9 WO1996003741 A9 WO 1996003741A9 US 9509130 W US9509130 W US 9509130W WO 9603741 A9 WO9603741 A9 WO 9603741A9
Authority
WO
WIPO (PCT)
Prior art keywords
speech
word choice
word
machine
possible word
Prior art date
Application number
PCT/US1995/009130
Other languages
French (fr)
Other versions
WO1996003741A1 (en
Filing date
Publication date
Application filed filed Critical
Priority to AU31368/95A priority Critical patent/AU3136895A/en
Publication of WO1996003741A1 publication Critical patent/WO1996003741A1/en
Publication of WO1996003741A9 publication Critical patent/WO1996003741A9/en

Links

Definitions

  • the field of the invention is speech recognition and transcription systems.
  • Conventional speech recognition systems have used several independent approaches with limited success.
  • One approach models the vocal tract, articulation, and acoustic production of speech.
  • a second approach models the acoustic waveform and the spectrum of human speech and uses signal processing methods.
  • a third approach models the human ear and its detection mechanisms through neural network methods.
  • a fourth approach analyzes phonetic features, perception, and linguistic models of human speech.
  • word boundary issues can arise when any subset of sequential syllables within a multi-syllable word is itself a word.
  • Word boundary problems also arise when the ending syllable or syllables of one word can be joined with subsequent syllables to form another word.
  • Homonyms and multiple spellings of one word also give rise to ambiguities. Therefore, fully automated, accurate end- to-end continuous speech recognition involves intensive data analyses, speech understanding and logical reasoning capabilities.
  • Currently available continuous speech processors thus suffer from transcription errors and other limitations as the requisite technology is currently prohibitively expensive and uneconomical for commercial purposes.
  • the system and method according to the present invention facilitates speech transcription of normally spoken continuous speech without sacrificing accuracy in transcription.
  • Speech is digitally processed to extract its spectral features. These features are then used to distinguish individual phonemes and to generate a string of equivalent machine-readable phonetic symbols.
  • the string of machine-readable phonetic symbols is processed to identify possible word boundaries for each spoken word within the string. In one embodiment, this process is performed one spoken word at a time.
  • the possible word choices for each spoken word, as delineated by the identified word boundaries are visually displayed.
  • word choice as used herein through this application represents both singular and multiple word choices as both possibilities may exist and option ⁇ ally may be displayed for each spoken word.
  • the represen ⁇ tation format can vary depending on the particular language used.
  • word choices may be represented as a string of Hiragana and Katakana symbols, each correspond ⁇ ing to a machine-readable phonetic symbol, or as Kanji characters, or as a combination of Kanji, Hiragana and Katakana letters.
  • Word choices in English, German, French, or other languages may be represented as alpha ⁇ numeric text or other appropriate representations .
  • the proper word choice corresponding to each spoken word can be readily selected, thereby ensuring accurate transcription.
  • the possible word choices can be displayed in various orders, for example, by alphabetical order, by order of probability, or by order of syllabic length, thereby facilitating word selection.
  • Other display orders including those discussed later, would be apparent to those skilled in the art and are within the scope of the invention.
  • editing tasks such as punctuation, margins, paragraphs, and capitalization, can be performed during this process.
  • Other embodiments process the input speech in larger blocks of speech. For instance, an embodiment can process the input speech for any block of speech which is brack ⁇ eted by silence intervals, such as a phrase or sentence at one time. Word choices are then presented for each word within the block of speech at one time.
  • the invention recognizes that the limitations of currently available conventional technology render this task commercially impractical. Accordingly, in one embodiment, the system according to the invention truncates the conventional process and presents possible solutions to the ambiguities in an environment wherein the correct alternative may be readily selected. Thus, according to the invention, ambiguities may be readily and accurately resolved while repetitive tasks, i.e. collecting, convert ⁇ ing and analyzing massive speech data and searching for words in a vocabulary set are performed automatically.
  • the invention has the further advantages of being both language and speaker independent, in that the invention accommodates any speaker and any spoken language which can be transcribed. Other and further objects and advantages will appear hereinafter.
  • FIG. 1 is a logical diagram of a preferred embodiment of the invention.
  • FIG. 2 is a block diagram of speech acceptance and analog-digital hardware suitable for practicing the invention.
  • FIG. 3 is a block diagram of a spectral processor suitable for practicing the invention.
  • FIG. 4 is a block diagram of a phoneme labeller suitable for practicing the invention.
  • FIG. 5 is a block diagram of a speech re-synthesizer suitable for practicing the invention.
  • FIG. 6 is a graph of the performance of a pre- emphasis filter which is suitable for practicing the invention.
  • FIG. 7 is a graph and equations showing computation of acoustic power.
  • FIG. 8 is a graph and equation showing peak-to-peak pitch estimation.
  • FIGs. 9A-9B are graphs and equations showing frequency domain processing.
  • FIGs. 10A-10B are an example of forward pass labelling.
  • FIG. 11 is an example of backward labelling. Appendix A describes spectral distortion techniques.
  • Appendix B describes Mel-scale filters.
  • Appendix C describes weighting techniques.
  • Appendix D describes forward pass and backward labelling.
  • Appendix E describes ranking word candidates.
  • a preferred embodiment of the system has a language, user, and mode selector 2, analog-to-digital convertor 4 inter- acting with an input device 6, a spectral processor 8, phoneme labeller 10 interfacing with phoneme models 14 and preferably, training data 16, a software pre-processor 11, and a word processor 12.
  • the system may also include a speech re-synthesizer 13.
  • the speaker dictates speech into the system through the input device 6, which may be a microphone, telephone, other line input, or any input device which converts speech into electro ⁇ magnetic signals.
  • the term, 'electromagnetic signal' includes any form of electrical signals, including microwaves, radio signals, and television signals, as well as optical and other signals such as infrared and laser.
  • the speaker identifies him or herself and selects a language and an operational mode.
  • Three of the possible operational modes are training, dictation and display and editing. These three modes are those known to the inventors at the present time. However, other opera ⁇ tional modes may become apparent to one skilled in the art as the technology develops and would also be within the scope of the invention. As explained in detail later, the dictation and display and editing modes may be operated concurrently.
  • the speaker dictates speech into the system.
  • the system analyzes the speech to generate data representing the speaker's distinctive voice characteristics.
  • This data is stored as training data and is later used by the system to identify the speaker's words when the system is operated in dictation mode. While not required to practice the broad scope of the invention, completion of the training mode in a preferred embodiment greatly facilitates the efficient practice of the invention.
  • the speaker dictates speech into the system, using the input device 6.
  • the speaker may dictate the speech at a prior time using conventional means such as a tape recorder and the audio or electromagnetic output of the conventional means can be transmitted to the input device 6.
  • the input speech is passed through the analog-digital converter 4, then through the spectral processor 8, and then through the phoneme labeller 10.
  • the output of the phoneme labeller 10 can either be stored for later retrieval and processing, or can be sent to the software pre-processor 11. It will be appreciated that elements of the system may be combined or rendered unnecessary as technological capabilities improve, while remaining within the scope of the invention.
  • editing and other functions may be performed through the software pre ⁇ processor 11 and word processor 12, using keyboard, mouse, digitizing pen, voice commands, or any other input- positioning device activated contemporaneously or at a later time by the speaker, by a subsequent user, or by a suitable software/computer combination.
  • the software pre ⁇ processor 11 receives the output of the phoneme labeller 10 and works in conjunction with the word processor 12 to display possible word candidates for each spoken word.
  • the displayed word choice is selected by the speaker, a subsequent user, or by an appropriate software/computer combination.
  • the speaker or a subsequent user may also manually input an alternative word choice. This process can then be repeated for the next and subsequent words within the input speech, until the entire input speech has been transcribed.
  • the system can include a speech re-synthesizer 13 which can reproduce the spoken words.
  • the speech input may be stored for later processing.
  • the speech re-synthesizer can facilitate accurate transcription. Preferred embodiments of the invention are now described in detail.
  • Preliminary training is preferably performed to initialize the system for each particular speaker or for each time that a particular speaker uses a new language.
  • the speaker-specific training data 16 is established to tune the system to the speaker's individual speech characteristics for the particular chosen language.
  • the system thus adapts to and accommodates a diverse set of speakers regardless of gender, age, accent, dialect, language or any other factor which can contribute to a difference in the pronunciation of any particular word. Initialization only needs to be performed once for any particular speaker for each particular language.
  • the speaker-specific training data 16 can be established independent of the invention or a system according to the invention may be used. In embodiments where the speaker generates training data by using the system, the speaker dictates a pre-established reference set of words into the system. Various reference sets of words can be used for any particular language.
  • the phoneme labeller 10 is then used to generate training data 16 by using a spectral distortion measure to compare the speaker's speech against reference speech samples of the pre-determined set of words.
  • Spectral distortion techniques are well-known to those skilled in the art and are also documented in Discrete-Time Processing of Speech Signals by Deller, Proakis, and Hansen, which is incorporated herein by reference and is also attached as Appendix A.
  • Training data can also be generated independent of the invention through other techniques well known to those skilled in the art .
  • the training data 16 is stored into a data base.
  • the data base may be organized in a variety of suitable structures; however, a preferred embodiment uses the same structure as in DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM NISTIR 4930, NTIS, February, 1993, which structure is shown in Example 1:
  • Example 1 database hierarchy:
  • the system of this preferred embodi ⁇ ment is used in dictation mode.
  • speech is received into the system via any of a variety of conventional devices, including but not limited to static or dynamic microphones 18, telephones 22, or other line input 20.
  • speech input could also be transmitted and received by other technology via electromagnetic signals.
  • microphones they preferably have noise limiting capability.
  • appropriate amplifiers 24 and interfaces 26 are used to connect the receiving devices with a low-pass analog filter 28 and to ensure a fixed level output compatible with digital equip ⁇ ment. These devices are readily available. Moreover, one skilled in the art would know how to choose appropriate amplifiers and interfaces. The text, Digital Signal Analysis, by Stearns, Hayden 1975 can also be referenced to determine the compatibility of these devices.
  • the signal representing the input speech is passed through a low pass analog filter 28 which eliminates frequencies that can alias into the acoustic signal during subsequent processing.
  • a low pass analog filter 28 which eliminates frequencies that can alias into the acoustic signal during subsequent processing.
  • One preferred embodiment uses a 3rd order Butterworth analog filter with
  • the signal is then passed through an analog-to- digital converter 30. While a variety of analog-to- digital converters may be used for this purpose, a preferred embodiment uses a Motorola 56 ADC 16 with sixteen bit resolution at 20 Khz and a continuously adjustable gain over a 20 db range.
  • the digitized signal is passed through a spectral processor, as shown in FIG. 3, to extract spectral speech features.
  • a spectral processor as shown in FIG. 3, to extract spectral speech features.
  • Two classes of problems contribute to background noise.
  • One class is attributed to the speaker and consists of sound artifacts such as lip smacks, heavy breathing, mouth clicks, and nasal pops. Although such artifacts are generated inadvertently, they often have an energy level comparable to speech.
  • the invention preferably models speaker artifacts and detects them along with phoneme recognition so that they may be removed during subsequent processing.
  • a second class of noise problems arises from the ambient noise environment.
  • non-speaker generated background noise such as a bell, whistle or other sound
  • the system can use a noise limiting microphone for receiving the dictated speech.
  • Other means of eliminating background noise will be apparent to those skilled in the art.
  • a running average of the ambient background noise during periods of silence can be kept so that sudden non-vocal tract background noises can be easily classified and eliminated.
  • a preferred embodiment uses both a noise limiting microphone and continuous modelling of background noise.
  • the digitized signal is first passed through a pre-emphasis filter 32 which works in connection with a noise model 34 to eliminate back- ground noise.
  • the noise model 34 is initially set to a default noise level and is adaptively updated during subsequent processing of the input speech. This adaptive feature is later discussed in connection with frequency domain processing.
  • the pre-emphasis filter 32 is a Finite Impulse Response (FIR) bandpass digital filter with zero phase shift and unity gain. The deriva ⁇ tion of the bandpass coefficients for such a filter is shown in Example 2.
  • FIR Finite Impulse Response
  • a 20 khz sampling rate with a bandpass from 100 hz to 6000 hz is assumed.
  • the sample signal has a 60 hz hum component, a 7000 hz whistle, random noise, and an acoustic signal at 1000 hz.
  • nt input ⁇ a ⁇ +sin(i* t*2.0 * ⁇ *f j ) +rnd (noise) noise
  • the performance of the pre-emphasis filter 32 is shown in FIG. 6.
  • the upper trace (a) is the acoustic input signal.
  • the middle trace (b) is the result of the pre-emphasis filter.
  • the third trace (c) is the desired signal that is embedded in the signal.
  • the lower trace (d) is the absolute difference of the acoustic signal and the filter output.
  • the signal is then sampled into blocks of data as indicated by the sampler 36. Since the human vocal mechanism modulates the slowly changing speech signal onto a higher frequency sound wave, it would be necessary to sample the acoustic wave form at over 10,000 times per second to capture this speech wave. In contrast, the movement of the tongue, jaw, lips and other vocal articu- lators change at the far slower rate of less than 100 times per second. This physical situation is exploited by grouping acoustic data into approximately one hundredth of a second blocks to isolate phonetic features and to identify noise sources.
  • a preferred embodiment samples the signal into blocks of 512 samples. Oversampling can be used to increase accuracy during subsequent processing. Thus, a preferred embodi- ment uses a 25%, or 128-sample overlap between adjacent blocks. As can be readily appreciated by one skilled in the art, numerous other data block sizes and overlap ranges can be used. For example, blocks of 256, 1024 or 2048 samples may be used. Moreover, any overlap range between 0 and 50% may be used between adjacent blocks.
  • sampling may be accomplished with hardware, microcode, or other methods known to those skilled in the art, a preferred embodiment uses microcode. As technology advances, other sampling methods and devices suitable for practicing the invention may become available.
  • the sampled signal is then processed in the time domain, as indicated by element 38, to extract the spectral speech features of acoustic power and the peak- to-peak pitch.
  • Acoustic power indicates the presence of speech or silence intervals that naturally occur at the end of sentences, phrases, or words.
  • a short time energy estimate is made for each sample block by averaging either the magnitude or square of the signal within the block, otherwise referred to as absolute power and squared power, respectively.
  • the peak-to-peak pitch is estimated by summing the signal zero crossings.
  • Equation (b) and (c) are used to compute the absolute power and squared power, respectively.
  • the peak-to-peak pitch is estimated by summing the signal zero crossings of the estimated signal shown in plot (d) , as shown by equation (e) .
  • the signal is also passed through acoustic band pass filters 40 to identify the relative signal power in each acoustic band, otherwise referred to as the spectral pattern. This information can be used later to assist in phoneme identification.
  • the band pass filters are selected based on the Mel scale.
  • the Mel scale is a logarithmic-based scale which is more fully described in the publication Advanced Algorithms and Architectures for Speech Understanding, ESPRIT Research Report Project 26, Vol. 1, Pirani, 1990, and is herein incorporated by reference, and is also attached as Appendix B.
  • the estimated formant frequencies are determined by processing the signal in the frequency domain as shown by element 42.
  • An example of frequency domain processing is given in Figures 9A-9B.
  • plot (a) represent the same sample block as plot (a) in Figure 7.
  • the sample block can be first tapered using a tapering function.
  • Numerous tapering functions including a Hanning Window as shown in Equation (b) , can be used to taper the 512 elements of the sample block, and are well-known to those skilled in the art. Tapering reduces high frequency noise in the subsequent Fast Fourier Transform (FFT) . Since the signal was previously oversampled by 256 elements (128 elements at each end) , tapering promotes accuracy, rather than loss of data.
  • Plot (c) represents the tapered sample block. As would be appreciated by one skilled in the art, numerous other tapering functions can be used.
  • the data from frequency domain processing necessarily represents only ambient background noise.
  • the system and method according to the invention update the noise model 34 with this data.
  • the noise model 34 reflects a semi-continuous model of background ambient noise.
  • This adaptive feedback enhances subsequent processing of speech through the pre-emphasis filter 32.
  • the spectral processor extracts the following spectral speech features: acoustic power, spectral pattern, and formant frequencies.
  • the extracted spectral speech features are passed to a phoneme labeller 10 as shown in FIG. 4.
  • the previously extracted spectral feature of relative signal power can be used to classify phonemes into one of the following phonetic categories: vowels, diphthongs, liquid or glide semi- vowels, nasal consonants, voiced or unvoiced fricatives, africates, whisper and voiced or unvoiced stops.
  • Table 1 describes a possible phoneme classification scheme for forty standard phonemes used in Western/European languages:
  • Table 1 vowels divided into: front -- IY, IH, EH, AE mid -- AA, ER, AH, AX, AO back -- UW, UH, OW dipthongs - AY, OY, AW, EY semivowels, including liquids -- WW, LL glides -- RR, YY nasal consonants -- MM, NN, NG stops voiced -- BB, DD, GG unvoiced -- PP, TT, KK fricatives voiced -- W, TH, ZZ, ZH unvoiced -- FF, TH, SS, SH aquates -- JH, CH whisper -- HH
  • each phonetic category contains at most five elements out of a possible forty. Moreover, these elements are usually well separated by physical features of the vocal tract so that subsequent identification of an individual phoneme within a phonetic category is enhanced.
  • a 'forward pass' 44 is made through each signal block and phoneme candidates are identified by comparing against existing features in the phoneme models 14.
  • the phoneme models 14 are pre-determined reference sets of data for each language. Techniques for generating these sets are well-known.
  • TIMIT Acoustic- Phonetic Continuous Speech Corpus CD-ROM NISTIR 4930, NTIS, February, 1993 is one such set which is readily available to one skilled in the art.
  • a score is kept for each phoneme candidate based on its probability of being the correct phoneme spoken.
  • the function is described as a four point trapezoid (vw) and disjunction:
  • each phoneme candidate feature and sample feature are used to form a normalized membership function of the sample in the phoneme data base.
  • all phonemes have a possibility of assignment in each sample block.
  • the features can be weighted to maximize the probability of phoneme identification. The weighting process can be accomplished through a variety of tech ⁇ niques which are well-known to those skilled in the art, some of which are described in Discrete-Time Processing of Speech Signals, which are incorporated herein by reference and are also attached as Appendix C.
  • the determination of the weighting function is based on experimental scoring derived from the training data 16 for the speaker.
  • the weighting function can be updated through adaptive feedback from the user. An example of forward pass labelling is given in FIGs. 10A-10B.
  • speaker artifacts and sudden, loud background noises can interfere with the input speech so that the system is unable to find a reasonable phoneme match for the received sound.
  • such interference is identified and the speaker is requested to repeat the masked speech.
  • spectral processing 8 and forward pass labelling 44 continue for the next and subsequent speech segments within the speech signal.
  • a silence interval could be defined as any time interval longer than 1/10 (one-tenth of a second) in which the ambient noise level does not exceed 10 dB over the average ambient background noise level. Shorter time intervals and smaller decibel increases may also be used. For instance, a shorter time period for speakers who dictate too rapidly to pause for more than 1/10 of a second, or for speakers who speak too softly to develop a 10 dB increase over the ambient background noise level. Thus, it is also within the broad scope of the invention for embodiments to define silence intervals as short as 1/15 of second, or to require only a 9 dB increase. Other embodiments may require over 12 dB increases for speakers who are particularly loud. Other combinations are readily apparent to those skilled in the art.
  • Backward labelling identifies the N most possible phoneme sequences within the speech segment .
  • Each phoneme candidate is ranked by its possibility of assignment in each sample block, based on conventionally known methods of phoneme identification. These methods are discussed in Discrete-Time Processing of Speech Signals, and are incor ⁇ porated herein by reference, and are also attached as Appendix D. As shown in FIG. 11, the candidates are sorted to maximize the possibility of phoneme identification.
  • the machine-readable phonetic symbols are directly output to a software pre ⁇ processor 11 as shown in FIG. 1.
  • the software pre- processor 11 can be separately joined to or integrated with a word processor 12 running, for example, on a personal computer, on stand-alone equipment or as part of a network or central computer system.
  • the software pre ⁇ processor 11 may be connected to the phoneme labeller 10 via cable connection or any conventionally available wireless communication device such as infrared trans- mittal, laser, radio, microwaves, or television. The process then continues with the editing mode described later.
  • the machine-readable phonetic symbols can be stored on any conventionally available device which stores machine-readable data, such as, for example, computer data disks, hard drives, flash, static or dynamic memory, tape, or CD-PROM.
  • machine-readable phonetic symbols Once the machine-readable phonetic symbols are stored, the process can be re-started at a later time by loading the machine- readable phonetic symbols into the software pre-processor 11 and continuing with the display and editing mode described below.
  • Another preferred embodiment automatically stores the machine-readable phonetic symbols while outputting the machine-readable phonetic symbols to the software pre ⁇ processor 11.
  • the system can be operated in the display and editing mode, utilizing the software pre ⁇ processor 11, word processor 12, and speech re-synthesizer 13, as described below.
  • the display and editing mode occurs after the speaker has finished dictating, as the machine-readable phonetic symbols which represent the dictated speech can be stored for later processing in the display and editing mode.
  • the speaker may continue dictat- ing while words spoken earlier are being processed through the display and editing mode. For example, a secretary may operate the display and editing mode while the speaker continues to dictate. Alternatively, the speaker may pause during dictation to edit prior dictation.
  • the software pre-processor 11 provides an option to edit the machine-readable pho ⁇ netic symbol string. This option is particularly useful for written languages which can be readily comprehensible as phonetic symbols, such as Japanese Hiragana and Katakana.
  • the software pre-processor 11 outputs the machine- readable phonetic symbol string to the word processor 12, where it is displayed. If a wrong machine-readable phonetic symbol or set of symbols is detected by the speaker or subsequent user, the symbol or symbols can be manually overridden by keyboard, mouse, digitizing pen, voice command or any other input- positioning device such as a track ball, joystick or touchscreen.
  • the user manually selects the subject symbol or symbols using either a keyboard, mouse, digitizing pen, or other input-positioning device, and then re-dictates the input speech as desired.
  • This speech is then processed as before, i.e. through the analog-digital convertor 4, spectral processor 8, phoneme labeller 10 and software pre-processor 11 to update the machine-readable phonetic symbol string.
  • each possibility can be displayed so that the speaker or subsequent user can readily select the correct machine-readable phonetic symbol.
  • a preferred embodiment only displays the two most likely machine- readable phonetic symbols.
  • the machine-readable phonetic symbol string may be edited using the software pre-processor 11 and word processor 12.
  • Certain languages such as English, German, and French, differ from Japanese in that no corresponding syllabic symbols, such as Hiragana and Katakana, exist for the former.
  • a universal phonetic symbol representation can be used.
  • the speaker or subsequent user can opt to skip machine-readable phonetic symbol string editing altogether and proceed directly to locating word boundaries.
  • the software pre-processor 11 identifies word boundaries within the machine-readable phonetic symbol string. Each substring of machine-readable phonetic symbols delineated by the identified word boundaries is combined into a possible word choice, which is represented in- machine-readable code. While this process can be performed for the entire machine-readable phonetic symbol string or any segment thereof at one time, a preferred embodiment identifies word boundaries on a word-by-word basis. In this preferred embodiment, the software pre- processor 11 starts at the beginning of the machine- readable phonetic symbol string and identifies possible word boundaries within the string for the first spoken word. Identification of possible word boundaries for subsequent spoken words occurs later, and can be accomplished, in part, by using adaptive feedback to determine the next starting word boundary. This embodiment is described in detail later.
  • the software pre-processor 11 identifies every Kanji character which can be represented by any substring of machine-readable phonetic symbols with a beginning word boundary at the start of the machine- readable phonetic symbol string. These Kanji characters are each considered a possible word choice. Techniques for combining machine-readable phonetic symbols into words are commonly known. For instance, commercially available Japanese word processors operate on these principles to convert Hiragana and Katakana symbols into Kanji characters.
  • the possible word choices delineated by the identified word boundaries are then ranked in order of probability by reference to linguistic usage, including such factors as grammatical and contextual syntax.
  • the possible word choices may be ranked by alphabetical or reverse alphabetical order, or by increasing or decreasing syllable length, where syllable length is measured by the number of syllables within a word.
  • An alternative method of ranking the word choices is with reference to the prior usage of words in the dictated speech.
  • previously dictated and selected words, as later described are accessed and if matches are found, the word choices presented reflect the previous usage.
  • the word choices can then be ranked, for example, by their most recent usage, by the time elapsed between the current words and their prior usage (i.e. age), or by frequency of usage.
  • the possible word choices are then output, in machine-readable form, to the word processor 12, or are stored for future use.
  • the word processor 12 possesses all the features commonly available on commer ⁇ cial word processors.
  • the word processor 12 displays the possible word choices in representations readily compre ⁇ hensible to the speaker or a subsequent user.
  • the representation format may vary depending on the language used. For instance, in Japanese, the word processor can represent each possible word choice as a Kanji character, Hiragana and Katakana representations, or a combination of the three. For English, German, French and other similar languages, the possible word choices can be represented as alphanumeric text or other appropriate representations.
  • the possible word choices can be displayed in order of probability, alphabetical order, reverse alpha- betical order, by increasing or decreasing syllable length, where syllable length is measured by the number of syllables in a word, by the most recent usage, by the time elapsed between the current word and the word choice's prior usage (i.e. age) , or by frequency of usage.
  • the most probable word choice is displayed separate from the remaining word choices and the remaining possible word choices are then displayed as two independent sequences. One of these sequences ranks the remaining word choices by increasing syllable length and the other sequence is ranked by alphabetical order.
  • the most probable word choice can be displayed separate from the remaining words. For instance, the most probable word choice can be displayed in the first line and the two independent sequences of remaining words as separate columns below the first line. Alternatively, the most probable word choice can be displayed in bold format as compared to the remain ⁇ ing words. Other variations are readily apparent to one skilled in the art. Word Selection
  • the word processor 12 further provides the speaker, a subsequent user, or a suitable software/computer combi ⁇ nation with functionality to readily select any displayed word choice or to manually input a word. These options may be similar to spell checking features commonly avail ⁇ able on conventional word processing software packages.
  • the selection process may be performed by any one of numerous means, including but not limited to voice com- mand, mouse, digitizing pen, keyboard or any other input- positioning device. Insertion of punctuation and similar tasks can be accomplished at this time either manually, or through macros or voice commands.
  • a preferred embodiment further provides an option whereby the speaker or a subsequent user may call up the machine-readable phonetic symbol substring associated with any displayed word.
  • the string may then be edited in the same manner as before, after which the software pre ⁇ processor 11 re-translates the modified machine-readable phonetic symbol substring into a word and displays the word for further editing.
  • the user may manually select word boundaries within the recalled machine-readable phonetic symbol substring, which the software pre-processor 11 then uses to re-combine the machine-readable phonetic symbols into words which are displayed for acceptance or further editing.
  • a preferred embodiment of the system has the additional capability of re-synthesizing the sound represented by any speech segment so that the user can hear what speech was actually dictated.
  • the speech re-synthesizer first uses a digital signal generator 50 which receives machine- readable phonetic symbols from the software pre-processor 11 and utilizes the look-up tables 15, and if desirable, training data 16, to convert the machine-readable phonetic symbols into a digital signal.
  • This digital signal is fed to a digital-to-analog convertor 52, and then to a speaker driver 54 connected to an audio speaker 56.
  • a digital signal generator 50 which receives machine- readable phonetic symbols from the software pre-processor 11 and utilizes the look-up tables 15, and if desirable, training data 16, to convert the machine-readable phonetic symbols into a digital signal.
  • This digital signal is fed to a digital-to-analog convertor 52, and then to a speaker driver 54 connected to an audio speaker 56.
  • Each of these devices is commercially available.
  • the speaker need not personally perform the editing process since any ambiguities can be under ⁇ stood by any subsequent user, especially in embodiments incorporating re-synthesized speech processes.
  • this particular re-synthesis feature can be implemented for uses other than speech transcription. For instance, storing speech as an analog signal consumes a relatively large amount of storage media. Even a digitized speech signal requires a substantial amount of storage space. However, storing speech as machine- readable phonetic symbols requires relatively little space.
  • the above speech re-synthesis technique constitutes an ideal speech compression method which is also readily adapted to telephone answering machines and other voice-storage applications.
  • the word processor 12 utilizes adaptive feedback so that information as to a selected word is incorporated into the selection of the next starting word boundary. After the correct word choice has been selected from the possibilities presented, this information is fed back to the software pre-processor 11 so that the software pre-processor will begin at the next machine-readable phonetic symbol. Thus, subsequent word boundary identification is enhanced since the beginning word boundary has already been determined.
  • each selection of an ambiguous word necessarily resolves ambiguities in phoneme identification.
  • a preferred embodiment exploits this situation by also using adaptive feedback so that user selection information can be used to update the weighting functions for subsequent phoneme labelling.
  • this feature allows the system and method according to the invention to further adapt to a speaker's particular speech characteristics. It is recognized that this feedback capability may not always be desirable. For instance, a speaker's speech characteristics may vary from time to time, due to general health, stress or other factors. Feedback may be undesir ⁇ able in these instances, since the weighting functions will be updated with anomalous data. Therefore, unlike feedback for determining starting word boundaries, an option is provided whereby feedback to the weighting functions can be disabled.
  • feedback to the weighting functions is preferably used only for a short period of time, depending on the consistency of an individual speaker's speech. After the weighting functions have been updated during this period, feedback to the weighting function is dis ⁇ abled. Feedback may be subsequently enabled to update the weighting functions if a speaker's speech characteristics have changed over a period of time, for instance, as a speaker grows older.
  • the process is iterated for the next and subsequent words.
  • the software pre-processor 11 begins at the next machine-readable phonetic symbol within the machine- readable phonetic symbol string to identify possible word boundaries for the next word.
  • While a preferred embodiment analyzes the machine- readable phonetic symbol string on a word-by-word basis, it is within the scope of the invention to analyze larger blocks of speech at one time, for instance, sentences or phrases, or any blocks of speech which are bracketed by silence intervals.
  • the software pre ⁇ processor 11 determines word boundaries for all the words in the entire phrase or sentence, which is then displayed through the word processor 12. The user is then given the option of validating the entire phrase or sentence as correct, or manually editing any word boundary within the displayed block. In the latter case, the software pre ⁇ processor 11 re-identifies word boundaries for the remaining words within the displayed block, since the user's editing may change subsequent word boundaries within the sentence. The updated block is then re ⁇ displayed for acceptance or further editing.

Abstract

The invention provides a system and method for facilitating speech transcription which accepts continuous speech from any of a variety of conventional devices capable of converting spoken words to electromagnetic signals, including microphones or telephones and, if the input signal is analog, converts the input signal from analog to digital format. The digitized signal is then processed in the time and frequency domains to extract spectral speech features which are used to match the input speech with associated phonemes. According to the invention, possible word choices may be extrapolated from the associated phonemes, and visually displayed in textual representations. The visually displayed text may then be edited and processed into final form. Systems and methods for facilitating the invention are also disclosed.

Description

DESCRIPTION
System and Method for Facilitating Speech Transcription
Background of the Invention
The field of the invention is speech recognition and transcription systems. Conventional speech recognition systems have used several independent approaches with limited success. One approach models the vocal tract, articulation, and acoustic production of speech. A second approach models the acoustic waveform and the spectrum of human speech and uses signal processing methods. A third approach models the human ear and its detection mechanisms through neural network methods. A fourth approach analyzes phonetic features, perception, and linguistic models of human speech.
There is a substantial difference in the performance and capability of the systems which currently exist. Initially, these systems accepted only discrete speech since the then nascent technology was unable to distin¬ guish between individual words in continuous speech. This limitation required users to pause between each word. As a result, discrete speech systems proved impractical to use due to the onerous speech adjustments required.
Efforts to improve upon discrete speech systems resulted in the development of systems which could identify one or more designated words within a continuous stream of words. Although these systems could not transcribe continuous speech, they were adequate for applications with limited functionality, such as remote control programming for audio-visual equipment.
A still higher level of complexity achieved by speech recognition systems was the recognition of every word within a sentence. However, these systems required that sentences be constrained within certain grammatical boundaries. Current efforts in the field of speech recognition have focused on developing speech processors which accept continuous speech while attempting to overcome the above limitations and constraints. These systems perform inade- quately when confronted with identifying word boundaries, providing correct spelling, and resolving other word ambiguities .
For example, word boundary issues can arise when any subset of sequential syllables within a multi-syllable word is itself a word. Word boundary problems also arise when the ending syllable or syllables of one word can be joined with subsequent syllables to form another word. Homonyms and multiple spellings of one word also give rise to ambiguities. Therefore, fully automated, accurate end- to-end continuous speech recognition involves intensive data analyses, speech understanding and logical reasoning capabilities. Currently available continuous speech processors thus suffer from transcription errors and other limitations as the requisite technology is currently prohibitively expensive and uneconomical for commercial purposes.
Summary of the Invention
The system and method according to the present invention facilitates speech transcription of normally spoken continuous speech without sacrificing accuracy in transcription. Speech is digitally processed to extract its spectral features. These features are then used to distinguish individual phonemes and to generate a string of equivalent machine-readable phonetic symbols. The string of machine-readable phonetic symbols is processed to identify possible word boundaries for each spoken word within the string. In one embodiment, this process is performed one spoken word at a time. In this embodiment, the possible word choices for each spoken word, as delineated by the identified word boundaries, are visually displayed. The term "word choice" as used herein through this application represents both singular and multiple word choices as both possibilities may exist and option¬ ally may be displayed for each spoken word. The represen¬ tation format can vary depending on the particular language used. For example, if the speech is in the Japanese language, word choices may be represented as a string of Hiragana and Katakana symbols, each correspond¬ ing to a machine-readable phonetic symbol, or as Kanji characters, or as a combination of Kanji, Hiragana and Katakana letters. Word choices in English, German, French, or other languages may be represented as alpha¬ numeric text or other appropriate representations . In each case, the proper word choice corresponding to each spoken word can be readily selected, thereby ensuring accurate transcription. As the technology develops, it would also be within the scope of the invention to repre¬ sent the possible word choices in different languages than that dictated. Moreover, the possible word choices can be displayed in various orders, for example, by alphabetical order, by order of probability, or by order of syllabic length, thereby facilitating word selection. Other display orders, including those discussed later, would be apparent to those skilled in the art and are within the scope of the invention. After the first spoken word within the machine- readable phonetic symbol string is resolved in the above manner, the system proceeds to the next word within the machine-readable phonetic symbol string and iterates the aforementioned process. The process may also use adaptive feedback after each selected word choice to automatically determine the starting word boundary for the next word within the string. Thus, the process always starts at the beginning of the next word, thereby further increasing accurate determination of word boundaries. Moreover, editing tasks such as punctuation, margins, paragraphs, and capitalization, can be performed during this process. Other embodiments process the input speech in larger blocks of speech. For instance, an embodiment can process the input speech for any block of speech which is brack¬ eted by silence intervals, such as a phrase or sentence at one time. Word choices are then presented for each word within the block of speech at one time.
In contrast to conventional methods of speech recognition which attempt to resolve ambiguities, the invention recognizes that the limitations of currently available conventional technology render this task commercially impractical. Accordingly, in one embodiment, the system according to the invention truncates the conventional process and presents possible solutions to the ambiguities in an environment wherein the correct alternative may be readily selected. Thus, according to the invention, ambiguities may be readily and accurately resolved while repetitive tasks, i.e. collecting, convert¬ ing and analyzing massive speech data and searching for words in a vocabulary set are performed automatically.
Accordingly, it is an object of the present invention to provide a system and method for facilitating continuous speech transcription. The invention has the further advantages of being both language and speaker independent, in that the invention accommodates any speaker and any spoken language which can be transcribed. Other and further objects and advantages will appear hereinafter.
Brief Description of the Drawings
FIG. 1 is a logical diagram of a preferred embodiment of the invention.
FIG. 2 is a block diagram of speech acceptance and analog-digital hardware suitable for practicing the invention.
FIG. 3 is a block diagram of a spectral processor suitable for practicing the invention.
FIG. 4 is a block diagram of a phoneme labeller suitable for practicing the invention. FIG. 5 is a block diagram of a speech re-synthesizer suitable for practicing the invention.
FIG. 6 is a graph of the performance of a pre- emphasis filter which is suitable for practicing the invention.
FIG. 7 is a graph and equations showing computation of acoustic power.
FIG. 8 is a graph and equation showing peak-to-peak pitch estimation. FIGs. 9A-9B are graphs and equations showing frequency domain processing.
FIGs. 10A-10B are an example of forward pass labelling.
FIG. 11 is an example of backward labelling. Appendix A describes spectral distortion techniques.
Appendix B describes Mel-scale filters.
Appendix C describes weighting techniques.
Appendix D describes forward pass and backward labelling. Appendix E describes ranking word candidates.
Detailed Description of a Preferred Embodiment of the Invention
While the invention will be described in terms of preferred embodiments, it will be appreciated that other embodiments are possible within the scope of the invention.
As shown in FIG. 1, a preferred embodiment of the system according to the invention has a language, user, and mode selector 2, analog-to-digital convertor 4 inter- acting with an input device 6, a spectral processor 8, phoneme labeller 10 interfacing with phoneme models 14 and preferably, training data 16, a software pre-processor 11, and a word processor 12. In another preferred embodiment, the system may also include a speech re-synthesizer 13. According to the method of the invention, the speaker dictates speech into the system through the input device 6, which may be a microphone, telephone, other line input, or any input device which converts speech into electro¬ magnetic signals. The term, 'electromagnetic signal', as used throughout this application, includes any form of electrical signals, including microwaves, radio signals, and television signals, as well as optical and other signals such as infrared and laser.
At start-up, the speaker identifies him or herself and selects a language and an operational mode. Three of the possible operational modes are training, dictation and display and editing. These three modes are those known to the inventors at the present time. However, other opera¬ tional modes may become apparent to one skilled in the art as the technology develops and would also be within the scope of the invention. As explained in detail later, the dictation and display and editing modes may be operated concurrently.
In the training mode, the speaker dictates speech into the system. The system then analyzes the speech to generate data representing the speaker's distinctive voice characteristics. This data is stored as training data and is later used by the system to identify the speaker's words when the system is operated in dictation mode. While not required to practice the broad scope of the invention, completion of the training mode in a preferred embodiment greatly facilitates the efficient practice of the invention.
In dictation mode, the speaker dictates speech into the system, using the input device 6. In one embodiment of the invention, the speaker may dictate the speech at a prior time using conventional means such as a tape recorder and the audio or electromagnetic output of the conventional means can be transmitted to the input device 6. The input speech is passed through the analog-digital converter 4, then through the spectral processor 8, and then through the phoneme labeller 10. The output of the phoneme labeller 10 can either be stored for later retrieval and processing, or can be sent to the software pre-processor 11. It will be appreciated that elements of the system may be combined or rendered unnecessary as technological capabilities improve, while remaining within the scope of the invention.
In the display and editing mode, editing and other functions may be performed through the software pre¬ processor 11 and word processor 12, using keyboard, mouse, digitizing pen, voice commands, or any other input- positioning device activated contemporaneously or at a later time by the speaker, by a subsequent user, or by a suitable software/computer combination. The software pre¬ processor 11 receives the output of the phoneme labeller 10 and works in conjunction with the word processor 12 to display possible word candidates for each spoken word. The displayed word choice is selected by the speaker, a subsequent user, or by an appropriate software/computer combination. The speaker or a subsequent user may also manually input an alternative word choice. This process can then be repeated for the next and subsequent words within the input speech, until the entire input speech has been transcribed. In one preferred embodiment, if a user cannot understand any segment of the spoken input, the system can include a speech re-synthesizer 13 which can reproduce the spoken words. It will be appreciated that within the scope of the invention, the speech input may be stored for later processing. In this embodiment, the speech re-synthesizer can facilitate accurate transcription. Preferred embodiments of the invention are now described in detail.
Training Mode
Preliminary training is preferably performed to initialize the system for each particular speaker or for each time that a particular speaker uses a new language.
During this initialization period, the speaker-specific training data 16 is established to tune the system to the speaker's individual speech characteristics for the particular chosen language. In this preferred embodiment, the system thus adapts to and accommodates a diverse set of speakers regardless of gender, age, accent, dialect, language or any other factor which can contribute to a difference in the pronunciation of any particular word. Initialization only needs to be performed once for any particular speaker for each particular language. The speaker-specific training data 16 can be established independent of the invention or a system according to the invention may be used. In embodiments where the speaker generates training data by using the system, the speaker dictates a pre-established reference set of words into the system. Various reference sets of words can be used for any particular language. Methods for determining these reference sets are well-known to those skilled in the art and are also set forth in the publication, Fundamentals of Speech Processing by Rabiner and Juang, Prentice Hall, 1993, which is incorporated herein by reference. The dictated speech is then pro¬ cessed through an analog-to-digital converter 4, spectral processor 8, and phoneme labeller 10 shown in FIG. 1. The analog-to-digital convertor 4 and spectral processor 8 process the dictated speech in the same manner during a training mode as during a dictation mode and are described in more detail later for the dictation mode.
The phoneme labeller 10 is then used to generate training data 16 by using a spectral distortion measure to compare the speaker's speech against reference speech samples of the pre-determined set of words. Spectral distortion techniques are well-known to those skilled in the art and are also documented in Discrete-Time Processing of Speech Signals by Deller, Proakis, and Hansen, which is incorporated herein by reference and is also attached as Appendix A. Training data can also be generated independent of the invention through other techniques well known to those skilled in the art .
Preferably, the training data 16 is stored into a data base. The data base may be organized in a variety of suitable structures; however, a preferred embodiment uses the same structure as in DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM NISTIR 4930, NTIS, February, 1993, which structure is shown in Example 1:
Example 1 : database hierarchy:
/<corpus>/<usage>/<dialect>/<sex>/<speaker>/<word id>.file corpus := language usage:= train/test dialect : =dλ . ..d^ sex:= m/f speaker:= (initials) (digit) word id:= <text typexword number> text type:= sa/si/sx word number:= 1...n words file:= wav/txt/wrd/phn
Dictation Mode
Speech Acceptance and Analog-Digital conversion.
In ordinary use, that is, after the system has received and processed training data for the speaker and the chosen language, the system of this preferred embodi¬ ment is used in dictation mode. As indicated in FIG. 2, speech is received into the system via any of a variety of conventional devices, including but not limited to static or dynamic microphones 18, telephones 22, or other line input 20. As will be appreciated by those skilled in the art, speech input could also be transmitted and received by other technology via electromagnetic signals.
If microphones are used, they preferably have noise limiting capability. As shown in FIG. 2, appropriate amplifiers 24 and interfaces 26 are used to connect the receiving devices with a low-pass analog filter 28 and to ensure a fixed level output compatible with digital equip¬ ment. These devices are readily available. Moreover, one skilled in the art would know how to choose appropriate amplifiers and interfaces. The text, Digital Signal Analysis, by Stearns, Hayden 1975 can also be referenced to determine the compatibility of these devices.
Once received, the signal representing the input speech is passed through a low pass analog filter 28 which eliminates frequencies that can alias into the acoustic signal during subsequent processing. One preferred embodiment uses a 3rd order Butterworth analog filter with
10 kHz and 6 dB/octave as the low-pass filter of choice.
The signal is then passed through an analog-to- digital converter 30. While a variety of analog-to- digital converters may be used for this purpose, a preferred embodiment uses a Motorola 56 ADC 16 with sixteen bit resolution at 20 Khz and a continuously adjustable gain over a 20 db range.
Phonetic Feature Detection
The digitized signal is passed through a spectral processor, as shown in FIG. 3, to extract spectral speech features. Before describing this process in detail, it is useful to discuss the problem of environmental or "background" noise which is commonly encountered in speech recognition. Two classes of problems contribute to background noise. One class is attributed to the speaker and consists of sound artifacts such as lip smacks, heavy breathing, mouth clicks, and nasal pops. Although such artifacts are generated inadvertently, they often have an energy level comparable to speech. As will be further discussed, the invention preferably models speaker artifacts and detects them along with phoneme recognition so that they may be removed during subsequent processing. A second class of noise problems arises from the ambient noise environment. Occasionally, non-speaker generated background noise, such as a bell, whistle or other sound, may interfere with and mask the speech. To alleviate this problem, the system can use a noise limiting microphone for receiving the dictated speech. Other means of eliminating background noise will be apparent to those skilled in the art.
Additionally, or independently, a running average of the ambient background noise during periods of silence can be kept so that sudden non-vocal tract background noises can be easily classified and eliminated. As will be explained in more detail, a preferred embodiment uses both a noise limiting microphone and continuous modelling of background noise. These techniques avoid the problems commonly experienced by other speech processing systems that pre-process the signal to remove noise, as the latter lose or distort the speech.
In a preferred embodiment, the digitized signal is first passed through a pre-emphasis filter 32 which works in connection with a noise model 34 to eliminate back- ground noise. The noise model 34 is initially set to a default noise level and is adaptively updated during subsequent processing of the input speech. This adaptive feature is later discussed in connection with frequency domain processing. In one preferred embodiment, the pre-emphasis filter 32 is a Finite Impulse Response (FIR) bandpass digital filter with zero phase shift and unity gain. The deriva¬ tion of the bandpass coefficients for such a filter is shown in Example 2.
Example 2.
A 20 khz sampling rate with a bandpass from 100 hz to 6000 hz is assumed. The sample signal has a 60 hz hum component, a 7000 hz whistle, random noise, and an acoustic signal at 1000 hz. Filter design characteristics:
Lower cutoff frequency: wlo = 100 Upper cutoff frequency: whi = 6000 Sample period: dt = .00005 rate: = l/dt = 2 X 104
Number of coefficients: nc = 45 Filter coefficients computation: whir = whi * 2 * τ = 3.77 X 10" wlor = wlo * 2 * π = 628.319 c0 = dt * (whir - wlor) / π = 1.8574 i = 1 .. nc; cx= (sin(i*whir*dt) sin(i*wlor*dt) ) / (7r*i) Sample signal: nn = 511 nt = 2 noise = 0.5 f0 = 60 fα = 1000 f2= 7000
= 0.5 a^ 1.0 a2= 0.5 i = 0 nn
nt input =∑ a^+sin(i* t*2.0 *π *fj) +rnd (noise) noise
7=0
output, = 0 errorj = 0 signal, = sin(i*dt*2.0*7r*f1] k= nc..nn-nc
ou tpu tk=cO *inpu tk+ ∑ [(Cj *inpu tktj ) + ( cJ *inpu tk^ )]
Figure imgf000014_0001
The performance of the pre-emphasis filter 32 is shown in FIG. 6. In FIG. 6, the upper trace (a) is the acoustic input signal. The middle trace (b) is the result of the pre-emphasis filter. The third trace (c) is the desired signal that is embedded in the signal. The lower trace (d) is the absolute difference of the acoustic signal and the filter output.
The signal is then sampled into blocks of data as indicated by the sampler 36. Since the human vocal mechanism modulates the slowly changing speech signal onto a higher frequency sound wave, it would be necessary to sample the acoustic wave form at over 10,000 times per second to capture this speech wave. In contrast, the movement of the tongue, jaw, lips and other vocal articu- lators change at the far slower rate of less than 100 times per second. This physical situation is exploited by grouping acoustic data into approximately one hundredth of a second blocks to isolate phonetic features and to identify noise sources.
Taking the above considerations into account, a preferred embodiment samples the signal into blocks of 512 samples. Oversampling can be used to increase accuracy during subsequent processing. Thus, a preferred embodi- ment uses a 25%, or 128-sample overlap between adjacent blocks. As can be readily appreciated by one skilled in the art, numerous other data block sizes and overlap ranges can be used. For example, blocks of 256, 1024 or 2048 samples may be used. Moreover, any overlap range between 0 and 50% may be used between adjacent blocks.
While sampling may be accomplished with hardware, microcode, or other methods known to those skilled in the art, a preferred embodiment uses microcode. As technology advances, other sampling methods and devices suitable for practicing the invention may become available.
The sampled signal is then processed in the time domain, as indicated by element 38, to extract the spectral speech features of acoustic power and the peak- to-peak pitch. Acoustic power indicates the presence of speech or silence intervals that naturally occur at the end of sentences, phrases, or words. To determine acoustic power, a short time energy estimate is made for each sample block by averaging either the magnitude or square of the signal within the block, otherwise referred to as absolute power and squared power, respectively. Next, the peak-to-peak pitch is estimated by summing the signal zero crossings. Although this process is well-known to one skilled in the art, an example of time domain processing is given in Figures 7-8.
In FIG. 7, a sample block of 512 samples is shown with a mean value of 3.298 * 104. Plot (a) represents the sample block signal. Equations (b) and (c) are used to compute the absolute power and squared power, respectively.
In FIG. 8, the peak-to-peak pitch is estimated by summing the signal zero crossings of the estimated signal shown in plot (d) , as shown by equation (e) .
The signal is also passed through acoustic band pass filters 40 to identify the relative signal power in each acoustic band, otherwise referred to as the spectral pattern. This information can be used later to assist in phoneme identification. In a preferred embodiment, the band pass filters are selected based on the Mel scale. The Mel scale is a logarithmic-based scale which is more fully described in the publication Advanced Algorithms and Architectures for Speech Understanding, ESPRIT Research Report Project 26, Vol. 1, Pirani, 1990, and is herein incorporated by reference, and is also attached as Appendix B.
The estimated formant frequencies are determined by processing the signal in the frequency domain as shown by element 42. An example of frequency domain processing is given in Figures 9A-9B. In Figure 9A, plot (a) represent the same sample block as plot (a) in Figure 7. The sample block can be first tapered using a tapering function. Numerous tapering functions, including a Hanning Window as shown in Equation (b) , can be used to taper the 512 elements of the sample block, and are well-known to those skilled in the art. Tapering reduces high frequency noise in the subsequent Fast Fourier Transform (FFT) . Since the signal was previously oversampled by 256 elements (128 elements at each end) , tapering promotes accuracy, rather than loss of data. Plot (c) represents the tapered sample block. As would be appreciated by one skilled in the art, numerous other tapering functions can be used.
An FFT is then performed on the tapered sample block and the power spectrum is estimated using the square of the resulting complex frequency coefficients, as shown by equation (d) in FIG 9B. Plot (e) provides a graph of the resultant power spectrum. The log of each element is then computed, as shown in plot (f) . Since human speech usually does not exist at frequencies greater than 8000 hz, the frequency spectrum in plot (f) can be truncated to reduce high frequency noise. The cut-off range can be as low as 7500 hz or as high as 8500 hz. In the embodiment shown in plot (g) , the frequency is truncated at 7500 hz. Next, an inverse FFT is performed to determine the cep- stral coefficients, as shown in plot (h) . The resonant or formant frequencies of the vocal tract are then extracted from the low order peaks of the cepstrum in plot (h) .
Note that if a silence interval is detected during time domain processing, the data from frequency domain processing necessarily represents only ambient background noise. The system and method according to the invention update the noise model 34 with this data. Thus, the noise model 34 reflects a semi-continuous model of background ambient noise. This adaptive feedback enhances subsequent processing of speech through the pre-emphasis filter 32. Thus, in this embodiment, the spectral processor extracts the following spectral speech features: acoustic power, spectral pattern, and formant frequencies.
Phoneme Identification. The extracted spectral speech features are passed to a phoneme labeller 10 as shown in FIG. 4. As would be known to one skilled in the art, the previously extracted spectral feature of relative signal power can be used to classify phonemes into one of the following phonetic categories: vowels, diphthongs, liquid or glide semi- vowels, nasal consonants, voiced or unvoiced fricatives, africates, whisper and voiced or unvoiced stops. Table 1 describes a possible phoneme classification scheme for forty standard phonemes used in Western/European languages:
Table 1: vowels divided into: front -- IY, IH, EH, AE mid -- AA, ER, AH, AX, AO back -- UW, UH, OW dipthongs - AY, OY, AW, EY semivowels, including liquids -- WW, LL glides -- RR, YY nasal consonants -- MM, NN, NG stops voiced -- BB, DD, GG unvoiced -- PP, TT, KK fricatives voiced -- W, TH, ZZ, ZH unvoiced -- FF, TH, SS, SH africates -- JH, CH whisper -- HH
By classifying phonemes into phonetic categories, the individual phoneme identification problem can be greatly simplified. For instance, under the classification scheme given in Table 1, each phonetic category contains at most five elements out of a possible forty. Moreover, these elements are usually well separated by physical features of the vocal tract so that subsequent identification of an individual phoneme within a phonetic category is enhanced. To identify individual phonemes for each signal block, a 'forward pass' 44 is made through each signal block and phoneme candidates are identified by comparing against existing features in the phoneme models 14. The phoneme models 14 are pre-determined reference sets of data for each language. Techniques for generating these sets are well-known. For instance, the TIMIT Acoustic- Phonetic Continuous Speech Corpus CD-ROM NISTIR 4930, NTIS, February, 1993 is one such set which is readily available to one skilled in the art. A score is kept for each phoneme candidate based on its probability of being the correct phoneme spoken.
In a preferred embodiment, scoring is accomplished by using the following fuzzy logic LR membership functions: Left(z) = 1/(1 + z*z) Left side function Right (z) = 1/(1 + z*z) Right side function The function is described as a four point trapezoid (vw) and disjunction:
ww(a,b, c, d,y,x) =ifix<b,
Figure imgf000019_0001
mx(a,b) = if (a>b,a,b) mn(a,b) = if (a<b,a,b) wx ( a l , b l , c l , d l , a 2 , b 2 , c 2 , d 2 , y , x ) ww(mx(al,a2) ,mx(bl,b2) ,mn(cl,c2) ,mn(dl,d2) ,y,x)
These functions are well-known to those skilled in the art and are also discussed in Fuzzy Set Theory and Its Applications by Zimmermann, 2d, Kluwer, 1991, which is herein incorporated by reference. In this preferred embodiment, each phoneme candidate feature and sample feature are used to form a normalized membership function of the sample in the phoneme data base. Thus, all phonemes have a possibility of assignment in each sample block. The features can be weighted to maximize the probability of phoneme identification. The weighting process can be accomplished through a variety of tech¬ niques which are well-known to those skilled in the art, some of which are described in Discrete-Time Processing of Speech Signals, which are incorporated herein by reference and are also attached as Appendix C. Preferably, the determination of the weighting function is based on experimental scoring derived from the training data 16 for the speaker. As discussed later, the weighting function can be updated through adaptive feedback from the user. An example of forward pass labelling is given in FIGs. 10A-10B.
As previously mentioned, speaker artifacts and sudden, loud background noises can interfere with the input speech so that the system is unable to find a reasonable phoneme match for the received sound. In a preferred embodiment, such interference is identified and the speaker is requested to repeat the masked speech.
When the extracted spectral speech features indicate a silence interval, a "speech segment" bracketed by the silence interval and its initial starting point has been defined. This event triggers backward labelling 46 of the most likely phoneme candidates for the speech segment . Simultaneously, spectral processing 8 and forward pass labelling 44 continue for the next and subsequent speech segments within the speech signal.
A silence interval could be defined as any time interval longer than 1/10 (one-tenth of a second) in which the ambient noise level does not exceed 10 dB over the average ambient background noise level. Shorter time intervals and smaller decibel increases may also be used. For instance, a shorter time period for speakers who dictate too rapidly to pause for more than 1/10 of a second, or for speakers who speak too softly to develop a 10 dB increase over the ambient background noise level. Thus, it is also within the broad scope of the invention for embodiments to define silence intervals as short as 1/15 of second, or to require only a 9 dB increase. Other embodiments may require over 12 dB increases for speakers who are particularly loud. Other combinations are readily apparent to those skilled in the art.
Backward labelling for these subsequent speech segments is initiated once processing for the instant speech segment is complete. Notably, partitioning of the speech into speech segments bracketed by silence intervals promotes accuracy and facilitates parallel processing. Moreover, bracketing further facilitates the display of full phrases in embodiments which process speech in blocks of phrases at one time. These embodiments are described later.
Backward labelling identifies the N most possible phoneme sequences within the speech segment . Each phoneme candidate is ranked by its possibility of assignment in each sample block, based on conventionally known methods of phoneme identification. These methods are discussed in Discrete-Time Processing of Speech Signals, and are incor¬ porated herein by reference, and are also attached as Appendix D. As shown in FIG. 11, the candidates are sorted to maximize the possibility of phoneme identification.
Once the phoneme candidates are identified, the system generates an equivalent machine-readable phonetic symbol for each phoneme candidate, as shown by element 48. This process utilizes look-up tables 15 which associate machine-readable phonetic symbols with each phoneme model 14. The machine-readable phonetic symbols are composed of machine-readable codes. Thus, by cross-referencing each phoneme candidate with the equivalent phoneme model and its associated machine-readable phonetic symbol, a cor¬ responding machine-readable phonetic symbol is generated for each phoneme candidate. Although various machine- readable codes can be used, a preferred embodiment uses ASCII codes. Output to Software Pre-Processor
In one preferred embodiment, the machine-readable phonetic symbols are directly output to a software pre¬ processor 11 as shown in FIG. 1. The software pre- processor 11 can be separately joined to or integrated with a word processor 12 running, for example, on a personal computer, on stand-alone equipment or as part of a network or central computer system. The software pre¬ processor 11 may be connected to the phoneme labeller 10 via cable connection or any conventionally available wireless communication device such as infrared trans- mittal, laser, radio, microwaves, or television. The process then continues with the editing mode described later. In another preferred embodiment, the machine-readable phonetic symbols can be stored on any conventionally available device which stores machine-readable data, such as, for example, computer data disks, hard drives, flash, static or dynamic memory, tape, or CD-PROM. Once the machine-readable phonetic symbols are stored, the process can be re-started at a later time by loading the machine- readable phonetic symbols into the software pre-processor 11 and continuing with the display and editing mode described below. Another preferred embodiment automatically stores the machine-readable phonetic symbols while outputting the machine-readable phonetic symbols to the software pre¬ processor 11.
Display and Editing Mode Once the software pre-processor 11 starts to receive the machine-readable phonetic symbols corresponding to the last spoken speech, the system can be operated in the display and editing mode, utilizing the software pre¬ processor 11, word processor 12, and speech re-synthesizer 13, as described below. In one embodiment, the display and editing mode occurs after the speaker has finished dictating, as the machine-readable phonetic symbols which represent the dictated speech can be stored for later processing in the display and editing mode. However, it is important to note that the speaker may continue dictat- ing while words spoken earlier are being processed through the display and editing mode. For example, a secretary may operate the display and editing mode while the speaker continues to dictate. Alternatively, the speaker may pause during dictation to edit prior dictation.
Phonetic Symbol String Editing
In a preferred embodiment, the software pre-processor 11 provides an option to edit the machine-readable pho¬ netic symbol string. This option is particularly useful for written languages which can be readily comprehensible as phonetic symbols, such as Japanese Hiragana and Katakana. To edit the machine-readable phonetic symbol string, the software pre-processor 11 outputs the machine- readable phonetic symbol string to the word processor 12, where it is displayed. If a wrong machine-readable phonetic symbol or set of symbols is detected by the speaker or subsequent user, the symbol or symbols can be manually overridden by keyboard, mouse, digitizing pen, voice command or any other input- positioning device such as a track ball, joystick or touchscreen. For voice-input, the user manually selects the subject symbol or symbols using either a keyboard, mouse, digitizing pen, or other input-positioning device, and then re-dictates the input speech as desired. This speech is then processed as before, i.e. through the analog-digital convertor 4, spectral processor 8, phoneme labeller 10 and software pre-processor 11 to update the machine-readable phonetic symbol string.
If the input speech generates more than one possible machine-readable phonetic symbol match during backward labelling 46, each possibility can be displayed so that the speaker or subsequent user can readily select the correct machine-readable phonetic symbol. A preferred embodiment only displays the two most likely machine- readable phonetic symbols.
Thus, the machine-readable phonetic symbol string may be edited using the software pre-processor 11 and word processor 12. Certain languages, such as English, German, and French, differ from Japanese in that no corresponding syllabic symbols, such as Hiragana and Katakana, exist for the former. In these cases, a universal phonetic symbol representation can be used. Alternatively, the speaker or subsequent user can opt to skip machine-readable phonetic symbol string editing altogether and proceed directly to locating word boundaries.
Identifying word boundaries The software pre-processor 11 identifies word boundaries within the machine-readable phonetic symbol string. Each substring of machine-readable phonetic symbols delineated by the identified word boundaries is combined into a possible word choice, which is represented in- machine-readable code. While this process can be performed for the entire machine-readable phonetic symbol string or any segment thereof at one time, a preferred embodiment identifies word boundaries on a word-by-word basis. In this preferred embodiment, the software pre- processor 11 starts at the beginning of the machine- readable phonetic symbol string and identifies possible word boundaries within the string for the first spoken word. Identification of possible word boundaries for subsequent spoken words occurs later, and can be accomplished, in part, by using adaptive feedback to determine the next starting word boundary. This embodiment is described in detail later.
As an example of identifying word boundaries for Japanese speech, the software pre-processor 11 identifies every Kanji character which can be represented by any substring of machine-readable phonetic symbols with a beginning word boundary at the start of the machine- readable phonetic symbol string. These Kanji characters are each considered a possible word choice. Techniques for combining machine-readable phonetic symbols into words are commonly known. For instance, commercially available Japanese word processors operate on these principles to convert Hiragana and Katakana symbols into Kanji characters.
The possible word choices delineated by the identified word boundaries are then ranked in order of probability by reference to linguistic usage, including such factors as grammatical and contextual syntax.
Techniques for performing this task are well-known to those skilled in the art and are also described in the aforementioned text, Discrete-Time Processing of Speech
Signals. which are herein incorporated by reference and are also attached as Appendix E. Alternatively, the possible word choices may be ranked by alphabetical or reverse alphabetical order, or by increasing or decreasing syllable length, where syllable length is measured by the number of syllables within a word.
An alternative method of ranking the word choices is with reference to the prior usage of words in the dictated speech. In this method, previously dictated and selected words, as later described, are accessed and if matches are found, the word choices presented reflect the previous usage. The word choices can then be ranked, for example, by their most recent usage, by the time elapsed between the current words and their prior usage (i.e. age), or by frequency of usage.
The possible word choices are then output, in machine-readable form, to the word processor 12, or are stored for future use.
Visually Representing Possible Word Choices The possible word choices are then visually repre¬ sented. In a preferred embodiment, the word processor 12 possesses all the features commonly available on commer¬ cial word processors. The word processor 12 displays the possible word choices in representations readily compre¬ hensible to the speaker or a subsequent user. The representation format may vary depending on the language used. For instance, in Japanese, the word processor can represent each possible word choice as a Kanji character, Hiragana and Katakana representations, or a combination of the three. For English, German, French and other similar languages, the possible word choices can be represented as alphanumeric text or other appropriate representations.
Numerous display orders are also possible. For instance, the possible word choices can be displayed in order of probability, alphabetical order, reverse alpha- betical order, by increasing or decreasing syllable length, where syllable length is measured by the number of syllables in a word, by the most recent usage, by the time elapsed between the current word and the word choice's prior usage (i.e. age) , or by frequency of usage. In a preferred embodiment, the most probable word choice is displayed separate from the remaining word choices and the remaining possible word choices are then displayed as two independent sequences. One of these sequences ranks the remaining word choices by increasing syllable length and the other sequence is ranked by alphabetical order.
There are numerous methods by which the most probable word choice can be displayed separate from the remaining words. For instance, the most probable word choice can be displayed in the first line and the two independent sequences of remaining words as separate columns below the first line. Alternatively, the most probable word choice can be displayed in bold format as compared to the remain¬ ing words. Other variations are readily apparent to one skilled in the art. Word Selection
The word processor 12 further provides the speaker, a subsequent user, or a suitable software/computer combi¬ nation with functionality to readily select any displayed word choice or to manually input a word. These options may be similar to spell checking features commonly avail¬ able on conventional word processing software packages. The selection process may be performed by any one of numerous means, including but not limited to voice com- mand, mouse, digitizing pen, keyboard or any other input- positioning device. Insertion of punctuation and similar tasks can be accomplished at this time either manually, or through macros or voice commands.
A preferred embodiment further provides an option whereby the speaker or a subsequent user may call up the machine-readable phonetic symbol substring associated with any displayed word. The string may then be edited in the same manner as before, after which the software pre¬ processor 11 re-translates the modified machine-readable phonetic symbol substring into a word and displays the word for further editing. Alternatively, the user may manually select word boundaries within the recalled machine-readable phonetic symbol substring, which the software pre-processor 11 then uses to re-combine the machine-readable phonetic symbols into words which are displayed for acceptance or further editing.
It is recognized that someone other than the original speaker (referred to herein as "subsequent user") may use the system to actually perform the editing process. Thus, a preferred embodiment of the system has the additional capability of re-synthesizing the sound represented by any speech segment so that the user can hear what speech was actually dictated.
As shown in FIG. 5, the speech re-synthesizer first uses a digital signal generator 50 which receives machine- readable phonetic symbols from the software pre-processor 11 and utilizes the look-up tables 15, and if desirable, training data 16, to convert the machine-readable phonetic symbols into a digital signal. This digital signal is fed to a digital-to-analog convertor 52, and then to a speaker driver 54 connected to an audio speaker 56. Each of these devices is commercially available.
Accordingly, the speaker need not personally perform the editing process since any ambiguities can be under¬ stood by any subsequent user, especially in embodiments incorporating re-synthesized speech processes. It is recognized that this particular re-synthesis feature can be implemented for uses other than speech transcription. For instance, storing speech as an analog signal consumes a relatively large amount of storage media. Even a digitized speech signal requires a substantial amount of storage space. However, storing speech as machine- readable phonetic symbols requires relatively little space. Thus, the above speech re-synthesis technique constitutes an ideal speech compression method which is also readily adapted to telephone answering machines and other voice-storage applications.
Adaptive Feedback
In a preferred embodiment, the word processor 12 utilizes adaptive feedback so that information as to a selected word is incorporated into the selection of the next starting word boundary. After the correct word choice has been selected from the possibilities presented, this information is fed back to the software pre-processor 11 so that the software pre-processor will begin at the next machine-readable phonetic symbol. Thus, subsequent word boundary identification is enhanced since the beginning word boundary has already been determined.
In addition to resolving word boundaries, each selection of an ambiguous word necessarily resolves ambiguities in phoneme identification. A preferred embodiment exploits this situation by also using adaptive feedback so that user selection information can be used to update the weighting functions for subsequent phoneme labelling. Thus, this feature allows the system and method according to the invention to further adapt to a speaker's particular speech characteristics. It is recognized that this feedback capability may not always be desirable. For instance, a speaker's speech characteristics may vary from time to time, due to general health, stress or other factors. Feedback may be undesir¬ able in these instances, since the weighting functions will be updated with anomalous data. Therefore, unlike feedback for determining starting word boundaries, an option is provided whereby feedback to the weighting functions can be disabled.
In practice, feedback to the weighting functions is preferably used only for a short period of time, depending on the consistency of an individual speaker's speech. After the weighting functions have been updated during this period, feedback to the weighting function is dis¬ abled. Feedback may be subsequently enabled to update the weighting functions if a speaker's speech characteristics have changed over a period of time, for instance, as a speaker grows older.
Subseguent Word Resolution
Once the first word is resolved in the above manner, the process is iterated for the next and subsequent words. As previously discussed, once the first word has been selected, the software pre-processor 11 begins at the next machine-readable phonetic symbol within the machine- readable phonetic symbol string to identify possible word boundaries for the next word.
Other Embodiments for Longer Speech Blocks
While a preferred embodiment analyzes the machine- readable phonetic symbol string on a word-by-word basis, it is within the scope of the invention to analyze larger blocks of speech at one time, for instance, sentences or phrases, or any blocks of speech which are bracketed by silence intervals.
These embodiments use the same principles as in a preferred embodiment. For instance, if the speech is processed in a phrase or sentence block, the software pre¬ processor 11 determines word boundaries for all the words in the entire phrase or sentence, which is then displayed through the word processor 12. The user is then given the option of validating the entire phrase or sentence as correct, or manually editing any word boundary within the displayed block. In the latter case, the software pre¬ processor 11 re-identifies word boundaries for the remaining words within the displayed block, since the user's editing may change subsequent word boundaries within the sentence. The updated block is then re¬ displayed for acceptance or further editing.
Accordingly, a system and method for facilitating speech transcription has been disclosed. Although pre¬ ferred embodiments are described to process continuous speech, it will be appreciated that the scope of the invention will also accommodate discrete speech. It is further apparent that a preferred embodiment may be implemented in any spoken language which can be tran¬ scribed and is also speaker-independent . Many other embodiments are easily recognized by one skilled in the art and are within the scope of the invention. For instance, the invention may be implemented to process speech on a phrase-by-phrase or sentence-by-sentence basis. A person skilled in the art could readily modify the system to accept digital input, as from an audio stereo system, by omitting analog-to-digital conversion and making appropriate modifications to pre-process the digital signal. Although currently not commercially feasible, another embodiment can use a computer, such as a Sun 10 Workstation, with sufficient computing capability to perform the editing and selecting process. The invention, therefore, is not to be restricted except in the spirit of the appended claims.

Claims

Claims
1. A system for facilitating speech transcription comprising:
(a) a device which receives a signal representing speech;
(b) an analog-to-digital convertor in communication with said device for converting said signal into a digital signal;
(c) a spectral processor in communication with said analog-digital convertor for receiving said digital signal and extracting spectral speech features from said digital signal ;
(d) a phoneme labeller in communication with said spectral processor for receiving said spectral speech features, identifying phonemes from said spectral speech features and generating corresponding machine-readable phonetic symbols for said phonemes;
(e) a software pre-processor in communication with said phoneme labeller for receiving said machine-readable phonetic symbols as input and for combining said machine- readable phonetic symbols into words; and
(f) a word processor in communication with said software pre-processor for visually displaying said words.
2. The system as in claim 1 wherein said signal representing speech is an electromagnetic analog signal.
3. A method for facilitating the transcription of speech comprising the steps of:
(a) receiving a digital signal representing human speech; (b) processing said digital signal to extract spectral speech features;
(c) identifying phonemes from said spectral speech features;
(d) generating corresponding machine-readable phonetic symbols for said phonemes; (e) outputting said machine-readable phonetic symbols into a software pre-processor;
(f) identifying word boundaries within a substring of said machine-readable phonetic symbols; (g) combining said substring into a possible word choice; and
(h) visually representing said possible word choice.
4. A method for facilitating speech transcription comprising the steps of : (a) receiving a human voice speech input as an analog signal; b) converting said speech input from said analog signal to a digital signal;
(c) processing said digital signal to extract spectral speech features;
(d) identifying phonemes from said spectral speech features;
(e) generating corresponding machine-readable phonetic symbols for said phonemes; (f) outputting said machine-readable phonetic symbols into a software pre-processor;
(g) identifying word boundaries within a substring of said achine-readable phonetic symbols;
(h) combining said substring into a possible word choice; and
(i) visually representing said possible word choice.
5. The method as in claim 4 wherein said step (d) of identifying phonemes comprises comparing said spectral speech features with a reference set of phoneme models.
6. The method as in claim 5 wherein said step (d) of identifying phonemes further comprises comparing said spectral speech features with speaker-specific training data generated prior to said step (d) .
7. The method as in claim 4, 5, or 6 wherein said step (i) of visually representing said possible word choice comprises displaying said possible word choice as alphanumeric text.
8. The method as in claim 4, 5, or 6 wherein said step (i) of visually representing said possible word choice comprises displaying said possible word choice as every machine-readable phonetic symbol within said substring.
9. The method as in claim 4, 5, or 6 wherein said step (a) of receiving a human voice speech input comprises using at least one of a static microphone, a dynamic microphone, a telephone, and line input.
10. The method as in claim 4, 5, or 6 wherein said step (a) of receiving a human voice speech input occurs in a selected language and said steps (d) - (h) and said step (i) of visually representing said possible word choice occur in said selected language.
11. The method as in claim 4, 5 or 6 further comprising the step of ranking said possible word choice by increasing syllable length, wherein the word choice with the fewest number of syllables is ranked first.
12. The method as in claim 4 further comprising the step of ranking said possible word choice by decreasing syllable length, wherein the word choice with the most number of syllables is ranked first.
13. The method as in claim 4 further comprising the step of ranking said possible word choice by alphabetical order.
14. The method as in claim 4 further comprising the step of ranking said possible word choice by probability in reference to linguistic usage.
15. The method as in claim 14 wherein said step (i) of visually representing said possible word choice comprises displaying the most probable word choice in reference to linguistic usage.
16. The method as in claim 4 further comprising the step of ranking said possible word choice by at least one of alphabetical order, reverse alphabetical order, proba¬ bility in reference to linguistic usage, increasing syllable length and decreasing syllable length.
17. The method as in claim 16 wherein said step (i) of visually representing said possible word choice com- prises displaying the most probable word choice, in reference to linguistic usage, separated in the display from the other choices and displaying the remaining word choices in alphabetical order.
18. The method as in claim 16 wherein said step (i) of visually representing said possible word choice com¬ prises displaying the most probable word choice, in reference to linguistic usage, separated in the display from the other choices and displaying the remaining word choices in reverse alphabetical order.
19. The method as in claim 16 wherein said step (i) of visually representing said possible word choice com¬ prises displaying the most probable word choice, in reference to linguistic usage, separated in the display from the other choices and displaying the remaining word choices in increasing syllable length.
20. The method as in claim 16 wherein said step (i) of visually representing said possible word choice com¬ prises displaying the most probable word choice, in reference to linguistic usage, separated in the display from the other choices and displaying the remaining word choices in decreasing syllable length.
21. The method as in claim 16 wherein said step (i) of visually representing said possible word choice com¬ prises displaying the most probable word choice, in reference to linguistic usage, separated in the display from the other choices and displaying the remaining word choices in order of probability in reference to linguistic usage.
22. The method as in claim 4, 5 or 6 wherein said step (c) of processing said digital signal comprises sampling said digital signal into blocks of samples.
23. The method as in claim 22 wherein said blocks of samples have a 25% overlap between adjacent blocks.
24. The method as in claim 22 wherein said step (c) of processing said digital signal comprises Fast Fourier
Transform after first tapering each block of samples.
25. The method as in claim 4, 5 or 6 wherein said step (c) of processing said digital signal comprises sampling said digital signal into blocks of samples with 0-50% overlap between adjacent blocks.
26. The method as in claim 4, 5 or 6 wherein said step (c) of processing said digital signal comprises sampling said digital signal into blocks from 256 to 2048 samples.
27. The method as in claim 4, 5 or 6 wherein said step (c) of processing said digital signal comprises sampling said digital signal into blocks with 512 samples.
28. The method as in claim 4, 5 or 6 further comprising the step of displaying said machine-readable phonetic symbols after said step (f) of outputting said machine-readable phonetic symbols.
29. The method as in claim 28 further comprising the step of editing said outputted machine-readable phonetic symbols after said step of displaying said machine- readable phonetic symbols.
30. The method as in claim 29 wherein said step of editing said machine-readable phonetic symbols comprises using at least one of a keyboard, a mouse, a digitizing pen, a voice command and an input-positioning device.
31. The method as in claim 28 further comprising the step of editing said word boundaries after said step of displaying said machine-readable phonetic symbols.
32. The method as in claim 4, 5 or 6 wherein said step (i) of visually representing said possible word choice comprises visually representing all the words in any block of speech bracketed by silence intervals at one time.
33. The method as in claim 32 wherein a silence interval is any period of longer than 1/10 of a second in which the ambient noise level does not increase more than 10 dB over the ambient background noise level.
34. The method as in claim 32 wherein a silence interval is any period of longer than 1/15 of a second in which the ambient noise level does not increase more than 9 dB over the average ambient background noise level .
35. The method as in claim 32 wherein a silence interval is any period between 1/15 of a second and 1/10 of a second in which the ambient noise level does not increase more than about 10 dB over the average ambient background noise level .
36. The method as in claim 4, 5 or 6 further comprising the step of selecting said represented possible word choice.
37. The method as in claim 36 wherein said step of selecting one of said represented possible word choice comprises using at least one of a mouse, keyboard, digi¬ tizing pen, voice command and an input-positioning device.
38. The method as in claim 36 wherein said step of identifying word boundaries comprises using adaptive feedback from the step of selecting one of said repre¬ sented possible word choice as performed on the immediately preceding word.
39. The method as in claim 36 wherein said step of selecting one of said represented possible word choice comprises employing a suitable software/computer combina¬ tion to make such selection.
40. The method as in claim 36 further comprising audio reproduction of said speech input from said machine- readable phonetic symbols.
41. The method as in claim 6 further comprising the step of selecting said represented possible word choice, wherein said speaker-specific training data is updated via adaptive feedback from said step of selecting said represented possible word choice.
42. The method as in claim 4, 5 or 6 further comprising the step of selecting one of a previously represented possible word choice wherein at least one of said steps (a) through (i) occurs simultaneous with said step of selecting one of a previously represented possible word choice.
43. The method as in claim 4, 5 or 6 further comprising the step of storing said machine-readable phonetic symbols.
44. A method for facilitating speech transcription comprising the steps of:
(a) receiving a human voice speech input as an analog signal;
(b) converting said speech input from said analog signal to a digital signal;
(c) processing said digital signal to extract spectral speech features; (d) identifying phonemes by comparing said spectral speech features with a reference set of phoneme models and speaker-specific training data;
(e) generating corresponding machine-readable phonetic symbols for said phonemes; (f) outputting said machine-readable phonetic symbols into a software pre-processor;
(g) displaying said machine-readable phonetic symbols;
(h) editing said machine-readable phonetic symbols; (i) identifying word boundaries within a substring of said achine-readable phonetic symbols;
(j) editing said word boundaries;
(k) combining said substring into a possible word choice; (1) ranking said possible word choice; and
(m) visually representing said possible word choice .
45. The method as in claim 44 further comprising the step (n) of selecting one of said represented possible word choices.
46. The method as in claim 45 wherein said steps (a) through (n) are repeated for a word within said human voice speech input and said step (i) of identifying word boundaries includes the use of adaptive feedback from the step (n) of selecting one of said represented possible word choice which occurred for the word preceding said word within said human voice speech input .
47. A method for storing speech as machine-readable phonetic symbols comprising the steps of: (a) receiving a human voice speech input as an analog signal;
(b) converting said speech input from said analog signal to a digital signal;
(c) processing said digital signal to extract spectral speech features;
(d) identifying phonemes from said spectral speech features; and
(e) generating corresponding machine-readable phonetic symbols for said phonemes.
48. A system for storing speech as machine-readable phonetic symbols comprising:
(a) a device which receives an analog signal representing human speech;
(b) an analog-to-digital convertor in communication with said device for converting said analog signal into a digital signal;
(c) a spectral processor in communication with said analog-to-digital convertor for receiving said digital signal and extracting spectral speech features from said digital signal; and
(d) a phoneme labeller in communication with said spectral processor for receiving said spectral speech features and for identifying phonemes from said spectral speech features.
49. The system as in claim 48 wherein said analog signal representing human speech is an electromagnetic signal.
50. A system for storing speech as machine-readable phonetic symbols comprising:
(a) a device which receives a digital signal representing human speech;
(b) a spectral processor in communication with said device for receiving said digital signal and extracting spectral speech features from said digital signal;
(c) a phoneme labeller in communication with said spectral processor for receiving said spectral speech features, for identifying phonemes from said spectral speech features; and for generating corresponding machine- readable phonetic symbols for said phonemes; and
(d) a storage device for storing said machine- readable phonetic symbols.
51. The system as in claim 50 wherein said digital signal representing speech is an electromagnetic signal.
52. A method of speech reproduction comprising re- synthesis of stored machine-readable phonetic symbols which represent a compressed form of speech.
53. The method as in claim 4 further comprising the step of ranking said possible word choice in order of the most recent usage, wherein the word choice with the least number of words since the last usage of said word choice is ranked first.
54. The method as in claim 53 wherein said step (i) of visually representing said possible word choice comprises displaying said possible word choice in order of the most recent usage, wherein the word choice with the least number of words since the last usage of said word choice is ranked first .
55. The method as in claim 4 further comprising the step of ranking said possible word choice in order of fre¬ quency used, wherein the word previously used most often is ranked first.
56. The method as in claim 55 wherein said step (i) of visually representing said possible word choice comprises displaying said possible word choice in order of frequency used, wherein the word previously used most often is ranked first.
57. The method as in claim 4 further comprising the step of ranking said possible word choice by the time elapsed since said possible word choice was last used, wherein the word choice temporally closest to its last usage is ranked first.
58. The method as in claim 57 wherein said step (i) of visually representing said possible word choice comprises displaying said possible word choice by the time elapsed since said possible word choice was last used, wherein the word choice temporally closest to its last usage is ranked first.
PCT/US1995/009130 1994-07-21 1995-07-19 System and method for facilitating speech transcription WO1996003741A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU31368/95A AU3136895A (en) 1994-07-21 1995-07-19 System and method for facilitating speech transcription

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US27826694A 1994-07-21 1994-07-21
US08/278,266 1994-07-21

Publications (2)

Publication Number Publication Date
WO1996003741A1 WO1996003741A1 (en) 1996-02-08
WO1996003741A9 true WO1996003741A9 (en) 1996-03-14

Family

ID=23064337

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1995/009130 WO1996003741A1 (en) 1994-07-21 1995-07-19 System and method for facilitating speech transcription

Country Status (2)

Country Link
AU (1) AU3136895A (en)
WO (1) WO1996003741A1 (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60113787T2 (en) * 2000-11-22 2006-08-10 Matsushita Electric Industrial Co., Ltd., Kadoma Method and device for text input by speech recognition
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US7697827B2 (en) 2005-10-17 2010-04-13 Konicek Jeffrey C User-friendlier interfaces for a camera
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US9551033B2 (en) 2007-06-08 2017-01-24 Genentech, Inc. Gene expression markers of tumor resistance to HER2 inhibitor treatment
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
DK201670539A1 (en) * 2016-03-14 2017-10-02 Apple Inc Dictation that allows editing
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US11314214B2 (en) 2017-09-15 2022-04-26 Kohler Co. Geographic analysis of water conditions
US11093554B2 (en) 2017-09-15 2021-08-17 Kohler Co. Feedback for water consuming appliance
US11099540B2 (en) 2017-09-15 2021-08-24 Kohler Co. User identity in household appliances
US10887125B2 (en) 2017-09-15 2021-01-05 Kohler Co. Bathroom speaker
US10448762B2 (en) 2017-09-15 2019-10-22 Kohler Co. Mirror

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0248593A1 (en) * 1986-06-06 1987-12-09 Speech Systems, Inc. Preprocessing system for speech recognition
US5289523A (en) * 1992-07-31 1994-02-22 At&T Bell Laboratories Telecommunications relay service method and apparatus
EP0645757B1 (en) * 1993-09-23 2000-04-05 Xerox Corporation Semantic co-occurrence filtering for speech recognition and signal transcription applications

Similar Documents

Publication Publication Date Title
WO1996003741A9 (en) System and method for facilitating speech transcription
WO1996003741A1 (en) System and method for facilitating speech transcription
US5220639A (en) Mandarin speech input method for Chinese computers and a mandarin speech recognition machine
US7280968B2 (en) Synthetically generated speech responses including prosodic characteristics of speech inputs
US5758023A (en) Multi-language speech recognition system
US5865626A (en) Multi-dialect speech recognition method and apparatus
US5787230A (en) System and method of intelligent Mandarin speech input for Chinese computers
Donovan Trainable speech synthesis
US5995928A (en) Method and apparatus for continuous spelling speech recognition with early identification
Zwicker et al. Automatic speech recognition using psychoacoustic models
US20030154080A1 (en) Method and apparatus for modification of audio input to a data processing system
Fendji et al. Automatic speech recognition using limited vocabulary: A survey
US20090240499A1 (en) Large vocabulary quick learning speech recognition system
JPH06214587A (en) Predesignated word spotting subsystem and previous word spotting method
Pellegrino et al. Automatic language identification: an alternative approach to phonetic modelling
Bhatt et al. Continuous speech recognition technologies—a review
Hatala Speech recognition for Indonesian language and its application to home automation
Cettolo et al. Automatic detection of semantic boundaries based on acoustic and lexical knowledge.
Burileanu et al. Spontaneous speech recognition for Romanian in spoken dialogue systems
Caranica et al. On the design of an automatic speaker independent digits recognition system for Romanian language
Syadida et al. Sphinx4 for indonesian continuous speech recognition system
Tun et al. A speech recognition system for Myanmar digits
Prasangini et al. Sinhala speech to sinhala unicode text conversion for disaster relief facilitation in sri lanka
Huckvale 14 An Introduction to Phonetic Technology
Reddy et al. Automatic pitch accent contour transcription for Indian languages