EP0364501A1 - Speech processing apparatus and methods - Google Patents
Speech processing apparatus and methodsInfo
- Publication number
- EP0364501A1 EP0364501A1 EP88908441A EP88908441A EP0364501A1 EP 0364501 A1 EP0364501 A1 EP 0364501A1 EP 88908441 A EP88908441 A EP 88908441A EP 88908441 A EP88908441 A EP 88908441A EP 0364501 A1 EP0364501 A1 EP 0364501A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- speech
- values
- value
- frequency
- spectrum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012545 processing Methods 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title abstract description 42
- 238000001228 spectrum Methods 0.000 claims abstract description 224
- 230000006870 function Effects 0.000 claims abstract description 51
- 230000001953 sensory effect Effects 0.000 claims description 109
- 230000000737 periodic effect Effects 0.000 claims description 42
- 230000011218 segmentation Effects 0.000 claims description 24
- 238000001514 detection method Methods 0.000 claims description 15
- 230000004044 response Effects 0.000 claims description 13
- 238000005070 sampling Methods 0.000 claims description 6
- 230000001133 acceleration Effects 0.000 abstract description 30
- 238000012360 testing method Methods 0.000 description 77
- 230000003595 spectral effect Effects 0.000 description 64
- 238000010586 diagram Methods 0.000 description 30
- 230000007935 neutral effect Effects 0.000 description 30
- 239000011295 pitch Substances 0.000 description 27
- 238000004458 analytical method Methods 0.000 description 24
- 239000013598 vector Substances 0.000 description 19
- VJLLLMIZEJJZTE-BUDJNAOESA-N n-[(e,3r)-3-hydroxy-1-[(2s,3r,4s,5r,6r)-3,4,5-trihydroxy-6-(hydroxymethyl)oxan-2-yl]oxyoctadec-4-en-2-yl]hexadecanamide Chemical compound CCCCCCCCCCCCCCCC(=O)NC([C@H](O)\C=C\CCCCCCCCCCCCC)CO[C@H]1O[C@H](CO)[C@H](O)[C@H](O)[C@H]1O VJLLLMIZEJJZTE-BUDJNAOESA-N 0.000 description 13
- 230000009466 transformation Effects 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 10
- 230000008447 perception Effects 0.000 description 10
- 238000001494 step-and-flash imprint lithography Methods 0.000 description 10
- 101100190282 Arabidopsis thaliana PHE1 gene Proteins 0.000 description 9
- 241000282414 Homo sapiens Species 0.000 description 8
- 208000022976 Liberfarb syndrome Diseases 0.000 description 8
- 238000013459 approach Methods 0.000 description 7
- 230000007423 decrease Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000007704 transition Effects 0.000 description 6
- 101100190284 Arabidopsis thaliana PHE2 gene Proteins 0.000 description 5
- 239000000872 buffer Substances 0.000 description 5
- 230000003247 decreasing effect Effects 0.000 description 5
- 238000012935 Averaging Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 230000033001 locomotion Effects 0.000 description 4
- 230000000717 retained effect Effects 0.000 description 4
- 101100047007 Mus musculus Trip6 gene Proteins 0.000 description 3
- 101100524639 Toxoplasma gondii ROM3 gene Proteins 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000003909 pattern recognition Methods 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 230000035807 sensation Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 238000010183 spectrum analysis Methods 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 241000238876 Acari Species 0.000 description 2
- 101100325756 Arabidopsis thaliana BAM5 gene Proteins 0.000 description 2
- 241001442234 Cosa Species 0.000 description 2
- 101001106432 Homo sapiens Rod outer segment membrane protein 1 Proteins 0.000 description 2
- 101150046378 RAM1 gene Proteins 0.000 description 2
- 101100476489 Rattus norvegicus Slc20a2 gene Proteins 0.000 description 2
- 102100021424 Rod outer segment membrane protein 1 Human genes 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000009499 grossing Methods 0.000 description 2
- 238000013178 mathematical model Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000087 stabilizing effect Effects 0.000 description 2
- 101100328887 Caenorhabditis elegans col-34 gene Proteins 0.000 description 1
- 102100031584 Cell division cycle-associated 7-like protein Human genes 0.000 description 1
- 102000016550 Complement Factor H Human genes 0.000 description 1
- 108010053085 Complement Factor H Proteins 0.000 description 1
- 208000034564 Coronary ostial stenosis or atresia Diseases 0.000 description 1
- 206010011878 Deafness Diseases 0.000 description 1
- 101000777638 Homo sapiens Cell division cycle-associated 7-like protein Proteins 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 101150065817 ROM2 gene Proteins 0.000 description 1
- 239000004809 Teflon Substances 0.000 description 1
- 229920006362 Teflon® Polymers 0.000 description 1
- 101100524644 Toxoplasma gondii ROM4 gene Proteins 0.000 description 1
- XAGFODPZIPBFFR-UHFFFAOYSA-N aluminium Chemical compound [Al] XAGFODPZIPBFFR-UHFFFAOYSA-N 0.000 description 1
- 229910052782 aluminium Inorganic materials 0.000 description 1
- 230000009118 appropriate response Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 235000013405 beer Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004040 coloring Methods 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013016 damping Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000003292 diminished effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 210000004704 glottis Anatomy 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 210000000653 nervous system Anatomy 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000003254 palate Anatomy 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007639 printing Methods 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000031893 sensory processing Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the present invention relates to speech processing apparatus and methods. More particularly, the present invention relates to apparatus and methods for use in automatic speech recognition applications and research.
- Speech as it is perceived, can be thought of as being made up of segments or speech sounds. These are the phonetic elements, the phonemes, of a spoken language and they can be represented by a set of symbols, such as International Phonetic Association symbols.
- Stage 1 is an auditory-sensory analysis of the incoming acoustic waveform whereby representation of the signal is achieved in auditory-sensory terms.
- Stage 2 is an auditory- perceptual transformation whereby the spectral output of stage 1 is transformed into a perceptual form relevant to phonetic recognition.
- the spectral descriptions are transformed into dimensions more directly relevant to perception.
- the perceptual form may be related to articulatory correlates of speech production or auditory features or pattern sequences.
- stage 3 in which the perceptual dimensions of stage 2 are transformed by a phonetic-linguistic transformation into strings of phonemes, syllables, or words.
- Stages 2 and 3 also are influenced by top-down processing wherein stored knowledge of language and events and recent inputs, including those from other senses as well as language, are brought into play.
- Diphthongs, glides, and r-colored vowels are speech sounds that are all generically referred to as glides herein. Analysis of these sounds continues to pose difficult problems among the many faced in the field of automatic speech recognition. A paper which discusses some of these types of speech sounds is "Transitions, Glides, and Diphthongs" by I. Lehiste et al., J. Acoust. Soc. Am., Vol. 33, No. 3, March, 1961, pp. 268-277. Summary of the Invention
- Among the objects of the present invention are to provide improved speech processing apparatus and methods which more effectively process speech into segments; to provide improved speech processing apparatus and methods which more effectively and automatically recognize glide phonetic elements in speech; and to provide improved speech processing apparatus and methods which are alternatives to prior speech processing apparatus and methods.
- speech processing apparatus includes a circuit for electronically deriving from speech over time a series of coordinate values representing positions of points on a path in a mathematical space which path positions represent the speech. Also, an electronic memory prestores phonetic representations in correspondence with indicia of a glide in the path which indicia represent a nucleus in the space at which the glide begins and a range of directions of offglide on the path from the nucleus. A further circuit electronically computes a trajectory parameter from the series of coordinate values and, when both the trajectory parameter satisfies a predetermined condition for significance and a coordinate value currently reached by the speech is within a predetermined region of such values, produces a signal.
- the speech on the path is electronically analyzed for occurrence of a position in the nucleus and an offglide in said range of directions which offglide happens before another significant trajectory parameter occurs.
- the phonetic representation corresponding to the glide indicia is obtained from the electronic memory.
- speech processing apparatus includes circuitry that electronically derives frequency spectra from speech in successive time intervals respectively and computes a series of coordinate values of points on a path in a mathematical space from the frequency spectra of the speech. Further circuitry generates a segmentation index signal representing a function of the difference between the greatest and the least coordinate value occurring in a time period encompassing a predetermined number of the time intervals. As a result, the segmentation index signal indicates how the speech is to be segmented.
- Fig. 1 is a block diagram of a speech processing apparatus
- Fig. 2 is a graph of voltage versus time of a typical speech waveform
- Fig. 3 is a diagram of operations of an interrupt routine of a unit CPU1 of Fig. 1;
- Fig. 4 is a diagram of operations of a main routine of CPU1 of Fig. 1;
- Fig. 5 is a graph of amplitude versus log-frequency of a ten-millisecond sample of the speech waveform of Fig. 2, showing a frequency spectrum thereof;
- Fig. 5A is a diagram of a table in a memory for CPU1 for holding a set of spectral values corresponding to multiples K of a basic frequency;
- Figs. 6, 7, 8 and 9 are a set of graphs of spectral envelopes in decibels versus log-frequency for illustrating a method for analyzing different frequency spectra of speech
- Fig. 10 is a diagram of three spectral envelopes in decibels versus log-frequency for showing how a quantity called speech goodness depends on shapes of spectra;
- Fig. 11 is a graph of speech goodness versus width of one or more peaks in a spectrum
- Fig. 12 is a graph of a quantity called speech loudness versus a decibel sum
- Figs. 13A and 13B are two parts of a diagram further detailing operations in the main routine of Fig. 4;
- Fig. 13C is a diagram of a spectral envelope in decibels versus log-frequency representing a "voice bar" for showing a method of determining a "first sensory formant" value SF(1);
- Fig. 14 is a diagram of operations according to a method for generating a spectral reference value SR
- Fig. 15 is a diagram of operations according to a method in a unit CPU2 of Fig. 1 for converting from sensory pointer coordinate values to coordinate values on a path having perceptual significance;
- Fig. 15A is a diagram of a table for use by
- Fig. 16 shows an illustration of a mathematical model for converting from sensory pointer coordinates to coordinates X p , Y p and Z p of a perceptual pointer in a three dimensional mathematical space;
- Fig. 17 is a simplified diagram of the mathematical space of Fig. 16, showing target zones for two phonetic elements, and showing a trajectory or path traced out by the perceptual pointer in the mathematical space;
- Fig. 18 shows an X,Y,Z coordinate system and an X',Y',Z' coordinate system in the mathematical space
- Figs. 19 and 20 show two different views of a vowel slab with target zones for the vowels in the mathematical space relative to the X',Y',Z' coordinate system of Fig. 18 and viewing along the X' axis in Fig. 19 and along the Z' axis in Fig. 20;
- Fig. 21 depicts target zones in the mathematical space for voiceless stops, voiced stops and voice bars as viewed along the Y axis of Fig. 18;
- Figs. 22A and 22B depict target zones in the mathematical space for nasal consonants as viewed respectively along the X' and Z' axes of Fig. 18;
- Fig. 23 depicts target zones in the mathematical space for voiceless fricatives of American English as viewed along the Y axis of Fig. 18;
- Fig. 24 depicts target zones in the mathematical space for voiced fricatives and the phonetic approximates as viewed along the Z' axis of the X', Y', Z' coordinate system of Fig. 18;
- Fig. 25 depicts target zones in the mathematical space for the voiced fricatives and the phonetic approximates of Fig. 24 as viewed along the X' axis of the X', Y', Z' coordinate system of Fig. 18;
- Fig. 26 is a diagram of inventive operations of a CPU3 of Fig. 1 of the inventive apparatus in analyzing the path in the mathematical space and obtaining phonetic elements when phonetically significant events occur;
- Fig. 27 is a diagram of a table for use in the operations of Fig. 26;
- Fig. 28 is a pictorial of an X, Y, Z coordinate system with target zones marked with identification numbers instead of phonetic element representations;
- Fig. 29 is a diagram of a table in ROM3 of CPU3 in Fig. 1, which table relates phonetic element representations to the target zone identification numbers of Fig. 28 and to various flags for purposes of a complex target zone method;
- Fig. 30 is a diagram of some perceptual paths in an X',Y' coordinate system which paths represent different occurrences of the same diphthong /AY/;
- Fig. 31 is a diagram of some perceptual paths in the same X',Y' coordinate system of Fig. 30 which paths represent different occurrences of the diphthong /EY/;
- Fig. 32 is a diagram of various nucleus zones, or nuclei, in the X', Y' coordinate system for explaining a glide-detection method of the invention;
- Fig. 32A is a diagram of a generalized perceptual path in the X',Y' coordinate system for explaining a glide-detection method of the invention
- Figs. 32B and 32C are diagrams of a nucleus for a w-glide respectively shown in Y',Z' and in X',Y' coordinates for explaining a glide-detection method of the invention
- Figs. 32D and 32E are diagrams of a nucleus for a j-glide (as in "yuh") respectively shown in Y',Z' and in X',Y' coordinates for explaining a glide-detection method of the invention
- Fig. 33 is a diagram of inventive operations in an alternative to Fig. 26 for implementing a complex target zone method and an inventive glide detection method;
- Fig. 34 is a flow diagram of inventive operations in a glide subroutine portion of Fig. 33 for the inventive glide detection method
- Fig. 34A is a table of coordinate values for use in a monotonicity test in the glide detection method of Fig. 34;
- Fig. 35 is a diagram of inventive operations in a further alternative to Fig. 26 for implementing a complex target zone method and an inventive glide detection method;
- Fig. 36 is a flow diagram of inventive operations in a glide subroutine portion of Fig. 35 for the inventive glide detection method
- Fig. 37 is a flow diagram of operations in an output subroutine portion of Fig. 36 for the inventive glide detection method
- Figs. 38, 39 and 40 are flow diagrams of operations according to an alternative to Fig. 14 for generating the spectral reference value
- Figs. 41, 42, 43 and 44 are diagrams of decibels versus log-frequency for illustrating a method of separating a spectrum of speech of Fig. 41 by use of a harmonic sieve to detect a periodic line spectrum of Fig.
- Fig. 45 is a flow diagram of inventive operations according to a more detailed version of Fig. 4 for the main routine of CPU1 of Fig. 1;
- Fig. 46 is a flow diagram of inventive operations of a step in Fig. 45 for separating periodic and aperiodic spectra shown in Figs. 41-44;
- Fig. 47 is a table showing one example of burst friction and glottal source flag values determined according to operations of Figs. 13A and 13B for each of the periodic and aperiodic spectra separated according to the operations of Fig. 46;
- Fig. 48 shows three hypothetical graphs of values in mathematical perceptual space for three coordinates over time, for use in describing inventive operations involving an alternative segmentation index approach to deriving a trajectory parameter in Fig. 26, 33 or 35; and Fig. 49 is a flow diagram of inventive operations for the alternative segmentation index approach to deriving a trajectory parameter in Fig. 26, 33 or 35.
- a speech processing system 1 of the invention has a microphone 11 for converting sound pressure variations of an acoustic waveform of speech to an analog electrical signal on a line 13.
- System 1 performs a short-term analysis on the speech waveform that allows it to represent, every few milliseconds, the spectral shape and the auditory state of the incoming speech.
- This sensory processing serves as an input to a higher level perceptual electronic system portion.
- the perceptual electronic system portion integrates sensory information over time, identifies auditory-perceptual events (or "sounds"), and converts the sensory input into a string of symbols or category codes corresponding to the phonetic elements of a human language.
- the electrical signal on line 13 is filtered by an antialiasing low pass filter 15 and fed to a sample-and-hold (S/H) circuit 17.
- S/H circuit 17 is enabled by an oscillator 19 at a sampling frequency such as 20 KHz. and supplies samples of the analog electrical signal to an analog-to-digital converter (ADC) 21 where the samples are converted in response to oscillator 19 to parallel digital form on a set of digital lines 23 connected to data inputs of a first central processing unit CPU1.
- CPU1 reads in the latest sample in digital form upon interrupt by oscillator 19 at interrupt pin IRQ every 50 microseconds.
- CPU1 is one of four central processing units CPU1, CPU2, CPU3 and CPU4 in Fig. 1, which respectively have programmable read only memory (ROM1, ROM2, ROM3 and ROM4), random access memory (RAM1, RAM2, RAM3 and RAM4), and a video terminal- keyboard unit (TERMKBD1, TERMKBD2, TERMKBD3, and TERMKBD4).
- CPU1 generates data for CPU2 which is buffered by a data buffer 25.
- CPU2 generates data for CPU3 which is buf fered by a data buf fer 27
- CPU3 generates data for CPU4 which is buffered by a data buffer 29.
- CPU3 has a memory 31 of approximately 2 megabyte or otherwise sufficient capacity that holds prestored information indicative of different phonetic representations, target zone identifications, and glide zone (glide nucleus or radical) identifications corresponding to respective sets of addresses in the memory.
- CPU3 is provided with a printer 33 for recording phonetic element information in the order obtained by it from memory 31.
- CPU4 is in one application shown in Fig. 1 programmed as a lexical access processor for converting the phonetic element information into plaintext and printing it out on a printer 35 to accomplish automatic dictation.
- CPU4 in some applications, such as a hearing aid embodiment or other intelligent sound system embodiment, is programmed additionally, or instead, to process the phonetic elements and synthesize speech therefrom and make it audible using an electroacoustic output transducer in a manner adapted to ameliorate hearing deficiencies or otherwise produce modified speech based on that entering microphone 11.
- CPU4 in still other applications acts as a bandwidth compressor to send the phonetic elements through a telecommunication system along with other phonetic elements from a different speech channel with which the first speech phonetic elements are. multiplexed.
- CPU4 in yet further applications is programmed with artificial intelligence or expert systems software to interpret the phonetic elements and to produce a printed response, a synthesized speech response, a robotic response controlling computers or other electronic devices or electromechanical apparatus in home, office or factory, or to produce any other appropriate response to the speech sensed on line 13.
- Fig. 2 shows a portion of an electrical waveform 51 of speech.
- the waveform 51 generally has several peaks and troughs over a time interval, or window, of about ten milliseconds, as well as higher frequency behavior.
- CPU1 is interrupted 20000 times per second so that in each ten millisecond time interval a set of 200 samples is obtained from ADC 21.
- an interrupt routine 70 of CPU1 commence upon interrupt at pin IRQ with a BEGIN 71 and proceed to a step 73 to read the latest sample into an address location in a section of N1 (e.g. 80) addresses in RAM1. Then in a step 75 both the address and a sample count N are incremented by one. In a decision step 77, the count N is compared with the number N1 to determine if the latest set of samples is complete. If so, then in a step 79 the sample count N is returned to zero and a flag FLG is set to 1 as a signal that the latest set of samples is complete.
- the address location for the next sample is reset to a predetermined location ADR0 at the beginning of the section of N1 addresses, whence a RETURN 81 is reached. If the latest set of samples is not complete, the operations branch from step 77 to RETURN 81 whence a main program resumes in CPU1 at an operation where the interrupt occurred.
- a set of variables or quantities herein called an auditory state code are all initialized to zero.
- the variables in the auditory state code of the present embodiment are: burst-friction BF, glottal-source GS, nasality NS, loudness indices LIBF and LIGS for burst-friction and glottal-source sounds respectively, and speech goodness values GBF and GGS for burst-friction and glottal-source sounds respectively.
- variables are included in the auditory state code for some or all of a variety of source characteristics of speech including nasality, voicing, frication, aspiration, whisper, loudness and goodness.
- a step 107 the flag FLG is checked to confirm that a full set of N1 samples is available.
- the interrupt operations of Fig. 3 are collecting the next set of N1 samples as the operations of Fig. 4 are executed. If the system 1 has just been turned on, CPU1 will wait until the first set of samples has been obtained and FLG has been set to 1 in the interrupt routine, which wait occurs by a branch from step 107 back to itself. When FLG becomes one, a full set of samples is present and FLG is reset to zero in a step 109.
- a set of digital values representing a frequency spectrum corresponding to the latest N1 samples from ADC 21 is computed according to a Discrete Fourier Transform (DFT) procedure. In other words each such set of digital values represents the frequency spectrum of the speech in each successive ten millisecond interval or frame.
- DFT Discrete Fourier Transform
- e is the base of natural logarithms
- j is the square root of minus one
- pi is the ratio of circumference to diameter of a circle.
- f is a basic frequency equal to the reciprocal of the time required to collect a set of N1 samples (when time is 10 milliseconds, f is 100 Hertz)
- Kf is an integer multiple of the frequency f at which one of the lines 113 in the spectrum is to be computed.
- CPU1 computes the DFT by the Fast Fourier Transform algorithm familiar to the art for frequency multiples K from 1 to a number M.
- D(Kf) The values of D(Kf) are stored as illustrated in Fig. 5A in a spectrum table in RAM at successive addresses corresponding to the K values respectively.
- the speech waveform is multiplied by time-window weighting functions of 5-40 millisecond duration but shifted in 1.0-2.5 millisecond steps.
- the successive time intervals defining the windows can be either overlapping or distinct.
- the window duration and step size as related to bursts, transitions and relatively steady-state segments are adjusted for best performance.
- the short-term spectrum is calculated for each segment by either DFT or linear prediction analysis (LPA).
- the DFT produces a line spectrum with components at integral multiples of the reciprocal of the window length while the LPA produces a smoothed spectral envelope--transfer function--with detail dependent on the number of LP-parameters selected.
- Either spectrum is represented in log-magnitude by log-frequency dimensions. Operations accomplish or approximate the following.
- the spectrum is "windowed" in the log frequency domain so that the amplitudes are represented in sensation levels or loudness levels.
- the spectrum is subjected to smoothing filters one of which is similar to the critical-band. Another minimizes confusing minor spectral peaks.
- the spectral envelope is subjected to high-pass filtering in the log-frequency domain to eliminate spectral tilt.
- the resulting spectra preferably have formant peaks of nearly uniform height--tilt having been removed and have minor irregularities removed by the smoothing filters.
- a nasal wave can be detected in the lower half of the speech spectrum by looking for a weakened and broadened first formant, or to window the processed spectral envelope in the appropriate range of log frequency units and band-pass filter that segment in search of the nasal wave, or to use correlational signal processing techniques.
- a real time filter bank circuit is used to produce the spectrum for CPU1.
- a filter bank advantageously reduces the computing required of CPU1, and in such embodiment the spectrum table is updated from the real time filter bank at regular intervals such as every ten milliseconds or even more frequently, for example every 1-2.5 milliseconds.
- signal processing chips for inexpensively and rapidly computing spectra are available such as the Texas Instruments TMS 320.
- Fig. 5 the spectrum has several peaks 115, 116 and 117 which decline in height or "tilt" with increasing frequency.
- an envelope 119 is drawn on Fig. 5 which envelope has the same peaks 115, 116 and 117.
- Envelope 119 is redrawn dashed in Fig. 6 with the spectral lines 113 being understood but suppressed in Fig. 6 for clarity.
- CPU1 in a step 121 of Fig. 4 converts the spectrum to decibels (dB) of sensation level according to the equation
- D(Kf) is each spectral value at frequency Kf, and ref is normal human threshold for that frequency in sound pressure.
- the spectrum is smoothed by sliding a critical-bandlike weighting function along the log-frequency or pitch-like axis, and spectral tilt or "combing" is also eliminated by passing the smoothed spectrum through a high-pass lifter defined in the log-frequency or pitch-like domain.
- the resulting smooth envelope is rectified (straightened) to eliminate low-level excursions, including those some fixed number of decibels below the highest spectral peaks as well as those below the threshold of hearing, since these are irrelevant to phonetic perception.
- the processed spectral envelope is tested for the presence, location and strength of the nasal wave. After determination of nasalization, which can be removed by further spectrum processing in some embodiments, the spectral envelope is examined for low and high frequency cutoffs and significant spectral prominences.
- a step 123 of Fig. 4 the tilt suggested by a dashed line 125 in Fig. 6 is eliminated from the spectrum by adding values to the spectrum that increase with frequency at a rate of C dB per ten-fold increase in frequency.
- the value of the constant C is determined using a linear regression analysis of the spectrum.
- each of the M values (where M is illustratively 40) of the spectrum in decibels is respectively added to a corresponding value computed according to equation (3) for each K from 1 to M.
- the resulting spectrum is suggested by an envelope 127 of Fig. 6 having three peaks P1, P2 and P3 in order of increasing frequency.
- the above-described short-term spectral analysis of the time-windowed speech waveform identifies the amplitudes and frequencies of tonal components in the speech waveform and at the same time produces a power spectrum of any significant aperiodic energy or other unresolved high-frequency components in the speech waveform.
- This information is used to distinguish aperiodic, periodic, and mixed segments and to establish an effective lower frequency F0 or low pitch, of the periodic and mixed segments.
- This same short-term spectral information undergoes further processing to generate auditory-spectral patterns that can be called sensory-excitation patterns, auditory- sensory spectra, or auditory-spectral envelopes.
- Voice pitch plays a role in the identification of voiced phonetic segments such as vowels like a, e, i, o and u. Detection of aperiodic energy in speech is very important for the recognition of aspiration sounds as in /h/, /p/, /k/ and /t/ and of fricatives such as /s/ and /f/ and so on. Voiced fricatives such as /z/, /zh/ and /v/ have a mixture of periodic and aperiodic energy and are a combination of both glottal-source and burst-friction spectra.
- Figs. 7, 8 and 9 show envelopes illustrating different types of spectra associated with different types of speech sounds. These spectra have different numbers and shapes of prominences, or peaks, at different frequencies compared with envelope 127 of Fig. 6. Clearly the spectra resulting from steps 111, 121 and 123 of Fig. 4 can vary widely as different sets of speech samples are processed by CPU1. To characterize these spectra with relatively few variables, each latest spectrum is analyzed in a step 131 of Fig. 4. In this step, three spectral frequencies SF1, SF2 and SF3 are computed.
- SF1, SF2 and SF3 are in some cases the frequencies at which peaks occur such as P1, P2 and P3 in Fig. 6, and the manner of determining them is described more specifically in connection with Figs. 13A and 13B hereinafter. Distinct lower and higher values SF1L and SF1H are computed for SF1 when nasality is present.
- a spectral frequency reference SR is also computed to indicate the overall general pitch (timbre) of the speech so that voices with high pitch (timbre) and voices with low pitch (timbre) are readily processed by the system 1.
- auditory state code quantities BF, GS, NS, LIGS, LIBF, GGS and GBF are determined from the spectrum.
- step 133 the speech goodness values GGS and GBF are tested and the loudness index values LIGS and LIBF are tested, and if none is positive or otherwise significant, operations branch to a step 135.
- step 135 a set of registers in CPU1 (corresponding to a set of three coordinates called sensory pointer coordinates X s , Y s and Z s ) are loaded with a code "*" indicating that the coordinates are undefined.
- the contents of the registers for X s , Y s and Z s are sent to CPU2 through buffer 25 of Fig. 1.
- step 133 If in decision step 133 the speech goodness is positive, operations proceed to a step 143 where sensory pointer coordinate value X s is set equal to the logarithm of the ratio of SF3 to SF2, pointer value Y s is set equal to the logarithm of the ratio of SF1L to SR, and pointer value Z s is set equal to the logarithm of the ratio of SF2 to SF1H, whence step 137 is reached.
- step 143 The equations of step 143 are computed once except when glottal source and burst friction spectra are simultaneously present, as in voiced fricatives, in which case step 143 is executed twice to compute sensory pointer coordinates X gs , Y gs , z gs for th e glottal source spectrum and X bf , Y bf , Z b f for the burst-friction spectrum.
- step 137 After sensory pointer coordinate values X s , Y s and Z s are sent to CPU2 in step 137, the auditory state code quantities BF, GS, NS, LIGS, LIBF, GGS and GBF are also sent in a step 145 to CPU2 through buffer 25. Then in a step 147, a test is made to determine if an OFF-ON switch is on, and if not, operations terminate at END 149. If the switch is on, as is normal, operations loop back to step 105 for obtaining the next spectrum, analyzing it and sending information to CPU2 as described above. CPU1 thus executes operations continually to obtain spectral information about the samples of speech as they arrive in real time.
- the auditory-spectral pattern at any moment in time is given by the auditory-spectral envelope in dB (Phons or Sensation Level or equivalent) against log frequency, as shown in Fig. 5.
- dB Phons or Sensation Level or equivalent
- voiced speech which has periodic spectra
- whispers or aspirated sounds which have aperiodic spectra
- GS glottal-source
- P1 low-frequency prominence
- a sensory pointer for vocalic portions of speech has a position in a mathematical space, or phonetically relevant auditory-perceptual space, computed in step 143 of Fig. 4.
- This pointer is called a glottal-source sensory pointer (GSSP).
- GSSP glottal-source sensory pointer
- SF1, SF2 and SF3 are the center frequencies of the first three spectral prominences in the auditory-spectral envelope 127 of Fig. 6.
- SF3 is interpreted as the upper edge of the spectral envelope when no clear peak P3 can be observed, such as when peaks P2 and P3 merge during a velar segment or is taken as being a fixed logarithmic distance over SR when P3 is absent.
- Spectral frequency SF1 generally corresponds to the center frequency of the first significant resonance of the vocaltract. However, during nasalization two peaks, or one broadened peak, appear near the first significant resonance, as in Figs. 7 and 8 respectively. To take account of such spectral differences steps 131 and 143 of Fig. 4 are made sufficiently flexible to compute the sensory pointer position differently for nasalization spectra than for other spectra. In another major class of spectra suggested by the envelope of Fig. 9, there is no major prominence in the area of peak P1 of Fig. 6. In other words, the latter two of the three prominences of Fig. 6 may occur without the first prominence in this class of spectra.
- burst- friction spectra are associated with burst sounds and sustained friction sounds and are produced by a talker with supraglottal sources such as when the tongue meets or approximates the velum, palate, or teeth or at the teeth and lips, themselves. These spectra are referred to as burst- friction (BF) spectra herein.
- a BF spectrum is analyzed differently from a GS spectrum by CPU1 in order to produce the spectral frequency values SF1, SF2 and SF3 and sensory reference value SR, and the position of the resulting sensory pointer values computed in step 143 of Fig. 4 is generally in the X s , Z s plane.
- These pointer values are regarded as defining the position of a pointer called the burst-friction sensory pointer (BFSP) which is distinct from the GSSP.
- BFSP burst-friction sensory pointer
- the glottal-source GS value is set to 1 in the auditory state code whenever a glottal-source spectrum is above the auditory threshold.
- the GSSP is regarded as moving through a mathematical space, or auditory-perceptual space. The path of the GSSP is interrupted by silences and by burst-friction spectra. Then the GS value is set to zero and the BF value is set to 1 in the auditory state code. In such case, the GSSP is replaced by the BFSP.
- the GSSP can be regarded as moving through the mathematical space as the glottal-source spectrum changes shape and sometimes this movement is nearly continuous as in the case of the sentence, "Where were you a year ago?", where the only interruption would occur during the friction burst of "g" in "ago.”
- the quantity GS in the auditory state code can remain at a value of one (1) through many spectra in various examples of speech, but the quantity BF in the auditory state code when set to one is generally reset to zero very shortly thereafter, because spectra which are not of the burst-friction type occur so soon thereafter.
- burst-friction sensory pointer BFSP will usually appear and disappear shortly thereafter as friction sounds are inserted in the speech stream.
- the BFSP may exhibit considerable jitter, and it usually will not move in any smooth, continuous way in the mathematical space.
- the quantity BF in the auditory state code is 1 when the quantity GS is zero, and vice versa.
- both BF and GS are equal to one simultaneously.
- both of the sensory pointers are simultaneously present as one is associated with the glottal-source spectrum of the voiced part of the voiced fricative speech sound and the other is associated with the burst-friction spectrum of the friction part of the sound.
- CPU1 computes goodness values and loudness values in the auditory state code for the GS and BF spectra.
- the speech goodness is a measure of the degree to which the sound represented by the latest spectrum is like a sound of speech, and is regarded as the cross-correlation between an ideal spectrum for a given speech sound and the latest actual spectrum of that sound. Since calculation of the cross-correlation itself represents a significant computer burden, the goodness value is estimated in the preferred embodiment. As shown in Fig. 10, the speech goodness value is low when an actual spectrum consists of a few pure tones showing up as very narrow peaks 171, 173 and 175; and the goodness value is also low when the spectrum is very broad-band with tiny bumps for peaks as in envelope 177. On the other hand, the goodness value is high for carefully produced natural speech of high fidelity, which has distinct moderately-wide prominences 181, 183 and 185 with distinct valleys between them.
- the goodness value is estimated, for instance, by determining when the width of at least one of the peaks in the frequency spectrum, such as P2, is within a predetermined range.
- the width is illustratively defined as the difference of the nearest two frequencies higher and lower than the center frequency of the peak at which the DFT value in decibels is at least a predetermined number of decibels (e.g. 15 dB) below the maximum decibel level of the peak itself.
- a predetermined number of decibels e.g. 15 dB
- the goodness value is set to zero if the width is outside the range.
- the goodness value when the width is in range is a triangular function 191 which peaks at unity for a best width value and illustratively declines linearly on either side of the best value to a value of 0.25 at a width of zero and to a value of zero at an upper limit of the range.
- the loudness index is estimated from the sum of the decibel levels (or total power) of the lines of a spectrum within the width of at least one (and preferably all) of the prominences or peaks, wherein the width is defined as in the previous paragraph. As illustrated by the graph of Fig. 12, this decibel sum is then compared with a value T indicative of a hearing threshold, and if the sum is less than T, the loudness index L is zero. The decibel sum is compared with a value U indicative of adequate loudness as for everyday conversational speech, and if the sum exceeds U, the loudness index L is 1. Between the levels T and U the decibel sum is converted into loudness index L by the function
- CPU1 in a step 203 finds the maximum value MAX, or highest peak, of the spectrum. This is illustratively accomplished by first setting to zero all spectral values which are less than a predetermined threshold decibel level, so that low sound levels, noise and periods of silence will not have apparent peaks. The nonzero values remaining, if any, are checked to find the highest value among them to find the value MAX.
- a loudness L is computed as discussed above in connection with Fig. 12.
- an appropriate preset value such as 15 db, or preferably 10 dB, is subtracted from the maximum value MAX to yield a reference level REF.
- the level REF is subtracted from all of the M values in the DFT spectrum and all of the resulting negative values are set to zero to normalize the spectrum so that the reference line is zero dB and spectral values that fall below the reference are set to zero dB.
- the values in the spectrum at this point in operations are called normalized spectral values and are suggested in Fig. 6 by the portions of envelope 127 lying above the dashed horizontal line marked REF.
- a step 211 following step 209 the fundamental frequency is found by a pitch-extraction algorithm such as that of Scheffers, M.T.M. (1983). "Simulation of auditory analysis of pitch; An elaboration of the DWS pitch meter.” J. Acoustic Soc. Am. 74, 1716-25. (see Fig. 6) and stored as a spectral frequency SF0, or pitch.
- the spectrum is analyzed in each of three frequency bands B1, B2 and B3, if the spectrum is a glottal-source spectrum, as suggested beneath Fig. 8; and otherwise analyzed in two frequency bands B2 and B3 with different numerical limits, as suggested beneath Fig. 9.
- These frequency bands are used as a way of discriminating the P1, P2 and P3 peaks and the frequency values selected to define each band are adjusted for best results with a variety of speaking voices.
- CPU1 determines whether there are any positive normalized spectral values lying in the band B1 which is defined as 0 less than or equal to log 10 (f/SR) less than or equal to 0.80, where SR is the spectral reference and f is frequency in Hertz. If there are no such positive normalized spectral values, it is concluded that the spectrum is a burst- friction spectrum (although this may also be a period of silence) and a branch is made to a step 215 where quantity BF is set to 1 in the auditory state code and the spectral higher and lower frequency values SFlL and SFlH are both set equal to SR.
- the burst-friction loudness index LIBF is set equal to the loudness L computed in step 205. (During silence the loudness is zero, and there is no harm in subsequent operations in having BF equal 1.)
- the frequency band B2 is established as 0.6 less than or equal to log 10 (f/SR) less than or equal to 1.45, and frequency band B3 is established as 1.0 less than or equal to log 10 (f/SR) less than or equal to 1.65.
- step 213 if in step 213 there is any positive normalized spectral value in band B1 then operations proceed to a step 217 in which CPU1 scans the normalized spectral values in order of increasing address values corresponding to frequency multiplier K until the first normalized spectral value is found which is succeeded by a lower normalized spectral value at the next higher value of K. That first modified spectral value is regarded as the lowest-frequency peak in frequency band B1 and the spectral frequency values SF1 and SFIL are set equal to the K value representing the frequency of this peak.
- step 217 the spectrum is checked for the presence of a "voice bar" which is a condition of the lowest frequency peak being extremely prominent (e.g., 30 dB or more above any higher frequency peaks).
- a voice bar or "murmur" is a periodic sound that occurs with the oral vocal tract stopped and usually with the nasal tract stopped. In a voice bar vocal folds are vibrated by moving air which either cannot escape and puffs the checks or which does escape through the nose. It is observed that voice bars that are associated with the voice stop consonants b, d, and g for example have the characteristic prominence.
- step 217 the spectrum is analyzed to detect any peak 30 dB or more above any higher frequency peaks. (An alternative test is to detect a tilt value in step 123 of Fig. 4 which is in excess of a preset value.)
- processor unit CPU1 acts as an example of a means for detecting an occurrence of a lowest frequency peak which lies in a defined frequency band indicative of a first formant peak wherein the peak substantially exceeds any other peak in its spectrum in intensity by a predetermined amount, and upon detection such an occurrence storing a lower frequency value than that of the detected peak as if it were the actual frequency value of the peak in the spectrum.
- CPU1 subsequently electrically derives from the lower frequency value and from the sets of digital values over time a series of coordinate values representing positions of points on a path in a mathematical space which path positions represent the speech.
- the glottal-source quantity GS is set to one in the auditory state code.
- the glottal-source loudness index LIGS is set equal to the loudness L computed in step 205.
- the frequency band B2 is established as 0.6 less than or equal to 10g 10 (f/SR) less than or equal to 1.18, and frequency band B3 is established as 1.0 less than or equal to 10g 10 (f/SR) less than or equal to 1.40.
- a decision step 219 determines whether there is a second peak at a higher frequency than SFIL in frequency band B1. If so, operations branch to a step 221 where nasality NS is set to one in the auditory state code, and proceed to a step 223 where the frequency of the second peak is determined and stored at a location SFlH.
- the value SF(1) is set equal in a step 224 to the geometric mean of SFIL and SFlH in Hertz, i.e. to the square root of the product of SFIL and SFlH, which is equivalent to the arith metic average of their values of log-frequency.
- step 219 If in decision step 219 no second peak is found in band B1, operations proceed to another decision step 225 where the width of the peak is compared with a predetermined width W1 (such as 300 Hz. at 10 db down) to determine whether the peak is wider than a typical GS peak would be without nasality. If the predetermined width is exceeded, a branch is made to a step 227 where nasality NS is set to one. Also in step 227 the edges of the nasally broadened PI peak are defined by setting the lower frequency SFIL to SF0 and the higher frequency SFlH to the frequency at the upper edge of the P1 peak where a normalized spectral value again is zero, whence step 224 is reached. If the predetermined width W1 is not exceeded in step 225, however, operations proceed to a step 229 where the value SFIH is set equal to SFIL because there is only P1 peak and no nasality.
- a predetermined width W1 such as 300 Hz. at 10 db down
- CPU1 Operations of CPU1 proceed from any of the steps 215, 224 or 229 in Fig. 13A through a point X to a decision step 231 of Fig. 13B.
- step 231 CPUl tests the normalized spectral values to determine whether there is a peak P2 in band B2 above the peak having value SFIH.
- Band B2 is already established to correspond with the BF or GS nature of the spectrum. The testing begins above value SFIH if SFIH lies in band B2, to avoid confusing the peak sought with a peak found earlier. If a peak P2 exists, then operations proceed to a step 233 where second spectral frequency value SF2 is set to the frequency K value of the first peak above frequency SFIH in band B2, and a decision step 237 is reached.
- step 231 If there is no peak found in step 231, operations branch from step 231 to a decision step 238 where the value of SFIH is tested to determine whether it is in the band B2. If not, operations branch to a step 239 where the value SF2 is set equal to SFIH and SFIH is not affected, whence operations reach step 237. If in decision step 238, the value of SFIH is in band B2 then operations proceed to a step 240 where the value SF2 is set equal to SFIH. Also, in step 240 SFIH is set equal to value SFIL and the nasality NS is reset to zero because nasality is not regarded as being present after all. Operations then pass from step 240 to step 237.
- means are provided for deriving a set of digital values representative of a frequency spectrum of the speech from the samples in digital form, for selectively storing in distinct locations in the memory the values of frequency of one or more frequency peaks in the spectrum wherein a selected one or more of the distinct memory locations in which the frequency value of a given peak is stored depends on whether the peak lies in a first predetermined band of frequencies and on whether or not any other peak lies both in the first band and a second band overlapping the first band, and for generating a set of digital values corresponding to coordinate values in a mathematical space depending both on the stored values of frequency and on the distinct locations of the stored values of frequency.
- means are thus provided for selecting values of end frequencies for both the second band and a third band overlapping the second band, the selected values depending on whether or not a peak exists in the first predetermined band of frequencies.
- means are in this way provided for selecting values of end frequencies for both the second band and a third higher band overlapping the second band and for determining whether or not one of the peaks is the only peak in the third band and lies in both the second and third bands, and if so, storing in one of the distinct locations another frequency value corresponding to an upper frequency edge of the one peak.
- means are thus provided for for determining whether or not one of the peaks lies in a third band which is generally higher in frequency than the second band and overlaps it, and if none of the peaks lies in the third band, storing another frequency value in one of the distinct locations, the other frequency value lying in the third band and being a function of a reference frequency value determined from at least two of the spectra.
- means are thus provided for storing as a lower first frequency the value of frequency of any lowest frequency peak in the first predetermined band of frequencies and as a higher first frequency the value of frequency of any next higher frequency peak in the first band, and for storing as a second frequency the value of frequency of any peak in the second band higher in frequency than the higher first frequency if the higher first frequency is also in the second band, and if there is no peak in the second band higher in frequency than the higher first frequency when it is in the second band then storing as the second frequency the value of frequency originally stored as the higher first frequency and storing as the higher first frequency the value of frequency stored as the lower first frequency also. Also provided thus is means for identifying lower and higher first frequencies descriptive of a peak which is widened or split upon at least one occurrence of nasality and for producing a signal indicative of the occurrence of nasality.
- step 237 CPU1 tests the normalized spectral values over increasing frequency K values to determine whether there is a peak P3 above any peak having value SF2 in band B3.
- Band B3 is already established to correspond with the BF or GS nature of the spectrum. The testing begins above value SF2 if SF2 lies in band B3, to avoid confusing the peak sought with any peak P2 found earlier. If a peak P3 is found, then operations proceed to a step 241 where third spectral frequency value SF3 is set to the frequency K value of the first peak above frequency SF2 in band B3.
- the speech goodness from step 235 is calculated based on a weighted average of the width of both peaks P2 and P3 using the function of Fig. 11 in the manner described hereinabove, and a calculation step 245 for SR is reached.
- step 237 If there is no P3 peak found in step 237, operations branch to a step 247 where spectral frequency SF2 is tested to determine if it is in band B3. If so, operations proceed to a step 249 where SF3 is set at the upper edge of the spectral envelope, whence step 243 is reached. If SF2 is not in band B3 , operations branch to a step 251 where value SF3 is set to a value equal to reference SR multiplied by ten-to-the-1.18- ⁇ ower, whence step 243 is reached.
- step 245 the spectral reference value SR is illustratively set equal to the frequency of the first non-zero spectral value SFO determined in step 211 if the spectrum is a GS spectrum and SFO is greater than zero.
- SFO the frequency of the first non-zero spectral value SFO determined in step 211 if the spectrum is a GS spectrum and SFO is greater than zero.
- CPU1 automatically computes spectral reference value SR (step 245 of Fig. 13B).
- the value SR is so defined that it is influenced by the geometric means of SF0 across the adult population (approximately 168 Hertz), by the geometric means of the pitch of the current talker, and by modulations in pitch of current talker filtered so as to eliminate the slow pitch changes such as those associated with pitch declination and so as to eliminate the very rapid transients at voice onset and offset.
- SR is so defined that it is influenced by the geometric means of SF0 across the adult population (approximately 168 Hertz), by the geometric means of the pitch of the current talker, and by modulations in pitch of current talker filtered so as to eliminate the slow pitch changes such as those associated with pitch declination and so as to eliminate the very rapid transients at voice onset and offset.
- K1 is a constant of about 168
- GMTFO is the geometric mean of the current talker's pitch
- a. is a constant equal to about 1/3
- FIL(SF0 i ) is the instantaneous value of the filtered modulations in the talker's SFO for GS spectra.
- Fig. 14 operation commences with BEGIN 301 and proceeds to a decision step 309 in which the spectrum is tested to determine whether it includes a periodic component.
- This test is performed according to any appropriate procedure such as the spectral analysis disclosed in L.J. Siegel et al. Voiced/ unvoiced/mixed excitation classification of speech, IEEE Trans. Acoust. Speech Signal Processing. 1982, ASSP-30, pp. 451-460. If there is not a component that is periodic, then operations proceed to a RETURN 311 directly from step 309. If GS is 1, then in a step 315 a recalculation of the value of SR commences according to the formulas
- step 319 the software bandpass filter for pitch modulation is illustratively implemented by maintaining a table of the values SFO of periodic spectra of glottal-source type. This table is analyzed for any discernible pitch modulation in the frequency range between 1.5 Hertz and 50 Hz. Then a value FIL which is originally initialized to zero is updated with the size of the pitch modulation determined from the output of the pitch modulation software filter. Each pass through the operations of Fig. 4 accesses step 245 so the table has an entry added regularly when a glottal-source speech sound is in progress.
- the value of SR is increased in a step 321 by the value of FIL, whence RETURN 311 is reached.
- CPU1 constitutes means for computing at least one of the values in the sets of first-named coordinate values (e.g. sensory pointer values) as a function of a reference frequency value which is a function of frequency values (e.g. values of SF0) determined from at least two of the spectra.
- CPU1 also constitutes means for computing at least one of the values in the sets of first-named coordinate values as a function of a reference frequency value which is a function of a geometric mean of frequency values determined from at least some glottal-source spectra over time.
- CPU1 additionally constitutes means for computing at least one of the values in the sets of first-named coordinate values as a function of a reference frequency which is a function of A) a frequency of pitch modulation of the speech and B) a mean of frequency values determined from at least some of the spectra of the speech over time.
- processors are needed to accomplish the operations described for CPU1.
- the block of Fig. 1 marked CPU1 represents a single processor.
- processors are used in a multiprocessing arrangement to compute several spectra at the same time and then to analyze the spectra so obtained in order to accomplish real time analysis of the speech waveform.
- microprocessors are multiplexed to line 23 from ADC 21 of Fig. 1 so that they take turns inputting the latest set of N1 samples in overlapping manner, for instance.
- each microprocessor need only input and compute the spectrum of every Pth set of N1 samples. Then the spectra can be supplied to one or more additional processors to analyze and output the auditory state code and the sensory pointer values X s , Y s and Z s .
- Fig. 15 the flow of operations in CPU2 for converting from sensory to perceptual coordinates is detailed.
- a vector difference equation, or set of three difference equations for the coordinates respectively is solved by CPU2 point by point by executing a loop continually.
- the difference equations are the numerical versions of three differential equations discussed hereinbelow.
- a sensory-perceptual transformation or transformation from sensory coordinates to perceptual coordinates as an integrative-predictive function.
- the fundamental concept of the sensory-perceptual transformation is that sensory pointers GSSP and BFSP as illustrated in Fig. 16 attract a perceptual pointer PP in the three dimensional mathematical space, or auditory-perceptual space having a coordinate system defined by three mutually perpendicular axes X, Y and Z, and induce the perceptual pointer to move through the auditory-perceptual space and trace out a perceptual path.
- Perceptual pointer PP has coordinate values X p , Y p and Z p .
- the perceptual pointer PP almost instantaneously, that is within a few milliseconds, takes on the summed loudnesses of the sensory pointers GSSP and BFSP. However, when the sensory pointers disappear, the loudness of the perceptual pointer decays slowly over a period of 100 to 200 milliseconds. In this way the perceptual response is maintained during brief silences in the acoustic input.
- the perceptual pointer like the sensory pointer, is regarded as having at any moment an auditory state, for which a perceptual auditory state code is computed.
- fixed pointers called neutral points NPGS and NPBF affect the motion of the perceptual pointer PP in the absence of the sensory pointers.
- At least one neutral point advantageously provides a home location for the perceptual pointer when a lengthy period of silence occurs. During such a period of silence, an attractive force from the neutral point NPGS causes the perceptual pointer PP to migrate toward it. Moreover, the use of at least one neutral point also remarkably allows the system to interpret even periods of silence in phonetically relevant ways in a manner similar to human speech perception. (For instance, many listeners hear "split" when a talker says “s” followed by brief silence followed by "lit.")
- the neutral point NPGS attracts the perceptual pointer immediately upon GS changing from one to zero in the auditory state code if BF is already zero.
- the attraction by NPGS lasts as long as the period of silence does, and the neutral point NPBF does not attract pointer PP at all.
- the neutral point NPBF attracts the perceptual pointer immediately upon BF changing from one to zero.
- the attraction by NPBF lasts about 120 milliseconds and is replaced upon the expiration of the 120 milliseconds by an attraction from the neutral point NPGS which lasts for the remainder of the period of silence until either GS or BF become one again.
- the sensory pointers GSSP and BFSP are conceived as being attached by springs to the perceptual pointer PP which is regarded as having mass and inertia.
- the stiffess of a spring depends on the goodness value and the loudness value of its associated sensory pointer. In this way, near-threshold spectra with little resemblance to speech have almost no influence on the perceptual response while moderately loud speech-like spectra have a strong influence on the perceptual response.
- the analogy to a spring is used because the attractive force of a sensory pointer or neutral point increases with the distance from the perceptual pointer PP.
- the position of any sensory pointer or neutral, point is not influenced by the spring, and all of the force acts on the perceptual pointer PP.
- the auditory-perceptual space is regarded as being a viscous medium and the perceptual pointer encounters resistance which not only varies with velocity but varies with the location of the perceptual pointer in a remarkable way in some embodiments.
- the particular mathematical model of the sensory-perceptual transformation is illustrative and can be modified in the practice of the invention by the skilled worker as additional experimental information about the process of auditory perception is obtained.
- the foregoing concepts are expressed in mathematical form by the difference equations which are solved by CPU2 to accomplish the sensory-perceptual transformation.
- the difference equations are expressed in terms of variables which are coordinate values exponentiated. Since the sensory pointers of Fig. 16 have coordinates which are expressed in erms of logarithmic functions of frequency ratios in step 143 of Fig. 4, the mathematical space of Fig. 16 is called a "log space" herein. Because the coordinates are exponentiated in a first set of the difference equations, only the frequency ratios remain and the expression "ratio space” is adopted herein to refer to the domain in which the difference equations are expressed. It is contemplated that in some embodiments, no logarithms are calculated in step 143 of Fig. 4 to avoid subsequently exponentiating in CPU2 to recover the ratios themselves. Subsequent analysis by CPU3 occurs in log space, however. (In still other embodiments, as discussed later herein, the difference equations themselves are expressed in log space.)
- ZRSGS - 10 Z SGS Z sGS Z sGS log(SF2/SFlH)
- NPBF Burst-Friction Neutral Point
- GSSP Glottal-Source Neutral Point
- CPU1 and CPU2 together electrically derive a series of coordinate values of points on a path in the mathematical space from frequency spectra of the speech occurring in successive time intervals respectively.
- Fig. 15 the operations of CPU2 commence with a START 401 and proceed to a step 403 to initialize a table 405 of Fig. 15A with two triplets of initial values XRP0, YRP0, ZRP0, XRP1, YRP1, ZRP1, for the set of coordinates XRP, YRP, ZRP in ratio space.
- row zero (suffix zero on the variables) is regarded as earliest in time, row one as next in time, and row 2 as latest in time and to be solved for.
- the initial position coordinates are in row zero and are 10 raised to the power of the respective coordinates X NGS , Y NGS , Z NGS of the neutral pointer NPGS.
- the initial velocity is assumed to be zero in both ratio space and log space so all the entries in row one are the same as in row zero, because there is no change in position initially.
- CPU2 reads the sensory pointer values X s , Y s and Z s for either the BF sensory pointer or the GS sensory pointer or both, and the auditory state code values BF, GS, LIBF, LIGS, GBF, GGS and NS from CPU1. Then a computation step 413 occurs in which the difference equations involving the sensory pointer values in ratio sp.ace are solved to obtain the next in a series of coordinate values X p , Y p and Z p on a path in the mathematical space.
- the difference equations are solved for the entries for row 2 of table 405, and subsequently the logs of the entries in row 2 are computed in order to obtain perceptual pointer coordinates X p , Y p and Z p in log space.
- the perceptual pointer coordinates X p , Y p and Z p are regarded as tracing out a path in the mathematical log space of Fig. 16 which path has a perceptual significance.
- the difference equations solved in step 413 are now described.
- CPU2 utilizes values of the coordinates XRP, YRP and ZRP from the two next-previous time intervals represented by rows zero and one of table 405, as well as quantities from the auditory state code and the sensory pointer coordinates in ratio space.
- Row two (2) of the table of Fig. 15A represents the unknown latest coordinate values on the path of the perceptual pointer in the ratio space which are to be obtained by solving the difference equations.
- Row one (1) of table 405 in general represents the next-previous coordinate values of the perceptual pointer which were found in the next previous pass through the computation loop of Fig. 15 by CPU2.
- Row zero (0) of the table generally represents the second-next-previous coordinate values of the perceptual pointer which were found in the second-next-previous pass through the computation loop of Fig. 15 by CPU2.
- the derivative of XRP is approximated by
- H is the reciprocal of the time interval between spectra, e.g. 1/(1 millisecond) or 1000 Hertz.
- XRP2 is the latest X-coordinate value in ratio space to be solved for, and XRP1 is the next previous such X-coordinate value.
- Equation (8) The quantity H is the same as in Equation (7).
- XRP2 (table 405, row 2, column XRP) is the latest X coordinate value to be solved for and XRP1 is the next previous X coordinate value (table 405, row 1, column XRP).
- XRP0 is the second-next-previous X coordinate value (row 0, column XRP).
- the factor H-square occurs in Equation (8) because the second derivative is the derivative of the first derivative.
- Equation (9A) 0 H 2 (XRP2-2XRP1+XRP0)
- Equation (9B) 0 H 2 (YRP2-2YRP1+YRP0) + rH(YRP2-YRPl)/B ABS(YRP2-YRNGS)
- CPU2 is programmed to perform an iterative or other suitable computation method to solve each of the three equations 9A, 9B and 9C for the latest coordinate values XRP2, YRP2 and ZRP2 of the perceptual pointer PP in the mathematical space.
- the absolute value function is represented by ABS.
- Coordinate values XRP1, YRP1, ZRP1 and XRP0, YRP0 , ZRP0 are previously calculated from the equations 9A, 9B and 9C and are available in the table 405 of Fig. 15A. Values of constants are illustratively set forth as follows:
- the viscous drag term is typified by the term rH(YRP2-YRPl)/B ABS(YRP2-YRNGS) in Equation 9B, which amounts to velocity times r/B ABS(YRP2-YRNGS) .
- B is a base for the exponentiation, and the viscous drag factor is about equal to constant r near the neutral point NPGS (which has a Y coordinate of YRNGS in ratio space) because the exponent for B is about zero.
- the denominator is BABS(1-10 YNGS ) or very roughly B 2 .
- the argument for the exponent is the distance (or sum of squares) along a straight line in either ratio space or log space connecting NPGS with the latest position of the perceptual pointer.
- the variables LIGS, GGS, GS, LIBF, GBF, and BF are in the auditory state code supplied by CPU1. These variables activate or deactivate (state switch) appropriate terms in Equations 9A, 9B and 9C depending on which sensory pointer(s) or neutral point is exerting an attraction on perceptual pointer PP. Then since the burst-friction flag BF and glottal-source flag GS are each either 0 or 1 and the loudness and goodness are zero during silence, the appropriate terms of the equations 9A, 9B and 9C figure in the solution computations or are advantageously neglected as circumstances require. A neutral flag NF is included in the neutral point terms (the last two terms in each of the difference equations).
- Neutral flag NF is controlled by a timer in CPU2 which monitors the states of BF and GS in the auditory state code. If either BF or GS is 1, flag NF is 0. If BF is zero and GS makes a transition from 1 to zero, flag NF becomes 1 until either GS or BF becomes 1. If BF is 1 and GS is 0, and then BF makes a transition from 1 to zero as detected by step 407, then a 120 millisecond timer in CPU2 is activated to keep flag NF zero until the 120 milliseconds expires, whence flag NF is set to 1.
- each difference equation is activated for 120 milliseconds and then is replaced by the second to last term (for neutral point NPGS) in each difference equation.
- Each term for a sensory pointer or neutral point is regarded as providing a contribution to the position of the perceptual pointer PP.
- means are provided for deriving sets of digital values representative of frequency spectra of the speech from the samples in digital form, for generating one of a plurality of auditory state codes for each of the sets of digital values and supplying at least two sets of coordinate values in a mathematical space, and for computing a series of other coordinate values of points defining a path with selected contributions from one or more of the sets of first-named coordinate values depending on which auditory state code is generated.
- CPU1 is also advantageously programmed to perform operations to compute different loudnesses and goodnesses specific to the glottal-source and burst- friction portions of the same spectrum of a voiced fricative or other speech sound, which values LIBF, LIGS, GGS and GBF are transmitted from CPU1 to CPU2, and two sets of sensory pointer values X sGS , Y sGS , Z sGS , X sBF , Y sBF and Z sBF are sent for the glottal- source pointer GSSP and the burst-friction pointer BFSP, instead of one triplet X s , Y s and Z s .
- means are provided for producing a first of the two sets of first-named coordinate values from one of the sets of digital values representing spectra when the auditory state code indicates a glottal-source auditory state and for also producing the second of the two sets of first-named coordinate values from the same one set of digital values when the auditory state code simultaneously indicates a burst-friction auditory state.
- the use of at least one neutral point as well as at least one sensory pointer in CPU2 provides means for producing a first of two sets of first-named coordinate values from the sets of digital values representing spectra and wherein the second set (e.g. neutral point values) of the first-named coordinate values is independent of the sets of digital values representing spectra.
- Equations 9A, 9B and 9C the value A is an exponent, illustratively 0, indicating that a neutral point attracts the perceptual pointer PP with a force that does not vary with distance.
- the value of A is made positive if experimental observations suggest that the force should increase with distance, or A is made negative if the force should decrease with distance. It was earlier believed that the best value of A was zero but more recent work indicates that the neutral points should act with spring-like forces for which A is unity (one).
- the equations 9A, 9B and 9C are collectively regarded as expressing one vector difference equation for the vector position of the perceptual pointer PP.
- all sensory inputs to microphone 11 of Fig. 1, including bursts, transitions, steady-states, and silences are all integrated into a single perceptual response by the sensory-perceptual transformation.
- the perceptual pointer PP position depends not only on the position of the sensory pointers but also their dynamics.
- a sensory pointer may rapidly approach and veer away from a target location, and yet it induces the perceptual pointer to overshoot and reach that desired location in the mathematical space.
- Operations by CPU2 in solving the difference equations are advantageously arranged to be analogous to such overshooting behavior, particularly in the cases of stop consonants and very rapid speech.
- step 415 of Fig. 15 the latest values XRP2, YRP2, ZRP2 resulting from solution of Equations 9A, 9B and 9C are stored in row 2 of table 405 of Fig. 15A. Then in a step 417 common logarithms of these latest values are sent as X p , Y p , Z p to CPU3. Operations proceed to a decision step 419 to determine if CPU2 is to remain ON. If ON, then a loop is made back to step 407. A new set of sensory pointer coordinates and auditory state code information is received in step 407.
- Table 405 is maintained in a cyclic manner to prepare for the next pass through the computation step 413, so that in table 405 the values in row 2 become the first-previous values and the values in row 1 become the second-next-previous values for purposes of XRP1, YRP1, ZRP1 and XRP0, YRP0, ZRP0 respectively. Equations 9A, 9B and 9C are solved again in step 413 and operations continue in the loop of Fig. 15 until CPU2 is not ON at decision step 419 whence operations terminate at an END 421.
- the difference equations are solved in log space and are given as:
- table 405 has columns for X p , Y p and Z p and the initial position coordinates in rows zero and one are the respective coordinates X NGS , Y NGS , Z NGS of the glottal source neutral pointer NPGS.
- Auditory- perceptual events or perceived sounds occur when the behavior of the perceptual pointer PP meets certain criteria. For example, these are (a) an auditory- perceptual event occurs when the perceptual pointer undergoes a period of low velocity; (b) an auditory- perceptual event occurs when the perceptual pointer undergoes sharp deceleration; and (c) an auditory- perceptual event occurs when the path of the perceptual pointer has high curvature.
- CPU3 is appropriately programmed to determine such events.
- the computations can involve any one or more of the criteria, or other criteria such as segmentation index SI discussed later hereinbelow, and time constraints can be added such that a velocity must be maintained for a predetermined number of milliseconds, or that a path or a certain locus and curvature have to be traversed within certain time limits.
- the auditory-perceptual event is regarded as associated with a position along the path in log space of a peak in magnitude of acceleration (determined now in log space and not ratio space in the preferred embodiment) of the perceptual pointer PP.
- the position of the perceptual pointer PP in log space is a vector defined by the coordinate values X p , Y p and Z p .
- Its velocity is a vector quantity equal to speed in a particular direction relative to the X, Y, Z frame of reference.
- the velocity has the components dX p /dt, dY p /dt and dZ p /dt, which are the time derivatives of X p , Y p and Z p .
- Speed is the magnitude, or length, of the velocity vector at any given time and is equal to the square root of the sum of the squares of the velocity components dX p /dt, dY p /dt and dZ p /dt.
- the magnitude, or length, of any vector is equal to the square root of the sum of the squares of its components.
- Acceleration is a vector which represents change of velocity or rate of such change, as regards either speed or direction or both.
- the components of acceleration are the time derivatives of the components of the velocity vector respectively.
- the acceleration has components d 2 X p /dt 2 , d 2 Y p /dt 2 and d 2 Z p /dt 2 , which are the time derivatives of dX p /dt, dY p /dt and dZ p /dt.
- the event is associated with a position along of the path of a peak in magnitude of acceleration of the perceptual pointer PP because a period of low velocity results from a deceleration which amounts to a peak in magnitude of acceleration.
- a sharp deceleration is a peak in magnitude of acceleration because deceleration is negative acceleration and a negative sign does not affect the magnitude which involves sums of squares.
- the acceleration is a vector peaking in magnitude and pointing centripetally from the path.
- CPU3 in some of the embodiments acts as at least one or more of the following: A) means for identifying coordinate values approximating at least one position along the path of a peak in magnitude of acceleration, generating a memory address as a function of the position coordinate values and obtaining from said memory means the phonetic representation information prestored at that memory address; B) means for computing a parameter approximating the curvature of the path and, when the parameter exceeds a predetermined magnitude at a point on the path, identifying the coordinate values of that point to approximate the position of a peak in magnitude of acceleration; C) means for computing a speed along the path and identifying the coordinate values of a position where the speed decreases by at least a predetermined amount within a predetermined time, to approximate the position of a peak in magnitude of acceleration; or D) means for computing a speed along the path and identifying the coordinate values of a position where a decrease in speed occurs that is both preceded and succeeded by increases in speed within a predetermined time, to approximate the position of a peak in the magnitude
- Each auditory-perceptual event is said to leave a trace or tick mark that fades in time.
- a cloud of ticks occurs, that is, when a region of high density of ticks surrounded by a region of lower density is formed, as would be the case for an oft-repeated speech sound, it is postulated that in human beings, the nervous system automatically places an envelope around the cloud of tick marks and creates a target zone capable of issuing a neural symbol or a category code. Under most circumstances such target zones are temporary and dissolve with time. Other target zones, such as those for the phones of one's native language and dialect, are formed during infancy and childhood under certain circumstances, such that they are nearly permanent and difficult to modify.
- the large memory 31 for target space storage is a memory means for holding prestored information indicative of different phonetic representations corresponding to respective sets of addresses in the memory.
- CPU1, CPU2, and CPU3 together constitute means for electrically deriving a series of coordinate values of points on a path in a mathematical space from frequency spectra of the speech occurring in successive time intervals respectively, for identifying coordinate values approximating at least one position along the path of a peak in magnitude of acceleration, generating a memory address as a function of the position coordinate values and obtaining from said memory means the phonetic representation information prestored at that memory address.
- the target zones for stop phonemes such as /b/, /d/, /g/, /k/, /p/ and /t/ (Fig. 21) are associated with respective sets of addresses in the memory corresponding to a negative-Y region of the mathematical space which cannot be entered by sensory pointer values X s , Y s and Z s but which can be entered by the coordinate values X p , Y p and Z p because of underdamping in the sensory-perceptual transformation.
- CPU3 finds a peak in the magnitude of acceleration, or otherwise finds a significant value of a trajectory parameter.
- the coordinates on the path at which a latest peak occurs are converted to integer values along each axis X, Y and Z.
- the target zones lie within ranges for X between 0 and 2, Y between - .5 and 1.5 and Z between 0 and 2.
- each axis is regarded as having 200 divisions which, for example include 150 divisions along the positive Y axis and 50 divisions along the negative Y axis. In this way, the shape of each target zone is definable with considerable precision. Therefore, the X p , Y p and Z p values at which the latest peak occurs are multiplied by 100 and rounded to the nearest integer by a function INT.
- a number of memory addresses equal to the cube of 200, or 8 megabytes, is used. In other words 23 bits are used to express each memory address in binary form, since 2 23 is about 8 million.
- the coordinates are converted to a memory address by the equation
- CPU3 finds a peak in the magnitude of acceleration by velocity analysis, curvature analysis, acceleration analysis, segmentation index analysis or other trajectory parameter analysis for significance or saliency, it then generates memory address ADR according to the above equation or the equivalent and obtains from the memory 31 the phonetic representation information, target zone identification or glide nucleus identification information prestored at that address.
- a binary code representing each phoneme, or phonetic element generally, of a language is stored at each of a set of addresses in the memory. The 8 bits in a byte provide ample flexibility to provide distinct arbitrary binary designations for the different phonemes in a given human language.
- memory 31 supplies the binary code stored at that address.
- CPU3 converts the binary code to a letter or other symbol representation of the phoneme and displays it on the video screen of its terminal and prints it out on printer 33.
- the targets for the nonsustainable speech sounds are placed outside of the octant of positive X, Y, and Z.
- the sensory pointer BFSP can only approach a target zone such as 451 for a sound such as "p" and must do with appropriate dynamics such that the perceptual pointer actually reaches the target zone in the negative Y region. For example, suppose a talker is just finishing saying the word "Stop.” The perceptual pointer has just made a sharp curve while passing through a target zone 453 for the vowel sound in "stop” under the influence of the glottal-source sensory pointer GSSP, now absent, and the suddenly appearing burst-friction sensory pointer BFSP.
- the sensory pointers thus can in some cases only go to approach zones in such a way as to induce the perceptual pointer PP to reach the more distant perceptual target zone. However, the target zones such as 453 for the vowels are able to be entered by both the sensory and perceptual pointers. The perceptual response should reach vowel target zones when starting from neutral point NPGS in about 50 milliseconds.
- Fig. 18 shows the axes X, Y and Z of a coordinate system for the mathematical space.
- Fig. 18 shows the axes X, Y and Z of a coordinate system for the mathematical space.
- additional axes X', Y' and Z' which intersect at a point in the first octant of the X, Y, Z system and are inclined relative to the axes X, Y and Z.
- Fig. 19 is a view of an approximately planar slab 465 in the X', Y', Z' coordinate system (also called SLAB coordinates) which has been found to hold the target zones for the vowels.
- Fig. 19 shows the slab 465 edge as viewed along the X' axis.
- the neutral point NPGS is approximately centered in the vowel slab. Even though the vowel slab is thin, lip-rounding moves the vowel to the back of the slab, while retroflection as in r-coloring moves the position far back toward the origin so that even with the vowels alone, the use of three-dimensions is beneficial.
- consonants fall in or near the vowel slab or in another slab that is orthogonal to the vowel slab, further supporting the use of a three dimensional space. It is contemplated that in some embodiments of the invention, however, that the slabs can be unfolded and unwrapped in such a way that a two dimensional space can be used. Also, it is contemplated that the slabs be mapped into the memory 31 addresses in such a way that the available memory capacity is efficiently used only for the slabs.
- Fig. 20 is a view of the slab 465 face-on and viewed along the Z' axis in the X', Y', Z' coordinate system. Outlines for the target zones for the vowels are shown, from which outlines sets of addresses are derived for prestoring codes representing each of the vowel symbols in memory 31 of Fig. 1. Ranges in the Z' coordinate for each of these target zones of Fig.
- codes are prestored by manually entering them for each of the addresses corresponding to a point within each of the target zones.
- the codes can be prestored by preparing 3- dimensional position acquisition equipment 467 such as a Perceptor unit from Micro Control Systems, Inc., of Vernon Connecticut.
- the unit has a teflon coated, precision-ground aluminum reference plate, on which is mounted a precision machined digitizing arm.
- a circuit that performs electrical data acquisition functions is housed beneath the reference plate. Dual RS-232 ports let the unit transmit data.
- the digitizing arm has five preloaded ball bearing supported joints which allow the arm to move. Potentiometers housed in the joints transmit electrical information about the angles of rotation of each segment of the arm.
- a Z-80A microprocessor in the unit computes the x, y, and z coordinates of the position of the arm's pointer tip.
- the shapes of the target zones are recorded relatively rapidly for use in automatically programming the memory 31 of Fig. 1.
- Fig. 21 shows target zones in the mathematical space for voiceless stops as viewed along the Y axis of Fig. 18. The legend for this Figure is found in Table 1.
- the corresponding voiced stops g, d, b (legend in Table 2) occupy negative Y values in a range -0.055 to -0.02.
- Voice bars vbg, vbd, and vbb occupy positive Y values in a range +0.02 to +0.03.
- Figs. 22A and 22B depict target zones in the mathematical space for nasal consonants as viewed along the X' axis in Fig. 22A and the Z' axis in Fig. 22B.
- the legend for these Figures is found in Table 2.
- Fig. 23 depicts target zones in the mathematical space for voiceless fricatives of American English as viewed along the Y axis of Fig. 18.
- the legend for this Figure is found in Table 3.
- Fig. 24 depicts target zones in the mathematical space for voiced fricatives and the phonetic approximates as viewed along the Z' axis of the X', Y', Z' coordinate system of Fig. 18.
- Fig. 25 depicts target zones in the mathematical space for the voiced fricatives and the phonetic approximates of Fig. 24 as viewed along the X' axis of the X', Y', Z' coordinate system of Fig. 18.
- the legend for Figs. 24 and 25 is found in Table 4. These target zones are generally juxtaposed in or near the vowels, so the the X', Y', Z' coordinate system is used.
- the Figs. 24 depicts target zones in the mathematical space for voiced fricatives and the phonetic approximates as viewed along the Z' axis of the X', Y', Z' coordinate system of Fig. 18.
- Figs. 24 and 25 are interpreted in the manner of an orthographic projection to define the three dimensional shapes of the target ones.
- a superficial comparison of Figs. 20 and 24 might suggest that the target zones for /er/ and /r/ in Fig. 24 conflict with the target zones of some of the vowels of Fig. 20, but this is not the case.
- Fig. 25 makes it clear that /er/ and /r/ fall behind the vowels in the log space.
- /w/ occupies two noncontiguous target zones according to present observations. In general target zones do not overlap.
- a legend for the vowel Figs. 19 and 20 is found in Table 5.
- Fig. 26 operations of CPU3 of Fig. 1 commence witha START 501 and proceed to a step 503 where the coordinate values X p , Y p and Z p of the latest point on the path in the mathematical space are input from CPU2 and stored in a table 504 of Fig. 27.
- step 505 the significant parameters of the trajectory are computed, so that it can be subsequently determined when a significant speech event occurs.
- the coordinate values result from sampling by S/H 17 at equal intervals in time and analyzing the spectra at a repetition rate expressed by the quantity H hereinabove. Therefore, the magnitude of acceleration is computed from the latest coordinate values and the two previous triplets of coordinate values from table 504.
- the subscripts zero (0), one (1) and two (2) are used to indicate the latest triplet, the next previous triplet, and the triplet before that.
- the magnitude of acceleration is illustratively computed according to the equation
- MAGACCEL H 2 SQRT( (X p0 -2X p1 +X p2 ) 2
- equation (12) defines a magnitude of acceleration that implicitly includes components that can be either normal (perpendicular) or tangential to the instantaneous velocity at any given point on the path. Therefore, equation (13) provides a trajectory parameter that includes curvature but is not limited to it.
- a further alternative embodiment computes the curvature only according to the following equations:
- S 01 SQRT((X p1 -X p0 ) 2 + (Y p1 -Y p0 )2 + (Z p1 -Z po ) 2 )
- S 12 SQRT((X p2 -X p1 ) 2 + (Y p2 -Y p1 ) 2 + (Z p2 -Z p1 ) 2 )
- S 01 is the length of a distance interval between two successive points on the path in the X, Y, Z coordinate system and S 12 is the length of another distance interval adjacent the first one.
- S AV is the average of S 01 and S 12 .
- CURVl is an example of one of a number of formulas for curvature which is a numerical approximation to the magnitude of the second partial derivative of an instantaneous position vector with respect to path length. Another of such formulas recognizes that the curvature equals the magnitude of the vector cross product of velocity vector crossed with acceleration vector /v x a/ divided by the cube of the magnitude of the velocity vector (three-halves power of sum of squares of the components of the velocity vector).
- CURVl advantageously eliminates any tangential components of acceleration for the curvative calculations in embodiments when to do so provides a more accurate segmentation of the path in X, Y, Z space for speech recognition purposes.
- Another related alternative procedure computes a quantity called curvature index CI according to the formula
- Two line segments of equal length with a common end subtend an interior angle A.
- the cosine of angle A is equal to the inner product of unit vectors in the direction of the two line segments representing two intervals along the path in X,Y,Z space.
- COSA [(X 3 -X 2 )(X 2 -X 1 ) + (Y 3 -Y 2 )(Y 2 -Y 1 )
- Each latest value of the magnitude of acceleration MAGACCEL (or alternatively CURV, CURVl, CI, or other trajectory parameter) is stored during step 505 in table 504 holding it and four previous values of MAGACCEL (or other parameter). Similar tabular analysis of the curvature CURV or CURVl or curvature index CI is applied where curvature is used.
- the argument in the square root SQRT function of Equation (12) is sufficient for use as a parameter related to the magnitude of acceleration also. It is emphasized that there are many ways of calculating significant trajectory parameters to accomplish an equivalent analysis of the path of the perceptual pointer in the mathematical space.
- a still further alternative method of determining significant trajectory parameters involves a segmentation index SI as discussed next and in connection with Figs. 48 and 49.
- SI segmentation index
- SI(i) W1*SIX(i) + W2*SIY(i) + W3*SIZ(i) (21)
- S ⁇ X(i), S ⁇ Y(i), and S ⁇ Z(i) are the maximum variations of X', Y' and Z' coordinate values at the 80 msec, region centered at the ith frame.
- W1 and W2 are set to one as a constant.
- W3 is a function of Z' (Z-prime) and varies inversely with Z' when Z' is below 0.6 (i.e., left of vowel slab in Fig. 19).
- a suitable threshold value for significance for the segmentation index when used as a trajectory parameter is that SI exceed 0.10 to be significant.
- step 507 the table 504 holding five values of MAGACCEL is tested to determine if a significant peak has occurred.
- a suitable test is that a peak has occurred if the table has a tested value entered therein which exceeds a predetermined level and is is preceded and succeeded in the table 504 by values less than the tested value. If this test is not passed, operations pass from step 507 to a decision step 509 where CPU3 checks an ON/OFF switch to determine whether it is to continue, and if so operations loop back to step 503.
- a phonetically significant event occurs at step 507 and operations proceed to a step 511 to generate an address ADR according to ADR Equation (10) hereinabove.
- a step 513 the address ADR is asserted by CPU3 to memory 31 of Fig. 1 to obtain a prestored phonetic element code PHE byte identifying the phonetic element of the target zone where the significant X p , Y p , Z p coordinate values lie.
- this PHE value is stored in a memory space holding PHE values in the order in which they are obtained.
- the PHE value, or byte is looked up in a table providing instructions for writing the corresponding phonetic symbol or category code corresponding to a phone of the language to printer 33 of Fig. 1.
- CPU4 is a lexical access processor which converts the PHE values to a series of words spelled according to a language chosen, or is any other processing apparatus chosen for other applications as noted hereinabove.
- step 519 operations proceed to ON decision step 509, and loop back to step 503 unless the CPU3 is to be no longer ON, whence operations terminate at an END 521.
- a complex target zone approach to identifying some of the phonetic elements PHE in step 513 of Fig. 26 it is recognized that experimental observations may require that some PHEs occupy the same target zone.
- the PHE is not associated with a target zone, and instead the target zone is assigned an arbitrary numerical target zone identification number as shown in Fig. 28 for target zones 601, 603 and 605.
- Target zones which are independent of flag values have identical PHEs tabulated in a column as in column 3. The effect of nasality to produce a different PHE value is shown in column 4.
- a null identification "-" is readily incorporated when a dummy target zone is desired.
- CPU1 detects which of the characteristics of the speech is present, for example, as indicated by the flags in the auditory state code.
- ROM3 acts as an example of an electronic memory for prestoring information relating at least one of the event identifications or address set identifications (e.g. a target zone identification number TZ in memory 31) to different phonetic representations depending on which of a plurality of characteristics of speech is present.
- CPU3 acts as an example of a means obtaining from the electronic memory the address set identification for the set of addresses containing the address so determined, and when the address set identification so obtained is related to more than one phonetic element representation in the memory according to characteristics of speech, then obtaining from the electronic memory the phonetic representation which corresponds to the detected characteristic of speech.
- Fig. 30 shows numerous differently-located segments 751.1, 751.2, ... and 751.8 of perceptual paths of speech which are all examples of what is heard as the diphthong /AY/ (same sound as in the word "buy”). All of the segments are at least as long as some minimum length 755 or some minimum time duration, are relatively straight or unidirectional (low curvature), appear as arrows with relatively consistent angular orientation from tail to tip, originate in either the AA or AH target zone, and terminate in either the EH or AE target zone if they originate in the AA target zone and otherwise terminate in either the IY or IH target zone if they originate in the AH target zone. Moreover, all of the segments are within a relatively well-defined pipe 757 and originate in a lower portion of that pipe 757 which portion is designated as a radical or nucleus [al n ] in perceptual space as shown in Fig. 30.
- diphthong /AY/ lead to corresponding tests which can be performed with electronic hardware or software to detect the presence of one or more indicia of the diphthong or other glide, and when the indicia are present, the diphthong or other glide is automatically recognized.
- Fig. 31 four differently-located segments of perceptual paths of speech 761.1-.4 are all examples of what is heard as the diphthong /EY/ (same sound as .in the word "bait").
- all of the segments are at least as long as minimum length 755 or some minimum time duration, are relatively straight or unidirectional (low curvature), appear as arrows with relatively consistent angular orientation from tail to tip, originate in a nucleus [el n ], and terminate in the IY target zone.
- all of the segments are within another relatively well-defined pipe 765 as shown in Fig. 31.
- diphthongs and the more general category of glide sounds are susceptible of automatic recognition by using tests directed to one or more of the various indicia above.
- Test Type I is a test for what is herein called a "nucleus-offglide sequence", wherein a region including the tail of the glide arrow is called the glide nucleus or radical and the position of any given tail is called the "root position". It is noted that the meaning of nucleus as used herein refers to a region in mathematical perceptual space and does not have the linguistic meaning associated with it by Lehiste et al. hereinabove-cited.
- angular orientation 601 is computed as the arctangent of the ratio of a distance 603 in a monotonic sequence traveled along a first coordinate (e.g. Y') from the root position 605 divided by a distance 607 traveled along a second coordinate (e.g. X').
- a Y' glide is monotonic in the increasing Y' direction.
- a decreasing X' glide is monotonic in the decreasing X' direction.
- a decreasing Z' glide is monotonic- in the decreasing Z' direction.
- Discussion of Figs. 32B-32D is found hereinbelow in connection with Figs. 33 and 34.
- Test Type I a table of glides is prepared and for each glide representation is stored one or more target zone identifications where a nucleus for the glide can occur and there is also stored an angle range for that glide. This can be accomplished by judicious selection of entries in a table analogous to that shown in Fig. 29.
- a separate section of memory 31 is provided for target zones corresponding only to glide nuclei, since they overlap other target zones as shown in Fig. 32.
- a nucleus and angle are determined for each glide when it occurs and are compared in electronic apparatus such as CPU3 with the prestored information.
- the prestored information is structured and based on experimental data such as that reflected in Figs.
- Speech sounds such as the diphthongs, at least some versions of the approximates or glides /wj/, and perhaps r-coloration of certain vowels such as in boar, bar, beer, and so on, can be treated as nucleus-offglide sequences.
- Such nucleus-offglide sequences are sometimes referred to as radical-glide sequences in linguistics.
- Test Type II is a test for what is herein called a "phonetic element doublet" or pair, i.e. a succession of two different phonetic elements as determined by a trajectory analysis test (e.g. curvature, acceleration magnitude, low velocity segment, segmentation index).
- the two different phonetic elements occur within a time interval that is in a predetermined range of time values, then they are regarded not as two separate phonetic elements but as one. For example, if a phonetic element doublet originates in the AA target zone and terminates in either the EH or AE target zone, it passes Test Type II and is identified as an /AY/ phonetic element. If instead it originates in the AH target zone and terminates in either the IH or IY target zone, it also passes Test Type II and is also identified as an /AY/ phonetic element. If as a further example, it originates in the IH target zone and terminates in the IY target zone, it passes Test Type II and is identified as an /EY/ phonetic element. A direction test can also be included.
- Test Type II a table of glides is prepared and for each glide representation is stored one or more pairs of phonetic element PHE or other target zone identifications. Then as phonetic element PHE values are stored in memory the latest pair of them is compared in electronic apparatus with the prestored information.
- the prestored information is structured and based on experimental data such as that reflected in Figs. 32 and 32A-D so that either the comparison yields a null identification in that the observed glide corresponds to nothing in the prestored information, or the comparison yields a match with at most one glide for which there is prestored information. Then the latest pair of PHE values or event identifications in memory is replaced with the phonetic element PHE identification of a glide for which a match is found.
- Fig. 33 operations in an embodiment
- Fig. 33 (alternative to Fig. 26 for implementing Test Type I for a glide) are marked with 800+ numbers in a manner analogous to the 500+ numbers of Fig. 26 to facilitate comparison.
- Operations in Fig. 33 commence with a BEGIN 801 and proceed to a step 802 to set a glide flag GLF and a glide number NG to zero.
- a step 803 the coordinate values X p , Y p and Z p of the latest point on the path in the mathematical space are input from CPU2 and stored in a table such as 504 of Fig. 27.
- a latest value of a parameter (such as curvature CURV of equation (13) or CURV1 of equation (17)) of the trajectory is computed.
- step 805 operations proceed to a step 807 to determine whether a significant peak in CURV1 has occurred by the same table analysis procedure as described for MAGACCEL in step 505 for example.
- the operations of Fig. 33 are capable of extracting phonetically important information not only when the test of step 807 is passed but also in some circumstances when the test of step 807 is not passed. If the test of step 807 is not passed, operations proceed to a step 808 to test the glide flag GLF. If GLF is still zero, as it is initially, then operations proceed to a decision step 809 where CPU3 checks an ON/OFF switch to determine whether it is to continue, and if so operations loop back to step 803. If in step 808 the glide flag had been set (as described in further operations hereinbelow), then operations proceed from step 808 to a Glide Subroutine 810 after execution of which step 809 is reached.
- step 807 a significant trajectory parameter value is detected in step 807 and operations proceed to a step 811 to generate an address ADR of target space storage 31 according to ADR Equation (10) hereinabove.
- step 813 the address ADR is asserted by CPU3 to memory 31 of Fig. 1 to obtain an identification number TZ prestored at each of a set of addresses including the address ADR asserted.
- TZ is an identification number for complex target zone purposes as discussed in connection with Figs. 28 and 29.
- the step 813 of Fig. 33 does not access a PHE byte directly. Instead, the address ADR is asserted by CPU3 to memory 31 to obtain a target zone identification TZ identifying the target zone where the significant X p , Y p and Z p values lie.
- a suitable test or preestablished condition for the test region is that Y be greater than +0.03.
- the X', Y', and Z' coordinates are also computed according to equations (11A), (11B) and (11C) stored as values X' o , Y' o and Z' o for glide identification purposes.
- the glide flag GLF and glide number NG are both reset to zero.
- a step 815 the phonetic element PHE value obtained from the table of Fig. 29 in step 814 is stored in a memory space holding PHE values in the order in which they are obtained.
- the next-to-latest PHE value, if any, in the same memory space is obtained from the memory space and used to look up in a table providing instructions for writing the corresponding phonetic symbol or category code corresponding to a phone of the language to printer 33 of Fig. 1.
- a next step 819 all PHE values stored in the order in which they are obtained are sent to CPU4 of Fig. 1. On a real-time basis this involves sending the next-to-latest PHE value just discussed, if any, to
- step 819 operations proceed to ON decision step 809, and loop back to step 803 unless the CPU3 is to be no longer ON, whence operations terminate at an END 821.
- Fig. 34 operations in the Glide Subroutine
- a BEGIN 851 commences with a BEGIN 851 and proceed to a step 853 to increment glide number NG by one and to compute the vowel slab coordinates X', Y' and Z' for the latest coordinate values X p , Y p and Z p which have been reached on the path of the perceptual pointer.
- These latest coordinates X', Y' and Z' are entered into a table 854 of Fig. 34A in a row corresponding to the value of the glide number NG .
- a step 855 tests the values in the table 854 to determine whether a monotonicity criterion is met.
- monotonicity is a condition in which all of the X', Y' or Z' values in table 854 are either listed in increasing order, or listed in a decreasing order.
- the coordinate values X' o , Y' o and Z' o stored in step 814 are regarded as the "root position" for a possible glide, the existence of which is to be detected in subroutine 810.
- Step 855 determines whether a segment of the path of the perceptual pointer PP, which segment begins in the nucleus and ends with the latest position coordinate values, is going in any particular direction or not by using this monotonicity test as a threshold test which a true glide must pass.
- a distance DES traveled is at least a predetermined amount DS0 (e-.g. 0.1 log units, or a sum of squares in X', Y' and Z' of table 854 exceeding 0.01 log units squared).
- step 861 If the test of step 861 is not passed, a branch is made to RETURN 859 to accumulate more path data, assuming that no trajectory event in step 807 is detected before a repeated loop through steps 803, 805, 807, 808, 810 and 809 in Fig. 33 occurs for enough times for the glide number to reach NG0.
- step 863 is executed to identify a recognizable nucleus-offglide sequence if possible.
- An exemplary procedure for the identification is discussed hereinbelow. When the procedure has been executed, and the nucleus and glide correspond to a recognizable speech sound according to an ID check in test step 865, then operations proceed to a step 867 to replace the last PHE value in the memory space, or cyclic store, (discussed in connection with steps 815 and 817) with a sequence identification PHE value representative of the recognizable speech sound.
- CPU3 in glide subroutine 810 and step 867 advantageously recognizes that the phonetic element PHE value last stored in the memory space according to step 815 by trajectory analysis (e.g. by curvature) should be replaced in appropriate instances with a glide or sequence identification value indicative of a diphthong or other glide speech sound which has been identified in step 863.
- a step 869 resets the glide flag GLF and glide number NG back to zero and the the flag is not set for glide detection again until a significant trajectory parameter again occurs in the vowel slab (cf. steps 807 and 814). If no recognizable speech sound is found in step 863, then operations branch from step 865 to step 869 as well, whence RETURN 859 is reached.
- a root position is in an identifiable glide nucleus if it lies within any of the regions in perceptual space identified as the respective glide nuclei in Figs. 32, 32B and 32C, or 32D and 32E.
- Fig. 32 shows various dipthong nuclei which occupy the Z' width of the vowel slab of Fig. 19 and have outlines in the X', Y' plane as shown.
- Figs. 32B and 32C show the w-glide nucleus from views alpng the X' axis and Z' axis respectively.
- Figs. 32D and 32E show the j-glide (yuh) nucleus from views along the X' axis and Z' axis respectively.
- CPU3 in its operations of Figs. 33 and 34 acts as an example of a means for electronically computing a trajectory parameter from the series of coordinate values and, when both the trajectory parameter satisfies a predetermined condition for significance and a coordinate value currently reached by the speech is within a predetermined region of such values, producing a signal, for electronically analyzing the speech on the path in response to the signal for occurrence of a position in the nucleus and an offglide in said range of directions which offglide happens before another significant trajectory parameter occurs, and upon occurrence of such indicia of a glide, obtaining from the electronic memory the phonetic representation corresponding to the glide indicia.
- Fig. 35 operations in an embodiment (alternative to Fig.
- Fig. 26 for implementing Test Type II for a glide
- Operations in Fig. 35 commence with a BEGIN 901 and proceed to a step 902 to set a glide flag GLF and a glide number NG to zero.
- Glide number NG operates as a glide timer between trajectory events in this embodiment.
- the coordinate values X p , Y p and Z p of the latest point on the path in the mathematical space are input from CPU2 and stored in a table such as 504 of Fig. 27.
- a latest value of a parameter (such as curvature CURV of equation (13) or CURV1 of equation (17)) of the trajectory is computed.
- step 907 operations proceed to a step 907 to determine whether a significant peak in CURV1 has occurred by the same table analysis procedure as described for MAGACCEL in step 505 for example. If the test of step 907 is not passed, operations proceed to a step 908 to test the glide flag GLF. If GLF is still zero, as it is initially, then operations proceed directly to a decision step 909 where CPU3 checks an ON/OFF switch to determine whether it is to continue, and if so operations loop back to step 903. If in step 908 the glide flag had been set (as described in further operations hereinbelow), then step 908 increments the glide number NG whence step 909 is reached.
- step 907 a significant trajectory parameter value is detected in step 907 and operations proceed to a step 911 to generate an address ADR of target space storage 31 according to ADR Equation (10) hereinabove.
- step 913 the address ADR is asserted by CPU3 to memory 31 of Fig. 1 to obtain an identification number TZ as discussed for step 813.
- step 914 the table of Fig. 29 is accessed as described for step 814.
- the test region includes all glottal source target zones which encompasses the vowels and a number of the consonants from which glides may be constituted.
- the glide flag GLF is maintained at zero or reset to zero.
- a glide subroutine 916 of Fig. 36 is executed to detect any glides and output phonetic element representation(s) as appropriate, whence operations loop back to ON test 909. If CPU3 is turned off, a branch from step 909 to an END 921 is made.
- Fig. 36 operations in the Glide Subroutine 916 of Fig. 35 are described in further detail. Operations commence with a BEGIN 951 and proceed to a step 953 to test glide flag GLF. If GLF is set to one, operations proceed to test the glide number NG to determine whether it exceeds zero. Initially, NG is zero when GLF is first set by step 914 of Fig. 35, so a branch is made from step 955 of Fig. 36 to a step 957 where a first backup register PHE1 is loaded with the PHE value from step 914 table access. Then a RETURN 959 is reached so that operations continue in Fig. 35 and execute a loop through steps 909, 903, 905, 907, 908 repeatedly so that glide number NG increases with time until a significant trajectory parameter is detected in step 907 and subroutine 916 is again executed after steps 911, 913 and 914.
- Step 917 Operations once again enter subroutine 917 of Fig. 36. If a phonetic element outside the test region of step 914 was detected, for example, glide flag GLF is not one in step 953 and operations branch to a step 961 to test glide number NG . If NG exceeds zero, then a phonetic element in the test region may have earlier been detected but has not yet been output even though it ought to be. In such case operations proceed to a step 962 to make sure that this is not the result of a second event identification in a previous glide doublet (which ought not to be output). If there is no glide identification in step 962, operations go on to a step 963 to load first backup register PHE1 into an output subroutine register PHEX. The glide number NG is reset to zero.
- An output subroutine of Fig. 37 is called, as indicated by the phrase "GOSUB OUTPUT" in step 963. If there is a glide identification symbol in step 962, operations branch to a step 964 to reset glide number NG to zero and null the glide identification symbol. Since the phonetic element hypothetically found outside the test region in step 914 also should be output, operations pass from either step 963 or 964 (or directly from step 961 if NG was zero) to a step 965. Step 965 loads the phonetic element representation PHE resulting from the most recent table access in step 914 into output subroutine register PHEX. Then in a step 967, the Output subroutine of Fig. 37 is called, whence RETURN 959 is reached.
- step 914 of Fig. 35 If in step 914 of Fig. 35, another PHE value in the test region had instead been found after the first one in the test region, then glide flag GLF is found to be one in step 953 of Fig. 36. Then operations proceed to step 955 where glide number NG is found to exceed zero due to above-described looping over time in Fig. 35. From step 955 operations proceed to a step 971 in which the as-yet unprinted phonetic element in backup register PHE1 is loaded into register PHEX for possible output. Next in a step 973, the value of the glide number NG is tested to determine whether it represents an interval of time between predetermined limits of a range such as 30 to 100 milliseconds and whether distance DES exceeds predetermined minimum distance DS0.
- step 914 If the NG number is out of range or the distance is too little, then there is no glide represented by the latest pair of phonetic element representations PHEX and PHE from table access in step 914. Consequently, the latest phonetic element PHE is stored in first backup register PHE1 in case it is part of a glide yet to be detected, and NG is reset to zero in a step 975. After step 975, the OUTPUT subroutine is executed on PHEX in step 967.
- step 973 If on the other hand in step 973, the value of glide number NG indicates that a time interval in the appropriate range has elapsed between two phonetic events in the test region and distance DES is sufficient, then operations go to a step 981 to load the latest phonetic element representation PHE into a second backup register PHE2. Also in step 981, glide number NG is reset to zero whence a step 983 is reached.
- Step 983 executes an important aspect of Test Type II for glides in that a Glide Table is searched to obtain and load into register PHE any glide identifica- tion which corresponds to the target zone or nucleus of the event that resulted in element PHE1 and which is associated with a direction corresponding to detected angular orientation in tabulated range. It is noted that in contrast to Figs. 33 and 34, a pair of trajectory events is detected in step 907 of Fig. 35 before this glide identification of step 983 is undertaken. The direction is computed based on the coordinate positions of the two trajectory events that resulted in elements PHE1 and PHE2, instead of a monotonicity table as in Figs. 34 and 34A.
- a step 985 loads the contents of PHE2 into PHE1. Then in a test step 987 if no glide was identified, a null identification exists in PHE due to step 983 and a branch is made to step 967 to output the last phonetic element loaded into PHEX in step 971. If there was a glide in step 983, then operations proceed from step 987 to step 965 to load the PHE value of the glide identification into the output subroutine register PHEX. PHEX is then output by the Output subroutine in step 967 whence RETURN 959 is reached.
- CPU3 in glide subroutine 916 advantageously recognizes that the last two phonetic elements PHE1 and PHE2. value detected by trajectory analysis (e.g. by curvature) should be replaced in appropriate instances with a glide identification value PHE which has been identified in step 983.
- Step 985 recognizes that latest phonetic element value in PHE2 should be and is stored in first backup register PHE1 in case it is part of a glide or another glide yet to be detected. In this way when there is no glide, the first of two phonetic element representations which were detected by trajectory analysis is output, and the second phonetic element representation is retained for later operations which determine whether it should be output as well, or regarded as part of a subsequent actual glide. When there is a glide, the glide representation is output, but the second event identification in the doublet is retained for later operations in case there is a subsequent glide.
- Glide subroutine 916 as shown in Fig. 36 advantageously detects two successive glides which are constituted by successive pairs of phonetic elements to which the glides correspond.
- subroutine 916 advantageously retains phonetic element representation which may be part of a glide yet to be detected even though the phonetic element was part of a latest pair which failed the glide test of step 983. And a retained phonetic element representation which turns out to be no part of a glide either is ultimately output in a subsequent pass through the subroutine 916.
- CPU3 acts as an example of a means for electronically selecting and supplying an event identification depending on the sound of the speech when the speech satisfies a preestablished condition, for measuring a time interval between successive times when the preestablished condition is satisfied and, when the time interval is within a predetermined range, retrieving from the electronic memory the phonetic representation corresponding to the pair of different event identifications when they have successively occurred.
- CPU3 acts as an example of a means for generating two phonetic element representations respectively corresponding to each of first and second event identification symbols in the pair when the particular value of a derived coordinate for the speech for which the first event identification symbol is supplied is in a predetermined test region of the coordinates and not in a glide and the particular value of another derived coordinate for the speech for which the second event identification symbol is supplied is outside of the predetermined test region.
- CPU3 acts as an example of means for determining whether any prestored phonetic element representation corresponds to an actually occurring pair having an earlier event identification and a later event identification that are consecutively supplied, and if not, then generating a phonetic element representation corresponding to the earlier event identification in the pair and retaining the later event identification for use as an earlier event identification in a later pair.
- OUTPUT subroutine In the OUTPUT subroutine called in the glide subroutine of Fig. 36, operations commence with a BEGIN 1001 of Fig. 37. Then in a step 1015, the contents of variable PHEX are stored in a memory space or cyclic store holding PHEX values corresponding to each instance when the OUTPUT subroutine was called. In a step 1017, the latest PHEX is used to look up in a table providing instructions for writing the corresponding phonetic symbol or category code corresponding to a phone of the language to printer 33 of Fig. 1. In a next step 1019 the latest PHEX value is sent to CPU4 of Fig. 1. After step 1019 is completed, a RETURN 1020 is reached.
- SR also called a spectral reference herein
- An improvement for computing SR in CPU1 of the apparatus of Fig. 1 utilizes operations as illustrated in Figs. 38-40. Generally, these operations take account of previously computed values designated SF(1), SF(2), and SF(3), keeping track of the glottal-source or burst- friction nature of SF(2) and SF(3) by segregating the burst-friction peaks which might otherwise be designated SF(2) and SF(3) as SF(4) and SF(5) for purposes of the SR calculation. Silence and other no-speech is appropriately accounted for.
- Fig. 38 operations commence with a BEGIN 1101 and go to a step 1103 to determine whether speech is present.
- a suitable test for speech is that its loudness and goodness exceed respective predetermined levels. If so, an SR Routine 1105 is executed to update the value of sensory reference SR whence a RETURN 1107 is reached.
- step 1103 detects silence or other no-speech
- operations go to a step 1111 to test whether a timer STIMER is at zero. If so, operations start STIMER in a step 1113 whence a step 1115 is reached. If STIMER is already started in step 1111, operations bypass step 1113. Then in step 1115, a test is made to determine whether STIMER is timed out by reaching or exceeding a preset value corresponding to three (3) continuous seconds of no-speech, for example. If STIMER is not timed out, RETURN 1107 is reached directly, without affecting the SR value.
- a step 1117 reinitializes a set of variables including a set of five factor flags FF(1)...FF(5), a set of five factor summand flags (or latching factor flags) FS(1)...FS(5), a factor sum FSUM, a set of six geometric mean values GMTF(0)...GMTF(5), a set of six averaging divisors N2(0)...N2 (5) for computing the geometric mean values, and the sensory reference value SR itself.
- RETURN 1107 is reached.
- CPU1 thereby acts as an example of a means for initializing the sensory reference value and electronically analyzing the sets of digital values (e.g. spectra) to determine whether sound indicative of speech is present, and if no speech is present for a predetermined period of time then reinitializing the sensory reference value and preventing any spectra of earlier speech from thereafter affecting the sensory reference value.
- sets of digital values e.g. spectra
- a set of population geometric mean values POF(0)...POF(5) for the various formants in Hertz are estimated as listed in the following Population Mean Table and prestored in memory ROM1:
- each geometric mean value GMTF(N) is initialized to its corresponding population mean value POF(N) of Table II.
- Sensory reference SR is initialized to 168 Hertz for example. All other variables FF, FS, FSUM, and N2 are initialized to zero.
- Speech as it heard may initially be of a glottal source type for which SR should be computed as a geometric mean on the basis of formants F0, F1, F2 and F3, that is, a fourth root of the product of four factors. If the speech is burst-friction, then SR should be a geometric mean on the basis of burst-friction formants F2 and F3 (SF(4) and SF(5)) and the latest previous value for F0 even if it be the estimate 168
- the values should accumulate various geometric means which are used to move the initial value of SR from 168 Hertz to a computed value based on the actual talker's speech which depends on the deparature of the sensed geometric means GMTF(N) of the various formants of the talker's speech relative to population values POF(N).
- Fig. 39 shows a preferred embodiment for achieving the SR values in an improved way.
- Operations in the SR Routine 1105 commence with a BEGIN 1151 and set no-speech timer STIMER to zero in a step 1153. Then a step 1155 tests the burst-friction flag BF. If BF is set to one, then operations go to a step 1157 to set two factor flags FF(4) and FF(5) to one and to set the latching factor flags FS(4) and FS(5) to one also. If BF is not 1 in step 1155, a branch is made to a step 1159 to set factor flags FF(4) and FF(5) to zero without doing any operation on the latching factor flags FS(4) and FS(5).
- step 1157 or 1159 operations proceed to a step 1161 to test glottal-source flag GS. If GS is set to one, then a step 1163 sets three factor flags FF(1), FF(2) and FF(3) to one and corresponding latching factor flags FS(1), FS(2) and FS(3) to one also. If GS is not 1 in step 1161, a branch is made to a step 1165 to set factor flags FF(1), FF(2) and FF(3) to zero without doing any operation on the corresponding latching factor flags. Also, since the sound is not glottal-source in nature, registers SF(4) and SF(5) are loaded with the F2 and F3 formant values SF(2) and SF(3) in preparation for calculations in Fig. 40.
- steps 1163 and 1165 operations go to a step 1167 to test the spectrum to determine whether it has a periodic component according to the pitch-extraction algorithm of Scheffers, M.T.M. (1983) cited hereinabove. If so, then a step 1169 sets an aperiodic flag AP to zero, and sets both factor flag FF(0) and latching factor flag FS(0) to one. If the spectrum is aperiodic in step 1167, then a step 1171 sets aperiodic flag AP to one, and sets factor flag FF(0) to zero and sets latching factor flag FS(0) to one. This latter operation on FS(0) can be omitted if it is desired to initially exclude any stabilizing contribution by the F0 estimate when BF speech (e.g. "s" in "so") breaks a silence. However, FS(0) is set to one in step 1171 preferably to obtain this stabilizing contribution even though no actual F0 may be initially sensed.
- step 1173 updates the geometric mean values GMTF(N) and factor sum FSUM as appropriate, and provides a latest computation of sensory reference SR as described in more detail in Fig. 40.
- steps 1175 and 1177 analyze a TF0 table for pitch modulation and augment the value of SR appropriately, as discussed in connection with steps 319 and 321 of Fig. 14, and followed by a RETURN 1179.
- Fig. 40 operations in the update routine 1173 of Fig. 39 commence with a BEGIN 1181 and then in a step 1183 initialize factor FSUM to zero and set an index N to zero in preparation for a looping series of steps 1185, 1187, 1189 and 1191.
- step 1185 a test is made of each factor flag FF(N) in turn to determine whether it is set to one. Only if the factor flag for a given factor is set, does CPU1 execute an update on the corresponding geometric mean GMTF(N) of the corresponding formant of the talker's speech.
- the computation of equation (6A) is generalized to the Nth formant according to the following formula:
- the respective averaging divisor N2(N) which was used for purposes of updating GMTF(N) in step 1187, is then incremented by one.
- step 1187 or step 1185 if FF is not set, for a given value of index N
- operations go to a step 1189 to determine how many factors will participate in the final geometric mean calculation for SR. In other words this number of factors represented by FSUM determines whether a cube root, fourth, fifth, or sixth root will be needed.
- step 1189 FSUM is incremented for each latest latching factor flag FS(N) which is set to 1. Also in step 1189, the index N is incremented in preparation for a test 1191. If in test 1191 index N does not exceed 5, operations loop back to step 1185 to repeat the loop based on the incremented index N.
- a first ratio GMTF(O)/POF(O) is raised to the 1/3 power (see “3" in denominator of exponent of that ratio) and then integrated into the overall geometric mean.
- FSUM in the denominator of each exponent of every ratio GMTF(N) signifies the root for the geometric mean.
- the latching flag FS(N) is one or zero depending on whether or not a given ratio is permitted to participate as a factor in the geometric mean calculation.
- the sensory reference SR is thus computed according to the formula:
- step 1193 Upon completion of step 1193, a RETURN 1195 is reached and operations continue with step 133 of Fig. 4 in the overall operations of CPU1 of Fig. 1.
- CPU1 as described thus advantageously electronically generates a signal representing a sensory reference value over time as a function of frequencies of a plurality of peaks in each of a plurality of the frequency spectra.
- CPU1 also electronically generates the signal to represent the sensory reference value over time as a function of respective ratios of the frequencies of a plurality of peaks in each of a plurality of the frequency spectra to predetermined frequency values corresponding to population averages for the peaks.
- CPU1 also detects the presence of a spectrum having peaks and indicative of a burst friction sound when such sound is present and electronically generates the signal to represent the sensory reference value over time as a function of respective ratios of the frequencies of two of the peaks indicative of the burst friction sound to predetermined frequency values corresponding to a population averages for the two peaks. It also detects the presence of a spectrum having peaks and indicative of a glottal source sound when such sound is present and electronically generates the signal to represent the sensory reference value over time as a function of the frequency of at least one of the peaks associated with the spectrum of the glottal source sound.
- Still further it electronically generates the signal to represent the sensory reference value over time as a function of a fundamental frequency of the spectrum when the detecting step indicates that the spectrum includes a periodic component. Additionally, it recursively electronically computes respective geometric means of frequencies of a plurality of peaks in each of a plurality of the frequency spectra and then electronically computes the sensory reference value as a function of the recursively computed geometric means.
- the sensory reference electronically computes the sensory reference as a function of a geometric mean of factors which are functions of first, second, and third frequencies corresponding to first, second, and third formant peaks in the sets of digital values for spectra of glottal source sounds and of fourth and fifth frequencies corresponding to second and third formant peaks in the sets of digital values for spectra of burst friction sounds.
- the geometric mean of factors also includes a factor which is a function of a fundamental frequency of spectra that include a periodic component.
- the sensory reference value is repeatedly computed and CPU1 selectively stores in respective locations in memory representing the different formants the values of frequency of different peaks in a spectrum wherein the respective memory location in which the frequency value of a given peak is stored depends on whether the peak lies in one or more defined bands of frequencies, the bands being defined as a function of a value of the sensory reference which has already been computed prior to said selective storing of frequencies of the formant peaks for use in computing a subsequent value of the sensory reference.
- the sensory reference value is computed to include a geometric mean of factors corresponding to particular characteristics of speech sounds including periodicity, glottal source spectral features, and burst friction spectral features. The factors are introduced and retained in the same order in which speech sounds having the particular characteristics initially occur.
- Figs. 41-44 illustrate a method for processing a latest incoming spectrum 1201 of Fig. 41 by a harmonic sieve (conceptually indicated as a movable template 1203 in the frequency domain) to obtain a periodic line spectrum 1205 of Fig. 42 which is smoothed to produce a smoothed periodic spectrum 1207 of Fig. 43.
- the periodic line spectrum 1205 is subtracted from the spectrum 1201 of Fig. 41 as indicated by empty segments 1209, 1211, 1213, 1215 and 1217.
- the spectrum 1201 with the empty segments is then smoothed to produce a smoothed aperiodic spectrum 1221 of Fig. 44.
- spectrum 1201 in general comprises a periodic spectrum 1207 and an aperiodic spectrum 1221 at the same time.
- the aperiodic spectrum 1221 predominates and the periodic spectrum has low loudness.
- voiced speech the periodic spectrum 1207 predominates and the aperiodic spectrum has low loudness.
- breathy speech both the periodic and aperiodic spectra are present with significant loudness.
- the periodic spectrum can have either a glottal-source GS characteristic or a burst-friction BF characteristic depending on whether or not the first formant P1 is present.
- the aperiodic spectrum can have either a glottal-source GS characteristic or a burst-friction BF characteristic depending on whether or not the first formant PI is present.
- Fig. 45 operations, according to an improvement of those of Fig. 4, commence with a START 1301 and housekeeping and initialization 1303, 1305, 1307 and 1309 as described for steps 101, 103, 105, 107 and 109 of Fig. 4.
- CPU1 computes a FFT (Fast Fourier Transform) spectrum with a resolution of 2 to 5 Hertz on a current window sample. For example, with a sampling rate of 20,000 Hertz, there are 20 samples per millisecond. Using a 24 millisecond Kaiser-Bessel window, there are 480 samples. For computation purposes, the 24 milliseconds is then padded out with enough zeros to form an effective transformable time domain function having 8192 (8K) points, or about 410 milliseconds (2.5 Hertz resolution). Accordingly, the Fast Fourier Transform is computed on the 480 samples plus 7712 zeros in step 1311.
- FFT Fast Fourier Transform
- a step 1313 converts the spectrum so derived to decibels as discussed in connection with step 121.
- a step 1315 separates the periodic and aperiodic spectra as discussed in connection with Figs. 41-44 to obtain a smoothed periodic spectrum 1207 and a smoothed aperiodic spectrum 1221 corresponding to the latest incoming spectrum from step 1311.
- the separation process utilizes, for example, a harmonics sieve procedure described in the Scheffers article cited hereinabove, or any other procedure which suffices to accomplish the separation.
- the set of harmonic numbers that best fits the set of resolved component frequencies can be determined by using the sieve. To this end the sieve is successively set at a number of positions in respect to the components. Each position is fully characterized by the fundamental of the sieve which varies from 50-500 Hz. A step size between successive positions of 3% of the fundamental frequency is chosen so that there is a slight overlap to minimize the chance of a component being missed.
- a criterion value is calculated for the match of the sieve in this position to the set of components.
- This criterion is basically a measure of the difference between the pattern of the sieve and that of the resolved components. The value is diminished for each component that passes a mesh and augmented (a) for each component that cannot pass a mesh (a spurious component) and (b) for each mesh through which no component passes (a missing harmonic). The last decision cannot be taken for components with frequencies above the frequency to which the 12th mesh corresponds. These components are therefore disregarded in the criterion. The same is done for empty meshes above the highest one through which a component passes.
- the mathematics underlying the criterion are extensively described by Duifhuis et al. (1982). The criterion 1 is given in Eq. (1).
- step 1315 in Fig. 46 commences with a BEGIN 1371 and proceeds to initialize an index I to zero in a step
- a criterion such as Q of Scheffers is computed and stored in a table according to index I.
- a test step 1381 determines whether index I equals or exceeds the index value I o (e.g. 46) for the upper end of the range. If not, operations loop back to step 1375 until the sieve has been slid along the spectrum through its range 50-500 Hertz. Then in a step 1383, the table of criterion values Q is searched for the maximum value, and the corresponding index I M is determined.
- the periodic line spectrum 1205 of Fig. 42 is obtained and temporarily stored.
- the periodic line spectrum is subtracted from the spectrum 1201 of Fig. 41 resulting in an unsmoothed aperiodic spectrum with empty segments 1209, 1211, 1213, 1215 and 1217 as shown in Fig. 41 and stored separately.
- a step 1389 of Fig. 46 smooths the periodic line spectrum 1205 of Fig. 42 by convolving it with a predetermined bell-shaped distribution of stored values 1390 (see Fig. 42), and storing the resulting smoothed spectrum as spectrum 1207 of Fig. 43.
- the unsmoothed aperiodic spectrum of Fig. 41 with the empty segments is smoothed to produce smoothed aperiodic spectrum 1221 of Fig. 44 and stored, whence a RETURN 1393 is reached.
- step 1323 analogous to step 123 of Fig. 4 wherein the periodic spectrum and aperiodic spectrum are processed to eliminate tilt from each. It is contemplated that the skilled worker provide sufficient computer speed in CPU1 or provide an auxiliary DMA (direct memory access) processor to accomplish the processing described for the various operations detailed herein.
- auxiliary DMA direct memory access
- operations execute a step 1331 by executing the operations of Figs. 13A and 13B first for the smoothed periodic P spectrum and then for the smoothed aperiodic AP spectrum obtained as hereinabove-described.
- the various values and flags respective to the two spectra are separately stored temporarily.
- steps 1333 and 1335 are analogous to steps 133 and 135 of Fig. 4.
- a further step 1334 provides BF and GS flag logic to determine that the proper spectrum or spectra are used in a step 1343 to compute the sensory pointer coordinates for each of glottal source and burst friction pointers BFSP and GFSP. These are output and operations occur in steps 1337, 1345, 1347 and 1349 which are analogous to steps 137, 145, 147 and 149 of Fig. 4.
- step 1334 it is noted that for many speech sounds the aperiodic AP spectrum lacks a first formant F1 and analysis of it in step 1331 therefore results in the burst-friction flag BF being set.
- the periodic P spectrum has a first formant F1, causing glottal-source flag GS to be set in step 1331. Still other sounds have both glottal source and burst friction components occurring simultaneously, as in "v" or "z".
- the aperiodic AP spectrum provides the values for computation of the coordinates X s , Y s and Z s of the burst-friction sensory pointer BFSP and the periodic P spectrum provides the values for computation of the coordinates X s , Y s and Z s of the glottal source sensory pointer GSSP.
- the BFSP For sounds in which the glottal source component predominates, and the burst friction component is weak or nonexistent, the BFSP, if computed, exerts a negligible influence since its loudness is low or zero.
- the GSSP For sounds in which the burst friction component predominates, and the glottal source component is weak or nonexistent, the GSSP, if computed, exerts a negligible influence since its loudness is low or zero.
- a loudness test can be provided in step 1334 to turn off the BF or GS flag respective to a given AP or P spectrum if the AP or P spectrum respectively falls below a predetermined loudness level, instead of relying on low loudness to eliminate the influence of the weak spectrum in the difference equations (9A-C) and (9A'-C).
- a table of Fig. 47 illustrates that ordinarily when both the BF and GS flags are set, they correspond to the aperiodic AP spectrum and periodic P spectrum respectively. However, it is possible in cases of breathy speech and some e ⁇ ectronically synthesized speech for both the aperiodic AP spectrum and periodic P spectrum to turn on the same flag ( e . g . GS).
- a logic sequence searches the table of Fig. 47 to determine whether either row of the table indicates that the same flag is set for both the P and AP spectra. Ordinarily, as illustrated in Fig. 47, this does not occur.
- step 1334 determines which spectrum P or AP should be used in step 1343 to compute the coordinates for the sensory pointer (e.g. GSSP) associated with that flag.
- the spectrum with the greater loudness is used to determine the BF or GS nature of the sound.
- CPU1 thus electronically produces sets of values representing both a periodic spectrum and an aperiodic spectrum from one of the frequency spectra of the speech and generates two sets of signals representing a glottal-source sensory pointer position and a burst-friction sensory pointer position from the sets of values representing the periodic spectrum and the aperiodic spectrum.
- Fig. 48 illustrates a segmentation index method of computing a segmentation index (SI) value from three series of coordinate values for the primed coordinates X' p , Y' p , and Z' p (see Fig. 18).
- Processor CPU3 of Fig. 1 generates a segmentation index signal representing a function of the differences between the greatest and the least primed coordinate value occurring in a time period encompassing a predetermined number of time intervals of the speech for which spectra are computed.
- CPU3 computes each segmentation value represented by the signal as a weighted sum of respective differences S ⁇ X(i), S ⁇ Y(i), S ⁇ Z(i) for each primed coordinate between the greatest and the least value for the respective primed coordinate occurring in a time period encompassing a predetermined number of the time intervals. For example, this means that if a peak (local maximum) occurs and is centered in the window, the applicable difference in the window for the coordinate in which the peak occurs is the difference between the top of the peak and the least value in the window, even if a higher peak is also in the window. If a dip (local minimum) occurs and is centered in the window, the difference in the window for the coordinate in which the dip occurs is the difference between the bottom of the dip and the largest value in the window, even if a deeper dip is also in the window.
- weight W1 and W2 for coordinates X' p and Y' p are set to unity and a weight W3 for coordinate Z' p is set to unity if Z' p exceeds 0.6 and otherwise is set to a value of 3/Z' i .
- the weight W3 is an inverse function over a range of values of that Z' coordinate.
- the segmentation index signal SI is generated in response to an occurrence of the peak, with the time period encompassing a peak time when the peak occurs and the peak time being approximately centered in the window time period for purposes of the segmentation index relating to its corresponding peak 1501.
- Fig. 49 operations in an alternative version of any of steps 505, 805 or 905 of Figs. 26, 33, or 35 respectively, commence with a BEGIN 1511. Then in a step 1513 CPU3 searches the three series of values for X' p , Y' p , and Z' p for a peak in any of them. Next, a step 1515 tests to determine whether the peak is centered yet in the last 25 frames (or number of frames corresponding to a desired window time period). If not, operations branch to a point 1517 representing the "NO" output of step 507, 807, or 907 of Figs. 26, 33 or 35 respectively since there is no significant trajectory parameter at this time.
- step 1519 when the peak is centered as determined by test 1515, operations proceed to a step 1519 to compute respective differences S ⁇ X(i), S ⁇ Y(i), S ⁇ Z(i) for each coordinate between the local maximum and minimum values in the window.
- a step 1521 sets each of the weights W1, W2 and W3 to unity.
- a branch is made to a step 1525 to set W3 to the abovementioned inverse function of the Z coordinate value of the latest centered peak 1501.
- steps 1523 or 1525 operations proceed to a step 1527 to compute the segmentation index SI according to the formula:
- a RETURN 1529 is reached, completing the generation of the segmentation index signal.
- the segmentation index represented by the signal is compared with a preset reference value of approximately 0.10. If the segmentation index exceeds the reference value, a significant trajectory parameter is considered to exist for segmentation purposes in analyzing the speech to be recognized.
- a system for studying target zones for refining system 1 from examples of talkers' speech displays and analyzes the target zones in three-dimensional display of the mathematical space.
- Such a system has an Evans and Sutherland PS300 Graphic System and a VAX-750 or uVAX-II computer, a special purpose "coordinate transformer” and appropriate peripherals that allow three-dimensional viewing of line figures.
- Features of the display include knob control of "zoom", and knob control of rotation or translation relative to the system's axes.
- the mathematical space, or auditory-perceptual space is displayed with axes.
- Three-dimensional target zones are created with a programs in the system.
- a target zone can be located in the space with a specified color, orientation and size as well as with a phonetic symbol located near it as desired.
- a quadruple set of values F0, F1, F2, F3 is entered for each time t at which time the fundamental and the first three spectral prominences are estimated using current speech-analysis techniques. These quadruples comprise a file. Next a value of a constant a is selected and quadruples (t, 10g(F3/F2), log (F1/R), 10g(F2/F1) are formed, where is a reference. These are the logarithms of the formant ratios and comprise a second file. When F1 is not defined log(F1/R) is arbitrarily set to zero.
- linear interpolation is performed by the computer to provide a file of the quadruples spaced at 5 or 10 millisecond intervals.
- a line segment connecting each set of coordinates can be displayed at user option.
- On the tip of each such segment a pyramid, appropriately oriented is displayed to represent the sensory pointer.
- the line segments and pyramids are stored in a third file.
- the mathematical space is displayed with appropriate selection of target zones.
- the user selects a sensory path, e.g. the syllable "dud" as spoken by a particular speaker.
- a rate of display such as five times real time, is selected and the run is started.
- the displays shows the sensory pointer moving through the mathematical space, and its path is shown by the segments.
- the interpolated log ratio file is converted into a table representing perceptual coordinates by applying the sensory-perceptual transformation to the sensory coordinates, n-resonators (second order) serve as the transformation. In this way, certain rates of spectral modulation are emphasized and others attenuated. These are stored in a fourth file.
- the perceptual path is displayed in the same way as the sensory path. Further programs enable the study of the magnitudes of velocity v, acceleration a, and curvature k as either the sensory pointer or perceptual pointer moves through the space.
- Appropriately scaled displays permit viewing of x, y, x, v, a, and k as a function of time or to view similarly log(F3), log(F2), log(Fl), 10g(F0), v, a, and k as a function of time.
- Knob control of a cursor permits marking points of interest and determination of the values of the coordinates and dynamic parameters at those points.
- Modeling of the sensory-perceptual transformation as a single second-order resonator with a center frequency of 55 Hz. and a damping factor of 0.6 results in perceptual paths that are orderly and reasonable, although experimental refinements can be made.
- top-down processing in a great many listening situations is significant, and the separation of the perceptual and sensory aspects of phonetic processing advantageously permits top-down processing by CPU2, CPU3 and/or CPU4.
- information derived by the system by pattern recognition apparatus, prestorage or other means is suitably used to generate additional contributions in Equations 9A, 9B and 9C that attract the perceptual pointer toward particular target zones.
- the perceptual pointer is driven not only by the sensory pointer(s) and the other factors previously mentioned, but also by the other information derived by the system as they are controlled by context, knowledge of the language, and so on.
- top-down processing involves information such as visual cues and information from other senses resulting in attractive or repulsive forces on the perceptual pointer. For example, mouth movements can be observed by pattern recognition apparatus and used to add forces that attract the perceptual pointer PP to various target zones and thus influence phonetic perception. Even more complicated forms of top-down processing are contemplated. For example, the sizes and shapes of the target zones are changed depending on the speech characteristics of the talker such as having a foreign accent, deaf speech, and so on.
- the PHE information prestored in the memory is accompanied by confidence level information bits representing a confidence between 0 and 1.
- PHE information for volume elements deep in the interior of a target zone has a high confidence
- PHE information for volume elements near the surface of a target zone has a. low confidence.
- the confidence information derived from the target zones when a peak in acceleration magnitude occurs is compared with confidence information derived from the pattern recognition apparatus and a decision is made as to the most probable interpretation of the speech. Similar analyses are executed in embodiments of the invention at the lexical access level by CPU4 to identify words and meanings.
- CPU3 forms and refines the target zones in memory 31 automatically. Streams of speech are fed to the system 1 and phonetically significant events identify addresses in memory 31.
- CPU3 tabulates the frequencies of events in regions of the memory and assigns distinctive binary category codes to regions having clusters of events. The category codes are listed in a table, and the skilled worker assigns conventional phonetic symbols to the tabulated category codes generated by the system, so that the system prints out the conventional symbols needed for human interpretation of the category codes generated by the system in a manner analogous to teaching the system to spell at a phonetic element level.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Electrophonic Musical Instruments (AREA)
Abstract
Un appareil de traitement de la parole (1) comprend une mémoire (31) contenant des informations prémémorisées (PHE) indicatives de représentations phonétiques différentes correspondant à des ensembles respectifs d'adresses (ADR) dans la mémoire (31). Un réseau de circuit (CPU1, CPU2 et CPU3) dans l'appareil (1) dérive électriquement une série de valeurs de coordonnées (Xp, Yp, Zp) de points sur un chemin dans un espace mathématique à partir de spectres de fréquence (D(K)) de la parole se produisant dans des intervalles de temps successifs, respectivement, identifie les valeurs de coordonnées (Xp, Yp, Zp) s'approchant au moins d'une position le long du chemin d'une crête (455) en intensité de l'accélération, génère une adresse de mémoire (ADR) en fonction des valeurs de coordonnées de position (Xp, Yp, Zp) et obtient de la mémoire (31) les informations de représentation phonétiques (PHE) préstockées au niveau de cette adresse de mémoire (ADR). Des procédés et autres appareils pour le traitement de la parole sont également décrits.A speech processing apparatus (1) includes a memory (31) containing pre-stored information (PHE) indicative of different phonetic representations corresponding to respective sets of addresses (ADR) in the memory (31). A circuit network (CPU1, CPU2 and CPU3) in the device (1) electrically derives a series of coordinate values (Xp, Yp, Zp) of points on a path in a mathematical space from frequency spectra (D (K)) of speech occurring in successive time intervals, respectively, identifies the coordinate values (Xp, Yp, Zp) approaching at least one position along the path of a ridge (455) in acceleration intensity, generates a memory address (ADR) as a function of the position coordinate values (Xp, Yp, Zp) and obtains from the memory (31) the phonetic representation information (PHE) preset at the level of this memory address (ADR). Methods and other apparatuses for speech processing are also described.
Description
Claims
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US60246 | 1987-06-09 | ||
US07/060,246 US4820059A (en) | 1985-10-30 | 1987-06-09 | Speech processing apparatus and methods |
US07/060,397 US4813076A (en) | 1985-10-30 | 1987-06-09 | Speech processing apparatus and methods |
US60397 | 1993-05-10 |
Publications (2)
Publication Number | Publication Date |
---|---|
EP0364501A1 true EP0364501A1 (en) | 1990-04-25 |
EP0364501A4 EP0364501A4 (en) | 1993-01-27 |
Family
ID=26739731
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19880908441 Withdrawn EP0364501A4 (en) | 1987-06-09 | 1988-06-08 | Speech processing apparatus and methods |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP0364501A4 (en) |
WO (1) | WO1988010413A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5313531A (en) * | 1990-11-05 | 1994-05-17 | International Business Machines Corporation | Method and apparatus for speech analysis and speech recognition |
US5450522A (en) * | 1991-08-19 | 1995-09-12 | U S West Advanced Technologies, Inc. | Auditory model for parametrization of speech |
CN115881118B (en) * | 2022-11-04 | 2023-12-22 | 荣耀终端有限公司 | Voice interaction method and related electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4284846A (en) * | 1978-05-08 | 1981-08-18 | John Marley | System and method for sound recognition |
EP0119835A1 (en) * | 1983-03-16 | 1984-09-26 | Figgie International Inc. | Speech recognition system based on word state duration and/or weight |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3076932A (en) * | 1963-02-05 | Amplifier | ||
US3368069A (en) * | 1967-09-14 | 1968-02-06 | David H. Trott | Globe and bulb mounting for signal light |
US3679830A (en) * | 1970-05-11 | 1972-07-25 | Malcolm R Uffelman | Cohesive zone boundary detector |
FR2150174A5 (en) * | 1971-08-18 | 1973-03-30 | Dreyfus Jean | |
JPS57147781A (en) * | 1981-03-06 | 1982-09-11 | Nec Corp | Pattern matching device |
US4570232A (en) * | 1981-12-21 | 1986-02-11 | Nippon Telegraph & Telephone Public Corporation | Speech recognition apparatus |
US4608708A (en) * | 1981-12-24 | 1986-08-26 | Nippon Electric Co., Ltd. | Pattern matching system |
JPS58132298A (en) * | 1982-02-01 | 1983-08-06 | 日本電気株式会社 | Pattern matching apparatus with window restriction |
JPS59226400A (en) * | 1983-06-07 | 1984-12-19 | 松下電器産業株式会社 | Voice recognition equipment |
EP0243479A4 (en) * | 1985-10-30 | 1989-12-13 | Central Inst Deaf | Speech processing apparatus and methods. |
-
1988
- 1988-06-08 EP EP19880908441 patent/EP0364501A4/en not_active Withdrawn
- 1988-06-08 WO PCT/US1988/001977 patent/WO1988010413A1/en not_active Application Discontinuation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4284846A (en) * | 1978-05-08 | 1981-08-18 | John Marley | System and method for sound recognition |
EP0119835A1 (en) * | 1983-03-16 | 1984-09-26 | Figgie International Inc. | Speech recognition system based on word state duration and/or weight |
Non-Patent Citations (6)
Title |
---|
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, vol. 25, no. 3, 1st June 1977, pages 252-256, New York, US; J.Y. CHEUNG et al.: "Computer recognition of linguistic stress patterns in connected speech" * |
IEEE TRANSACTIONS ON ACOUSTICS,SPEECH AND SIGNAL PROCESSING. vol. 25, no. 3, 1 June 1977, NEW YORK US pages 252 - 256 CHEUNG ET AL 'Computer recognition of linguistic stress patterns...' * |
IEEE TRANSACTIONS ON AUDIO AND ELECTROACOUSTICS. vol. 21, no. 3, 1 June 1973, NEW YORK US pages 239 - 249 ITAHASHI ET AL 'Discrete word recognition utilizing a word dictionnary and phonological rules' * |
INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING vol. 1, 3 May 1982, PARIS FRANCE pages 550 - 553 RUSKE 'Automatic recognition of syllabic speech segments using spectral and temporal features' * |
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA vol. 58, no. 4, October 1975, NEW YORK US pages 880 - 883 MERMELSTEIN 'Automatic segmentation of speech into syllabic units' * |
See also references of WO8810413A1 * |
Also Published As
Publication number | Publication date |
---|---|
EP0364501A4 (en) | 1993-01-27 |
WO1988010413A1 (en) | 1988-12-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US4813076A (en) | Speech processing apparatus and methods | |
US4820059A (en) | Speech processing apparatus and methods | |
US5758023A (en) | Multi-language speech recognition system | |
US4783807A (en) | System and method for sound recognition with feature selection synchronized to voice pitch | |
US5715367A (en) | Apparatuses and methods for developing and using models for speech recognition | |
Yin et al. | Automatic cognitive load detection from speech features | |
Cole et al. | Feature-based speaker-independent recognition of isolated English letters | |
CN114446268B (en) | Audio data processing method, device, electronic equipment, medium and program product | |
CN110718210B (en) | English mispronunciation recognition method, device, medium and electronic equipment | |
US4707857A (en) | Voice command recognition system having compact significant feature data | |
Nedjah et al. | Automatic speech recognition of Portuguese phonemes using neural networks ensemble | |
EP0364501A1 (en) | Speech processing apparatus and methods | |
Markey | Acoustic-based syllabic representation and articulatory gesture detection: prerequisites for early childhood phonetic and articulatory development | |
Broad | Formants in automatic speech recognition | |
Yavuz et al. | A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model. | |
Jones et al. | Using relative duration in large vocabulary speech recognition. | |
Singh et al. | Speech recognition system for north-east Indian accent | |
WO1989003519A1 (en) | Speech processing apparatus and methods for processing burst-friction sounds | |
Mermelstein | Computer Simulation of Articulatory Activity in Speech Production. | |
EP0245252A1 (en) | System and method for sound recognition with feature selection synchronized to voice pitch | |
Ouhnini et al. | Vocal Parameters Analysis for Amazigh Phonemes Recognition System | |
JP3899122B2 (en) | Method and apparatus for spoken interactive language teaching | |
JP3899122B6 (en) | Method and apparatus for spoken interactive language teaching | |
Bengio et al. | Use of multilayer networks for the recognition of phonetic features and phonemes | |
Schwartz | The Graduate Project of Kwangsu Han is approved |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 19891208 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): DE FR GB NL SE |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: CHANG, HISAO, MING Inventor name: MILLER, JAMES, D. |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 19921208 |
|
AK | Designated contracting states |
Kind code of ref document: A4 Designated state(s): DE FR GB NL SE |
|
17Q | First examination report despatched |
Effective date: 19940505 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 19940916 |