WO1989003519A1 - Procedes et appareil processeurs de la parole servant a traiter des sons plosifs-fricatifs - Google Patents

Procedes et appareil processeurs de la parole servant a traiter des sons plosifs-fricatifs Download PDF

Info

Publication number
WO1989003519A1
WO1989003519A1 PCT/US1988/003374 US8803374W WO8903519A1 WO 1989003519 A1 WO1989003519 A1 WO 1989003519A1 US 8803374 W US8803374 W US 8803374W WO 8903519 A1 WO8903519 A1 WO 8903519A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
burst
frequency
peak
values
Prior art date
Application number
PCT/US1988/003374
Other languages
English (en)
Inventor
Allard Jongman
James D. Miller
Original Assignee
Central Institute For The Deaf
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US07/107,488 external-priority patent/US4809332A/en
Application filed by Central Institute For The Deaf filed Critical Central Institute For The Deaf
Publication of WO1989003519A1 publication Critical patent/WO1989003519A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present invention relates to speech processing apparatus and methods. More particularly, the present invention relates to improved apparatus and methods for use in automatic speech recognition technology to process a class of speech sounds called burst-friction sounds.
  • Speech as it is perceived, can be thought of as being made up of segments or speech sounds. These are the phonetic elements, the phonemes. of a spoken language and they can be represented by a set of symbols, such as International Phonetic Association symbols.
  • Burst-friction spectra are involved in the perception of voiced plosives (e.g. /g/, /d/, and /b/) and voiceless aspirated and unaspirated stops or plosives (e.g. sounds of k, t or p), voiceless fricatives (e.g. s, h, sh, th in "both", f and wh) and voiced fricatives (e.g. z, zh, j, v and th in "the”).
  • voiced plosives e.g. /g/, /d/, and /b/
  • voiceless fricatives e.g. s, h, sh, th in "both", f and wh
  • voiced fricatives e.g. z, zh, j, v and th in "the”
  • burst-friction spectra participate in a large part of the speech sound inventory of most
  • Stage 1 is an auditory-sensory analysis of the incoming acoustic waveform whereby representation of the signal is achieved in auditory-sensory terms.
  • Stage 2 is an auditory-perceptual transformation whereby the spectral output of stage 1 is transformed into a perceptual form relevant to phonetic recognition.
  • the spectral descriptions are transformed into dimensions more directly relevant to perception.
  • the perceptual form may be related to articulatory correlates of speech production or auditory features or pattern sequences.
  • stage 3 in which the perceptual dimensions of stage 2 are transformed by a phoneticlinguistic transformation into strings of phonemes, syllables, or words.
  • Stages 2 and 3 also are influenced by top-down processing wherein stored knowledge of language and events and recent inputs, including those from other senses in addition to language as heard, are brought into play.
  • Miller incorporated patent application to provide improved speech processing apparatus and methods for automatically selecting particular spectral peaks in burst-friction speech sounds of unknown identity for automatic speech recognition purposes which more frequently correspond with those peaks which could be picked out (as characterizing the particular speech sounds) by scientific specialists reviewing spectra already knowing beforehand the identity of each speech sound which was uttered; to provide speech processing apparatus and methods which are feasible alternatives to those already known in the art; to provide improved speech processing apparatus and methods which are relatively low in computer burden when implemented in software and relatively uncomplicated when implemented in hardware; and to provide improved speech processing apparatus and methods which are accurate, economical and reliable.
  • one form of the invention is a speech processing apparatus including an electronic memory and circuitry that derives from speech sets of digital values representative of frequency spectra.
  • the spectra have peaks at frequencies associated therewith.
  • the peaks include a highest magnitude peak for each spectrum.
  • the circuitry also generates an auditory state signal representing the presence or absence of a burst-friction auditory state of the speech.
  • Circuitry further electronically identifies, when the auditory state signal indicates the presence of a burst-friction auditory state, the highest magnitude peak for each spectrum as well as each peak having a magnitude within a range of magnitudes less than the magnitude of the highest magnitude peak, and selectively stores in distinct locations in the memory, respectively representative of normally occurring prominences of a burst-friction sound, the values of frequency of the lowest two frequencies associated with the identified peaks.
  • Fig. 1 is a block diagram of speech processing apparatus according to and operating by methods of the present invention
  • Fig. 2 is a flow diagram of operations improved according to methods of the present invention for a main routine of CPU1 of Fig. 1;
  • Figs. 3A and 3B are two parts of a flow diagram further detailing operations in the main routine of Fig. 2 improved according to methods of the present invention
  • Figs. 4, 5 and 6 are graphs of intensity or sound pressure level SPL in decibels versus frequency of the sound in kiloHertz comprising spectra of three speech sounds;
  • Fig. 7 is a flow diagram of a burst-friction processing method of the invention for use in a step in Fig. 3A;
  • Fig. 8 is a graph of breadth D in decibels of an intensity range for use in a method of the invention, versus intensity in decibels of a highest magnitude peak in a speech sound spectrum;
  • Fig. 9 is a graph of frequency in kiloHertz versus time in milliseconds depicting values of frequency BF2 and BF3 for storage in distinct locations in an electronic memory of Fig. 1 respectively representative of normally occurring prominences of a burst-friction sound;
  • Fig. 10 is a memory map of an electronic memory of Fig. 1 associated with spectrum diagrams and legends to illustrate methods of the invention;
  • Fig. 11 is an illustration of a mathematical model for converting from sensory pointer coordinates to coordinates X p , Y p and Z p of a perceptual pointer in a three dimensional mathematical space;
  • Fig. 12 is a simplified diagram of the mathematical space of Fig. 11, showing target zones for two phonetic elements, and showing a trajectory or path traced out by the perceptual pointer in the mathematical space;
  • Fig. 13 is a diagram of operations according to a method in a unit CPU2 of Fig. 1 for converting from sensory pointer coordinate values to coordinate values on a path having perceptual significance; and Fig. 14 is a flow diagram of operations for calculating trajectory parameters and testing them to determine points on the path where a predetermined condition is satisfied, and for implementing a complex target zone method and a glide detection method.
  • Corresponding reference characters indicate corresponding parts throughout the several views of the drawings. Detailed Description of Preferred Embodiments
  • a speech processing system 1 of the invention has a microphone 11 for converting sound pressure variations of an acoustic waveform of speech to an analog electrical signal on a line 13.
  • System 1 performs a short-term analysis on the speech waveform that allows it to represent, once every millisecond, the spectral shape and the auditory state of the incoming speech. This sensory processing serves as an input to a higher level perceptual electronic system portion.
  • the perceptual electronic system portion integrates sensory information over time, identifies auditory-perceptual events (or
  • the electrical signal on line 13 is filtered by an antialiasing low pass filter 15 and fed to a sample-and-hold (S/H) circuit 17.
  • S/H circuit 17 is enabled by an oscillator 19 at a sampling frequency such as 20 KHz. and supplies samples of the analog electrical signal to an analog-to-digital converter (ADC) 21 where the samples are converted in response to oscillator 19 to parallel digital form on a set of digital lines 23 connected to data inputs of a first central processing unit CPU1.
  • CPU1 reads in the latest sample in digital form upon interrupt by oscillator 19 at interrupt pin IRQ every 50 microseconds.
  • CPU1 is one of four central processing units CPU1, CPU2, CPU3 and CPU4 in Fig. 1, which respectively have programmable read only memory (ROMl, ROM2, ROM3 and ROM4), random access memory (RAM1, RAM2, RAM3 and RAM4), and a video terminal- keyboard unit (TERMKBD1, TERMKBD2, TERMKBD3, and TERMKBD4).
  • CPU1 generates data for CPU2 which is buffered by a data buffer 25.
  • CPU2 generates data for CPU3 which is buffered by a data buffer 27, and CPU3 generates data for CPU4 which is buffered by a data buffer 29.
  • CPU3 has a memory 31 of approximately 2 megabyte or otherwise sufficient capacity that holds prestored phonetically relevant information indicative of different phonetic representations, target zone identifications, and glide zone (glide nucleus or radical) identifications corresponding to respective sets of addresses in the memory.
  • CPU3 is provided with a printer 33 for recording phonetic element information in the order obtained by it from memory 31.
  • CPU4 is in one application shown in Fig. 1 programmed as a lexical access processor for converting the phonetic element information into plaintext and printing it out on a printer 35 to accomplish automatic dictation.
  • CPU4 in some applications is programmed additionally, or instead, to process the phonetic elements and synthesize speech therefrom and make it audible using an electroacoustic output transducer in a manner adapted to ameliorate hearing deficiencies or otherwise produce modified speech based on that entering microphone 11.
  • CPU4 in still other applications acts as a bandwidth compressor to send the phonetic elements through a telecommunication system along with other phonetic elements from a different speech channel with which the first speech phonetic elements are multiplexed.
  • CPU4 in yet further applications is programmed with artificial intelligence or expert systems software to interpret the phonetic elements and to produce a printed response, a synthesized speech response, a robotic response controlling computers or other electronic devices or electromechanical apparatus in home, office or factory, or to produce any other appropriate response to the speech sensed on line 13.
  • operations of CPU1 commence with a
  • CPU1 computes a FFT (Fast Fourier Transform) spectrum with a resolution of 2 to 5 Hertz on a current window sample. For example, with a sampling rate of 20,000 Hertz, there are 20 samples per millisecond. Using a 24 millisecond time weighting function such as a Hamming window or a Kaiser-Bessel window, there are ⁇ 480 samples. For computation purposes, the 24 milliseconds is then padded out with enough zeros to form an effective transformable time domain function having 8192 (8K) points, or about 410 milliseconds (2.5 Hertz resolution) . Accordingly, the Fast Fourier Transform is computed on the 480 samples plus 7712 zeros in step 1311.
  • FFT Fast Fourier Transform
  • step 1313 converts the spectrum so derived to decibels as discussed in connection with step 121 of the incorporated patent application.
  • a step 1315 separates the periodic and aperiodic spectra as discussed in connection with Figs. 41-44 of the incorporated patent application to obtain a smoothed periodic spectrum and a smoothed aperiodic spectrum corresponding to the latest incoming spectrum from step 1311.
  • the separation process utilizes, for example, a harmonics sieve procedure or any other procedure which suffices to accomplish the separation.
  • step 1323 analogous to step 123 of Fig. 4 of the incorporated application wherein the periodic spectrum and aperiodic spectrum are processed to eliminate tilt from each.
  • the speech waveform is multiplied by time-window weighting functions of 5-40 millisecond duration but shifted in 1.0-2.5 millisecond steps.
  • the window duration and step size as related to bursts, transitions and relatively steady-state segments are adjusted for best performance.
  • the short-term spectrum is calculated for each segment by either DFT or linear prediction analysis (LPA) .
  • the DFT of course, produces a line spectrum with components at integral multiples of the reciprocal of the window length while the LPA produces a smoothed spectral envelope—transfer function—with detail dependent on the number of LP-parameters selected. Either spectrum is represented in log-magnitude by log-frequency dimensions.
  • operations execute a step 1331 by executing the operations of Figs. 3A and 3B first for the smoothed periodic P spectrum and then for the smoothed aperiodic AP spectrum obtained as hereinabove described.
  • the various values and flags respective to the spectra are separately stored temporarily.
  • each latest spectrum is analyzed in step 1331 so that three spectral frequencies SFl, SF2 and SF3 are computed.
  • SFl, SF2 and SF3 are in some cases the frequencies at which peaks occur, and the manner of determining them is described more specifically in connection with Figs. 3A and 3B hereinafter. Distinct lower and higher values SF1L and SF1H are computed for SFl when nasality is present.
  • a spectral frequency reference SR is also computed to indicate the overall general pitch (timbre) of the speech so that voices with high pitch (timbre) and voices with low pitch (timbre) are readily processed by the system 1.
  • an auditory state code or signal representing presence or absence of burst-friction auditory state BF, glottal source auditory state GS, nasality NS, loudness LIGS of glottal source sound, loudness LIBF of burst-friction sound, goodness GGS of glottal source sound and goodness GBF of burst-friction sound are determined from the spectrum.
  • step 1333 the speech goodness values GGS and GBF are tested and the loudness index values LIGS and LIBF are tested, and if none is positive or other ⁇ wise significant, speech is absent and operations branch to a step 1335.
  • step 1335 a set of registers in CPU1 or RAM1 (corresponding to a set of three coordinates called sensory pointer coordinates X s , Y s and Z s ) are loaded with a code "*" indicating that the coordinates are undefined.
  • step 1337 the contents of the registers for X s , Y s and Z s are sent to CPU2 through buffer 25 of Fig. 1.
  • step 1334 If in decision step 1333 the speech goodness and loudness are positive, operations proceed to a step 1334 which provides BF (burst-friction) and GS (glottal-source) flag logic to determine that the proper spectrum or spectra are used in a step 1343 to compute sensory pointer coordinates for each of glottal source and burst-friction sensory pointers BFSP and GFSP.
  • BF burst-friction
  • GS glottal-source
  • step 1343 sensory pointer coordinate value X s is set equal to the logarithm of the ratio of SF3 to SF2, pointer value Y s is set equal to the logarithm of the ratio of SF1L to SR, and pointer value Z s is set equal to the logarithm of the ratio of SF2 to SF1H, whence step 1337 is reached.
  • step 1343 The equations of step 1343 are computed once except when glottal source and burst-friction spectra are simultaneously present, as in voiced fricatives, in which case step 143 is executed twice to compute sensory pointer coordinates X qs , Y qs , Z gs for the glottal source spectrum and X j -,f, fcf/ j -f for the burst-friction spectrum.
  • step 1337 After sensory pointer coordinate values X s , Y s and Z s are sent to CPU2 in step 1337, the auditory state signal coded quantities BF, GS, NS, LIGS, LIBF, GGS and GBF are also sent in a step 1345 to CPU2 through buffer 25. Then in a step 1347, a test is made to determine if an OFF-ON switch is on, and if not, operations terminate at END 1349. If the switch is on, as is normal, operations loop back to step 1305 for obtaining the next spectrum, analyzing it and sending information to CPU2 as described above. CPU1 thus executes operations continually to obtain spectral information about the samples of speech as they arrive in real time.
  • the auditory-spectral pattern at any moment in time is given by the auditory-spectral envelope in decibels dB (Phons or Sensation Level or equivalent) against log frequency.
  • dB Digits or Sensation Level or equivalent
  • the frequency values of a sensory reference SR, as well as SFl, SF2 and SF3 are found for the vocalic portions of speech.
  • Vocalic portions are those segments or spectral components that ordinarily result from an acoustic source at the glottis and have the vocal tract, with or without the nasal tract, as a transmission path to the external air.
  • voiced speech which has periodic spectra
  • whispers or aspirated sounds which have aperiodic spectra
  • GS glottal-source
  • PI low-frequency prominence
  • a sensory pointer for vocalic portions of speech has a position in a mathematical space, or phonetically relevant auditory-perceptual space, computed in step 1343 of Fig. 2.
  • This pointer is called a glottal-source sensory pointer (GSSP) .
  • GSSP glottal-source sensory pointer
  • SFl, SF2 and SF3 are the center frequencies of the first three spectral prominences in the auditory-spectral envelope 127 of Fig. 6 of the incorporated application.
  • SF3 is interpreted as the upper edge of the spectral envelope when no clear peak P3 can be observed, such as when peaks P2 and P3 merge during a velar segment or is taken as being a fixed logarithmic distance over SR when P3 is absent.
  • Spectral frequency SFl generally corresponds to the center frequency of the first significant resonance of the vocal tract. However, during nasalization two peaks, or one broadened peak, appear near the first significant resonance. To take account of such spectral differences steps 1331 and 1343 of Fig. 2 herein are made sufficiently flexible to compute the sensory pointer position differently for nasalization spectra than for other spectra.
  • burst-friction BF burst-friction
  • a BF spectrum is analyzed differently from a GS spectrum by CPU1 in order to produce spectral frequency values SF2 and SF3 and sensory reference value SR, and the position of the resulting sensory pointer values computed in step 1343 of Fig. 2 is in the X s , Z s plane.
  • Frequency values SF2 and SF3 are for the present purposes denominated BF2 and BF3 respectively when a BF spectrum is processed.
  • the sensory reference SR value takes the place of SFl (and SF1H and SF1L) in the calculations.
  • the calculations of step 1343 then define the position of a pointer called the burst-friction sensory pointer (BFSP) which is distinct from the GSSP.
  • BFSP burst-friction sensory pointer
  • CPU1 then acts as an example of means for electronically producing a set of signals representing coordinate values with a first coordinate value which is a function of a ratio of the values of frequency stored in the distinct memory locations (e.g. log(BF3/BF2)) , a second coordinate value which is substantially constant (e.g. log(SR/SR)), and a third coordinate value which is a function of a ratio of the lowest frequency associated with an identified peak to a reference frequency value (e.g. log(BF2/SR)) .
  • the glottal-source GS code value is set to 1 in the auditory state signal whenever a glottal-source spectrum is above the auditory threshold.
  • the GSSP is regarded as moving through a mathematical space, or auditory-perceptual space.
  • the path of the GSSP is interrupted by silences and by burst-friction spectra.
  • the GS value is set to zero and the BF value is set to 1 in the auditory state signal or code. In such case, the GSSP is replaced by the BFSP.
  • the GSSP can be regarded as moving through the mathematical space as the glottal-source spectrum changes shape and sometimes this movement is nearly continuous as in the case of the sentence, "Where were you a year ago?", where the only interruption would occur during the friction burst of "g" in "ago.”
  • the quantity GS in the auditory state code can remain at a value of one (1) through many spectra in various examples of speech, but the quantity BF in the auditory state code when set to one is generally reset to -zero very shortly thereafter, because spectra which are not of the burst-friction type occur so soon thereafter.
  • burst-friction sensory pointer BFSP will usually appear and disappear shortly thereafter as friction sounds are inserted in the speech stream. As burst-friction spectra are unstable, the BFSP may exhibit considerable jitter, and it usually will not move in any smooth, continuous way in the mathematical space.
  • both BF and GS are equal to one simultaneously.
  • both of the sensory pointers are simultaneously present as one is associated with the glottal-source spectrum of the voiced part of the voiced fricative speech sound and the other is associated with the burst-friction spectrum of the friction part of the sound.
  • step 1334 it is noted that for many speech sounds the aperiodic AP spectrum lacks a first prominence and analysis of it in step 1331 therefore results in the burst-friction flag BF being set. Also, in many speech sounds the periodic P spectrum has a first prominence, causing glottal-source flag GS to be set in step 1331. Still other sounds have both glottal source and burst friction components occurring simultaneously, as in "v" or "z".
  • the aperiodic AP spectrum provides the values for computation of the coordinates X ⁇ , Y s and Z s of the burst-friction sensory pointer BFSP and the periodic P spectrum provides the values for computation of the coordinates X s , Y s and Z s of the glottal source sensory pointer GSSP.
  • the BFSP if computed, exerts a negligible influence since its loudness is low or zero.
  • the GSSP if computed, exerts a negligible influence since its loudness is low or zero. If the skilled worker elects, a loudness test can be provided in step 1334 to turn off the BF or GS flag respective to a given AP or P spectrum if the AP or P spectrum respectively falls below a predetermined loudness level, instead of relying on low loudness to eliminate the influence of the weak spectrum in the difference equations (9A-C) and (9A'-C) of the incorporated patent application.
  • step 1334 determines which spectrum P or AP should be used in step 1343 to compute the coordinates for the sensory pointer (e.g. GSSP) associated with that flag.
  • the spectrum with the greater loudness is used to determine the BF or GS nature of the sound.
  • CPU1 thus electronically produces sets of values representing both a periodic spectrum and an aperiodic spectrum from one of the frequency spectra of the speech and generates two sets of signals representing a glottalsource sensory pointer position and a burst-friction sensory pointer position from the sets of values representing the periodic spectrum and the aperiodic spectrum.
  • CPU2 electronically derives coordinate values on a path of a perceptual pointer from both the glottal-source sensory pointer position and burst-friction sensory pointer position.
  • CPU1 in a step 203 finds the maximum value MAX, or highest peak, of the spectrum with tilt removed. This is illustratively accomplished by first setting to zero all spectral values which are less than a predetermined threshold decibel level, so that low sound levels, noise and periods of silence will not have apparent peaks. The nonzero values remaining, if any, are checked to find the highest value among them to find the value MAX.
  • a step 207 an appropriate preset value such as 15 db, or preferably 10 dB, is subtracted from the maximum value MAX to yield a reference level REF.
  • the level REF is subtracted from all of the M values in the DFT spectrum and all of the resulting negative values are set to zero to normalize the spectrum so that the reference line is zero dB and spectral values that fall below the reference are set to zero dB.
  • a step 211 following step 209 the fundamental frequency is found by a pitch-extraction algorithm such as that of Scheffers, M.T.M. (1983) "Simulation of auditory analysis of pitch; An elaboration of the DWS pitch meter.” J. Acoustic Soc. Am. Z , 1716-25, (see Fig. 6 of incorporated patent application) and stored as a spectral frequency SF0, or pitch.
  • CPU1 determines whether there are any positive normalized spectral values lying in a band Bl which is defined as 0 less than or equal to log 10 (f/SR) less than or equal to 0.80, where SR is the spectral reference and f is frequency in Hertz.
  • band Bl If at least one positive normalized spectral value is present in band Bl, the spectrum is regarded as a glottal-source spectrum, and the spectrum (with tilt eliminated per step 1323) is next analyzed in each of three frequency bands Bl, B2 and B3, as suggested beneath Fig. 8 of the incorporated patent application. These frequency bands are used as a way of discriminating the PI, P2 and P3 peaks and the frequency values selected to define each band are adjusted for best results with a variety of speaking voices. Steps 217-243 and steps 247-251 in Figs. 3A and 3B for processing glottal-source spectra are the same as described in connection with Figs. 13A and 13B of the incorporated patent application, and need no further description herein.
  • step 213 of Fig. 3A determines that there are no positive normalized spectral values in the band Bl, it is concluded that the spectrum is a burst-friction spectrum (although this may also be a period of silence) and a branch is made to a step 1615 where auditory state signal BF is set to 1 and the spectral higher and lower frequency values SF1L and SF1H are both set equal to sensory reference SR, so that the later operations of step 1343 of Fig. 2 compute the sensory pointer coordinate values correctly for the burst-friction spectrum.
  • a spectrum that is a burst-friction spectrum is not analyzed in the manner of Fig. 9 of the incorporated patent application and step 215 of Fig.
  • step 1615 retrieves the set of digital data representing the corresponding stored spectrum with tilt remaining (see discussion of step 1323 hereinabove) .
  • Step 1615 processes each set of digital data representative of that burst-friction spectrum with tilt, ignoring any information above 6 kHz., as described in detail hereinbelow in connection with spectrum Figs. 4-6 herein and in the flow diagram of operations of Fig. 7.
  • step 1615 operations pass through a point Y to step 243 of Fig. 3B where the speech goodness is computed.
  • the loudness of the latest spectrum is computed according to a procedure described in Stevens, S.S., "Perceived Level of Noise by Mark VII and Decibels (E) , J. Acoust. Soc. Am., Vol. 51, 2(2), pp. 575-601 (1972), and used to calculate LIBF or LIGS and stored in a register for LIBF or LIGS depending on whether the latest spectrum is burst-friction or glottal-source respectively.
  • Operations proceed from step 1644 to calculate sensory reference SR in step 245 whence a RETURN 257 is reached.
  • Burst-friction processing operations of step 1615 are now illustrated and discussed in connection with Figs. 4-6. The determinations that are needed are seemingly contradictory and unpredictable. However, operations according to the flow diagram of Fig. 7 have been discovered which provide accurate characterizations of burst-friction spectra.
  • Fig. 4 shows a burst spectrum of [t] as produced in the word 'teen' by a male speaker.
  • a highest magnitude peak P(max) is located at 4523 Hz, with an amplitude of 69 dB.
  • Operations of step 1615 should establish burst-friction prominences BF2 at 2924 Hz (65 dB) , and BF3 at 3905 Hz (65dB), even though the two peaks selected are not as prominent as highest magnitude peak P(max) .
  • Fig. 5 shows a burst spectrum of [k] as produced in the word 'cot' by a male speaker.
  • the highest magnitude peak P(max) for this spectrum is located at 1400 Hz (60 dB) .
  • Operations of step 1615 should establish both burst-friction prominences BF2 and BF3 at the same frequency of 1400 Hz since the next peak (4366 Hz, 54 dB) is separated too widely in frequency from the first peak to be of interest.
  • Fig. 6 shows a burst spectrum of [k] as produced in the word 'ken* by a female speaker.
  • the highest magnitude peak P(max) is located at 3112 Hz and is much more prominent than higher-frequency peaks located nearby. Operations of step 1615 should disregard the higher-frequency peaks even though they are not separated very widely from the first peak.
  • a burst-friction processing part of step 1615 of Fig. 3A commences with a BEGIN 1651 of Fig. 7 and then goes to a step 1653 in which undefined frequencies BF2 and BF3 which are ultimately to be determined are first initialized.
  • a step 1655 all peaks below 6 kHz (6000 Hertz) are selected by a peak detection routine. In this way only those peaks which are below a preset frequency of 6 kHz are identified in a frequency band of 0-6 kHz. The number of these peaks is counted and stored as a number K.
  • the frequencies of the peaks are temporarily stored in serial order such that the frequency P(i) of the peak with index i is less than the frequency of the peak P(i+1) with the next higher index i+1. Also, the peak intensity in decibels of sound pressure level (SPL) is indexed and stored as a series of values PDB(i) .
  • the method of Fig. 7 electronically identifies, when the auditory state signal signal indicates the presence of a burst-friction auditory state, the highest magnitude peak for each spectrum as well as each peak having a magnitude within a range of magnitudes less than the magnitude of the highest magnitude peak.
  • This range of magnitudes has a breadth D, also called an amplitude cutoff, as shown in Figs. 4-6.
  • the range of magnitudes is substantially constant in breadth D, preferably 10 decibels when the highest magnitude peak P(max) is above a predetermined intensity such as 55 decibels.
  • This 10 decibel range is quite suitable in processing speech signals at conversational intensity levels ranging from 55 to 75 dB.
  • the breadth D is made to vary directly with the intensity in decibels of the highest magnitude peak, as shown in Fig. 8. In Fig.
  • amplitude cutoff D as a function of high magnitude peak intensity PDB(M) is level or constant between 55 and 75 dB, and decreases as PDB(M) intensity decreases. This important feature is implemented in Fig. 7 wherein step 1657 computes breadth D according to the equation
  • equation (1) corresponds to the graph of Fig. 8 of range breadth D in decibels versus intensity of the highest magnitude peak PDB(M) .
  • breadth D is 10/55 of the intensity PDB(M) .
  • a narrower range than 10 dB is used in Figs. 4-6 when the highest peak is less than 55 dB.
  • the breadth D is 10 dB.
  • a search in Fig. 7 is made through the peaks indexed by index i.
  • a step 1659 sets index i to one.
  • a step 1661 determines whether the peak indexed by the current value of index i is within the decibel range that has breadth D. Step 1661 thus tests whether the difference PDB(M) less PDB(i) is less than or equal to breadth D. If not, the current peak is ignored, and operations go to a step 1663 to increment index i by one. If index i is still less than or equal to the number K of peaks in a step 1665, operations loop back to step 1661 until a peak within the magnitude range is found, whence operations branch to a test step 1667.
  • Step 1667 determines whether BF2 still is not determined, as indicated by BF2 still having its initialized value from step 1653. If BF2 is undetermined, operations go to a step 1669 to set BF2 to the frequency P(i) corresponding to the current value of index i, whence the loop is reentered at step 1663. If BF2 is already determined (BF2 unequal to its initialized value) , then a branch is made from step 1667 to a step 1671 to set BF3 to the frequency P(i) corresponding to the current value of index i instead.
  • step 1665 If in step 1665 it is found that all of the peaks have been examined, operations proceed to a step 1673 to determine whether BF3 remains undefined (BF3 still equal to its initialized value) . If still undefined, then a branch is made to a step 1675 to set BF3 equal to the frequency P(M) of the highest magnitude peak. After either of steps 1671 and 1675, or if test step 1673 finds BF3 is defined, then operations proceed to a step 1677 to determine whether the burst-friction prominence values BF2 and BF3 so determined have frequencies which differ by more than 2500 Hertz. The 2500 Hertz range is thus independent of the sensory reference SR, and effectively floats in the 6 kHz voice band.
  • a branch is made to a step 1679 to set BF3 equal to BF2 whence a RETURN 1681 is reached. If BF2 and BF3 do not differ by more than 2500 Hertz, RETURN 1681 is reached directly from step 1677.
  • the Fig. 7 burst-friction processing moves from about 60 Hz to 6000 Hz in the spectrum, and often the first two peaks within 10 dB of the level of the maximum peak are picked as BF2 and BF3. Thus, in those cases where there are two peaks within 10 dB of, and to the left of, the maximum peak, the maximum itself would not be picked, as illustrated in Figure 4.
  • the frequency value of BF2 is also used as that for BF3, due to steps 1677 and 1679, as is shown in Figs. 5 and 7. If there are no peaks within 10 dB of the maximum peak, the frequency value for the maximum peak is used for both BF2 and BF3, due to steps 1673 and 1675, as shown on Figs. 6 and 7.
  • Another way of understanding the operations is that they establish a test whether there are two peaks that differ in frequency by 2500 Hz or less. If not, then the value of frequency of the identified peak which is lowest in frequency is stored in both distinct memory locations BF2 and BF3. If there is only one peak, it is the identified peak which is lowest in frequency and thus its frequency is stored as BF2 and BF3.
  • the method was experimentally tested by hand for speech stops /_. / , /t/ and /k/ using 36 spectra from a male speaker and 36 spectra from a female speaker.
  • a 24 millisecond Hamming window was centered over the onset of the burst.
  • the spectral peak with maximum amplitude below 6 kHz was located.
  • the center frequencies of the first two peaks within 10 dB of the maximum were picked as the burst-friction components BF2 and BF3, with the exceptions discussed above such as when there is only one peak or a higher frequency peak was separated from BF2 by 2500 Hz or more.
  • the process yielded determinations of the three speech stops which were correct in 96% of the cases.
  • BF2 and BF3 can be graphed as a function of time, as shown in Fig. 9.
  • a burst spectrum of [t] as produced in the word 'teen' by a male speaker is graphed as a function of time.
  • BF2 and BF3 are displayed for each millisecond of burst-f iction signal as lower and upper plus signs respectively. Due to the scale of the graph, the plus signs merge into line segments. However, presence on the same line segment does not indicate identity of interpretation.
  • spectrum 1693 which has only one peak, occurs, and the frequency of the one peak is stored as the value of both BF2 and BF3.
  • the processing selectively stores in distinct locations in the memory (e.g. BF2 and BF3) respectively representative of normally ocurring prominences of a burst-friction sound, the values of frequency of the lowest two frequencies associated with the identified peaks.
  • the processing instead stores in both distinct locations in the memory the value of frequency of the highest magnitude peak when there are no other identified peaks.
  • it stores in both distinct locations in the memory the lowest value of frequency associated with an identified peak when the lowest two frequencies associated with the identified peaks differ by at least a predetermined value of frequency.
  • circuits and methods are provided for electronically producing a set of coordinate values for a sensory pointer in a mathematical space from the values (e.g. BF2 and BF3) in the distinct memory locations, and electronically deriving a series of coordinate values of a perceptual pointer on a path in a mathematical space with a contribution from the set of coordinate values for the sensory pointer.
  • the memory or target space storage 31 holds prestored information representative of a speech stop sound at addresses corresponding to a region 451 of the mathematical space which cannot be entered by the sets of coordinate values for the sensory pointer BFSP.
  • Fig. 13 shows operations for implementing the model of Figs. 11 and 12 in a digital computer so that difference equations are solved for the latest points on the path of the perceptual pointer PP.
  • a series of coordinate values of the perceptual pointer PP on a path in a mathematical space are electronically derived with a contribution from the set of coordinate values for the sensory pointer.
  • Fig. 14 further shows operations for detecting points such as 455 and 457 of Fig. 12 where perceptual pointer PP has a significant trajectory parameter.
  • Target space storage memory 31 holds prestored phonetically relevant information such as identifiers of phonetic elements corresponding to respective sets of addresses in the memory.
  • CPU2 electronically derives per Fig. 13 a series of coordinate values of points on a path in the mathematical space as a function of repeatedly determined values of frequency (e.g. BF2 and BF3) selectively stored in the distinct memory locations when the auditory state signal represents a burst-friction sound.
  • CPU3 electronically computes values of a trajectory parameter from the series of coordinate values.
  • CPU3 determines an address in memory 31 corresponding to coordinates of a position where the preestablished condition is satisfied, and obtains from the memory 31 the prestored phonetically relevant information corresponding to the address so determined. Glides are also detected.
  • the various operations of Fig. 14 are numbered to correspond with the description of Fig. 33 of the incorporated patent application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

Un appareil (1) de traitement de la parole comprend une mémoire électronique (RAM1) et un circuit (CPU1) qui tire de la parole des groupes de valeurs numériques représentatives de spectres de fréquences (1691, 1693). Les spectres ont des crêtes (i) aux fréquences (P(i)) qui leur sont associées. Les crêtes incluent une crête d'amplitude la plus élevée (M) pour chaque spectre. Le circuit produit également un signal d'état auditif (BF) indiquant la présence ou l'absence d'un état auditif plosif-fricatif de la parole. Le circuit identifie en outre électroniquement, lorsque le signal d'état auditif (BF) indique la présence d'un état auditif plosif-fricatif, la crête d'amplitude la plus haute (M) pour chaque spectre ainsi que chaque crête (i) ayant une amplitude (PDB(i)) comprise dans une plage de (D) amplitudes inférieures à l'amplitude (PDB(M)) de la crête d'amplitude la plus élevée (M), et stocke sélectivement dans des endroits distincts de la mémoire (RAM1), représentant respectivement des pointes apparaissant normalement (BF2, BF3) d'un son plosif-fricatif, les valeurs de fréquence (P(i)) des deux fréquences les plus basses associées aux crêtes identifiées. D'autres appareils et procédés de traitement de la parole sont également décrits.
PCT/US1988/003374 1987-10-08 1988-09-30 Procedes et appareil processeurs de la parole servant a traiter des sons plosifs-fricatifs WO1989003519A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US107,488 1987-10-08
US07/107,488 US4809332A (en) 1985-10-30 1987-10-08 Speech processing apparatus and methods for processing burst-friction sounds

Publications (1)

Publication Number Publication Date
WO1989003519A1 true WO1989003519A1 (fr) 1989-04-20

Family

ID=22316888

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1988/003374 WO1989003519A1 (fr) 1987-10-08 1988-09-30 Procedes et appareil processeurs de la parole servant a traiter des sons plosifs-fricatifs

Country Status (1)

Country Link
WO (1) WO1989003519A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0743613A2 (fr) * 1995-05-17 1996-11-20 Olympus Optical Co., Ltd. Système de réproduction de données reproduire et sortie des informations multimédia utilisant une imprimante
US11153472B2 (en) 2005-10-17 2021-10-19 Cutting Edge Vision, LLC Automatic upload of pictures from a camera

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4060695A (en) * 1975-08-09 1977-11-29 Fuji Xerox Co., Ltd. Speaker identification system using peak value envelop lines of vocal waveforms
US4087630A (en) * 1977-05-12 1978-05-02 Centigram Corporation Continuous speech recognition apparatus
US4087632A (en) * 1976-11-26 1978-05-02 Bell Telephone Laboratories, Incorporated Speech recognition system
US4435617A (en) * 1981-08-13 1984-03-06 Griggs David T Speech-controlled phonetic typewriter or display device using two-tier approach
US4610023A (en) * 1982-06-04 1986-09-02 Nissan Motor Company, Limited Speech recognition system and method for variable noise environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4060695A (en) * 1975-08-09 1977-11-29 Fuji Xerox Co., Ltd. Speaker identification system using peak value envelop lines of vocal waveforms
US4087632A (en) * 1976-11-26 1978-05-02 Bell Telephone Laboratories, Incorporated Speech recognition system
US4087630A (en) * 1977-05-12 1978-05-02 Centigram Corporation Continuous speech recognition apparatus
US4435617A (en) * 1981-08-13 1984-03-06 Griggs David T Speech-controlled phonetic typewriter or display device using two-tier approach
US4610023A (en) * 1982-06-04 1986-09-02 Nissan Motor Company, Limited Speech recognition system and method for variable noise environment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0743613A2 (fr) * 1995-05-17 1996-11-20 Olympus Optical Co., Ltd. Système de réproduction de données reproduire et sortie des informations multimédia utilisant une imprimante
EP0743613A3 (fr) * 1995-05-17 2000-04-12 Olympus Optical Co., Ltd. Système de réproduction de données reproduire et sortie des informations multimédia utilisant une imprimante
US11153472B2 (en) 2005-10-17 2021-10-19 Cutting Edge Vision, LLC Automatic upload of pictures from a camera
US11818458B2 (en) 2005-10-17 2023-11-14 Cutting Edge Vision, LLC Camera touchpad

Similar Documents

Publication Publication Date Title
US4809332A (en) Speech processing apparatus and methods for processing burst-friction sounds
US4820059A (en) Speech processing apparatus and methods
US4783807A (en) System and method for sound recognition with feature selection synchronized to voice pitch
US5873062A (en) User independent, real-time speech recognition system and method
US6553342B1 (en) Tone based speech recognition
EP2083417B1 (fr) Dispositif de traitement de sons et programme
JPH08263097A (ja) 音声のワードを認識する方法及び音声のワードを識別するシステム
US4707857A (en) Voice command recognition system having compact significant feature data
WO2007017993A1 (fr) Dispositif de traitement de signal sonore capable d'identifier une periode de production de son, et procede de traitement de signal sonore
CN112712823A (zh) 拖音的检测方法、装置、设备及存储介质
JP3174777B2 (ja) 信号処理方法および装置
Broad Formants in automatic speech recognition
Tchorz et al. Estimation of the signal-to-noise ratio with amplitude modulation spectrograms
Yavuz et al. A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model.
WO1989003519A1 (fr) Procedes et appareil processeurs de la parole servant a traiter des sons plosifs-fricatifs
EP0364501A4 (en) Speech processing apparatus and methods
KR930010398B1 (ko) 음성신호 파형에서 비대칭율을 이용한 전이구간 검출방법
Vicsi et al. Continuous speech recognition using different methods
Waardenburg et al. The automatic recognition of stop consonants using hidden Markov models
WO1987003127A1 (fr) Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix
JPS6148898A (ja) 音声の有声無声判定装置
Lea et al. Algorithms for acoustic prosodic analysis
KR19990087730A (ko) 불특정 화자에 대한 실시간 음성인식 시스템 및 이것의 방법
Kaminski Developing A Knowledge-Base Of Phonetic Rules
Kolokolov Preprocessing and Segmentation of the Speech Signal in the Frequency Domain for speech Recognition

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): DE FR GB NL SE