WO1987003127A1 - Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix - Google Patents

Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix Download PDF

Info

Publication number
WO1987003127A1
WO1987003127A1 PCT/US1985/002229 US8502229W WO8703127A1 WO 1987003127 A1 WO1987003127 A1 WO 1987003127A1 US 8502229 W US8502229 W US 8502229W WO 8703127 A1 WO8703127 A1 WO 8703127A1
Authority
WO
WIPO (PCT)
Prior art keywords
producing
signal
vector
fricative
sound
Prior art date
Application number
PCT/US1985/002229
Other languages
English (en)
Inventor
John Marley
Original Assignee
John Marley
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by John Marley filed Critical John Marley
Priority to EP19850905990 priority Critical patent/EP0245252A1/fr
Priority to PCT/US1985/002229 priority patent/WO1987003127A1/fr
Publication of WO1987003127A1 publication Critical patent/WO1987003127A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition

Definitions

  • the invention relates to systems and methods for speaker indepedent continuous and connected speech recognition and characteristic sound recognition, and more particularly to syterns and methods for dealing with both rapid and slow transitions between phonemes and characteristic sounds, and for dealing with silence and distinguishing between certain closely related phonemes and characteristic sounds, and for processing the phonemic recognition in real time.
  • speaker-dependent voice command recognition systems there are a number of devices presently available. They are capable of receiving, for example, simple word commands and producing corresponding digital command codes which are transmitted to a computer. Typically, such voice command systems must be "trained" to recognize particular command words spoken by a particular speaker. It should be appreciated that an average person can not speak the same word in exactly the same way twice. In fact, there is a great variation in the speech waveforms produced when an average person tries to speak the same word a number of times. Present speaker-dependent voice command recognition systems are not capable of storing digitized speech waveform data for only one utterance of a particular command, and then later reliably recognizing the same word spoken by the same speaker.
  • the presently available systems are "trained" by instructing the speaker to speak the desired word into the system's microphone a number of times.
  • the microphone signal for each repetition of the word is amplified and digitized, typically by using zero-crossing techniques, and sometime by using analog to digital converters and processing the resulting digital output.
  • Some of the available systems compare each stored version of that word with the digitized version of a later spoken utterance of that command word to try to match the spoken command with one of the stored versions of it.
  • Various auto-correlation operations are performed to determine if there is a match.
  • the invention provides a system and method for producing a binary signal having a "1" level during positive pressure wave portions of a sound signal and a "0" level during negative pressure wave portions of the sound signal, detecting the time point of the major peak positive and negative excursions of each pitch cycle of the sound signal and producing a corresponding pitch cycle marker signal that occurs substantially at the beginning of each pitch cycle, producing a first number that represents the duration of a "1" leve1 of the binary signal most closely following a pulse of the pitch cycle signal and producing a second number that represents the duration of the following "1" level of the binary signal, composing a vector from the first and second numbers, comparing that vector with a plurality of stored vector domains to determine if the present vector falls within any of the stored vector domains, and producing a character or signal or a phoneme identifying signal or code representing a phoneme or sound corresponding to a one of the stored vectors that most nearly matches the present input vector.
  • the system also produces a first "running average" of the durations of the "0" levels for a plurality of the pitch cycles of the binary signals and also produces a summation of the durations of the "1" and “0” levels of each and every pitch cycle, divides that summation by the sum of the running averages of the "1" and “0” intervals, and, if there is a remainder, uses that remainder to obtain a correction factor that is added to or multiplied by the input vector formed from the successive "1" levels to compensate for inaccuracies in the input vector caused by the influence of mismatches between the pitch cycle of the sound presently uttered by a speaker and the resonant frequency of the present configuration of the mouth cavity of the speaker.
  • the change or "velocity" of the durations of the "1" levels is computed on a real time basis, as is the rate of change, or “acceleration” of the "1" levels.
  • accelerations and/or velocities which represent the velocities and/or accelerations of various parts of the speaker's articulation apparatus character demarcation signals or phoneme demarcation signals or codes that represent the beginning or end of a character or phoneme represented by the present sound signal are produced.
  • fricative vectors are computed between the "1" and "0" running averages and are compared with a plurality of stored fricative vector domains to determine the best match of the stored fricative domains with the presently computed fricative vector.
  • a fricative phoneme signal corresponding to a one of the stored fricative vector domains that best matches the presently computed fricative vector is produced if there is such a matching, indicating non-voiced fricative phonemes or fricative content in a voiced phoneme.
  • Intervals of silence indicated by quite small durations of sporadic "1" intervals, are assigned a "0" level.
  • the identified interval is used to help identify the plosive phoneme corresponding to the matched stored silence duration. All of the foregoing information is utilized to generate various phoneme code signals that can be used to drive a printer, display a sequence of phoneme characters that accurately represent the incoming speech or other characteristic sounds or provide various pre-arranged activities in a wide variety of machines or apparatus which can benefit from the guidance of speech and sound signals.
  • an apparatus and method for providing a compact digitizing of a speech or sound waveform, by producing the above first and second running averages of the durations of the "1" and “0" levels, respectively, sampling the first and second funning averages at predetermined intervals, operating on the sampled data to identify and produce event data for significant events in the speech or sound waveform, including rapidly rising and rapidly falling values of the first running average, "silence" values of the second running average to produce a very compact digital representation of the meaning of the speech or sound waveform.
  • this event data is arranged in "windows" of significant events of the speech waveform, and the event data in each window is compared to previously stored event data which previously has been similarly obtained by speaking selected words into the apparatus and producing and storing significant data, identifying matches of significant event data in the window to stored reference event data, and repeating the procedure for event data in another window that begins immediately after a matching is detected, until a predetermined number of consecutive failures to match an event in a particular window to that number of reference event or until all events of the utterance lows tested against reference events.
  • event data for each significant event of a first window is compared to reference event data for a first reference event, and any matching is identified, in accordance with weighting criteria, and a score representing the degree of mismatching is accumulated.
  • a new window of subsequent significant events of the speech or sound waveform is compared to a subsequent reference event.
  • a higher degree of recognition of voiced commands is achieved at a lower degree of cpmplexity than has been previously achieved.
  • a microphone, amplifying and filter circuitry, inflection point detecting circuitry, and a microprocessor are included in a circuit to produce the sampled values of the first and second running averages.
  • the operations on the sampled values of the first and second running averages are performed by a desk top computer coupled by a cable to the unit, and the instructions for conducting these operations are contained in a program loaded into the desk top computer.
  • Fig. 1 is a block diagram of an embodiment of the invention.
  • Fig. 2 is a diagram showing a portion of a speech waveform and a binary waveform corresponding to major inflection points of the speech waveform.
  • Fig. 3 is a diagram of a decision tree in which input vectors computed in real time, silence intervals, and swoop trajectory directions computed in real time are compared to entries in a stored vector map or lookup table to identify the most recently uttered phoneme.
  • Fig. 4 is a circuit schematic diagram of the microphone signal amplifier of Fig. 1.
  • Fig. 5 is a circuit schematic diagram of the audio band pass amplifier of Fig . 1.
  • Fig. 6 is a circuit schematic diagram of the pitch band pass amplifier of Fig. 1.
  • Fig. 7 is a circuit schematic diagram of the major peak detector circuit in Fig. 1.
  • Fig. 8 is a diagram showing several cycles of a speech waveform, and is useful in describing the operation of the circuit of Fig. 7.
  • Fig. 9 is a circuit schematic diagram of the inflection point detector circuit of Fig. 1.
  • Fig. 10A is a diagram including two waveforms useful in describing the operation of the inflection point detector circuit of Fig. 9.
  • Fig. 10B is a diagram showing two waveforms that are also useful in describing the operation of the circuit of Fig. 9.
  • Fig. 11 is a circuit schematic diagram of the threshhold limiting circuit of Fig. 1.
  • Fig. 12 is a circuit schematic diagram of the pulse shaper and latch circuit of Fig. 1.
  • Figs. 13A and 13B constitute a flow chart of a program executed by microcomputer 10 of Fig. 1 in accordance with the phoneme recognition method of the present invention.
  • Fig. 14 is a flow chart of a program executed by the microcomputer 10 of Fig. 1 to process binary data produced by the inflection point circuit of Fig. 1.
  • Fig. 15 is a flow chart of a subroutine executed by the microcomputer 10 of Fig. 1 to service an interrupt request produced by the shaper and latch circuit of Fig. 1.
  • Fig. 16 is a flow chart of a subroutine executed by microcomputer 10 of Fig. 1 to compute updated values of certain variables used in the execution of the program of Figs. 13A and 13B.
  • Fig. 17 is a generalized fricative vector domain map useful for identifying an unknown fricative phoneme.
  • Fig. 18 is a generalized voiced phoneme vector domain map useful in identifying an unknown voiced phoneme.
  • Fig. 19 is a diagram illustrating another embodiment of the invention, which provides a speaker-dependent voiced command recognition system.
  • Fig. 20 is a graph of two waveforms that are useful in describing the operation of the device of Fig. 19.
  • Fig. 21 is a memory map useful in describing the operation of the device of Fig. 19.
  • Fig. 22 is a diagram useful in explaining comparison of event data derived from a voiced command spoken into the device of Fig. 19 with stored reference event data previously spoken into the device of Fig. 19 during a "training" session.
  • Fig. 23 is a flow chart of a program executed by acomputer to affectuate the event comparison procedure illustrated in Fig. 22.
  • Fig. 24 is a block diagram of an embodiment of the invention with a higher degree of circuit integration than the embodiment of Fig. 1.
  • a speech analyzer circuit 1 includes a microphone 2 which receives audible speech sounds and produces an electrical signal that is applied to the input of a microphone signal amplifier circuit 3.
  • speech analyzer circuit 1 produces electrical signals on bus 21 that identify speech phonemes and other audible sound-representing electrical signals.
  • the electrical signals produced on bus 21 can be ASCII signals or the like for causing phonemic symbols or other characters to be printed or displayed on a suitable screen.
  • the signals on conductor 21 can be utilized to control a variety of other kinds of electromechanical apparatus, such as desk-top computers, automobiles, robots and the like, in response to voice commands.
  • the signals on conductor 21 can also be utilized to operate apparatus such as devices for aiding speech-impaired persons, to operate phonetic typewriters, and can find many other applications in the general field of speechto-machinery communication.
  • the electrical signals on bus 21 do not, however, represent speech that has been recognized in the semantic sense, nor do the signals on bus 21 represent the correct spelling used to represent sounds and words in various human languages.
  • the electrical signals on bus 21 only represent characteristic sounds, such as phonemes which are generally used in all spoken languages.
  • the output produced by amplifier 3 is applied by conductor 4 to the input of an audio band pass filer amplifier 5, which has a center frequency of approximately 550 hertz and passes a band of frequencies adequate for phoneme recognition.
  • the output of filter amplifier 5 is produced on conductor 7 and applied to the input of an inflection point detector circuit 8, which performs the function of generating a binary output signal on conductor 9.
  • the positive and negative transitions of binary signal 9 occur at precisely the times of occurence of major inflections of the input signal on conductor 7.
  • a major inflection is one that ocurs when the positivegoing or negative-going waveform on conductor 7 endures for at least approximately 50 microseconds before the next point of inflection arrives and the slop of the waveform reverses direction.
  • the binary waveform on conductor 9 has the appearanceof the signal 44C in Fig. 2, which is a duplicate of speech waveform 33 and binary signal 44C in Fig. 7 of my prior patent 4,284,856 entitled "SYSTEM AND METHOD FOR SOUND RECOGNITION", issued August 18, 1981 and entirely incorporated herein by reference.
  • Inflection point detector circuit 8 has an adjustment circuit 22 which is connected to a ground conductor 23. Adjustment circuit 22 allows normalizing of the binary waveform on conductor 9 to a "standard" waveform. (As subsequently explained, the input circuitry used in inflection point detector circuit 8 has various input offset leakage currents which must be compensated for to produce a standard offset.)
  • the binary waveform on conductor 9 is a real time binary signal that is fed into a capture register 11 of a single chip microcomputer 10, which can be any of a variety of available devices, such as a Motorola MC68701.
  • the above described path from microphone 2 to capture register 11 of microcomputer 10 is one of the two input signal paths to microcomputer 10.
  • the other signal path applies the amplified analog speech signal of conductor 4 to the input of a band pass "pitch” filter amplifier 6, the output of which is applied by conductor 12 to the input of a negative peak detector circuit 13.
  • the band pass "pitch” filter amplifier 6 has a much lower pass band than the audio band pass filter amplifier 5 , because the purpose of bandpass filter 6 is to encompass the "pitch" or vocal cord frequency of average human voices.
  • the output signal 14, which identifies the times of occurrence of major negative peaks of the band pass filter output signal on conductor 12, is applied to the input of a threshhold limiting circuit 15.
  • Threshhold limiting circuit 15 allows the microcomputer 10 to provide a threshhold adjustment voltage on conductor 16 to raise or lower the sensitivity of threshhold limiting circuit 15, so that only the negative peaks on conductor 14 having the greatest amplitude are passed onto conductor 17. These maximum amplitude signals on conductor 17 are referred to as "pitch trigger” signals or pulses.
  • the "pitch trigger” signals such as 24 are applied by conductor 17 to an input of a pulse shaper and latch circuit 18.
  • circuit 18 The purpose of circuit 18 is to gteatly shorten the time of the leading edge of the trigger pulses to operate the binary latch.
  • the latch circuit in block 18 When the latch circuit in block 18 is set, it produces a negative edge, such as the one indicated by reference numeral 25.
  • This negative edge 25 is interpreted as an interrupt by interrupt circuitry 27 inside microcomputer 10.
  • microcomputer 10 After the microcomputer has interpreted and serviced the interrupt request signal on conductor 19, microcomputer 10 produces a "clear" signal 26 on conductor 20 to clear the latch circuitry in block 18.
  • this circuitry provides a means of increasing or lessening, under microcomputer control, the select number of pitch trigger signals that might be considered in the phoneme analyzing process of the invention. Otherwise, there would be frequent instances of multiple pitch triggers on strong or stressed sounds.
  • the selected pitch trigger signal serves as a pointer to locate the first negative pressure wave related to the onset of each glottal pulse of the speaker, in order to allow locating and measuring the durations of the following first two positive pressure waves, which, in accordance with my recent discoveries, contain a major portion of the information that allows accurate determination of the identity of the phoneme or characteristic speech sound presently being input to microphone 2.
  • the input capture register 11 of microcomputer 10 operates in conjunction with a software subroutine that detects the occurrence of any negative-going or positive-going transition of the binary waveform on conductor 9 and stores the time of that occurrence in a 16 bit software capture register. This provides accuracy of "capture” of each transition of the binary speech waveform 9 to within one microsecond .
  • the disclosed arrangement allows microcomputer 10 to capture, in real time and with very high accuracy, each major inflection point (as defined above) of analog speech waveforms on conductor 7.
  • the binary waveform on conductor 9 is, in effect, a "piece-wise linear" approximation of the analog signal on conductor 7.
  • the characteristic ratios or vectors between such major inflection points can be compared to stored phonemic vectors by means of the decision tree shown in Fig. 3 (which is nearly identical to Fig. 8 in my above mentioned patent 4,284,846) to rapidly identify, in real time, the phoneme presently being uttered by a speaker into microphone 2 or the characteristic sound presently being received by microphone 2.
  • a "vector map" representation of the first two positive pressure wave time durations such as the vector maps shown in Fig. 17 and 18, provides a more accurate decision tree than can be achieved using calculated ratios, which are scalar quantities, compared to vectored quanties which include scalar amplitudes and directions.
  • a detailed circuit schematic diagram of a well known microphone amplifier circuit 3 is shown. It includes an inexpensive electrect microphone cell 2 coupled to ground conductor 23 and also coupled by a bias resistor 31 to a five volt supply conductor 32.
  • the output of microphone cell 2 which is the same as microphone 2 in Fig. 1, is connected to one terminal of a 4.7 microfarad capacitor 34, the other terminal of which is connected by means of a 2.2 kilohm resistor 35 to the negative input of an operational amplifier 37, which can be a National Semiconductor LM324.
  • the positive input of operational amplifier 37 is connected to conductor 39, to which a bias voltage of approximately two volts is applied.
  • Conductor 36 is connected to the negative input of operational amplifier 37, and is coupled by a 100 kilohm feedback resistor 38 to the output of operational amplifier 37, which is also connected to conductor 4, also shown in Fig. 1.
  • the electrect microphone 2 has a flat pass band from under 100 hertz up to over 15 kilohertz
  • Capacitor 34 provides audio frequency coupling that blocks off the bias voltage from the input of operational amplifier 37.
  • the gain of this amplifier circuitry is, of course, determined by the ratio of the resistance of resistor 38 to the resistance of resistor 35. However, care has to be taken to ensure that even slight coupling from the digital signal lines associated with microcomputer 10 onto conductors 36 and 39 as a result of the physical layout configuration of microphone amplifier circuit 3 is avoided.
  • a well known audio band pass amplifier circuit 5 has its input connected to conductor 4.
  • Conductor 4 is coupled by 18 kilohm resistor 41 to conductor 43.
  • Conductor 43 is coupled by 5.6 kilohm resistor 45 to ground, and is also coupled by .01 microfarad capacitor 42 to conductor 48, and by .01 microfarad capacitor 44 to conductor 47.
  • a 240 kilohm feedback resistor 46 is coupled between conductors 47 and 48.
  • Conductor 48 is connected to the negative input of operational amplifier 50, which can be a National LM324 operational amplifier. Its output is connected to conductor 47 and its positive input is connected to conductor 49, to which a bias voltage of approximately two volts is applied.
  • this circuit is to eliminate very high frequency, rapidly fluctuating signals that could create pulses that are so fast, and are of such short duration (below approximately 50 microseconds) that microcomputer 10 cannot reliably interpret them.
  • the circuit of Fig. 5 must pass an adequate number of pulses representing fricative phonemes.
  • An adequate range of frequencies for fricatives and voiced sounds has a center frequency of approximately 560 hertz and a pass band going from about 200 hertz to approximately 2300 hertz.
  • the above indicated component values produce this pass band and provide a gain of approximately 10.
  • Fricative sounds typically have frequencies of approximately 4,000 hertz to 6,000 hertz. Some fricative high frequency components do get through the simple band pass filter 5, which high frequency components are necessary to provide all of the "clues" needed to adequately distinguish various fricative sounds.
  • the audio band pass amplifier 5, with the above-indicated components values, provides an output range of signal amplitudes from approximately 60 millivolts peak-to-peak to approximately 3.3 volts peak-to-peak as inputs to the major inflection point detector circuit 8. In response to this input, inflection point detector circuit 8 produces a constant amplitude digital signal which can represent the wide "dynamic range" of spoken sounds.
  • the pitch band pass amplifier circuit 6 has its input connected to conductor 4, which is coupled by 7.5 kilohm resistor 142 to conductor 143.
  • Conductor 143 is coupled by 15 kilohm resistor 52 to ground conductor 23, by .047 microfarad capacitor 145 to conductor 146, and by .047 microfarad capacitor 149 to conductor 12.
  • Conductor 146 is coupled to the negative input of an operational amplifier circuit 147, which can be a National Semiconductor LM324 operational amplifier, and by 68 kilohm feedback resistor 53 to conductor 12, which is also connected to the output of operational amplifier 147.
  • the positive input of operational amplifier 147 is connected to conductor 148, to which a bias voltage of approximately two volts is applied.
  • This circuit has a center frequency of approximately 190 hertz, and has a pass band of approximately 100 hertz to approximately 360 hertz, and has a gain of approximately four.
  • This circuit is a multiple feedback band pass filter which is well known. Its pass band covers the normal frequency range or pitch range of the glottal pulses uttered by most humans.
  • the gain of filter amplifier 6 is only four because it is desired to detect only relatively large amplitude negative pressure wave peaks in the analog voice waveform produced by amplifier 13.
  • This "pitch band pass amplifier” 6 is designed to very selective, so as to cancel out the higher frequency components and sharp background noise as much as possible, and to provide amplification only in the band in which human glottal pulses occur normally.
  • the major peak detector circuit 13 detects major negative peaks in the complex audio waveform that is produced on conductor 4 by amplifier 3, and then is filtered by pitch filter 6 to produce the signal on conductor 12.
  • Peak detector circuit 13 has its input connected to conductor 12, which is coupled by a 2 kilohm resistor 54 to one terminal of .01 microfarad capacitor 55, the other terminal of which is connected to conductor 56.
  • Conductor 56 is connected to the negative input of operational amplifier 60, which can be a National Semiconductor LM324, and is also coupled by 240 kilohm resistor 57 to conductor 14.
  • Conductor 14 is connected to the output of operational amplifier 60.
  • the positive input of operational amplifier 60 is connected to the ground conductor 23.
  • peak detecting circuit 13 can be best understood with reference to the waveforms of Fig. 8.
  • waveform 61 designates a signal resulting from glottal vibrations of the vocal cords that result in producing of the signal on conductor 12, which we may call the analog pitch signal
  • Waveform 62 is the output of peak detecting circuit 13 on conductor 14.
  • the above circuit detects the most negative excursion of the analog pitch signal 61, namely negative excursion 64.
  • the input signal 61 rides on a two volt DC bias voltage.
  • each glottal pulse intiates a reverberatory sequence of negative and positive peaks whose positions and excursions depend on the cavity resonances of the speaker's articulation apparatus at the present moment. These excursions decay as the sound energy is dissipated, until another glottal pulse occurs to re-energize the speaker's resonant cavities.
  • Circuit 13 has a shortcoming in that it will detect moderate amplitude negative excursions in addition to larger negative excursions of the waveform 61. This results in output pulses on conductor 14 corresponding to waveform 62 in Fig. 8.
  • peak detector circuit 13 produces pulse 66 in response to the negative excursion indicated by reference numeral 64 and produces pulse 67 in response to the negative excursion 65 of input signal 61.
  • My above-described circuitry can detect only the maximum amplitude negative excursion of each pitch cycle for a four to one range of pitches.
  • the second negative excursion such as 65 in Fig. 8
  • the threshhold limiting circuit 15 it was necessary to produce the threshhold limiting circuit 15, and provide computer-controlled adjustment of the threshhold of circuit 15.
  • a high sensitivity comparator 71 which can be a Motorola MC3302 comparator.
  • Conductor 7 is also coupled by 3.3 kilohm resistor 69 to conductor 70, which is connected to the positive input of comparator 71, and is also connected to the terminal of a capacitor 72.
  • the other terminal of capacitor 72 is connected to ground conductor 23.
  • Conductor 70 is also coupled by 220 kilohm resistor 73 to the tap of a potentiometer 74, which is connected between +5 volt conductor 32 and ground.
  • comparator 71 The output of comparator 71 is connected to conductor 9 and is also coupled by 22 kilohm pull-up resistor 75 to the +5 volt conductor 32.
  • the binary output signal 44C shown in Fig. 2 is produced on conductor 9. (Other examples of the binary output signal on conductor 9 are shown in Figs. 10A and 10B, subsequently described.
  • a very important aspect of the improvement of the present invention is the providing of a band-limited waveform such as the one produced on conductor 7, and, by means of a very simple circuit such as 8, producing a "piece-wise linear" binary representation of this band-limited waveform which is invariant with respect to the amplitude of the audio signal and contains enough information to make phoneme recognition, especially speaker-independent phoneme recognition, possible.
  • the strong pitch signal produced in response to the glottal pulses of the speaker has been attenuated on the low end of the frequency range, and a great deal of the very high frequency signals such as high frequency fricatives and background noise of the type commonly produced by fans and air conditioners, has been eliminated from the waveform on conductor 7.
  • the filtered waveform on conductor 7 then has a bare minimum of features which nevertheless have enough information to accurately identify phonemes and other characteristic sounds represented by the speech waveform produced by the microphone 2.
  • a major challenge in providing the improvementsof the present invention was to determine how much detail could be eliminated from the analog speech waveform while leaving enough detail that a relatively simple circuit could extract that detail and make it available in binary form for processing by a microcomputer, especially a single chip microcomputer, in such a way as to accurately detect, independently of the unique characteristics of a particular speaker, phonemes and characteristic sounds that make up speech and produce reliable identifying signals that allow the speech to be accurately represented by phonemic signals.
  • Binary waveforms 78 in Fig. 10A is the response of inflection point detector circuit 8 to the filtered analog speech waveform 77, and the narrow pulses 77-1, 77-2 of binary waveform 78 can readily be seen to occur at the times of the above-indicated inflection points caused by high frequency components contained in waveform 77.
  • the objective is to find a way of simplifying speech for improved tramsmission in digital form over electrical lines, in essentially the same way that conventional delta modulator circuits attempt to improve digital transmission of speech over long lines and then reconstruct the original analog speech signal from the received digital signal.
  • the Stewart reference does not suggest that the binary output produced by this circuit has the necessary information and accuracy to allow recognition of phonemes.
  • inflection point detector circuit will produce the desired constant "0" level in response to silences or "white noise” consisting of very short "1" level pulses, as indicated by waveform 80, and also in response to very low-level fricative pulses in the "silence" waveform 79 of Fig. 10B.
  • fricative pulses include sharp, narrow peaks with long durations of silence between them.
  • the periods of "silence" between the fricative pulses showm in waveform 79 result in "0" levels in the binary output waveform produced on conductor 9 by inflection point detector circuit 8 of Fig. 9.
  • the procedure for adjusting the desired "offset” is to apply a 200 millivolt 1 kilohertz sine wave input to conductor 7, and adjust potentiometer 74 so that a binary output is produced on conductor 9.
  • the resulting offset voltage on the positive input of comparator 71 compensates for input offset variations that will occur from unit to unit with the Motorola MC3302 circuit used to implement comparator 71.
  • Fig. 11 the threshhold limiting circuit of Fig. 1 is shown.
  • This circuit receives the peak pulses 62 (Fig. 8) on conductor 14 produced by peak detector circuit 13, and couples them by a 110 kilohm resistor 83 to conductor 84.
  • Conductor 84 is connected to the positive input of operational amplifier 85, which can be a National Semiconductor LM324.
  • operational amplifier 85 can be a National Semiconductor LM324.
  • Conductor 84 is also coupled by 510 kilohm resistor 86 to conductor 17, which produces a pitch period signal to pulse sharper and latch circuitry 18 of Fig. 12.
  • Conductor 17 is also connected to the output of operational amplifier 85.
  • the negative input of operational amplifier 85 is connected to conductor 87, on which a threshhold reference voltage having a "default" value of approximately 1.8 volts is produced by the resistive division of a +5 volt supply voltage on conductor 32 by resistors 88 and 89, which are connected in series between conductor 32 and ground conductor 23.
  • Conductor 87 is coupled by resistor 91 to threshhold adjusting conductor 16, which is connected to a suitable output port of microcomputer 10 to provide computer controlled adjustment of the threshhold of circuit 15.
  • a capacitor 90 is also connected between conductor 87 and ground conductor 23 to provide a time constant of approximately 15 milliseconds with resistors 88, 89 and 91.
  • the pitch trigger signal produced on conductor 14 by peak detector circuit 13 has amplitudes that vary considerably.
  • Threshhold limiting circuit 15 performs the function of selecting only the highest amplitude pulses of these peak signals in each period.
  • the ratio of resistors 88 and 89 establishes a reference level on the negative input of operational amplifier 85. This reference level is approximately 1.8 volts. Any voltage on conductor 14 that exceeds this 1.8 volts threshhold level will cause a positive pulse to be produced on conductor 17.
  • the 1.8 volts reference level produces a single pitch pulse per pitch period of the speaker's voice on conductor 17, for most voiced sounds.
  • the microcomptuer 10 produces a sequence of "0" level pulses on conductor 16, which effectively reduce the charge on capacitor 19 temporarily, and thus lower the threshhold level at the negative input of the operational amplifier in order to pass the peak of the smaller pulses on conductor 14.
  • microcomputer 10 produces a sequence of "1" level pulses on conductor 16 to increase the charge on capacitor 90 temporarily, in order to reduce the number of pulses passed on conductor 17.
  • the microcomputer 10 may determine that if there are too many positive pulses being produced on conductor 17, the sound being received is not a "voiced” sound because no human vocal cords are capable of producing positive pressure waves with such rapidity.
  • Other criteria involved in the subsequently described "autocorrelation" analysis also can be used by the microcomputer program to appropriately adjust the threshhold voltage adjustment signal on conductor 87.
  • Conductor 98 is coupled to the S* (set) input of an RS latch 92, the R* (reset) input of which is connected to conductor 20.
  • conductor 20 is connected to an output port of microcomputer 10.
  • the Q* output of latch 92 is connected to the IRQ (interrupt request) conductor of microcomputer 10.
  • the relative wide pitch trigger pulse on conductor 17 may be as much as 300 microseconds wide.
  • Relatively small capacitor 93 and relatively large resistor 96 differentiate the pitch trigger signal, thereby producing a much narrower pulse on conductor 94.
  • a narrow pulse By applying this narrow pulse to the base of NPN transistor 97, which functions as an inverter, a narrow (approximately 10 microseconds) pulse is applied to the S* input of latch 92.
  • the latch 92 When the latch 92 is set, the Q* output goes to a low level, causing an interrupt request flag to be set inside microcomputer 10.
  • microcomputer 10 can then reset latch 92 by means of conductor 20, often blanking out some of the spurious strong pulses, since the pitch duration has been established.
  • Microcomputer 10 of Fig. 1 can be implemented by a Motorola MC68701 microcomputer with 2048 bytes of programmable memory and read only memory. Average instruction execution times of 2 to 4 microseconds are required.
  • the microcomputer must have 16 bit arithmetic capability, and 128 bytes of scratch pad memory.
  • the microcomputer must also be capable of measuring elapsed times between two input events as close together as 100 microseconds and with a resolution of 2 microseconds and must be capable of forming internal and external timed alarms of equivalent resolution.
  • the microcomputer must be capable of providing parallel output data or serial codes representing identified sounds, phonemes, and allophones to various devices, such as desk-top computers which receive the signals as command words, or to a phonetic typewriter, or to tactile matrix devices to stimulate the skin of deaf persons, or to a low bit communcation modem, or to a "hands off" instrument control panel, etc.
  • devices such as desk-top computers which receive the signals as command words, or to a phonetic typewriter, or to tactile matrix devices to stimulate the skin of deaf persons, or to a low bit communcation modem, or to a "hands off" instrument control panel, etc.
  • Other suitable microcomputers that could be used include the Hitachi HD6301, the Intel 8096, or the Motorola MC68HC11.
  • the software executed by microcomputer 10 includes a "foreground” analysis routine which gathers information and stores it in its internal random access memory.
  • This software also includes a "background” analysis program that includes an algorithm which processes the gathered binary data produced by the "foreground” routine and determines how it should be processed and when to take appropriate action.
  • the heart of the background program executed by microcomputer 10 is shown in the flow chart of Figs. 13A and 13B.
  • the data that is stored in appropriate locations of the random access memory includes binary data representing the real times of the transitions between levels produced on conductor 9 in Fig. 1. These "captured" transition times are used to compute time intervals in accordance with the foreground flow chart of Fig. 14.
  • PVAL The (time) duration of any positivegoing slope in the audio waveform between peak points of inflection.
  • NVAL The duration of any negative-going slope in the audio waveform between points of inflection, including any longer periods of zero slope as found in very weak fricatives or silent intervals.
  • PAVG A rolling average of successive or PVAL time intervals using a number of P* stages of software filtering in order to indicate the trend of the P value.
  • NAVG A rolling average of successive NVAL or values using a number of stages of N* software filtering in order to indicate the trend of the N value.
  • Q The sum of any successive pair of PVAL and the following NVAL times which determines the total period of this resonant cycle.
  • NAVG NAVG
  • PAVG PAVG
  • VALID P The first P greater than some constant of time i.e., 200 microseconds, which follows a SILENCE interval.
  • FRICATIVE A region of the audio waveform characterized by short P durations, i.e., under 400 microseconds for PAVG, interdispersed with various random times of N.
  • VOICED A region of the audio waveform characterized by various durations of orderly Ps of moderate length, i.e., over 450 microseconds, interdispersed with various durations of orderly Ns of similar durations. These orderly sequences repeat in a cyclic fashion every 1, 2, 3, 4,... sets of Q.
  • DELTA P The time differences between P and DELTA N N durations stored in sections of the rolling average software filters, respectively.
  • SWOOP DELTA P and/or DELTA N have exceeded prescribed values indicating that the speaker's articulating apparatus is moving to another position.
  • TRAJECTORY The direction of motion in a swoop calculated by using the magnitude and sign of both DELTA values.
  • PITCH CYCLE The time required for a cycle pattern of Q values to repeat.
  • PITCH PULSE A pointer indicating the region in each PITCH CYCLE where a new burst of energy from the vocal cords has arrived to initiate the next cycle of cavity resonances.
  • P1, N1, P2 The three significant segments following the PITCH PULSE. These pressure wave segments are least influenced by the "personality" of the speaker's overall resonant cavity details.
  • PITCH The predicated point in time where TIMEOUT the next trigger should be. It is used to control the pitch trigger threshhold activity and to re-access the classification of the sound (silence, fricative, mixed, voiced) and the detailed identity of the phoneme.
  • Figs. 13A and 13B the program for operating on such data gathered by the "foreground" routine of Figs. 14, 15, and 16 is entered via label 100 after the foreground analysis is finished.
  • the program of Figs. 13A and 13B goes from label 100 to decision block 101, where the program waits for the "data ready" signal from the foreground program signifying that new data is in the RAM (random access memory). If the needed data is available in the RAM, the program rapidly determines if the present data indicates a present condition of "silence", or if a fricative sound is being received, or if a voiced sound (one produced by vocal cords) is being received. If it is determined in decison block 1 that no new data has arrived, the program moves back to label 100.
  • decision block 102 if a determination is made that there is presently the condition of silence, the N value will be too long , and the program recognizes that such long values of the N duration are not possible in human speech. More specifically, the program determines that if the N duration is more than a prescribed constant, such as 5 milliseconds, then the present condition is one of silence, and the program then enters block 105 and "accumulates" silent time by counting successive "time-out” alarms until a valid P interval indicative of fricative or voicing activity is detected.
  • a prescribed constant such as 5 milliseconds
  • the program goes to block 117 and sets a "long silence” condition, indicating, for example, that the speaker has left and, for example, the equipment can be put on a standby condition until new data arrives. If the determination of decision block 118 is that the present silence condition is not one that can be categorized as a "very long silence”, the program goes to decision block 120 and determines if the criteria for a "medium silence condition" is met by the time count that has been "accumulated" in decision block 105.
  • Decision block 120 causes the program to determine if the accumulated silence time corresponds to a normal pause time that occurs between words and phrases in ordinary human speech. If the determination of decision block 120 is affirmative, the program goes to decision block 119 and categorizes the silence as a "pause". For example, in a word processing application, a silence of a particular length could be interpreted as meaning that a carriage return operation should be effectuated, while a shorter pause can delineate word separation. After the categorization of block
  • the program returns to label 100 and waits for new data to arrive. If the determination of decision block 120 is negative, the program goes to decision block 122 and determines if the accumulated silence time fits into the category of being a "short silence" of the type that can be extremely useful in phoneme recognition. These short silence durations are caused by lung "pressure buildups” that occur, for example, as a result of glottal "catch” that precedes a plosive phoneme, such as a "p" sound. Glottal catches are usually associated with the change in the position erf the articulators of the mouth, throat, and tongue mechanisms that precede resuming of "voiced" sounds of speech as speech is continued. These short sound durations are typically in the range of 30 to 150 milliseconds. Durations of the "medium silence" times tested for in decision block 120 are typically large fractions of a second.
  • microcomputer 10 usually has enough time to execute the entire flow chart of Figs. 13A and 13B and Figs. 14-16 between such pulses.
  • this decision block is negative if the newest "foreground" data parameters stored as variables in the random access memory contain positive pressure durations of significance persistance, for example, if PAVL is over 150 microseconds, or if PAVG is over 50 microseconds.
  • the program then goes to decision block 103. If the determination of decision block 103 if the PAVG positive pressure wave average duration is too short for human voicing, then the determination of decision block 103 is affirmative. This means that the sound is a fricative, a tapping or clicking sound, or some other background sound.
  • the program then goes to decision block 104 and determines if the sound is just beginning to produce a significant PAVG time duration. If this is the case, the program goes to block 106 and bypasses some of the processing steps that are performed on comparatively long voiced sounds, because it is now known that the present sounds are unvoiced, short duation pulses.
  • fricative identification vector consisting of PAVG for one coordinate of a vector map and NAVG for the other coordinate of the vector map
  • identity of fricatives such as SH, F, TH, and so forth can be determined from a stored matrix or look-up table of the type shown in Fig. 17 and de scribed in detail in my prior patent 4,284,846. Fricative sounds have no "voiced" components. Therefore they are speaker-independent, i.e., pitch-independent.
  • Fig. 17 shows the typical regions where various fricatives are located, as I have empirically determined.
  • Fig. 17 is a vector map of the fricative regions plotting N* (NAVG) versus P* (PAVG) , and is referred to in block 123 in the program of Fig. 13A.
  • Fricatives are composed of rather short duration positive pressure waves separated by relatively long durations of inactivity or gaps between the positive pressure waves.
  • the boundaries delineated by the dotted lines in the fricative vector map of Fig. 17 show the approximate regions where the indicated fricative sounds such as "H", "F", and the various other fricative symbols shown in Fig. 17 that are defined in my prior patent, which is incorporated by reference herein.
  • This fricative categorization process gets carried out in block 123 of Fig. 13A.
  • the program then goes to block 124 and assigns a predetermined code to the fricative that was identified by reference to the look-up map or table in block 123, and waits for the fricative to be completed.
  • decision block 125 the program determines if the present fricative is complete by waiting until either silence occurs (i.e., N exceeds a certain value) or a voiced sound occurs (i.e., PAVG exceeds a certain value).
  • the program goes to block 126 and re-configures the appropriate "foreground" operations of Figs. 14-16 to the more elaborate analysis required for voiced sounds, and then returns to label 100 of Fig. 13A.
  • decision block 103 if the determination of decision block 103 is that PAVG is too long for the present sound to be a fricative, then this negative determination causes the program to go to block 107 in Fig. 13B.
  • the present sound then is probably a voiced sound, or perhaps a musical sound.
  • Fig. 13B received new "foreground" data, in the form of durations of PVAL and NVAl between the various transitions of the binary waveform on conductor 9 in Fig. 1 and values of PAVG, NAVG, QAVG and the velocity DELTA P and the velocity DELTA N computed in accordance with Figs.
  • the program then goes to decision block 110 and determines if there is one pulse per pitch period. If this determination is negative, the program goes to block 109 and causes microcomputer 10 to adjust the pitch pulse sensitivity by appropriately varying the threshhold adjustment voltage on conductor 16 of Fig. 1, as well as by monitoring the clearing of the pulse latch using conductor 20 from the microcomputer 10. The program then returns to block 107.
  • the program then goes to block 127 and computes a "swoop trajectory". This can be graphically illustrated by considering a plot of the instantaneous P values versus instantaneous N values, and determining if the present rate of change of the P and N values fall closest to a 0° or closest to 30° multiples of the complete 360° of the possible swoop directions. This "direction” is stored and provides a rapid indication of the "directions” in which the articulators ⁇ f the speaker's mouth are moving, as indicated in block 128.
  • swoop trajectory directions are helpful in identifying voice plosives at the start or finish of a sequence of sounds. If the determination of decision block 111 is that a vocal swoop is not presently occuring, the program goes to block 112 and computes a pitch factor equal to the average pitch divided by a sum comprised of a running average PAVG plus the running average NAVG. This sum represents the average reverberation cycle time of the present phoneme in the speaker's speech apparatus.
  • the program then goes to block 113 and, using the time of occurrence of the pitch trigger pulse, locates a "tag" on the most significant positive wave front duration times, namely P1 and P2 of the present pitch cycle. These most significant segments occur after the PITCH PULSE pointer. In accordance with an important aspect of the invention, these individual time values of Pi and P2 are adjusted by any remainder obtained in the division computed in block 112. Block 114 indicates the making of this correction.
  • This step is believed to be important in establishing a high degree of speaker-independence of the phoneme recognition method of the present invention. This can be done by using the remainder of the above division to access a stored look-up table from which empiracally determined quantities can be obtained for addition to or multiplication by the P1 and P2 values. Should there be no remainder, then the reverberation sequence of the phoneme is harmonic with and exactly fits with the present pitch period, so no correction is required.
  • the program then goes to block 115 and, guided by the decision tree of Fig. 3, categorizes the voiced phoneme on the basis of the best match between the voiced phoneme vector domain map of Fig. 18 and the vector formed between the "adjusted" values of P1 and P2, and also on the basis of the swoop trajectory direction of block 128, and obtains a code that represents the present voiced phoneme.
  • the setting of the swoop trajectory direction means writing this information into random access memory, thereby making this information readily available for steps in block 115.
  • the usefullness of the swoop trajectory lies in the fact that for certain phonemes, it is not possible to determine what the prior phoneme was until the next phoneme is being uttered.
  • the real-time phoneme recognition will occur very soon after the actual utterance thereof.
  • the phoneme identification will lag by approximately one phoneme behind, because for such sounds the "total evidence" produced by the various positive pressure wave transitions, including the swoop trajectories produced during the transitions between the prior phoneme and the following one, must be converted to binary data and analyzed before correct identification of the phoneme can take place.
  • the information available to the program for phoneme-identifying decisions includes information as to the existence of silence durations of the various lengths and the time of occurrence of the beginning of every pitch cycle.
  • the program in block 116, analyzes and uses this information, and also information regarding the presence of a fricative, information indicating the identification of the fricative, information as to the swoop trajectory direction, the pitch factor, and the remainder correction factor in order to determine the identity of the most recent phoneme from the voiced phoneme vector domain map of Fig. 18. If the same result is obtained in two consecutive pitch periods or default values of the pitch period during fricative and silence intervals, then the phoneme-identifying code will be output by microcomputer 10 to a suitable receiving device via bus 21.
  • This procedure is indicated in the output of executed decison routine that is entered from label 130.
  • the step of waiting for the pitch time-out which can be a real pitch time-out or a default value, is performed in block 131. If the same phoneme "candidate" is consecutively identified twice, the output decision is made in block 132 , and the phoneme-identifying code is transmitted in accordance with block 133. A return to the background analysis occurs via return labels 134 and 100.
  • Fig. 18 I have empirically determined that generally increasing values of P1 indicate a moving of the tongue position from the front of the mouth to the back of the mouth, as indicated by arrow 279.
  • Increasing values of P2 in Fig. 18 tend to indicate moving of the mouth from a relative open or slack configuration to a closed position so that a partial fricative type of sound is produced.
  • vector 280 in Fig. 18 represents a fairly "loose" or
  • Vector 281 identifies a back vowel that is close to the point of being a fricative, for example , the germanic back vowel "hoch" the German word for high.
  • This vector map is the one which is referred to in block 115 of the flow chart of Fig. 13B.
  • Figs. 13A and 13B indicated how the basic category of sound is determined in accordance with the present invention.
  • each branch radiating from the center of the map includes a sequence of phonemes based on their frequency of usage in average American or English speech.
  • each branch can be done in a sequence based on this decreasing order of usage of phonemes, as represented graphically by the decreasing size of the rectangles.
  • the probabilities therefore, are in favor of the location of the present phoneme in the shortest average search time, usually within five matching tests. If no match is found in the course of searching from the center of the phoneme map to the end of the particular branch, the search is exited from that branch and returns to the center of the phoneme map. Then, based on broad evidence, the routine searches through the next most likely branch, etc.
  • the main phonemic branch "directions" radiating from the center of the phoneme map include:
  • a foreground subroutine executed by microcomputer 10 for jumping to interrupt flags and measuring intervals based on the binary waveform on conductor 9 is shown .
  • This foreground subroutine performs the data acuuisition function. If a falling edge of the binary waveform is received, then the time of the interrupt is automatically recorded in the capture register 11 of microcomputer 10 (Fig. 1), indicating the "1" level is over, and the subroutine is entered via label 150.
  • Block 151 the microcomputer 10 clears a contingency software time out alarm for long "0" durations and enters block 152.
  • Block 152 of the program computes the duration of the just completed positive voice level or "P” value and assigns it to a variable called PVAL, measured in microseconds.
  • the program then goes to decision block 153 and determines if there is an interrupt request signal on conductor 19 of Fig. 1. If the determination is affirmative, the program goes to block 154 and sets a variable called "PULSE” equal to a logical "1". If not, the routine goes to block 155 and sets the value of the variable "PULSE” equal to a logical "0", indicating that no pitch pulse has been recently received on conductor 19. In this case, the routine enters block 156 and computes the velocity and acceleration, and also the "trend” or running average, PAVG of the instantaneous value PAVL.
  • the foreground routine of Fig. 14 then enters decision block 157 and determines if the voice level has already gone back to a high level. If it has, then it is known that the duration of the present "0" level is very short. This determination is made because it is known that the sequence of events from label 150 to the end of block 156 requires a certain amount of time, and ordinarily the following edge level should not have reversed direction yet. If the direction has reversed, this indicates that "splitting" of the negative-going waveform represented by the high level presently on the binary waveform conductor 9 has occurred.
  • the routine goes back to block 152 and repeats the foregoing sequence of steps, in order to make a complete analysis of the positive undulation that must have occurred in the negative slope of the speech waveform. If the determination of decision block 157 is negative, the program enters block 158, which is a software "silence" time out alarm (should the present "0" level be the start of silence) and then exits the binary waveform evaluation routine of Fig. 14. (The execution of this interruption path is designed to be as fast as possible in order to maximize the time spent in the background routine of Figs. 13A and 13B.)
  • label 160 of the flow chart of Fig. 14 is entered by interruption from the background routine of Figs. 13A and 13B.
  • This subroutine first goes to block 161 and computes the duration of the just completed """ level and assigns it to a variable NAVL measured in microseconds.
  • the subroutine then goes to block 162 and computes the velocity, acceleration, and trend or average value of NVAL.
  • the program then goes to block 163 and computes the present resonance cycle duration Q, which is equal to the sum of the instantaneous values of NVAL and PVAL.
  • the program then goes to decision block 164. If the binary level of conductor 9 is already low, then a minimum default time of PVAL is set, as indicated in block 165, and the routine goes back to block 152. If the voice level is not low, the routine is exited.
  • glottal pulse if glottal pulse (nitch trigger) is received, the routine is entered via label 170. This routine then enters block 171 and sets a flag indicating that a pitch pulse has been detected. The routine then goes to decision block 172 to wait for "go ahead" to reset the pulse latch 15 of Fig. 1 and exits.
  • the time-out alarm will occur, as indicated in block 175, because of too long of a negative interval on binary waveform 9 without the occurrence of a rising edge. If this occurs, the subroutine enters block 176 and forces NVAL to have a prescribed time out value. The program then goes to block 177, where an artificial value of PVAL is forced in order to stabilize the trend of PAVG. Next, the routine enters block 178, which forces all computation involved with the P values, and then exits.
  • Appendix A attached hereto is a printout of a computer program represented by the flow chart of Figs. 13A and 13B.
  • Appendix B is a printout of the program represented by Figs. 14-16.
  • Fig. 19 shows a voice command module 180 which includes some of the hardware of Fig. 1, including microphone 2, microphone signal amplifier 3, audio band pass filter 5, inflection point detector 8, and a microprocessor 10 as shown in Fig. 1. but does not include the "pitch extraction" circuitry including blocks 6, 13, 15, and 18 of Fig. 1.
  • Voice command module 180 is coupled by means of a cable 182 to a typical computer, such as a desk top computer 183.
  • a prompting light 186 is used to guide the speaker.
  • the microphone 181 picks up ambient noise as well as the speaker's voice, and in general accordance with the background routine of Fig. 13A, evaluates the existing noise level to establish if an adequate low level of "silence” exists to merit making a voice record. When one half second of "silence" has been achieved, the prompting light 186 indicates to the speaker that a valid "listening" condition exists.
  • the incoming binary waveform on conductor 9 is operated upon generally in accordance with the "foreground" subroutine previously described with reference to Figs. 14-16.
  • voice command module 180 makes no attempt to create or compare the phoneme vectors previously described. Instead, the "foreground" data, including the above-mentioned P* and N* averages are computed and are output via conductors 21 in Fig. 1, which are the conductors of the cable 182 in Fig. 19.
  • Microprocessor 10 in the voice command module 180 outputs samples every 10 milliseconds of the P* and N* samples.
  • the information required to utilize this "sampled" running average data is contained in a program stored on floppy disc 184 in Fig. 19, which is then inserted into computer 183, as indicated by arrow 185.
  • This program operates on the sampled running average data to compare it with stored "reference" data of a similar kind previously stored on floppy disc 184 in response to words spoken by the speaker whose voice commands are to be recognized during a "training session". These reference words are digitized and sampled by the voice command module 180 and stored on floppy disk 184.
  • each ordinant of waveform 195 represents the present value of the sampled running average of the negative pressure wave portions of the incoming speech waveforms.
  • each ordinant of waveform 196 represents the present value of the sampled running average of the durations of the positive pressure wave portions of the incoming speech waveform.
  • the program stored on floppy disc 184 (Fig. 19) operates on the data of Fig. 20 to isolate and produce data corresponding to "significant events" that can be identified from the features of waveforms 195 and 196. More particularly, these significant events include the characteristics listed in Table 2 below:
  • R designates a rapidly rising portion of the P* waveform 196 .
  • the rapidly rising portions of waveforms 196 are identified by R.
  • rapidly falling portions of P* graph are identified by F.
  • S designates periods of silence.
  • D and C designate portions of the P* graph which are relatively constant, and which change only very gradually, respectively.
  • the floppy disc 184 contains a simple routine which identifies the R states in accordance with the following formula: [(P* n - P* n-1 ) + (P* n+1 - P* n )] greater than 96 microseconds, where n is the sample number.
  • the state identification subroutine produces an F state in accordance with the expression
  • the state identification subroutine identifies the S state in accordance with the expression N* >1700 microseconds.
  • the D state is identified in accordance with the condition that the magnitude of P* n - P* n-1 is less than or equal to 48 microseconds
  • the C state is identified by determining whether P* changes slowly by at least 64 microseconds during a D state.
  • Fig. 21 there are shown three bytes identified as bytes 1, 2, and 3, respectively, that represent the event data corresponding to one event.
  • Byte 1 includes at least three significant bits that identify which of the five states R, F, S, D, and C have occurred. Bits 3-7 of byte 1 represent the duration of the most recent event.
  • Byte 2 includes the value of the running average P* at the end of the present event.
  • Byte 3 represents the present value of the running average N* at the end of the present event.
  • the "duration" of the present state is the number of samples, each spaced 10 milliseconds from the last, that the present state lasted.
  • an important application of the foregoing technique is the accomplishment of "speaker-dependent" voice command recognition, which is achieved by the voice command system 179 of Fig. 19, wherein the event data is compared with previously stored reference data that is obtained by "training” the system, a term understood in the art to mean entry of data corresponding to particular words spoken by a particular person into the system, which words are to be later recognized as comands by the system when spoken by the same person.
  • the graph in Fig. 22 illustrates how incoming event data 188 is compared to reference event data 187 previously stored during the "training" of voice command recognition system 179.
  • the previously stored “reference events” 187 are plotted on the vertical axis, and are identified by event numbers, each event number 1, 2,...24, corresponding to a "significant event” (i.e., R, F, S, D, or C) of the voiced command "learned" by the system 179 during the "training" session.
  • the comands are spoken into the microphone 181 of the voice command module 180, resulting in "raw” sampled reference event data produced in the format of Fig. 21 and stored in a predetermined number of preselected locations on floppy disc 184.
  • event numbers 1, 2,...20 are generally indicated by reference numeral 188 in Fig. 20.
  • reference numeral 89 identifies a "window" including a group of four significant events of the present spoken word to be recognized.
  • the event data for each of the significant events in window 189 is compared to the previously stored reference event data for stored reference event No. 1 until a matching occurs.
  • the comparison program of Fig. 23, subsequently explained, is stored on floppy disc 184 and is executed by computer 183 to determine if any matching occurs. If a match does occur, the comparison program moves to a new "window" that begins immediately after the matching occurs. If no matching occurs, the same window of four significant unknown events of the present utterance is compared to the stored reference event data No. 2, as indicated by reference number 190 in Fig. 22.
  • the graph indicates that no matching of the first window of four significant unknown events of the present utterance sufficiently closely matches stored reference event data No. 1 or No. 2, so the program applies the same data window to stored reference event data No. 3, as indicated by reference numeral 191.
  • unknown event data No. 2 matches stored reference event data No. 3 by the comparison routine of Fig. 23.
  • the X in data window 191 indicates this matching.
  • the comparison routine of Fig. 23 repeats its comparison operations for another unknown events window 192, which begins immediately after the X in window 191.
  • the comparison routine in Fig. 23 determines that the present utterance cannot be recognized and does not match the stored reference event of 187 in Fig. 22. The routine then goes to the next stored reference word and repeats the above procedure. What this technique accomplishes is that it identifies matching features between the spoken command and the previously stored command, despite "absence” or “addition” of extraneous features that may occur each time the word is spoken.
  • the width of the window such as 189 and 190 allows effective "synchronizing" of the comparison of the sequence of features in the presently spoken command to those in the previously stored reference command, despite these extraneous differences in the significant features of the same word when it is spoken at different times by the same speaker.
  • the above-mentioned significant event comparison subroutine is entered via label 200 , and goes to block 201.
  • block 201 two variables called SCORE and STRIKE are each set to zero.
  • the routine then goes to block 202 and sets an utterance event point UT equal to 1.
  • the routine then goes to block 203 and loops through the previously stored reference event numbers designated by reference numeral 187 in Fig. 22, incrementing a reference pointer RR, which indicates at which matching unknown utterance event to begin the next window.
  • UU in blocks 205 and 206 points to unknown utterance events 188.
  • the routine then goes to block 204 and sets the value of four variables SUBSCORE (1)... SUBSCORE(4) to zero.
  • the later four variables correspond to "scores" which are computed for each of the four events in each of the event windows in Fig. 22.
  • the significant event comparison routine then goes to block 205 and loops through the present utterance event window, and for each comparison of an utterance event in a particular window with the corresponding previously stored reference event, obtains a "difference number", assigns to it a "weighted” scored, and sets the appropriate one of the variables SUBSCORE(1), etc. to that score, based on the difference between the utterance event and the stored reference event.
  • the routine then goes to decision block 207 and determines if there are more events in the present window. If this determination is affirmative, the routine goes back to block 205 and repeats. If the determination is negative, the routine goes to block 208 and selects the smallest of the four subscores. The routine then goes to decision block 209 and determines if the value of the smallest subscore variable is zero. If this is the case, it indicates a perfect match of the corresponding unknown utterance event with the present reference event, and the routine then goes to block 210.
  • the routine sets the UT variable, which points to the first event in the window of the unknown utterance event, to point to the next event after the perfect match.
  • the routine then goes to block 213 and sets the variable STRIKE equal to zero, and also diminishes the value of the variable score by a suitable amount to "reward" the significnat event comparison routine for locating a perfect match between an unknown utterance event in the present window and the present stored reference event.
  • the routine enters block 211 and increments the variable STRIKE.
  • the routine then goes to block 212 and adds the total amount of mismatch of each of the four utterance events in the present window to the cumulative value of SCORE.
  • the routine then goes to decision block 214 and determines if STRIKE exceeds 5, or another suitable empirically determined number. If this determination is affirmative, it is assumed that the present unknown utterance does not match the present stored reference utterance, and the routine is exited via block 215.
  • the routine goes to block 216 and determines if all of the significant events of the present utterance have been compared to the stored reference event data for the present stored reference utterance. If this determination is affirmative, the routine goes to block 217, and sets STRIKE back to zero, and also suitably diminishes or "rewards" the variable SCORE. The routine then is exited via label 219. The routine then returns to label 200 and repeats the algorithm of Fig. 23 for additional stored reference utterances to determine if any of them match the present utterance better than the just completed analysis.
  • decision block 216 If the decision of decision block 216 is negative, the routine enters decision block 218 and determines if all significant events of the present utterance have been tested against stored reference event data for the present stored reference utterance. If this' determination is negative, the routine goes to back to block 203, but otherwise goes to block 217 and then exits via label 219.
  • the "rewards" used in block 217 are required in case some other word in the lexicon has a similar or or identical sequence in the early part of matching, but exhibits a discrepant latter part. Any word which finishes the algorithm of Fig. 23 therefore has a large reduction in the accumulated score (such as a reduction of one half) in order to portray or encourage a favorable final decision when the scores of all matched words are evaluated.
  • Fig. 24 illustrates another implementation of the system shown in Fig. 1, wherein blocks 5, 6, 8, 13, 15, and 18 are omitted, and their functions are performed or accomplished by means of a high speed analog to digital converter of the type referred to in the art as a "flash analog to digital converter", and by means of two digital filter programs. More specifically, in Fig. 24, the system 285 includes a microphone 289, a microphone amplifier 290, the output of which is applied as an analog input to a microcomputer 286. Microcomputer 286 includes a processor 292, which can be implemented by means of any of a large number of presently commercially available microprocessors. It is coupled to a digital bus 293, which can include the bus 21 in Fig. 1 on which phoneme identification signals are provided, or it can simply provided sampled, compact digital representation of information of the type conducted on cable 182 in the embodiment of Fig. 19.
  • Computer 286 includes a flash analog to digital converter 288, which can be implemented by means of several commercially available analog to digital converters that are capable of performing conversions in roughly 30 microseconds or less.
  • Microprocessor 292 executes two digital filter routines, which are schematically indicated in Fig. 24 by block 287. Dotted lines 296 schematically represents the execution of the two digital filter routines 287 by microprocessor 292.
  • Reference numeral 282 represents an RC time constant circuit which is coupled by conductor 283 to digital circuitry within microcomputer 286 which much be provided and controlled by microprocessor 292 in order to execute the two digital filter routines.
  • One of the two digital filter routines and associated digital filter circuitry performs the function of digitally filtering the voice band components of the analog signal on conductor 291.
  • Reference numeral 284 represents an RC time constant circuit which is coupled by conductor 295 to additional digital filtering circuitry which is controlled by microprocessor 292 in the course of executing the second digital filter routine in block 287, to provide a properly filtered pitch band signal.
  • This circuitry and digital filter routine perform the function of the pitch band pass filter 6 in Fig. 1.
  • Microprocessor 292 could easily execute a subroutine which would look at the outputs of the flash analog to digital converter 288 and could easily perform a comparison of each point of the output with the adjacent points thereof to detect the peaks of the analog speech waveform produced on conductor 291. It would be a straight-forward matter for the microprocessor to then mathematically determine the major inflection points and compute all of the various running averages, rates of change, and acceleration variables that have been previously described herein. The undesired "intervening" numbers produced by analog to digital converter 288 could be discarded, to achieve the same degree of significant event data compaction previously described herein.
  • FRICATIVE_BEGINS true ; call speed_up_foreground_data_acquisition; enddo; elseif FRICATIVE_BEGINS - true then do
  • TOKEN fricative_look_up(PAVG,NAVG); call wait_for_end_of_fricative; call OUTPUT ( TOKEN );
  • FRICATlVE_BEGINS false; enddo; enddo; /* Fricative/ */
  • NVAL CURRENT_TIME - LAST_EDGE_TIME
  • NAVG rurining_average ( NVAL ) ;
  • N_TREND trend ( NAVG ) ;
  • PVAL DEFAULT_PVAL; goto COMPUTE_P; enddo;
  • NEW_P1 correction( PITCH_FACTOR, P1 );
  • TOKEN vowel_look_up( P1, P2, SWOOP_VECTOR ); call OUTPUT ( TOKEN ); enddo; enddo; /* Vowel */ enddo; /* whenever */ return; /* BACKGROUND */
  • Procedure CRUNCH is useful in implementing the Procedure */
  • DURATION BUFFER_INDEX - LAST_INDEX; write ( DURATION, LAST_STATE, P, N );
  • LAST_INDEX BUFFER_INDEX; return;
  • LAST_DWELL_P QUEUE ( 2, 1 ); call STORE_EVENT ( QUEUE ( 2, 1 ), QUEUE ( 2, 2 ), CURRENT_STATE ); goto STATE_CASE; /* Process */ while not( END_OF_BUFFER ) or SILENCE_COUNT ⁇ TERMlNATE_THRESHOLD then call CALCULATE;
  • SILENCE_COUNT SILENCE_COUNT + 1; otherwise do call STORE_EVENT( P_SILENCE_THRESHOLD, N_SILENCE_THRESHOLD,
  • UTTERANCE ( n, 1..4 ) -- has same configuration */ array REFERENCE ( 1..REF_LENGTH, 1..4 ), UTTERANCE(1.. UTT_LENGTH, 1.. a ); array SUBSCORE ( 0..3 );
  • SCORE truncate( SCORE / 2 ); /* Reward SCORE */ enddo; otherwise do
  • SCORE truncate( SCORE / 2 ); /* Reward score for finishing */ return; /* end COMPARE */

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

Un système destiné à transformer les signaux du langage en des signaux d'identification de phonèmes comprend un circuit de détection des points d'inflexion simples (8) qui produit des "1" et des "0" durant les parties d'ondes de pression négative et positive du signal de parole analogue correspondant. En ce qui concerne les parties "parlées" des signaux de parole les pointes maximales de chaque cycle d'intonation de signal de parole sont détectées (13) pour produire des cycles d'intonation qui sont analysés par un micro-ordinateur (10), afin de détecter le départ de chaque cycle d'intonation et de détecter et enregistrer la durée des niveaux "1" et "0" successifs. Pour chaque cycle d'intonation un vecteur d'entrée est formé à partir des durées des niveaux "1" qui suivent le plus directement le départ du cycle d'intonation. Ce vecteur d'entrée est corrigée par une quantité dérivée par détermination de la disparité entre l'intonation présente de la voix de l'orateur et les fréquences de résonance dues à la configuration de la cavité bucale de l'orateur. La vitesse et l'accélération représentant la durée des deux niveaux "1" au commencement de chaque cycle d'intonation sont analysées par le micro-ordinateur (10), qui fixe une limite entre le début et la fin de chaque phonème. Pour les parties fricatives du signal de parole, un vector fricatif est calculé par ordinateur et comparé à un plan de son fricatif afin d'identifier une consonne fricative. Les intervalles silencieux sont marqués par la non-indication par le détecteur de points d'inflexion (8) d'un niveau "1" moyen en fonctionnement substantiel. Les phonèmes plausifs sont partiellement identifiés par les durées des intervalles silencieux précédents. Les phonèmes de longue durée utilisent la pente située entre les valeurs de durée des "1" et "0" mobiles, afin de contribuer à la procédure. Identification de tous les vecteurs de séquence-temps phonèmiques est obtenue en utilisant différents plans de domaines de vecteurs de référence dérivés de manière empirique.
PCT/US1985/002229 1985-11-08 1985-11-08 Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix WO1987003127A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19850905990 EP0245252A1 (fr) 1985-11-08 1985-11-08 Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix
PCT/US1985/002229 WO1987003127A1 (fr) 1985-11-08 1985-11-08 Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US1985/002229 WO1987003127A1 (fr) 1985-11-08 1985-11-08 Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix

Publications (1)

Publication Number Publication Date
WO1987003127A1 true WO1987003127A1 (fr) 1987-05-21

Family

ID=22188935

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1985/002229 WO1987003127A1 (fr) 1985-11-08 1985-11-08 Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix

Country Status (2)

Country Link
EP (1) EP0245252A1 (fr)
WO (1) WO1987003127A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100902112B1 (ko) 2006-11-13 2009-06-09 한국전자통신연구원 키 재동기 구간의 음성 데이터를 예측하기 위한 벡터 정보삽입 방법, 전송 방법 및 벡터 정보를 이용한 키 재동기구간의 음성 데이터 예측 방법
KR100906766B1 (ko) 2007-06-18 2009-07-09 한국전자통신연구원 키 재동기 구간의 음성 데이터 예측을 위한 음성 데이터송수신 장치 및 방법
CN103390409A (zh) * 2012-05-11 2013-11-13 鸿富锦精密工业(深圳)有限公司 电子装置及其侦测色情音频的方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3989896A (en) * 1973-05-08 1976-11-02 Westinghouse Electric Corporation Method and apparatus for speech identification
GB2020467A (en) * 1978-05-08 1979-11-14 Marley J System for speech recognition
US4284846A (en) * 1978-05-08 1981-08-18 John Marley System and method for sound recognition
US4470150A (en) * 1982-03-18 1984-09-04 Federal Screw Works Voice synthesizer with automatic pitch and speech rate modulation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3989896A (en) * 1973-05-08 1976-11-02 Westinghouse Electric Corporation Method and apparatus for speech identification
GB2020467A (en) * 1978-05-08 1979-11-14 Marley J System for speech recognition
US4181813A (en) * 1978-05-08 1980-01-01 John Marley System and method for speech recognition
US4284846A (en) * 1978-05-08 1981-08-18 John Marley System and method for sound recognition
US4470150A (en) * 1982-03-18 1984-09-04 Federal Screw Works Voice synthesizer with automatic pitch and speech rate modulation

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100902112B1 (ko) 2006-11-13 2009-06-09 한국전자통신연구원 키 재동기 구간의 음성 데이터를 예측하기 위한 벡터 정보삽입 방법, 전송 방법 및 벡터 정보를 이용한 키 재동기구간의 음성 데이터 예측 방법
KR100906766B1 (ko) 2007-06-18 2009-07-09 한국전자통신연구원 키 재동기 구간의 음성 데이터 예측을 위한 음성 데이터송수신 장치 및 방법
CN103390409A (zh) * 2012-05-11 2013-11-13 鸿富锦精密工业(深圳)有限公司 电子装置及其侦测色情音频的方法

Also Published As

Publication number Publication date
EP0245252A1 (fr) 1987-11-19

Similar Documents

Publication Publication Date Title
US4783807A (en) System and method for sound recognition with feature selection synchronized to voice pitch
US4284846A (en) System and method for sound recognition
US4181813A (en) System and method for speech recognition
US4809332A (en) Speech processing apparatus and methods for processing burst-friction sounds
JP3162994B2 (ja) 音声のワードを認識する方法及び音声のワードを識別するシステム
US6553342B1 (en) Tone based speech recognition
US5602960A (en) Continuous mandarin chinese speech recognition system having an integrated tone classifier
US5220639A (en) Mandarin speech input method for Chinese computers and a mandarin speech recognition machine
CN100587806C (zh) 语音识别方法和语音识别装置
Hansen et al. Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification
US4707857A (en) Voice command recognition system having compact significant feature data
RU2466468C1 (ru) Система и способ распознавания речи
JP2980438B2 (ja) 人間の音声を認識するための方法及び装置
WO1987003127A1 (fr) Systeme et procede de reconnaissance des sons avec selection de caracteres synchronisee a l'intonation de la voix
Yavuz et al. A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model.
JPS6138479B2 (fr)
Clapper Automatic word recognition
JPH0640274B2 (ja) 音声認識装置
JPS5934600A (ja) 音声認識装置
JPH0682275B2 (ja) 音声認識装置
WO1989003519A1 (fr) Procedes et appareil processeurs de la parole servant a traiter des sons plosifs-fricatifs
KR960007132B1 (ko) 음성인식장치 및 그 방법
JPS5969799A (ja) 音声登録方法
JPS61180300A (ja) 音声認識装置
JPS63131196A (ja) 鼻子音識別装置

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BR DK FI JP KP NO

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE FR GB IT LU NL SE