US3909532A - Apparatus and method for determining the beginning and the end of a speech utterance - Google Patents

Apparatus and method for determining the beginning and the end of a speech utterance Download PDF

Info

Publication number
US3909532A
US3909532A US456027A US45602774A US3909532A US 3909532 A US3909532 A US 3909532A US 456027 A US456027 A US 456027A US 45602774 A US45602774 A US 45602774A US 3909532 A US3909532 A US 3909532A
Authority
US
United States
Prior art keywords
signal
representative
energy
output signals
developing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US456027A
Inventor
Lawrence Richard Rabiner
Lewis Hyman Rosenthal
Ronald William Schafer
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
Bell Telephone Laboratories Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bell Telephone Laboratories Inc filed Critical Bell Telephone Laboratories Inc
Priority to US456027A priority Critical patent/US3909532A/en
Priority to CA214,938A priority patent/CA1036271A/en
Priority to AU79266/75A priority patent/AU495754B2/en
Priority to GB12245/75A priority patent/GB1487291A/en
Application granted granted Critical
Publication of US3909532A publication Critical patent/US3909532A/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S52/00Static structures, e.g. buildings
    • Y10S52/13Hook and loop type fastener

Definitions

  • FIG .6 l A WWW 'WORD BEG
  • FIG. 1 A first figure.
  • Speech must be stored in digital form.
  • a file of speech is created and stored in a suitable memory
  • an adaptive speech encoder e.g., an I adaptive differential pulse code modulator (ADPCM)
  • ADPCM I adaptive differential pulse code modulator
  • an adaptive speech encoder effectively exhibits a form of automatic gain control useful in determining the endpoints ofan utterance.
  • Coded output words of such a coder it has been found, exhibit high energy during both voiced and ning of a speech utterance is detected whenlthe code word energy exceeds a predetermined threshold for-.a
  • FIG. 1 depicts a prior art ADPCM coder which may be used in the practice of this invention
  • FIG. 2 displays the code word sequence for the utterance oh
  • FIG. 3 displays the decoded speech waveform corresponding to the code word sequence of FIG. 2;
  • FIG. 4 is a block diagram of apparatus used in the practice of this invention to determine code word en-
  • FIG. 5 is a block diagram of apparatus used in the practice of this invention to determine the beginning and end of a speech utterance;
  • FIG. 5A is a block diagram depicting the system operation of this invention.
  • FIG. 6 displays the code word sequence for the beginning of the utterance three
  • FIG. 7 displays the code word energy corresponding to the code word sequence of FIG. 6;
  • FIG. 8 displays the decoded speech waveform corresponding to the code word sequence of FIG. 6;
  • FIG. 9 displays the energy of the speech waveform of FIG. 8; 1
  • FIG. 10 displays the code word sequence for the end of the utterance three.
  • FIG. 11 displays the speech waveform corresponding to the code word sequence of FIG. 10.
  • FIG. 1 depicts a prifft'adaptive differential pulse code modulation CII'CUIIWI'IICII is described in detail in tial PCM Coding of Speech, Bell System Technical Jonrnkzl,,Vol. 52, p. 1105-1118, September 1973.
  • differential input amplifier or network 11 develops an output signal proportional to the difference between an applied sampled speech signal and a signal which is an estimate of the incoming speech signal. This difference signal is quantized in adaptive quantizer l2 and applied to encoder I 13 and to summing amplifier or network 14.
  • logic network 16 utilizes the coded-speechsignals (code words) emanating from encoder13 to determine optimum quantization steps. 'Thatis, logic network 16 monitors the coded output of encoder l3 and provides for adaptation of the step size on the basis of the most recent encoded quantizer output. For example, if the code word corresponds to one of the higher levels, the quantizer is overloaded and the step size is increased. On the other hand, if the code word corresponds to one of the lower levels, the step size is decreased.
  • coded-speechsignals code words
  • Step size adaptation effectively compensates for amplitude variations to the extent that the quantizer treats low level unvoiced speech signals, e.g., fricatives, much the same as high level voiced speech signals.
  • the objective is that each of the quantizer levels be used a significant portion of the time regardless of the absolute amplitude level of the incoming speech samples.
  • the adaptation logic insures that the step size will seek its minimum value and the difference signal will then fall within the lowest quantization levels.
  • FIG. 2 is an exemplary display of code word activity for the voiced utterance oh.
  • Each line, A, B, C, D of FIG. 2 corresponds to approximately 256 samples (6 kHz sampling rate) of the applied speech utterance, i.e., approximately 40 milliseconds of the signal.
  • Line B is to be considered a continuation of line A, line C a continuation of line B, etc. It is noted that for line A, and for most of line B, the code words show little activity, remaining for the most part within a limited range ofquantization levels. This first part of the code word sequence corresponds to background silence. However, at almost the end of line B, and then for the remainder of lines C and D, the code word sequence fluctuates much more rapidly and with greater amplitude.
  • FIG. 1 is an exemplary display of code word activity for the voiced utterance oh.
  • the largest negative quantization level is represented by the binary code word 0000 While the largest positive quantization level is represented by the binary code word I l l 1, corresponding to the decimal number 15.
  • the code word 0000 While the largest positive quantization level is represented by the binary code word I l l 1, corresponding to the decimal number 15.
  • a different coding implementation may be utilized which inherently has a zero average value. Since the number 7.5, corresponding to the average value, may not be conveniently represented in digital form, the following definition of energy may be utilized:
  • the code word energy is computed at each sample of the speech signal and compared with a threshold which is established at a level intermediate to the measured energy of silence and the average measured energy of the speech utterance. When the code word energy exceeds this threshold for approximately 320 consecutive samples, corresponding to about 50 milliseconds of speech, the word c(n) at which the energy first exceeded the threshold is defined as the beginning of an utterance.
  • the code word energy-threshold comparison is continued, and when the code word energy falls below the threshold for approximately 1,024 consecutive samples, corresponding to about I60 milliseconds of speech, the point at which the energy first fell below the threshold is defined as the end of the utterance.
  • the millisecond criterion insures that a stop consonant within a word or phrase will not be mistaken for the end of the utterance.
  • FIG. 4 Apparatus for determining the energy of the code words in accordance with Eq. (3) is illustrated in FIG. 4.
  • a code word, c(i), emanating from encoder 13 of FIG. 1, is applied to digital doubler 17, wherein it is doubled in value to develop a signal 2 c(i), which is twice the digital value of the applied code word.
  • Digital doubler 17 may be of any well-known configuration, e.g., a shift left by one bit register will double the value of an applied binary signal.
  • Digital subtractor 18 subtracts from signal 2 c(i), a signal supplied by digital reference register 19.
  • the signal stored in register 19 is proportional to the dc level or average of the code words. In a particular embodiment, the digital signal stored in register 19 is equal to fifteen as required by Eq. (2).
  • Digital multiplier 21 multiplies the output signal of subtractor 18 by itself to achieve a squared signal which corresponds to the function a(i) of Eq. (2). Both subtractor l8 and multiplier 21 may be conventional digital arithmetic circuits.
  • the signal output, a(i), of multiplier 21 is applied to shift register 22.
  • Register 22 which preferably has a digital capacity of one hundred and two words, sequentially shifts digital signal a(i) through the register at the system clock rate. It is to be understood that in the circuitry of FIG. 4, and also in that of FIG. 5, that all operations are performed in synchronism with the master sampling clock of the coder of FIG. 1, which has not been depicted in order not to obfuscate the operation of the instant invention.
  • the first and last words of register 22 are combined in conventional digital subtractor 23 to form a difference signal, a(n+50) a(n5 l
  • This difference signal is applied to conventional digital adder 24 which, in conjunction with delay network 25, develops a signal representative of the code word energy as defined in Eq. (3)
  • Delay network 25 may be of conventional design and is utilized to delay the output of adder 24 by one clock period.
  • the output signal E(n), of adder 24, is applied to digital comparator 26 of FIG. 5.
  • Comparator 26 compares the energy of each code word E(n) with a signal stored in register 27 to determine whether or not the energy of the code word is above or below a predetermined threshold.
  • the threshold is generally empirically determined and may be approximately equal to a point midway between the measured energy of background silence and the average measured energy of the speech signal, which is readily obtained by averaging the output of the apparatus of FIG. 4.
  • the point at which the energy function first exceeded the threshold is defined as the beginning of an utterance.
  • the apparatus of FIG. 5 is utilized to deter mine when this has occurred.
  • the apparatus of FIG. 5 continues to make a comparison of the energy of subsequent code words with the threshold signal stored in register 27.
  • the threshold signal stored in register 27.
  • input lead 46 to NAND gate 31 is at a logical 1 state and input lead 47 to NAND gate 31 is at a logical I state.
  • lead 48, connecting the output of NAND gate 31 and one of the inputs to NAND gate 32, is at a 0 state and lead 51, one of the inputs to NAND gate 38, is also at a 0 state.
  • Clock input 49 to NAND gate 32 is presumed to enable NAND gate 32 upon the presence of a logical l on lead 49.
  • output lead 54 of NAND gate 32 isat a logical I state; counter 33 is presumed to be incremented upon the presence of a 0 level input on line 54.
  • output leads 55, 56 and 57 of counter 33 which correspond to the 10th, 8th and 6th powers, respectively, of the binary base two, are at a logical 0 state.
  • Output lead 58 of NAND gate 35 is thus at a logical 1 state as is output lead 59 of NAND gate 36.
  • Input leads 53 and 52 to NAND gate 38 are also at a logical I state, thus establishing output lead 61 of NAND gate 38 at a logical 1 state and output lead 62 of inverter circuit 37 at a logical 0 state. Since this is the clear input to counter 33, a logical 0 state is presumed to clear the counter.
  • output lead 43 of comparator 26 assumes a logical I state and output lead 45 of comparator 26 assumes a 0 state.
  • Output lead 46 of NAND gate 28 is then at a logical 0 state and output lead 47 of NAND gate 29 is at a logical 1 state.
  • Output lead 48 of NAND gate 31 assumes a logic 1 state as does lead 51, which is one of the inputs to NAND gate 38.
  • output lead 59 of NAND gate 36 assumes a logical 0 state indicating the beginning of a speech utterance.
  • the presence of a 0 level signal on output lead 59 resets flip-flop 34 so that a logical 1 signal appears on output lead 39 and a logical 0 signal appears on output lead 41.
  • Output lead 58 of NAND gate 35 remains at a logical I state.
  • the resetting of flip-flop 34 causes output lead 59 to return to a logical I state and in turn causes input lead 44 to NAND gate 29 to assume a logical I state and input lead 42 to NAND gate 28 to assume a logical 0 state.
  • output lead 43 is still at a logical I state, but since input lead 42 to NAND gate 28 is now at a logical 0 state, output lead 46 of NAND gate 28 assumes a logical I state.
  • Output lead 45 of comparator 26 is still at a 0 state, but input lead 44 to NAND gate 29 is now at a logical 1 state.
  • output lead 47 of NAND gate 29 is at a logical 1 state.
  • output lead 48 of NAND gate 31 assumes a logical 0 state as does input lead 51 to NAND gate 38.
  • Input lead 54 to counter 33 assumes a logical 1 state and counter 33 is not incremented.
  • output lead 47 of NAND gate 29 assumes a logic state.
  • output lead 48 of NAND gate 31 is at a logical I state as is input lead 51 to NAND gate 38.
  • output lead 54 assumes a logical 0 state and increments counter 33. Assuming the input energy level of the code words remains below the predetermined threshold, counter 33 will be successively incremented but no change in the logic states of the circuit will occur until leads 55, 56, and 57 of counter 33 all assume a logical 1 state. This state corresponds to a count of 1024.
  • output lead 58 assumes a logical 0 state indicating the end of the speech utterance while output lead 59 remains at a logical I state.
  • the occurrence of a O logic state on output lead 58 sets flip-flop 34 back to its original state, i.e., output lead 39 assumes a 0 state and output lead 41 assumes a I state.
  • Output lead 58 accordingly returns to a logical I state and the apparatus of FIG. 5 has returned to the conditions initially assumed prior to the beginning of the speech utterance.
  • the waveforms appearing at output leads 59 and 58 of the apparatus of FIG. 5 indicate the logic state transition, respectively, at the beginning and end of a speech utterance.
  • the signals appearing on leads 58 and 59 may be utilized to activate an alarm circuit to indicate to an operator that the beginning and end of a speech utterance has occurred.
  • a register which temporarily stores the code words of the apparatus of FIG. 1 so that the code words of the speech utterance, determined by the apparatus of FIG. 5, may be conveyed to a permanent store.
  • the signals appearing on leads 58 and 59 may be utilized to activate an alarm circuit to indicate to an operator that the beginning and end of a speech utterance has occurred.
  • FIG. 5A is a block diagram depicting the overall operation of this invention, as discussed above.
  • Adaptive encoder 501 corresponds to the encoder shown in FIG. 1
  • code word energy detector 502 corresponds to the apparatus depicted in FIG. 4
  • threshold detector 503 corresponds to the apparatus shown in FIG. 5.
  • FIG. 6 displays the sequence of code words corresponding to the beginning of the word three.
  • the left-half of line A shows very little code word variation and corresponds to low level noise.
  • the right-half of line A, and the next two lines, B and C correspond to the initial fricative th of the word three.
  • the code words show markedly greater variation as does the last line, D, which corresponds to the beginning of voicing, i.e., ree.
  • the marker in the middle of line A denotes the beginning point of the speech utterance, as determined by this invention.
  • FIG. 7 displays the energy of the code words of FIG. 6, as determined by this invention.
  • the marker on line A denotes the point at which the energy of the code words exceeded the threshold and remained above the threshold for approximately 50 milliseconds, as discussed above. It is noted that the code word energy is roughly the same for both the voiced and unvoiced segments of the utterance while the energy is significantly lower when no speech is present.
  • FIG. 8 displays the actual speech waveform represented by the code word sequence of FIG. 6. The beginning of the word three is not nearly as evident as in the code word sequence; indeed, it is hardly discernible.
  • FIG. 10 displays the code word sequence at the end of the word three.
  • the marker on line B indicates the end point of the utterance as determined by the instant invention.
  • FIG. 11 displays the speech waveform corresponding to the code word sequence of FIG. 10. The end point of the utterance is clearly not apparent from an examination of the
  • the instant invention has been tested extensively in determining the beginning and end speech entries for a voice response system vocabulary, and has proved to be very reliable.
  • Two other aspects of the coded speech signal i.e., the energy of the difference signal of the coder of FIG. 1, and the energy of the quantizer output were also studied as possible considerations for use in the instant invention.
  • the results based on the coded word samples themselves were found to be far more accurate.
  • Apparatus for determining a boundary of an applied speech utterance comprising:
  • said threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
  • Apparatus for determining a boundary of an applied speech utterance comprising:
  • said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
  • Apparatus for detecting the beginning of a speech utterance including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:
  • said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
  • said digital' threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
  • fourth means for sequentially storing a predetermined number of said squared signals
  • sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals
  • Apparatus for determining the beginning of a speech signal including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:
  • Apparatus for detecting the end ofa speech utterance including an adaptive differential pulse code modulation circuit responsive to said speech utterance,
  • said means for developing a signal representative of the en- 5 ergy of said digitally coded output signals comprises:
  • Apparatus for determining the end of a speech signal including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:
  • Apparatus for detecting the boundaries of a speech utterance including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:
  • code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals
  • comparator means for comparing said digital representative signal with an applied digital threshold signal
  • Apparatus for determining the boundaries of a speech signal including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:
  • code word energy means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals
  • comparator means for comparing said representative signal with an applied threshold signal
  • Apparatus for detecting the boundaries of a speech utterance including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:
  • code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals
  • fourth means for sequentially storing a predetermined number of said squared signals; fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means;
  • sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals
  • digital comparator means responsive to said signal representative of the energy of said coded output signals and to said applied digital threshold signal for developing a signal at a first output terminal when said representative energy signal is greater than said threshold signal and for developing a signal at a second output terminal when said representative energy signal is less than said threshold signal;
  • bistable circuit having first and second output terminals, and set and reset terminals
  • a fourth logic circuit responsive to the output signal of said third logic circuit and to an applied clock signal
  • a counter circuit having a plurality of output terminals, responsive to the output signal of said fourth logic circuit
  • a fifth logic circuit for developing said signal indicative of the end of said speech utterance, responsive to the signal at the second output terminal of said bistable circuit and to the signal at a preselected one of said plurality of counter circuit output terminals;
  • a sixth logic circuit for developing said signal indicative of the beginning of said speech utterance, responsive to the signal at the first output terminal of said bistable circuit first and to the signals at the other of said plurality of counter circuit output terminals;
  • a seventh logic circuit responsive to the output signals of said third, fifth, and sixth logic circuits for developing a control signal for said counter circuit, said control signal returning said counter to a predetermined initial state;

Abstract

It has been discovered that the energy of the code words at the output of an adaptive speech encoder may be utilized to accurately determine the beginning and end of an encoded speech utterance. The beginning of an utterance is detected when the code word energy exceeds a predetermined threshold for a fixed duration of time. Likewise, the end of an utterance is detected when the code word energy falls below the threshold for another fixed duration of time.

Description

United States Patent Rabiner et al.
[451 Sept. 30, 1975 APPARATUS AND METHOD FOR OTHER PUBLICATIONS DETERMINING THE BEGINNING AND THE 7 END OF A SPEECH UTTERANCE Johnson. C. et al., Adaptive Rate Delta Modulator, [75! Inventors: Lawrence Richard Rabiner, IBM Tech" Dlsclosurc Apnl 1973 Berkeley Heights, N..l.; Lewis Hyman Rosenthal, Cambridge. Primary Ii\'mninerKathlecn H. Claffy M1188; Ronald William schafel", NCW Assistant E.\'aminerE. S. Kemeny Providence, NJ. Attorney, Agent, or Firm-G. E. Murphy [73] Assigneez. Bell Telephone Laboratories,
Incorporated. Murray Hill NJ.
[57] ABSTRACT [22] Filed: Mar. 29, 1974 It has been discovered that the energy of the code [21] Appl 456027 words at the output of an adaptive speech encoder may be utilized to accurately determine the beginning [52] us. Cl. 179/1 SC; 325/36 B and end of an encoded speech utterance. The begin- [51] Int. Cl. l. G10L l/04 ning of an utterance is detected when the code word [58] Field Of Search 179/1 SA l SC; 325/38 B, energy exceeds a predetermined threshold for a fixed 325/62 326 duration of time. Likewise, the end of an utterance is detected when'the code word energy falls below the [56] References Cited threshold for another fixed duration of time.
UNITED STATES PATENTS 27 Claims, 12 Drawing Figures 3 750.()24 7/1973 Dunn et ul. 325/38 B SPEECH |NpUT 5Ol 502 503 END L 0 u) E(n) J j 8 EG l N ADAPTIVE CODE WORD THRESHOLD ENCODER DETECTOR ENERGY DETECTOR U.S. Patent Sept. 30,1975 Sheet 1 of5 3,909,532
FIG
(PRIOR ART) SPEECH ADAPTIVE INPUT 1| [QUANTIZER A l3 I I woans ENCODER 1 cu) LOGIC NETWORK I4 DELAY FIG. 5A PEECH p 50| 502 503 END cu) E(n) J j D BEGlN ADAPTIVE CODE WORD THRESHOLD ENCODER ENERGY DETECTOR DETECTOR FIG. 2
BMW
U.s. Patent Se t. 30,1975 sheetzofs 3,909,532
2C(i) g I? DIGITAL DOUBLER DELAY U.S, Batant Sept. 30,1975 Sheet4 of5 3,909,532
A WWW 'WORD BEG|NS FIG .6 l
WWDMWWWW DH M WWWW (WORD BEGINS WORD BEGINS US. Patent FIG. /0
FIG.
Sept. 30,1975 Sheet 5 of 5 3,909,532
Pwoao BEGINS }-woRo ENDS k-WORD ENDS APPARATUS AND METHOD'FOR DETERMINING THE BEGINNING AND THE END OF A SPEECH of extensive research. Generally, for these applications,
speech must be stored in digital form. Typically, a file of speech is created and stored in a suitable memory,
e.g., a fixed head disk or drum. In order to efficiently store speech, itis necessary that individual words and phrases be stored in memory without intervening periods of silence between entries. Thus, the need to automatically locate the beginning and endof a speech ut terance frequently arises in speech processing for manmachine communication.
DESCRIPTION OF THE PRIOR ART Conventionally, the. task of determining the end- 1 pointsof a speech utterance has been accomplished by manual editing, utilizing a combination of auditory and visual examinations of the speech waveform. However,
imanual editing is both time-consumingand subject to the inaccuracies concomitant with human judgment. Furthermore, repeatable results are not normally obrange of speech renders the combination of ear and eye a poor determinant of word boundaries. This is especially true when an unvoiced segment of speech, e.g.,'
the fricative at the beginning of the word three, appears' at the beginning or end of a word. Consequently, manual editing usually results in shortening the speech,
both at the beginning and at the end of the utterance. Thus, the words are chopped,
and when they are concatenated to form a message, the effects are quite discernible and also'distracting. I
It is thus an object of this invention to efficiently, ac-
curately, andv automatically detect the beginning and end of a speech utterance. i
SUMMARY'OF THE INVENTION This and other objects of this invention are accomplished by utilizing an adaptive speech encoder, e.g., an I adaptive differential pulse code modulator (ADPCM),
an adaptive delta modulator, etc. It has been discovered'by us that because of the step size adaptation used in 'developing adaptive encodeds peech, an adaptive speech encoder effectively exhibits a form of automatic gain control useful in determining the endpoints ofan utterance. Coded output words of such a coder, it has been found, exhibit high energy during both voiced and ning of a speech utterance is detected whenlthe code word energy exceeds a predetermined threshold for-.a
' the article by I. Cummiskey, N. S. Jayant, and .I. L.
. tained. One reason for this is that the wide dynamic I Flanagan, entitled Adaptive Quantization in Differenfixed interval of time. Likewise, the end of an utterance is detected when the codeword energy falls below the threshold for another fixed interval of time.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 depicts a prior art ADPCM coder which may be used in the practice of this invention;
FIG. 2 displays the code word sequence for the utterance oh;
FIG. 3 displays the decoded speech waveform corresponding to the code word sequence of FIG. 2;
FIG. 4 is a block diagram of apparatus used in the practice of this invention to determine code word en- FIG. 5 is a block diagram of apparatus used in the practice of this invention to determine the beginning and end of a speech utterance;
FIG. 5A is a block diagram depicting the system operation of this invention;
FIG. 6 displays the code word sequence for the beginning of the utterance three;
FIG. 7 displays the code word energy corresponding to the code word sequence of FIG. 6;
FIG. 8 displays the decoded speech waveform corresponding to the code word sequence of FIG. 6;
FIG. 9 displays the energy of the speech waveform of FIG. 8; 1
FIG. 10 displays the code word sequence for the end of the utterance three; and
FIG. 11 displays the speech waveform corresponding to the code word sequence of FIG. 10.
' DETAILED DESCRIPTION OF THE INVENTION FIG. 1 depicts a prionart'adaptive differential pulse code modulation CII'CUIIWI'IICII is described in detail in tial PCM Coding of Speech, Bell System Technical Jonrnkzl,,Vol. 52, p. 1105-1118, September 1973. In the ADPCM coder, of FIG. 1, differential input amplifier or network 11 develops an output signal proportional to the difference between an applied sampled speech signal and a signal which is an estimate of the incoming speech signal. This difference signal is quantized in adaptive quantizer l2 and applied to encoder I 13 and to summing amplifier or network 14. Summing amplifier 14, in-conjunction with first order prediction network 15, having a transfer function, for example, of az", is utilized to develop an estimate of the incoming speech signal. If the estimate of the input speech signal 1 is fairly accurate, then the difference signal emanating from network 1 1 will be small and thus more accurately represented by a fixed number of bits than the input speech samples themselves. The difference signal, al-
though nowhere near as redundant as the original 1 speech signal, still exhibits a wide amplitude range. In
order'to make efficient use of the available quantizationlevels of quantizer 12, the peak excursion of the level voiced sounds Accordingly, the need for adaptive quantization is apparent; logic network 16 utilizes the coded-speechsignals (code words) emanating from encoder13 to determine optimum quantization steps. 'Thatis, logic network 16 monitors the coded output of encoder l3 and provides for adaptation of the step size on the basis of the most recent encoded quantizer output. For example, if the code word corresponds to one of the higher levels, the quantizer is overloaded and the step size is increased. On the other hand, if the code word corresponds to one of the lower levels, the step size is decreased. Step size adaptation effectively compensates for amplitude variations to the extent that the quantizer treats low level unvoiced speech signals, e.g., fricatives, much the same as high level voiced speech signals. The objective, of course, is that each of the quantizer levels be used a significant portion of the time regardless of the absolute amplitude level of the incoming speech samples. However, when the amplitude of the input speech signal is of the order of the minimum step size, the adaptation logic insures that the step size will seek its minimum value and the difference signal will then fall within the lowest quantization levels. We have discovered that when no speech is present at the input, the code word energy will vary only slightly. It is this feature of ADPCM speech encoding, and of adaptive encoding in general, that is turned to account in the practice of this invention. It is to be understood that the principles of this invention are applicable to all forms of adaptive encoders including ADPCM and adaptive delta modulation.
FIG. 2 is an exemplary display of code word activity for the voiced utterance oh. Each line, A, B, C, D of FIG. 2, corresponds to approximately 256 samples (6 kHz sampling rate) of the applied speech utterance, i.e., approximately 40 milliseconds of the signal. Line B is to be considered a continuation of line A, line C a continuation of line B, etc. It is noted that for line A, and for most of line B, the code words show little activity, remaining for the most part within a limited range ofquantization levels. This first part of the code word sequence corresponds to background silence. However, at almost the end of line B, and then for the remainder of lines C and D, the code word sequence fluctuates much more rapidly and with greater amplitude. FIG. 3 illustrates the decoded speech waveform corresponding to the code word sequence of FIG. 2. It is noted in FIG. 3 that voiced speech apparently commences somewhere near the end of line B and continues for line C and D. This property of code words to indicate the presence of speech activity is more accurately reflected in what we define as adaptation activity or code word energy. The code word energy may be defined as the number of code word adaptations per unit time. In one embodiment of this invention, we used as a measurement of energy the sum of the squares of the code words for one hundred and one samples, or code words, corresponding to a 16 millisecond window centered about a selected sample. That is, the code word energy may be defined as +50 (11) Emm i=n-5(l where c(i) corresponds to a code word emanating from encoder 13 of FIG. 1. Of course, other equivalent definitions of energy may be utilized.
In the prior art ADPCM implementation of FIG. 1, the largest negative quantization level is represented by the binary code word 0000 While the largest positive quantization level is represented by the binary code word I l l 1, corresponding to the decimal number 15. Thus, it is necessary, if one is using such a symmetrical coding system, to subtract from the code words a number corresponding to the dc level or average value of the code words to make the average level of the code words equal to zero. Of course, a different coding implementation may be utilized which inherently has a zero average value. Since the number 7.5, corresponding to the average value, may not be conveniently represented in digital form, the following definition of energy may be utilized:
where a(i) [2 c(i) 151 By using this definition, the dc level is removed from consideration and the energy content of the code words differs from the definition of Eq. l by only a multiplicative constant. It may readily be shown that the energy term defined by Eq. (2) is equivalent to The code word energy, in accordance with this invention, is computed at each sample of the speech signal and compared with a threshold which is established at a level intermediate to the measured energy of silence and the average measured energy of the speech utterance. When the code word energy exceeds this threshold for approximately 320 consecutive samples, corresponding to about 50 milliseconds of speech, the word c(n) at which the energy first exceeded the threshold is defined as the beginning of an utterance. The code word energy-threshold comparison is continued, and when the code word energy falls below the threshold for approximately 1,024 consecutive samples, corresponding to about I60 milliseconds of speech, the point at which the energy first fell below the threshold is defined as the end of the utterance. The millisecond criterion insures that a stop consonant within a word or phrase will not be mistaken for the end of the utterance.
Apparatus for determining the energy of the code words in accordance with Eq. (3) is illustrated in FIG. 4. A code word, c(i), emanating from encoder 13 of FIG. 1, is applied to digital doubler 17, wherein it is doubled in value to develop a signal 2 c(i), which is twice the digital value of the applied code word. Digital doubler 17 may be of any well-known configuration, e.g., a shift left by one bit register will double the value of an applied binary signal. Digital subtractor 18 subtracts from signal 2 c(i), a signal supplied by digital reference register 19. The signal stored in register 19 is proportional to the dc level or average of the code words. In a particular embodiment, the digital signal stored in register 19 is equal to fifteen as required by Eq. (2). Digital multiplier 21 multiplies the output signal of subtractor 18 by itself to achieve a squared signal which corresponds to the function a(i) of Eq. (2). Both subtractor l8 and multiplier 21 may be conventional digital arithmetic circuits. The signal output, a(i), of multiplier 21 is applied to shift register 22. Register 22, which preferably has a digital capacity of one hundred and two words, sequentially shifts digital signal a(i) through the register at the system clock rate. It is to be understood that in the circuitry of FIG. 4, and also in that of FIG. 5, that all operations are performed in synchronism with the master sampling clock of the coder of FIG. 1, which has not been depicted in order not to obfuscate the operation of the instant invention. At any point of time, the last digital word stored in register 22, i.e., the oldest word in storage, corresponds to a(n5 l and the first word stored in register 22, i.e., the most recently stored word, corresponds to a(n+50). The first and last words of register 22 are combined in conventional digital subtractor 23 to form a difference signal, a(n+50) a(n5 l This difference signal is applied to conventional digital adder 24 which, in conjunction with delay network 25, develops a signal representative of the code word energy as defined in Eq. (3), Delay network 25 may be of conventional design and is utilized to delay the output of adder 24 by one clock period.
The output signal E(n), of adder 24, is applied to digital comparator 26 of FIG. 5. Comparator 26 compares the energy of each code word E(n) with a signal stored in register 27 to determine whether or not the energy of the code word is above or below a predetermined threshold. The threshold is generally empirically determined and may be approximately equal to a point midway between the measured energy of background silence and the average measured energy of the speech signal, which is readily obtained by averaging the output of the apparatus of FIG. 4. As discussed above, when the code word energy exceeds this threshold for approximately 50 milliseconds or 300 consecutive samples, the point at which the energy function first exceeded the threshold is defined as the beginning of an utterance. The apparatus of FIG. 5 is utilized to deter mine when this has occurred. Also, when an utterance has been determined to have begun, the apparatus of FIG. 5 continues to make a comparison of the energy of subsequent code words with the threshold signal stored in register 27. When the code word energy falls below this threshold for approximately 160 milliseconds or 1,000 consecutive samples, the point at which the energy function first passed below the threshold is recorded as the end of the utterance.
To understand the operation of the circuit of FIG. 5, it is convenient to assume that speech is not present at the input to the ADPCM coder and, in fact, has not been present long enough so that the last indication encountered was an end of a speech utterance. This is indicated by certain states or levels for particular circuit components. Thus, it may be assumed that output lead 39 of flip-flop 34 is at a logical 0 state and that output lead 41 of flip-flop 34 is at a logical I state. It may also be assumed that output lead 43 of digital comparator 26 is at a 0 state and that output lead 45 of digital comparator 26 is at a I state. Accordingly, input lead 42 to NAND gate 28 is at a logical I state andinput lead 44 to NAND gate 29 is at a logical 0 state. In accordance with the well-known logical rules for NAND circuits, input lead 46 to NAND gate 31 is at a logical 1 state and input lead 47 to NAND gate 31 is at a logical I state. Thus, lead 48, connecting the output of NAND gate 31 and one of the inputs to NAND gate 32, is at a 0 state and lead 51, one of the inputs to NAND gate 38, is also at a 0 state. Clock input 49 to NAND gate 32 is presumed to enable NAND gate 32 upon the presence of a logical l on lead 49. Accordingly, output lead 54 of NAND gate 32 isat a logical I state; counter 33 is presumed to be incremented upon the presence of a 0 level input on line 54. Thus, output leads 55, 56 and 57 of counter 33, which correspond to the 10th, 8th and 6th powers, respectively, of the binary base two, are at a logical 0 state. Output lead 58 of NAND gate 35 is thus at a logical 1 state as is output lead 59 of NAND gate 36. Input leads 53 and 52 to NAND gate 38 are also at a logical I state, thus establishing output lead 61 of NAND gate 38 at a logical 1 state and output lead 62 of inverter circuit 37 at a logical 0 state. Since this is the clear input to counter 33, a logical 0 state is presumed to clear the counter.
If it is now presumed that the energy signal applied to digital comparator 26 exceeds the output of digital threshold register 27, output lead 43 of comparator 26 assumes a logical I state and output lead 45 of comparator 26 assumes a 0 state. Output lead 46 of NAND gate 28 is then at a logical 0 state and output lead 47 of NAND gate 29 is at a logical 1 state. Output lead 48 of NAND gate 31 assumes a logic 1 state as does lead 51, which is one of the inputs to NAND gate 38. Since input leads 52 and 53 are already at a logical 1 state, the output lead 61 of NAND gate 38 assumes a logical 0 state and therefore output lead 62 of inverter 37 assumes a logical 1 state, thereby allowing counter 33 to be incremented. Upon the presence ofa logical 1 signal at clock input 49 to NAND gate 32, output lead 54 of NAND gate 32 assumes a logical 0 state and counter 33 is incremented. Assuming that the inputenergy signal to comparator 26 remains above, the predetermined threshold, then with each energy word, counter 33 will I be incremented. When counter 33 reaches a level of 320, which corresponds to a 1 output on leads 56 and 57, output lead 59 of NAND gate 36 assumes a logical 0 state indicating the beginning of a speech utterance. The presence of a 0 level signal on output lead 59 resets flip-flop 34 so that a logical 1 signal appears on output lead 39 and a logical 0 signal appears on output lead 41. Output lead 58 of NAND gate 35 remains at a logical I state. The resetting of flip-flop 34 causes output lead 59 to return to a logical I state and in turn causes input lead 44 to NAND gate 29 to assume a logical I state and input lead 42 to NAND gate 28 to assume a logical 0 state. Assuming that the energy signal remains above the threshold, output lead 43 is still at a logical I state, but since input lead 42 to NAND gate 28 is now at a logical 0 state, output lead 46 of NAND gate 28 assumes a logical I state. Output lead 45 of comparator 26 is still at a 0 state, but input lead 44 to NAND gate 29 is now at a logical 1 state. Thus, output lead 47 of NAND gate 29 is at a logical 1 state. Accordingly, output lead 48 of NAND gate 31 assumes a logical 0 state as does input lead 51 to NAND gate 38. Input lead 54 to counter 33 assumes a logical 1 state and counter 33 is not incremented. Since input lead 51 is at a 0 state and input leads 52 and 53 of NAND gate 38 are at a logical 1 state, output lead 61 of NAND gate 38 is at a logical I state and the clear input to counter 33, lead 62, is at a logical 0 state. Thus, the counter is cleared and output leads 58, 59 remain at a logical 1 state. When the energy of the applied code words to digital comparator 26 decreases to a level below the threshold level established by register 27, output lead 45 of comparator 26 assumes a logical I state and output lead 43 assumes a logical 0 state. Since input lead 42 to NAND gate 28 is at a 0 level, output lead 46 of NAND gate 28 assumes a logical I state. Similarly, since input lead 44 to NAND gate 29 is at a logical I state, output lead 47 of NAND gate 29 assumes a logic state. Thus, output lead 48 of NAND gate 31 is at a logical I state as is input lead 51 to NAND gate 38. Upon the occurrence of a I level on clock input 49 to NAND gate 32, output lead 54 assumes a logical 0 state and increments counter 33. Assuming the input energy level of the code words remains below the predetermined threshold, counter 33 will be successively incremented but no change in the logic states of the circuit will occur until leads 55, 56, and 57 of counter 33 all assume a logical 1 state. This state corresponds to a count of 1024. Upon the occurrence of this condition, output lead 58 assumes a logical 0 state indicating the end of the speech utterance while output lead 59 remains at a logical I state. The occurrence of a O logic state on output lead 58 sets flip-flop 34 back to its original state, i.e., output lead 39 assumes a 0 state and output lead 41 assumes a I state. Output lead 58 accordingly returns to a logical I state and the apparatus of FIG. 5 has returned to the conditions initially assumed prior to the beginning of the speech utterance. The waveforms appearing at output leads 59 and 58 of the apparatus of FIG. 5 indicate the logic state transition, respectively, at the beginning and end of a speech utterance. The output signals of the apparatus of FIG. 5 may be used in a variety of ways. For example, they may be used to gate a register which temporarily stores the code words of the apparatus of FIG. 1 so that the code words of the speech utterance, determined by the apparatus of FIG. 5, may be conveyed to a permanent store. Or, if so desired, the signals appearing on leads 58 and 59 may be utilized to activate an alarm circuit to indicate to an operator that the beginning and end of a speech utterance has occurred. Many other applications, of course, will be apparent to those skilled in the art.
FIG. 5A is a block diagram depicting the overall operation of this invention, as discussed above. Adaptive encoder 501 corresponds to the encoder shown in FIG. 1, code word energy detector 502 corresponds to the apparatus depicted in FIG. 4, and threshold detector 503 corresponds to the apparatus shown in FIG. 5.
The significant advantages of the instant invention, in determining the beginning and end of a speech utterance, are illustrated by FIGS. 6 through 11. FIG. 6 displays the sequence of code words corresponding to the beginning of the word three. The left-half of line A shows very little code word variation and corresponds to low level noise. The right-half of line A, and the next two lines, B and C, correspond to the initial fricative th of the word three. The code words show markedly greater variation as does the last line, D, which corresponds to the beginning of voicing, i.e., ree. The marker in the middle of line A denotes the beginning point of the speech utterance, as determined by this invention. FIG. 7 displays the energy of the code words of FIG. 6, as determined by this invention. The marker on line A denotes the point at which the energy of the code words exceeded the threshold and remained above the threshold for approximately 50 milliseconds, as discussed above. It is noted that the code word energy is roughly the same for both the voiced and unvoiced segments of the utterance while the energy is significantly lower when no speech is present. FIG. 8 displays the actual speech waveform represented by the code word sequence of FIG. 6. The beginning of the word three is not nearly as evident as in the code word sequence; indeed, it is hardly discernible. FIG. 9, which displays the energy of the speech waveform of FIG. 8, emphasizes the fact that the beginning of a speech utterance is not readily discernible from an examination of the energy of the speech waveform itself. FIG. 10 displays the code word sequence at the end of the word three. The marker on line B indicates the end point of the utterance as determined by the instant invention. FIG. 11 displays the speech waveform corresponding to the code word sequence of FIG. 10. The end point of the utterance is clearly not apparent from an examination of the speech waveform itself.
The instant invention has been tested extensively in determining the beginning and end speech entries for a voice response system vocabulary, and has proved to be very reliable. Two other aspects of the coded speech signal, i.e., the energy of the difference signal of the coder of FIG. 1, and the energy of the quantizer output were also studied as possible considerations for use in the instant invention. However the results based on the coded word samples themselves were found to be far more accurate.
What is claimed is:
1. Apparatus for determining a boundary of an applied speech utterance comprising:
means for adaptive encoding said applied speech utterance to develop coded output signals;
means for developing a signal representative of the energy of said coded output signals; and
means for comparing said representative signal with a predetermined threshold signal.
2. The apparatus defined in claim 1 wherein said signal representative of the energy of said coded output signals is representative of the adaptation activity of said means for adaptive encoding.
3. The apparatus defined in claim 1 wherein said threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
4. Apparatus for determining a boundary of an applied speech utterance comprising:
means for adaptive differential pulse code modulating said applied speech utterance to develop digitally coded output signals;
means for developing a signal representative of the energy of said coded output signals; and
means for comparing said representative signal with a predetermined digital threshold signalv 5. The apparatus defined in claim 4 wherein said sig nal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
6. The apparatus defined in claim 4 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
7. Apparatus for detecting the beginning of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:
means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and
means responsive to said representative signal for developing an output signal when said representative signal is greater than, for a predetermined interval of time, an applied digital threshold'signal. said output signal indicative of the beginning of said speech utterance. i
8. The apparatus defined in claim' 7 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
9. The apparatus defined in claim 7 wherein said digital' threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
10. The apparatus defined in claim 7 wherein said means for developing a signal representative of the energy of said digitally coded output signals comprises:
first means for doubling each digitally coded output signal of said modulation circuit;
second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal;
third means for squaring each output signal of said second means;
fourth means for sequentially storing a predetermined number of said squared signals;
fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means;
sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and
seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.
11. Apparatus for determining the beginning of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:
' means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; and means responsive to said representative signal for developing an indicator signal when said representative signal is greater than, for a predetermined interval of time, an applied threshold signal, said indicator signal indicative of the beginning of said speech signal.
12. Apparatus for detecting the end ofa speech utterance. including an adaptive differential pulse code modulation circuit responsive to said speech utterance,
comprising:
means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and
means responsive to said representative signal for developing an output signal when said representative signal is less than. for a predeterminedinterval of time. an applied digital threshold signal said output signal indicative of the end of said speech utterance.
13. The apparatus defined in claim 12 wherein said signal representative ofthe energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
[4. The apparatus defined in claim 12 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
15. The apparatus defined in claim 12 wherein said means for developing a signal representative of the en- 5 ergy of said digitally coded output signals comprises:
I first means for doubling each digitally coded output signal of said modulation circuit; second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal; third meansfor squaring each output signal of said second means; fourth means for sequentially storing a predetermined number of said squared signals; fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means; sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and seventh means for applying said representative signal to said sixth means after a predetermined interval 2 of time has elapsed.
16. Apparatus for determining the end of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:
means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; and
means responsive to said representative signal for developing an indicator signal when said representative signal is less than, for a predetermined interval of time, an applied threshold signal, said indicator signal indicative of the end of said speech signal.
17. Apparatus for detecting the boundaries of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:
code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals;
comparator means for comparing said digital representative signal with an applied digital threshold signal; and
means responsive to said comparator means for developing a signal indicative of the beginning of said speech utterance when said representative signal is greater than, for a first predetermined interval of time, said threshold signal, and for developing a signal indicative of the end of said speech utterance when said representative signal is less than, for a second predetermined interval of time, said threshold signal.
7 18. Apparatus for determining the boundaries of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising:
code word energy means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals;
comparator means for comparing said representative signal with an applied threshold signal; and
means responsive to said comparator means for developing a signal indicative of the beginning of said speech signal when said representative signal is greater than, for a first predetermined interval of time, said threshold signal, and for developing a signal indicative of the end of said speech signal when said representative signal is less than, for a second predetermined interval of time, said threshold signal.
19. Apparatus for detecting the boundaries of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising:
code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and
means for developing a signal indicative of the beginning of said speech utterance when said representative signal is greater than an applied digital threshold signal for a first predetermined interval of time, and for developing a signal indicative of the end of said speech utterance when said representative signal is less than said applied threshold signal for a second predetermined interval of time.
20. The apparatus defined in claim 19 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a prede termined number of said digitally coded output signals.
21. The apparatus as defined in claim 19 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
22. The apparatus defined in claim 19 wherein said means for developing a signal representative of the energy of said coded output signals comprises:
. first means for doubling each digitally coded output signal of said modulation circuit;
second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal;
third means for squaring each output signal of said second means;
fourth means for sequentially storing a predetermined number of said squared signals; fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means;
sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and
seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.
23. The apparatus defined in claim 19 wherein said means for developing said indicative signals comprises:
digital comparator means responsive to said signal representative of the energy of said coded output signals and to said applied digital threshold signal for developing a signal at a first output terminal when said representative energy signal is greater than said threshold signal and for developing a signal at a second output terminal when said representative energy signal is less than said threshold signal;
a bistable circuit having first and second output terminals, and set and reset terminals;
a first logic circuit responsive to said comparator first output terminal signal and to the signal at the first output terminal of said bistable circuit;
a second logic circuit responsive to said comparator second output terminal signal and to the signal at the second output terminal of said bistable circuit;
a third logic circuit responsive to the output signals of said first and second logic circuits;
a fourth logic circuit responsive to the output signal of said third logic circuit and to an applied clock signal;
a counter circuit, having a plurality of output terminals, responsive to the output signal of said fourth logic circuit;
a fifth logic circuit, for developing said signal indicative of the end of said speech utterance, responsive to the signal at the second output terminal of said bistable circuit and to the signal at a preselected one of said plurality of counter circuit output terminals;
a sixth logic circuit, for developing said signal indicative of the beginning of said speech utterance, responsive to the signal at the first output terminal of said bistable circuit first and to the signals at the other of said plurality of counter circuit output terminals;
a seventh logic circuit responsive to the output signals of said third, fifth, and sixth logic circuits for developing a control signal for said counter circuit, said control signal returning said counter to a predetermined initial state; and
means for connecting the output terminals of said fifth and sixth logic circuits, respectively, to said set and reset terminals of said bistable circuit.
24. The method of determining a boundary of an applied speech utterance comprising the steps of:
adaptive differential pulse code modulating said applied speech utterance to develop digitally coded output signals;
developing a signal representative of the energy of said coded output signals; and
comparing said representative signal with a predetermined digital threshold signal.
25. The method defined in claim 24 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
26. The method defined in claim 24 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
27. The method of determining a boundary of an applied speech utterance comprising the steps of:
adaptive encoding said applied speech utterance to develop coded output signals;
developing a signal representative of the energy of said coded output signals; and
comparing said representative signal with a predetermined threshold signal.

Claims (27)

1. Apparatus for determining a boundary of an applied speech utterance comprising: means for adaptive encoding said applied speech utterance to develop coded output signals; means for developing a signal representative of the energy of said coded output signals; and means for comparing said representative signal with a predetermined threshold signal.
2. The apparatus defined in claim 1 wherein said signal representative of the energy of said cOded output signals is representative of the adaptation activity of said means for adaptive encoding.
3. The apparatus defined in claim 1 wherein said threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
4. Apparatus for determining a boundary of an applied speech utterance comprising: means for adaptive differential pulse code modulating said applied speech utterance to develop digitally coded output signals; means for developing a signal representative of the energy of said coded output signals; and means for comparing said representative signal with a predetermined digital threshold signal.
5. The apparatus defined in claim 4 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
6. The apparatus defined in claim 4 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
7. Apparatus for detecting the beginning of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising: means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and means responsive to said representative signal for developing an output signal when said representative signal is greater than, for a predetermined interval of time, an applied digital threshold signal, said output signal indicative of the beginning of said speech utterance.
8. The apparatus defined in claim 7 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
9. The apparatus defined in claim 7 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
10. The apparatus defined in claim 7 wherein said means for developing a signal representative of the energy of said digitally coded output signals comprises: first means for doubling each digitally coded output signal of said modulation circuit; second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal; third means for squaring each output signal of said second means; fourth means for sequentially storing a predetermined number of said squared signals; fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means; sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.
11. Apparatus for determining the beginning of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising: means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; and means responsive to said representative signal for developing an indicator signal when said representative signal is greater than, for a predetermined interval of time, an applied threshold signal, said indicator signal indicative of the beginning of said speech signal.
12. Apparatus for detecting the end of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising: means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and means responsive to said representative signal for developing an output signal when said representative signal is less than, for a predetermined interval of time, an applied digital threshold signal, said output signal indicative of the end of said speech utterance.
13. The apparatus defined in claim 12 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
14. The apparatus defined in claim 12 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
15. The apparatus defined in claim 12 wherein said means for developing a signal representative of the energy of said digitally coded output signals comprises: first means for doubling each digitally coded output signal of said modulation circuit; second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal; third means for squaring each output signal of said second means; fourth means for sequentially storing a predetermined number of said squared signals; fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means; sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.
16. Apparatus for determining the end of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising: means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; and means responsive to said representative signal for developing an indicator signal when said representative signal is less than, for a predetermined interval of time, an applied threshold signal, said indicator signal indicative of the end of said speech signal.
17. Apparatus for detecting the boundaries of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising: code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; comparator means for comparing said digital representative signal with an applied digital threshold signal; and means responsive to said comparator means for developing a signal indicative of the beginning of said speech utterance when said representative signal is greater than, for a first predetermined interval of time, said threshold signal, and for developing a signal indicative of the end of said speech utterance when said representative signal is less than, for a second predetermined interval of time, said threshold signal.
18. Apparatus for determining the boundaries of a speech signal, including an adaptive differential pulse code modulation circuit responsive to said speech signal, comprising: code word energy means responsive to the output signals of said modulation circuit for developing a signal representative of the energy of said output signals; comparator means for comparing said representative signal with an applied threshold signal; and means responsive to said comparator means for developing a signal indicative of the beginning of said speech signal when said representative signal is greater than, for a first predetermined interval of time, said threshold signal, and for developing A signal indicative of the end of said speech signal when said representative signal is less than, for a second predetermined interval of time, said threshold signal.
19. Apparatus for detecting the boundaries of a speech utterance, including an adaptive differential pulse code modulation circuit responsive to said speech utterance, comprising: code word energy means responsive to the digitally coded output signals of said modulation circuit for developing a digital signal representative of the energy of said coded output signals; and means for developing a signal indicative of the beginning of said speech utterance when said representative signal is greater than an applied digital threshold signal for a first predetermined interval of time, and for developing a signal indicative of the end of said speech utterance when said representative signal is less than said applied threshold signal for a second predetermined interval of time.
20. The apparatus defined in claim 19 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
21. The apparatus as defined in claim 19 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
22. The apparatus defined in claim 19 wherein said means for developing a signal representative of the energy of said coded output signals comprises: first means for doubling each digitally coded output signal of said modulation circuit; second means for subtracting from each of said doubled coded output signals a predetermined digital reference signal; third means for squaring each output signal of said second means; fourth means for sequentially storing a predetermined number of said squared signals; fifth means for subtracting from the most recently stored squared signal in said fourth means the oldest stored squared signal in said fourth means; sixth means for adding the output signal of said fifth means and an applied signal to develop a signal representative of the energy of said coded output signals; and seventh means for applying said representative signal to said sixth means after a predetermined interval of time has elapsed.
23. The apparatus defined in claim 19 wherein said means for developing said indicative signals comprises: digital comparator means responsive to said signal representative of the energy of said coded output signals and to said applied digital threshold signal for developing a signal at a first output terminal when said representative energy signal is greater than said threshold signal and for developing a signal at a second output terminal when said representative energy signal is less than said threshold signal; a bistable circuit having first and second output terminals, and set and reset terminals; a first logic circuit responsive to said comparator first output terminal signal and to the signal at the first output terminal of said bistable circuit; a second logic circuit responsive to said comparator second output terminal signal and to the signal at the second output terminal of said bistable circuit; a third logic circuit responsive to the output signals of said first and second logic circuits; a fourth logic circuit responsive to the output signal of said third logic circuit and to an applied clock signal; a counter circuit, having a plurality of output terminals, responsive to the output signal of said fourth logic circuit; a fifth logic circuit, for developing said signal indicative of the end of said speech utterance, responsive to the signal at the second output terminal of said bistable circuit and to the signal at a preselected one of said plurality of counter circuit output terminals; a sixth logic circuit, for developing said signal indicative of the beginning of said speech utTerance, responsive to the signal at the first output terminal of said bistable circuit first and to the signals at the other of said plurality of counter circuit output terminals; a seventh logic circuit responsive to the output signals of said third, fifth, and sixth logic circuits for developing a control signal for said counter circuit, said control signal returning said counter to a predetermined initial state; and means for connecting the output terminals of said fifth and sixth logic circuits, respectively, to said set and reset terminals of said bistable circuit.
24. The method of determining a boundary of an applied speech utterance comprising the steps of: adaptive differential pulse code modulating said applied speech utterance to develop digitally coded output signals; developing a signal representative of the energy of said coded output signals; and comparing said representative signal with a predetermined digital threshold signal.
25. The method defined in claim 24 wherein said signal representative of the energy of said coded output signals is defined as the sum of the squares of a predetermined number of said digitally coded output signals.
26. The method defined in claim 24 wherein said digital threshold signal is representative of an energy level intermediate the energy of background silence and the average energy of said speech utterance.
27. The method of determining a boundary of an applied speech utterance comprising the steps of: adaptive encoding said applied speech utterance to develop coded output signals; developing a signal representative of the energy of said coded output signals; and comparing said representative signal with a predetermined threshold signal.
US456027A 1974-03-29 1974-03-29 Apparatus and method for determining the beginning and the end of a speech utterance Expired - Lifetime US3909532A (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US456027A US3909532A (en) 1974-03-29 1974-03-29 Apparatus and method for determining the beginning and the end of a speech utterance
CA214,938A CA1036271A (en) 1974-03-29 1974-11-29 Apparatus and method for determining the beginning and the end of a speech utterance
AU79266/75A AU495754B2 (en) 1974-03-29 1975-03-19 Determining the boundaries ofa speech utterance
GB12245/75A GB1487291A (en) 1974-03-29 1975-03-24 Determining the boundaries of a speech utterance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US456027A US3909532A (en) 1974-03-29 1974-03-29 Apparatus and method for determining the beginning and the end of a speech utterance

Publications (1)

Publication Number Publication Date
US3909532A true US3909532A (en) 1975-09-30

Family

ID=23811146

Family Applications (1)

Application Number Title Priority Date Filing Date
US456027A Expired - Lifetime US3909532A (en) 1974-03-29 1974-03-29 Apparatus and method for determining the beginning and the end of a speech utterance

Country Status (3)

Country Link
US (1) US3909532A (en)
CA (1) CA1036271A (en)
GB (1) GB1487291A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE2659083A1 (en) * 1975-12-31 1977-07-14 Western Electric Co METHOD AND DEVICE FOR SPEAKER RECOGNITION
US4275270A (en) * 1979-11-29 1981-06-23 The Regents Of The University Of California Speech detector for use in an adaptive hybrid circuit
FR2496951A1 (en) * 1980-12-19 1982-06-25 Western Electric Co METHOD AND DEVICE FOR DETERMINING THE END OF A SPEECH TRANSMISSION
US4351983A (en) * 1979-03-05 1982-09-28 International Business Machines Corp. Speech detector with variable threshold
DE3337353A1 (en) * 1982-10-15 1984-04-19 Western Electric Co., Inc., 10038 New York, N.Y. VOICE ANALYZER BASED ON A HIDDEN MARKOV MODEL
US4454586A (en) * 1981-11-19 1984-06-12 At&T Bell Laboratories Method and apparatus for generating speech pattern templates
USRE32172E (en) * 1980-12-19 1986-06-03 At&T Bell Laboratories Endpoint detector
DE3630518A1 (en) * 1985-09-06 1987-03-19 Ricoh Kk Speech or sound recognition device
US4704696A (en) * 1984-01-26 1987-11-03 Texas Instruments Incorporated Method and apparatus for voice control of a computer
US4802224A (en) * 1985-09-26 1989-01-31 Nippon Telegraph And Telephone Corporation Reference speech pattern generating method
US4821325A (en) * 1984-11-08 1989-04-11 American Telephone And Telegraph Company, At&T Bell Laboratories Endpoint detector
US4829572A (en) * 1987-11-05 1989-05-09 Andrew Ho Chung Speech recognition system
DE3645118A1 (en) * 1985-09-06 1989-08-17
US4989246A (en) * 1989-03-22 1991-01-29 Industrial Technology Research Institute, R.O.C. Adaptive differential, pulse code modulation sound generator
USRE33597E (en) * 1982-10-15 1991-05-28 Hidden Markov model speech recognition arrangement
US5706393A (en) * 1994-04-08 1998-01-06 Matsushita Electric Industrial Co., Ltd. Audio signal transmission apparatus that removes input delayed using time time axis compression
WO1999035639A1 (en) * 1998-01-08 1999-07-15 Art-Advanced Recognition Technologies Ltd. A vocoder-based voice recognizer
EP0945854A2 (en) * 1998-03-24 1999-09-29 Matsushita Electric Industrial Co., Ltd. Speech detection system for noisy conditions
EP1019904A4 (en) * 1997-09-17 2000-07-19 Ameritech Corp Speech reference enrollment method
WO2001056015A1 (en) 2000-01-27 2001-08-02 Koninklijke Philips Electronics N.V. Speech detection device having two switch-off criterions
US20030133423A1 (en) * 2000-05-17 2003-07-17 Wireless Technologies Research Limited Octave pulse data method and apparatus
US20030212548A1 (en) * 2002-05-13 2003-11-13 Petty Norman W. Apparatus and method for improved voice activity detection
US20080288247A1 (en) * 2004-06-28 2008-11-20 Cambridge Silicon Radio Limited Speech Activity Detection
WO2009127014A1 (en) 2008-04-17 2009-10-22 Cochlear Limited Sound processor for a medical implant
USRE44466E1 (en) 1995-12-07 2013-08-27 Koninklijke Philips Electronics N.V. Method and device for packaging audio samples of a non-PCM encoded audio bitstream into a sequence of frames
CN112669880A (en) * 2020-12-16 2021-04-16 北京读我网络技术有限公司 Method and system for adaptively detecting voice termination

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2518765B2 (en) * 1991-05-31 1996-07-31 国際電気株式会社 Speech coding communication system and device thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3750024A (en) * 1971-06-16 1973-07-31 Itt Corp Nutley Narrow band digital speech communication system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3750024A (en) * 1971-06-16 1973-07-31 Itt Corp Nutley Narrow band digital speech communication system

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE2659083A1 (en) * 1975-12-31 1977-07-14 Western Electric Co METHOD AND DEVICE FOR SPEAKER RECOGNITION
US4351983A (en) * 1979-03-05 1982-09-28 International Business Machines Corp. Speech detector with variable threshold
US4275270A (en) * 1979-11-29 1981-06-23 The Regents Of The University Of California Speech detector for use in an adaptive hybrid circuit
FR2496951A1 (en) * 1980-12-19 1982-06-25 Western Electric Co METHOD AND DEVICE FOR DETERMINING THE END OF A SPEECH TRANSMISSION
DE3149134A1 (en) * 1980-12-19 1982-07-29 Western Electric Co., Inc., 10038 New York, N.Y. METHOD AND DEVICE FOR DETERMINING LANGUAGE POINTS
US4370521A (en) * 1980-12-19 1983-01-25 Bell Telephone Laboratories, Incorporated Endpoint detector
USRE32172E (en) * 1980-12-19 1986-06-03 At&T Bell Laboratories Endpoint detector
US4454586A (en) * 1981-11-19 1984-06-12 At&T Bell Laboratories Method and apparatus for generating speech pattern templates
US4587670A (en) * 1982-10-15 1986-05-06 At&T Bell Laboratories Hidden Markov model speech recognition arrangement
DE3337353A1 (en) * 1982-10-15 1984-04-19 Western Electric Co., Inc., 10038 New York, N.Y. VOICE ANALYZER BASED ON A HIDDEN MARKOV MODEL
USRE33597E (en) * 1982-10-15 1991-05-28 Hidden Markov model speech recognition arrangement
US4704696A (en) * 1984-01-26 1987-11-03 Texas Instruments Incorporated Method and apparatus for voice control of a computer
US4821325A (en) * 1984-11-08 1989-04-11 American Telephone And Telegraph Company, At&T Bell Laboratories Endpoint detector
DE3630518A1 (en) * 1985-09-06 1987-03-19 Ricoh Kk Speech or sound recognition device
DE3645118A1 (en) * 1985-09-06 1989-08-17
US4802224A (en) * 1985-09-26 1989-01-31 Nippon Telegraph And Telephone Corporation Reference speech pattern generating method
US4829572A (en) * 1987-11-05 1989-05-09 Andrew Ho Chung Speech recognition system
US4989246A (en) * 1989-03-22 1991-01-29 Industrial Technology Research Institute, R.O.C. Adaptive differential, pulse code modulation sound generator
US5706393A (en) * 1994-04-08 1998-01-06 Matsushita Electric Industrial Co., Ltd. Audio signal transmission apparatus that removes input delayed using time time axis compression
USRE44955E1 (en) 1995-12-07 2014-06-17 Koninklijke Philips N.V. Method and device for packaging audio samples of a non-PCM encoded audio bitstream into a sequence of frames
USRE44466E1 (en) 1995-12-07 2013-08-27 Koninklijke Philips Electronics N.V. Method and device for packaging audio samples of a non-PCM encoded audio bitstream into a sequence of frames
EP1019904A1 (en) * 1997-09-09 2000-07-19 Ameritech Corporation Speech reference enrollment method
EP1019904A4 (en) * 1997-09-17 2000-07-19 Ameritech Corp Speech reference enrollment method
US6003004A (en) * 1998-01-08 1999-12-14 Advanced Recognition Technologies, Inc. Speech recognition method and system using compressed speech data
US6377923B1 (en) 1998-01-08 2002-04-23 Advanced Recognition Technologies Inc. Speech recognition method and system using compression speech data
WO1999035639A1 (en) * 1998-01-08 1999-07-15 Art-Advanced Recognition Technologies Ltd. A vocoder-based voice recognizer
EP0945854A2 (en) * 1998-03-24 1999-09-29 Matsushita Electric Industrial Co., Ltd. Speech detection system for noisy conditions
EP0945854A3 (en) * 1998-03-24 1999-12-29 Matsushita Electric Industrial Co., Ltd. Speech detection system for noisy conditions
WO2001056015A1 (en) 2000-01-27 2001-08-02 Koninklijke Philips Electronics N.V. Speech detection device having two switch-off criterions
US7848358B2 (en) * 2000-05-17 2010-12-07 Symstream Technology Holdings Octave pulse data method and apparatus
US20030133423A1 (en) * 2000-05-17 2003-07-17 Wireless Technologies Research Limited Octave pulse data method and apparatus
US7072828B2 (en) * 2002-05-13 2006-07-04 Avaya Technology Corp. Apparatus and method for improved voice activity detection
US20030212548A1 (en) * 2002-05-13 2003-11-13 Petty Norman W. Apparatus and method for improved voice activity detection
US7672839B2 (en) * 2004-06-28 2010-03-02 Cambridge Silicon Radio Limited Detecting audio signal activity in a communications system
US20080288247A1 (en) * 2004-06-28 2008-11-20 Cambridge Silicon Radio Limited Speech Activity Detection
WO2009127014A1 (en) 2008-04-17 2009-10-22 Cochlear Limited Sound processor for a medical implant
US20110093039A1 (en) * 2008-04-17 2011-04-21 Van Den Heuvel Koen Scheduling information delivery to a recipient in a hearing prosthesis
CN112669880A (en) * 2020-12-16 2021-04-16 北京读我网络技术有限公司 Method and system for adaptively detecting voice termination
CN112669880B (en) * 2020-12-16 2023-05-02 北京读我网络技术有限公司 Method and system for adaptively detecting voice ending

Also Published As

Publication number Publication date
CA1036271A (en) 1978-08-08
GB1487291A (en) 1977-09-28
AU7926675A (en) 1976-09-23

Similar Documents

Publication Publication Date Title
US3909532A (en) Apparatus and method for determining the beginning and the end of a speech utterance
US5341456A (en) Method for determining speech encoding rate in a variable rate vocoder
US4449190A (en) Silence editing speech processor
Rabiner et al. An algorithm for determining the endpoints of isolated utterances
US4142066A (en) Suppression of idle channel noise in delta modulation systems
KR100307065B1 (en) Voice detection device
US3832491A (en) Digital voice switch with an adaptive digitally-controlled threshold
US5251261A (en) Device for the digital recording and reproduction of speech signals
US3985956A (en) Method of and means for detecting voice frequencies in telephone system
KR20040004421A (en) Method and apparatus for selecting an encoding rate in a variable rate vocoder
JPH0226901B2 (en)
US4370521A (en) Endpoint detector
US4008375A (en) Digital voice switch for single or multiple channel applications
US6226607B1 (en) Method and apparatus for eighth-rate random number generation for speech coders
JPS6245730B2 (en)
EP0770254A2 (en) Transmission system and method for encoding speech with improved pitch detection
Rosenthal et al. An algorithm for locating the beginning and end of an utterance using ADPCM coded speech
JPH0950288A (en) Device and method for recognizing voice
Rheem et al. A nonuniform sampling method of speech signal and its application to speech coding
Itoh et al. A new artificial speech signal for objective quality evaluation of speech coding systems
Jayant Pitch-adaptive DPCM coding of speech with two-bit quantization and fixed spectrum prediction
JP2602641B2 (en) Audio coding method
Lea Evidence that stressed syllables are the most readily decoded portions of continuous speech
KR0141237B1 (en) Audio signal recording/reproducing method
JPS5834986B2 (en) Adaptive voice detection circuit