US3448216A - Vocoder system - Google Patents

Vocoder system Download PDF

Info

Publication number
US3448216A
US3448216A US569898A US3448216DA US3448216A US 3448216 A US3448216 A US 3448216A US 569898 A US569898 A US 569898A US 3448216D A US3448216D A US 3448216DA US 3448216 A US3448216 A US 3448216A
Authority
US
United States
Prior art keywords
signal
speech
voiced
frequency
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
US569898A
Inventor
James M Kelly
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
Bell Telephone Laboratories Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bell Telephone Laboratories Inc filed Critical Bell Telephone Laboratories Inc
Application granted granted Critical
Publication of US3448216A publication Critical patent/US3448216A/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders

Description

J. M. KELLY VOCODER SYSTEM June 3, 1969 Sheet Filed Aug. 5. 1966 A TTORNE Y Sheet 2 of 2 June 3, 1969 J. M. KELLY vocoDER SYSTEM Filed Aug. 5. 1966 United States Patent O 3,448,216 VOCODER SYSTEM James M. Kelly, Morris Plains, NJ., assignor to Bell Telephone Laboratories, Incorporated, Murray Hill and Berkeley Heights, NJ., a corporation of New York Filed Aug. 3, 1966, Ser. No. 569,898 Int. Cl. H04m 1/24; H04b 1/66 U.S. Cl. 179--1 9 Claims ABSTRACT OF THE DISCLOSURE yIn a channel vocoder, the excitation signals are more accurately characterized as voiced or unvoiced by developing auxiliary'control signals responsive to the frequencies at which the speech signals shift from voiced to unvoiced and conversely, and using the control signals to control the pass bands of the channels which transmit the periodic and aperiodic signals, respectively.
This invention relates to the transmission of speech signals and in particular to the transmission of high quality speech signals over a transmission channel of narrow bandwidth. An object of this invention is to improve the quality of speech synthesized by a vocoder by improving the match between the vamplitude spectrum of the synthesized speech and the amplitude spectrum of the input speech.
Much attention has been devoted to reducing the transmission channel bandwidth required to transmit speech. One result of this eifort is the vocoder. In a vocoder, low frequency control signals representative of an input speech signal are derived at an Ianalyzer. These control signals are then transmitted over a narrow bandwidth transmission channel to a synthesizer where they are used to construct a replica of the input speech signal.
To aid in synthesizing a replica of the input speech, the input speech is usually categorized by the vocoder as either voiced or unvoiced. During voiced speech, most of the speech energy uctuates periodically due to vocal cord excitation while, during unvoiced speech, most of the speech energy uctuates aperiodically due to turbulence in the vocal tract. This voiced-unvoiced categorization has long been `known to be somewhat arbitrary because, at any given instant, the amplitude spectrum of voiced speech contains some frequency regions characteristic of voiced speech and other frequency regions characteristic of unvoiced speech. Thus voiced speech synthesized by a vocoder on the assumption that input voiced speech contains only periodic or voiced energy is often of poorer quality than the input speech because the amplitude spectrum of the synthesized speech fails to match that of the input speech with respect to voiced and unvoiced characteristics.
This invention overcomes this problem. In this invention the quality of speech synthesized by 'a vocoder is improved by matching the voiced and unvoiced regions of the amplitude spectrum of the synthesized voiced speech to the voiced and unvoiced regions of the amplitude spectrum of the input voiced speech.
According to one embodiment of this invention, control signals are derived at the vocoder analyzer indica.- tive of the frequencies at which the characteristics of the amplitude spectrum of the input voiced speech shift from those of predominantly voiced speech to those of predominantly unvoiced speech and vice versa. These control signals are then used at the vocoder synthesizer to control the cutoif frequencies of complementary bandstop and bandpass lters which pass periodic and aperiodic excitation signals respectively. Speech synthesized using the sum of these excitation signals possesses a spectrum ice FIG. 3 is -a graph of the signal generated at the vocoder analyzer to determine the frequencies at which the characteristics of the input voiced speech spectrum shift from those of voiced speech to those of unvoiced speech and v1ce versa.
Theory The described embodiment of this invention uses the so-called cepstrum technique to determine the pitch frequency of input voiced speech. This technique involves taking the Fourier transform of the logarithm of the amplitude spectrum of a selected time segment of the input speech signal. Of course, other types of pitch detectors, particularly those capable of detecting the presence of voiced speech despite the yabsence of the fundamental frequency, are also suitable for use inthis invention.
The Fourier transform F(w) of a time dependent signal 3(t) is dened as Fa): fan-fed:
where w=radian freqency, j=\/-l, and tztime. Since e-J'wt equals the sum of a real part, cos wt, and an imaginary part, j sin wt, the Fourier transform F011) is composed of a real part R(w) defined as (2a) and an imaginary part X(w) defined fas Xw) jffe) sin mit (2b) where F(w)=R(w)+X(w) (3) The amplitude spectrum A(w) of f(t) is given by A(w)=\/R2(w)+X2(w) (4a) while the phase spectrum lMw), where I (w) is the phase diiference between the modulating signal represented in Equation l by e-wt and the corresponding frequency component of f(t), is given by Of course, it is impossible to obtain the integral of a time function over all time from -oo to -l-eo as required by Equation 1. Thus, as explained in an article by A. Michael Noll entitled Short-Time Spectrum and Cepstrum Techniques for Vocal-Pitch Detection, published in the February 1964 Journal of the Acoustical Society of America, vol. 36, pp. 296 to 302, a spectrum analyzer yyields what is called a short-time amplitude or power spectrum. The adjective short-time means the spectrum is derived from a signal segment of linite rather than infinite duration.
Further, as stated in the above cited Noll article, the short-time amplitude spectrum of speech A(w) equals the product of the amplitude spectra of the speech excitation [S(w)| and of the vocal tract |V(w)|, Thus The speech excitation spectrum S(w) represents the frequency characteristics of the vocal cords during periodic speech and of the vocal tract noise sources during aperiodic speechHlho Vocal tract, a series of ducts of Variable area which resonate at certain frequencies called formant frequencies, has an impulse responsive with frequency characteristics given by V(w). The vocal tract usually amplies the speech frequency components at or near the formant frequencies, and attenuates all other speech frequency components. As a result, the amplitude spectrum A(w)"of the speech heard by a listener contains a slow oscillation as a function of frequency representing the contribution of the vocal tract resonances. In addition, during voiced speech, the amplitude spectrum A(w) also contains a more rapid fluctuation with frequency, representing the fundamental vocal cord frequency and its harmonics.
By taking the logarithm of the short-time speech spectrum Aw), the contributions of the speech excitation spectrum |S(w)| and the vocal tract transfer function lV(w)| become separate terms in a new logarithmic signal. Thus 'During periodic speech, the absolute value of S(w) is just where N is a positive integer equal to one less than the number of pitch periods in the short-time speech segment from which {S(w)| is derived, |ST(w)] is the absolute value of the speech excitation spectrum ST(w) derived from one (1) pitch period of speech, T is the pitch period and e-jlT represents the phase shift with frequency of the amplitude spectrum {ST(w)| representing the lth repeated pitch period. Equation 7 can be rewritten in closed form as The second term on the right-hand side of Equation S represents the effect on log |S(w)| of including N +1 pitch periods in the finite time segment of the speech signal f(t) used to obtain S(w). If N=l, then log lS(w)[=log lST(w)[-}1/2 log 2(1-l-cos Tw) (9) and log A(w) clearly contains a ripple,
l/2 log (l-I-cos T(w),
which oscillates as a function of frequency w, with `a repetition rate of T, the pitch period. The repetition rate T of this pitch ripple has units of radians per randian per second, or seconds, and is usually called quefrency, (a paraphrase of frequency) to avoid confusing this ripple with a time dependent cosinusoidally varying cornponent of the speech signal. For N l, the peaks of this pitch ripple become more pronounced but still occur at a quefrency T.
Thus one technique for detecting the presence of a periodic speech signal is to detect the pitch ripple on the logarithm of the short-time amplitude spectrum A(w) of selected segments of the speech signal. This is often done by calculating the short-time spectrum of the logarithm of the short-time amplitude spectrum of the speech signal, called for short, the cepstrum, (a paraphrase of spectrum) of the speech signal. To obtain the cepstrum, the logarith of the short-time amplitude spectrum A(w) is usually converted into a time-dependent waveform, with frequency in the original amplitude spectrum proportional to time. The Fourier transform of this timedependent waveform, namely the cepstrum, is obtained by conventional methods. The resulting cepstrum is defined in terms of time, rather than frequency as in the case of a spectrum. If the pitch ripple is present in the time-dependent waveform representing log A(w), the cepstrum exhibits a distinct maximum at a time T corresponding to the pitch period. The presence of several different pitch periods in a selected time segment of speech is indicated by a corresponding number of distinct maximums in the cepstrum.
Thus by `analogy to Equation 1, the cepstrum C(1) is defined as where wo is the maximum significant frequency component in the short-time speech spectrum, and 1- is called frequency. Note the similarity between 1 in the operation indicated by Equation l0 and frequency w in the transformation to the frequency domain of a time-dependent signal as indicated by Equation 1.
An examination of Equation 4a shows that the shorttime amplitude spectrum is an even function; that is, A(m) =A(-w). Therefore, Equation l0 can be rewritten, using `only the even cosine component of ef, as
U0 C(-) 2f01ogA(w)-cos da (n) because the integral of the product log A (w) -sin rw from wo to -l-wo is zero.
In this invention, the frequencies at which the characteristics of the short-time amplitude spectrum A(w) shift from those of voiced speech to those of unvoiced speech and vice versa are determined in the following manner. The pitch period T of a selected segment of the input voiced speech signal is determined from the cepstrum of this segment. A voltage proportional to the pitch frequency is derived from the' pitch period T and is used to control the relative delay between two identical time dependent waveforms representing the logarithm of the short-time amplitude spectrum of the speech signal. The integral of the product of these two waveforms over a time proportional to a selected frequency band Aw gives, as a function of time, the short-time autocorrelation function I (2n/T,w-Aw/2) of the logarithm of the short-time amplitude spectrum of the speech signal. That is When this integral is above a selected threshold, the portions of the amplitude spectrum contributing to this integral are periodic and thus highly correlated, indicating the presence of voiced speech characteristics. When, however, this integral falls below theI selected threshold, the portions of the amplitude spectrum contributing to this integral are aperiodic and thus relatively uncorrelated, indicating the presence of unvoiced speech characteristics. The time at which the autocorrelation integral crosses the threshold indicates the frequency at which the characteristics of the amplitude spectrum change from those of voiced speech to those of unvoiced speech or vice versa. The direction of the crossing indicates Whether the change is from voiced to unvoiced characteristics or vice versa.
Apparatus In FIG. 1, an input speech signal f(t) is detected by transducer and converted into an electrical signal. This electrical signal is operated upon by short-time spectrum analyzer 101 to derive a waveform which varies in amplitude as a function of time and which represents the short-time amplitude spectrum A(w) of the input speech signal.
In analyzer 101 a selected time segment of the electrical signal representing the input speech signal is cornpressed in time and stored in a recirculating storage device. This stored signal is continuously updated and repeatedly modulated by sinusoidal and cosinusoidal output signals from a sweep oscillator contained in analyzer 101. As dictated by Equations 2a, 2b, and 3, the sum of the integrals of the modulation products over the duration of the stored, time-compressed segment of the electrical signal is proportional to the short-time frequency spectrum F(w) of the stored signal ;f(t) at a frequency corresponding to the average modulating frequency w over this time duration. The stored electrical signal Kt) is modulated many times during one sweep of the modulating oscillator over its frequency range. A time series composed of samples proportional to the amplitudes of the resulting integrals is used to generate a waveform A(t) representing the short-time amplitude spectrum A(w) of the input speech signal.
The waveform A(t) representing A(w) is amplified by logarithmic amplifier 103 to generate a time dependent signal log AU) proportional to log Am), the logarithm of the short-time amplitude spectrum of the input speech signal. This logarithmic signal log A(t) is in turn compressed in time and stored in a second recirculating storage device constituting part of short-time spectrum ana- Iyzer 108 where it is used to obtain the cepstrum C() of the input speech signal. It remains stored until replaced by a new logarithmic signal representing the short-time amplitude spectrum derived from the next following time segment of the input speech signal.
Spectrum analyzer 108 operates upon log A(t) just as though this logarithmic signal was another time dependent signal similar to the input speech signal. Time of course is really a dummy variable representing frequency since log A(t) represents the variation with frequency of log A( w).
The time compressed logarithmic signal, log A(t), is, as shown above, an even function. Thus this signal is modulated repeatedly by a cosinusoidal output signal from a sweep oscillator contained in analyzer 108 to obtain the short-time cepstrum of the input speech signal. As required by Equation 11, this cosinusoidal modulating signal must be carefully synchronized with the waveform representing log A(w) so that the cosinusoidal modulating signal has a maximum value precisely at the beginning of the wave form representing log A(w), that is, when w=0.
The integral of the modulation product over the duration of the stored waveform is proportional to the magnitude of the amplitude spectrum of the stored waveform at the average modulating frequency except now real time t is actually proportional to frequency w. The frequency of the modulating signal in turn is proportional to quefrency r as indicating by Equations 10 and ll. The resulting integral reaches a maximum value when the quefrency 1- of the modulating signal is equal to the pitch period T of the speech signal. A plot of the amplitude of this integral versus the quefrency of the modulating signal is the cepstrum C(r) of the input speech signal.
The quefrency 1- at which the peak of the cepstrum occurs is a direct measure of the pitch period T of the input speech signal. Thus, during voiced speech cepstrum peak picker 110 generates an output voltage proportional to the pitch frequency.
A vocoder system utilizing this cepstrum technique to detect the pitch of an input speech signal is described in copending application Ser. No. 420,362 filed Dec. 22, 1964, by A. M. Noll and M. R. Schroeder. Thus spectrum analyzer 101, logarithmic amplifier 103, analyzer 108, and cepstrum peak picker 110 will not be described here in further detail.
The output voltage from peak picker 110 is used to control the operation of a short-time autocorrelation function generator consisting of delay 105, product modulator 106, and filter 109. In addition, this voltage is transmitted in coded form to the vocoder synthesizer (FIG. 2) where it is used during voiced speech to generate an excitation signal.
Control signals indicative of the frequencies at which the characteristics of the amplitude spectrum of the input voiced speech signal change from those of voiced speech to those of unvoiced speech and vice versa are derived in the autocorrelation function generator. The waveform log A(t), obtained from the output lead of amplifier 103, is passed through synchronizing delay 104 to compensate for delays in the derivation of the output signal from peak picker 110, and is then sent along two paths. The signal sent along one path is delayed in variable delay 105 by a time proportional to the fundamental frequency 21r/T of the input speech signal. The other signal is transferred undelayed to product modulator 106 where a signal proportional to the instantaneous product log A(w) -log A(w-21r/T) is obtained. This product signal is passed through filter 109 which essentially integrates the instantaneous value of this product signal over a time period selected to correspond to the frequency band Aw. Thus, the continuous output signal from lilter 109, given mathematically by Equation 12, represents the short-time autocorrelation of the last Aw cycles per second of the waveform representing log A(w) for a delay time proportional to the pitch frequency. The magnitude of Aw is selected so that the section of log A(w) being correlated will usually contain at least two harmonics during voiced speech. For example, Aw can represent a 500 cycle per second band of log A(w).
Variable delay 105 is controlled by the output signal from cepstrum peak picker 110. As the pitch period changes, the output voltage from cepstrum peak picker 110 varies thereby varying delay 105. Variable delay 105 is always controlled so that the instantaneous value of the output signal from modulator 106 is the maximum possible value during voiced speech.
A variable delay suitable for use in this invention is described in copending application Ser. No. 538,676, iiled Mar. 30, 41966, by R. N. Kennedy. The delay disclosed in the Kennedy application is essentially a domain Wall shift register. lIn this delay, an analog signal is converted into digital form, the bits representing each sample of the signal Vare driven at a selected rate along magnetic wires, and, upon leaving the wires, are converted back into an analog signal. The delay time is a function of the rate at which the signals are driven along the wires, and this rate in turn is controlled by the sign-al from peak picker 110.
A slight error in the magnitude of the autocorrelation signal from iilter 109 occurs when variable delay 105 is rapidly changed from one value of delay to another. All bits on the magnetic wires while the delay time is changing emerge from the wires with a slightly erroneous delay. This effect, while small, can be eliminated, if desired, by using a tapped delay line. Tapped delay lines are well known and thus will not be described herein detail.
Thus variable delay 105, product modulator 106, and lter 109 work in such a manner that the output signal fro-m lter 109 indicates which sections of the amplitude spectrum of the input voiced speech are characteristic of voiced speech and which sections are characteristic of unvoiced speed. At all frequencies for which the amplitude spectrum of the input speech is characteristic of voiced speech, the output signal from lter 109 is greater than some selected threshold. However, when the amplitude spectrum of the input speech signal resemble more Closely the amplitude spectrum of unvoiced rather than voiced speech, the output signal from filter 109 drops below the selected threshold value. FIG. 3 shows the shape of the output signal from filter 109 when the shorttime amplitude spectrum of input voiced speech has a region which corresponds to unvoiced speech between 1000 and 2000 cycles per second. The output signal is seen to dip below the threshold at a time corresponding to a frequency of 1000 cycles per second and to rise above the threshold at a time corresponding to a frequency of 2000 cycles per second.
One-zero quantizer 111 generates an output voltage of one during the time the output signal from filter 109 corresponds to voiced characteristics (that is, is above the selected threshold), and an output signal of zero during the time the output signal from Ifilter 109 corresponds to unvoiced characteristics (that is, is below the selected threshold). Quantizer 111 is controlled by a normalizing signal from syllabic energy detector 107 to ensure that low level voiced speech signals are not erroneously categorized as unvoice dsignals. Energy detector 107 can in one embodiment comprise a rectifier in series with a low pass filter.
Differentiator 112 differentiates the output signal from quantizer 111. Differentiator 112 produces a negative pulse when the characteristics of the amplitude spectrum of the input speech signal shift from those of voiced to those of unvoiced speech. It produces a positive pulse when the characteristics of the amplitude spectrum of the input voiced speech shift from those of unvoiced speech to those of -voiced speech.
Ramp generator 113 generates a linearly increasing voltage, the amplitude of which is proportional to the time elapsed from the beginning of the signal representing log A(w). The output voltage from ramp generator 113 is set to zero at the beginning of the signal representing log A(w). If, for example, each signal representing log A(w) has a duration of 10 milliseconds, and represents a frequency range of, for example, to 10,000 cycles per second, generator 113 is reset to Zero every milliseconds.
A possible ambiguity occurs during the start of correlation while the output signal from filter 109 is building up to a value representative of the short-time autocorrelation :function of log A(w). To prevent the low value of the output signal from filter 109 (see FIG. 3) from being interpreted .as representing unvoiced speech characteristics, sample and hold circuits 114 and 117 are inhibited from sampling the voltage produced by ramp generator 113 until a time corresponding to the frequency bandwith Aw has elapsed. It Aw=500 cycles per second, as suggested above, this time is just 0.5 millisecond, using the numbers given in the above example. Thus any pulse from differentiator 112 before 0.5 millisecond has elapsed from the beginning of log A(w), fails to actuate sample and hold circuits 114 and 117.
On the generation of the first pulse by differentiator 112 after, for example, 0.5 millisecond (a negative pulse during voiced speech), sample and hold circuit 114 is activated through the corresponding diode 115 to sample the voltage generated by ramp generator 113 and to hold this voltage for transmission to the vocoder synthesizer. This voltage is proportional to the frequency at which the characteristics of the amplitude spectrum of the input voiced speed shift from those of voiced speech to those of unvoiced speech. On the next pulse from diiferentiator 112 (a positive pulse during voiced speech), the ramp voltage is sampled again, this time by sample and hold circuit 117 activated through diode 116. This voltage is proportional to the frequency `at which th echaracteristics of the amplitude spectrum of the input voiced speech shift from those of unvoiced speech to those of voiced speech. The voltages held by the sample and hold circuits 114 and 117 are periodically sampled by sampler 118 along with the voltage representing the pitch frequency from peak picker 110.
In addition, samples of the amplitude spectrum of the input speech signal are periodically and repetitiously obtained by spectrum sampler 102. The operation of spectrum sampler 102 is described in copending application Ser. No. 557,682, filed .lune 15, 1966, by I. M. Kelly,
A. M. Noll and M. R. Schroeder. The samples obtained by sampler 102 represent the amplitudes of signals in contiguous frequency bands of the input speech signal. These samples, together with the samples from sampler 118 are converted into digital code words and transmitted to the vocoder synthesizer by means of encoding and transmission apparatus 119. Sampler 118 and transmission apparatus 119 `are well known and thus will not be described in detail.
At the synthesizer (FIG. 2), all the transmitted signals are decoded in decoder 201. Decoder 201 converts the samples representing the amplitudes of signals in selected contiguousffrequency bands of the input speech signal into' low frequency control signals identical to those obtained in prior art vocoders. Decoder 201 also converts the samples representing the pitch frequency into a replica of the input pitch signal. This pitch signal is used in excitation source 202 during voiced speech t0 generate an excitation signal containing a fundamental frequency and harmonics similar to those of the input voiced speech signal.
Decoder 201 also produces two voltages, one proportional to the frequency at which the characteristics of the amplitude spectrum of the input voiced speech signal change from those of voiced speech to those of unvoiced speech and the other proportional to the frequency at which the characteristics of the input voiced speech spectrum shift from those of unvoiced speech to those of voiced speech. The voiced-unvoiced control signal is used to set the lower cutoff frequency of bandstop filter 204 and bandpass filter 205 to the frequency at which the characteristics of the amplitude spectrum A(w) of the input voiced speech signal shift from those of voiced speech to those of unvoiced speech. The unvoiced-voiced control signal is used to set the upper cutoff frequencies of these two filters to the frequency at which these characterstics shift from those of unvoiced speech to those of voiced speech.
Bandstop filter 204 passes the periodic excitation signal generated by excitation source 202 with the exception of that portion of the excitation signal between the two cutoff frequencies. On the other hand, bandpass filter 205 passes that portion of a noise signal between the same two cutoff frequencies. Thus, the sum of the two signals passed by the two filters, obtained in summing network 206, has a frequency spectrum which closely matches the frequency spectrum of the input voiced speech signal with respect to voiced and unvoiced characteristics.
The remainder of the vocoder circuit is well known. The signal from summing network 206 is separated into subsignals occupying contiguous frequency bands by bandpass filters 208. Spectrum ilatteners 209 operate on these subsignals to smooth their amplitude spectrums. Modulators 210 each generate an output signal in response to the simultaneous presence of a subsignal and a low frequency control signal from decoder 201. The signals from the modulators 210 are filtered by corresponding bandpass filters 211 to remove undesired frequency components. The resulting filtered signals are summed in summing network 212 and converted into an acoustic speech signal by loudspeaker 200.
The amplitude spectrum of this acoustic speech signal faithfully resembles the amplitude spectrum of the input voiced speech signal with respect to voiced and unvoiced characteristics. As a result, the quality of the synthesized voiced speech is improved over the quality of voiced speech synthesized in prior art vocoders.
During unvoiced speech, voiced-unvoiced detector 213 (FIG. 2) generates an output signal which holds the cutoff frequencies of bandstop and bandpass filters 204 and 205 at selected values regardless of any signals which might be generated by sample and hold circuits 114 and 117 (FIG. 1). Excitation source 202 generates an aperiodic signal in response to a signal of constant voltage, for example zero volts, indicating unvoiced speech from cepstrum peak picker 110. Noise source 203 likewise generates an aperiodic signal. Thus the composite signal from summing network 206 is completely aperiodic, and is characteristic of unvoiced speech.
If during voiced speech no portion of the amplitude spectrum of the input speech is characteristic of unvoiced speech, as indicated by no negative pulse from differentiator 112 (FIG. l), sample and hold circuit 114 generates a maximum voltage. This voltage drives the lower cutoff frequencies of bandstop filter 204 and of bandpass filter 205 above a selected value, thereby ensuring that the portion of the excitation signal passed by bandpass filters 208 contains only periodic energy components characteristic of voiced speech.
While this invention has been described for the case where the amplitude spectrum of voiced speech contains only one region characteristic of unvoiced speech, the principles of this invention can be extended by one skilled in the art to include cases where the amplitude spectrum of voiced speech contains several noncontiguous regions characteristic of voiced speech.
What is claimed is:
1. In apparatus in which a replica of an input speech signal is produced by modulating an excitation signal with low frequency control signals, that improvement which comprises:
means for :analyzing the spectrum of an input speech sig-nal to derive first control signals indicative of the frequencies at which the characteristics of the spectrum shift from those of voiced speech to those of unvoiced speech and second control signals indicative of the frequencies at which the characteristics of the spectrum shift from those of unvoiced speech to those of voiced speech;
means for transmitting said indicative control signals to a speech synthesizer;
and at said synthesizer, means responsive to said indicative control signals for producing an excitation signal containing selected frequency components characteristic of voiced speech and other frequency components characteristic of unvoiced speech.
2. In a yocoder, that improvement which comprises:
means for analyzing the amplitude spectrum of input -voiced speech to derive a first control signal indicative of the frequency at which the characteristics of the amplitude spectrum shift from those of voiced speech to those of unvoiced speech Iand a second control signal indicative of the frequency at which the characteristics of the amplitude spectrum shift from those of unvoiced speech to those of voiced speech;
means for transmitting said control signals to a speech synthesizer; and at said synthesizer, means responsive to said control signals for producing a periodic excitation signal during intervals of -voice speech and an aperiodic excitation signal during intervals of unvoiced speech.
3. Apparatus as in claim 2 in which said analyzing means comprises:
means for generating a first signal proportional to the logarithm of theamplitude spectrum of said input voiced speech; means for generating a second signal proportional to the pitch frequency of said input voiced speech;
autocorrelation means responsive to said second signal for producing a third signal proportional to the maximum short time autocorrelation function of said first signal;
means responsive to said thind signal for producing said first control signal to indicate the frequency at which the characteristics of said amplitude spectrum shift from those 'of Ivoiced speech to those of unvoiced speech, and said second control signal to indicate the frequency at which said characteristics shift from those of unvoiced speech to those of voiced speech.
4. Apparatus as in claim 3 in which said `autocorrelalation means compri-ses:
means controlled by said second signal to delay said rst signal an amount proportional to said pitch frequency;
means for obtaining the product of said first signal and said delayed first signal;
and means for integrating said product over :a selected time to produce a third signal proportional to the maximum short-time autocorrelation function of said first signal.
5. Apparatus as in claim 3 in which said means responsive to said third signal comprises:
a comparator for comparing the amplitude of said third signal to a selected threshold;
means for generating 4a constant positive voltage during the time said third signal exceeds `said threshold and a zero voltage during the time said third signal is below said threshold; l
a differentiator for providing a negative pulse at the instant the output voltage from said generating means drops from said positive voltage to zero voltage and a positive pulse at the instant said output voltage rises from zero voltage to said positive voltage;
a ramp generator for producing a linearly increasing voltage with time;
first sample and hold means responsive to said negative pulse for sampling the voltage produced by said ramp generator to produce said first control signal;
yand second sample and hold means responsive to said positive pulse for sampling the voltage produced by said ramp generator to produce said second control signal.
6. Apparatus as in claim 2 in which said means responsive to said control signals comprises:
an excitation source for producing a periodic excitation signal;
a noise source for producing an aperiodic excitation Signal;
a bandstop filter with cutoff frequencies controlled by said control signals for passing -selected frequency components of said periodic excitation signal;
a bandpass filter with cutoff frequencies controlled by said control signals for passing selected frequency components of said aperiodic excitation signal;
and a summing network for combining 4said filter periodic and aperiodic excitation signals to produce an excitation signal with an amplitude spectrum containing regions matching the voiced and unvoiced regions of the amplitude spectrum of said input voiced speech.
7. -In combination:
means for generating a `first signal proportional to the amplitude spectrum of an input speech signal;
means for generating from said amplitude spectrum a second signal proportional to the pitch frequency of said input speech signal during voiced speech;
means controlled by said second signal for generating the maximum autocorrelation function of said first signal;
means for detecting the times at which said maximum autocorrelation function crosses a selected threshold rvalue;
means responsive to said detecting means for generating pulses when said autocorrelation function crosses said threshold;
means responsive to said pulses for generating a first set of control signals proportioanl to the frequencies at which said autocorrelation function crosses said threshold;
means for generating from said first signal a second set of control signals proportional to the energy in selected frequency bands of said input speech signal;
means for transmitting in coded form to a synthesizer, said second signal, said first set of control signals, and said second set of control signals; and
at said synthesizer:
means for converting to analog form said transmitted signals; v
voiced-unvoiced detection means responsive to said decoded second signal;
means responsive to said decoded second signal for generating a periodic excitation signal during voiced speech and an aperiodic excitation signal during unvoiced speech;
means for continuously generating an aperiodic noise signal;
-bandstop filter means with cutoff frequencies controlled by said first set of control signals during voiced speech and by said voiced-unvoiced detector during unvoiced speech for passing selected frequency components of said periodic excitation signal during voiced speech 'and said aperiodic excitation signal during unvoiced speech;
bandpass filter means with cutoff frequencies controlled by said first set of control signals during ivoiced speech and by said voiced-unvoiced detector during unvoiced speech for passing selected frequency components of said aperiodic noise signal;
means for combining said filtered excitation and noise signals;
and means for generating a replica of said input speech signal from said combined excitation and noise signals and said second set of control signals.
8. In combination:
means for generating a first set of control siganls proportional to the energy in contiguous frequency bands of an input speech signal;
means for generating a first signal proportional to the logarithm of the amplitude spectrum of said input speech signal;
means for producing from said first signal a second signal proportional to the pitch frequency of said -input speech signalduring voiced speech and equal to a constant during unvoiced speech;
means responsive to said second signal for generating a third signal proportional to the maximum autocorrelation function of said first signal;
means responsive to said third signal for generating a second set of control signals indicative of the frequencies at which the characteristics of the amplitude spectrum of said input speech signal shift from those of voiced speech to those of unvoiced speech and from those of unvoiced speech to those of voiced speech;
means for transmitting said first set of control signals,
said second signal `and said second set of control signals to a synthesizer in coded form;
and at said synthesizer:
means for decoding said transmitted signals;
means responsive to said second signal and said second yset of control signals for generating an excitation signal, the amplitude spectrum of which possesses characteristics which closely match those of the amplitude spectrum of said input speech signal with respect to voiced and unvoiced regions;
and means responsive to said excitation signal and said first set of control signals for generating a replica of said input speech signal.
9. In combination:
means for generating a first set of control signals proportional to the energy in selected frequency bands of said input speech signal;
means for generating a first signal proportional to the logarithm of the amplitude spectrum of said input speech signal;
means for generating from said first signal a second signal proportional to the pitch frequency of said input speech signal;
means responsive to said first and second signals for generating a second set of control signals during voiced speech proportional to the frequencies at which the characteristics of the amplitude spectrum of said input speech signal change from voiced to unvoiced and from unvoiced to voiced;
`means for transmitting in coded form to a synthesizer said first and second sets of control signals and said second signal;
and at said synthesizer:
means for decoding said transmitted signals;
means responsive to said second signal and said second set of control ysignals for generating an excitation signal which during voiced speech possesses an amplitude spectrum closely matching the amplitude spectrum of said input voiced speech signal with respect to voiced and unvoiced characteristics;
and Imeans responsive to said excitation signal .and said first set of control signals for generating a replica of said input speech signal.
References Cited UNITED STATES PATENTS 3,030,450 5/19'62 Schroeder 179-1555 3,321,582 5/1967 Schroeder 179-1 3,328,525 6/1967 Kelly 179--1 KATHLEEN H. CLAFFY, Primary Examiner.
ROB-ERT P. TAYLOR, Assistant Examiner.
U.S. C1. X.R. 179--1555
US569898A 1966-08-03 1966-08-03 Vocoder system Expired - Lifetime US3448216A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US56989866A 1966-08-03 1966-08-03

Publications (1)

Publication Number Publication Date
US3448216A true US3448216A (en) 1969-06-03

Family

ID=24277354

Family Applications (1)

Application Number Title Priority Date Filing Date
US569898A Expired - Lifetime US3448216A (en) 1966-08-03 1966-08-03 Vocoder system

Country Status (1)

Country Link
US (1) US3448216A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3627920A (en) * 1969-04-03 1971-12-14 Bell Telephone Labor Inc Restoration of degraded photographic images
US3649765A (en) * 1969-10-29 1972-03-14 Bell Telephone Labor Inc Speech analyzer-synthesizer system employing improved formant extractor
US3681530A (en) * 1970-06-15 1972-08-01 Gte Sylvania Inc Method and apparatus for signal bandwidth compression utilizing the fourier transform of the logarithm of the frequency spectrum magnitude
US4124773A (en) * 1976-11-26 1978-11-07 Robin Elkins Audio storage and distribution system
US4417102A (en) * 1981-06-04 1983-11-22 Bell Telephone Laboratories, Incorporated Noise and bit rate reduction arrangements

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3030450A (en) * 1958-11-17 1962-04-17 Bell Telephone Labor Inc Band compression system
US3321582A (en) * 1965-12-09 1967-05-23 Bell Telephone Labor Inc Wave analyzer
US3328525A (en) * 1963-12-30 1967-06-27 Bell Telephone Labor Inc Speech synthesizer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3030450A (en) * 1958-11-17 1962-04-17 Bell Telephone Labor Inc Band compression system
US3328525A (en) * 1963-12-30 1967-06-27 Bell Telephone Labor Inc Speech synthesizer
US3321582A (en) * 1965-12-09 1967-05-23 Bell Telephone Labor Inc Wave analyzer

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3627920A (en) * 1969-04-03 1971-12-14 Bell Telephone Labor Inc Restoration of degraded photographic images
US3649765A (en) * 1969-10-29 1972-03-14 Bell Telephone Labor Inc Speech analyzer-synthesizer system employing improved formant extractor
US3681530A (en) * 1970-06-15 1972-08-01 Gte Sylvania Inc Method and apparatus for signal bandwidth compression utilizing the fourier transform of the logarithm of the frequency spectrum magnitude
US4124773A (en) * 1976-11-26 1978-11-07 Robin Elkins Audio storage and distribution system
US4417102A (en) * 1981-06-04 1983-11-22 Bell Telephone Laboratories, Incorporated Noise and bit rate reduction arrangements

Similar Documents

Publication Publication Date Title
Noll Short‐time spectrum and “cepstrum” techniques for vocal‐pitch detection
Holmes The JSRU channel vocoder
US5621854A (en) Method and apparatus for objective speech quality measurements of telecommunication equipment
US3566035A (en) Real time cepstrum analyzer
Gold et al. The channel vocoder
CA1065490A (en) Emphasis controlled speech synthesizer
US3471648A (en) Vocoder utilizing companding to reduce background noise caused by quantizing errors
US4991215A (en) Multi-pulse coding apparatus with a reduced bit rate
US3448216A (en) Vocoder system
US3431362A (en) Voice-excited,bandwidth reduction system employing pitch frequency pulses generated by unencoded baseband signal
US3069507A (en) Autocorrelation vocoder
US3109070A (en) Pitch synchronous autocorrelation vocoder
US2928902A (en) Signal transmission
US20060184359A1 (en) Method and system for low bit rate voice encoding and decoding applicable for any reduced bandwith rquirements including wireless
Robinson Speech analysis
US3078345A (en) Speech compression systems
US3321582A (en) Wave analyzer
US3083266A (en) Vocoder apparatus
US3330910A (en) Formant analysis and speech reconstruction
US3091665A (en) Autocorrelation vocoder equalizer
US3493684A (en) Vocoder employing composite spectrum-channel and pitch analyzer
US3555191A (en) Pitch detector
US3325596A (en) Speech compression system
KR0128851B1 (en) Pitch detecting method by spectrum harmonics matching of variable length dual impulse having different polarity
Edwards et al. Better vocoders are coming