US3328525A

US3328525A - Speech synthesizer

Info

Publication number: US3328525A
Application number: US334354A
Authority: US
Inventors: Jr John L Kelly
Original assignee: Bell Telephone Laboratories Inc
Current assignee: AT&T Corp
Priority date: 1963-12-30
Filing date: 1963-12-30
Publication date: 1967-06-27
Anticipated expiration: 1984-06-27

Description

June 27, 1967 J. L. KELLY, JR

SPEECH SYNTHESIZER 2 Sheets-Sheet l Filed Dec. 50, 1963 'TORNEV /A/f/E/WOR @VJ L KELL Y JR 2 Sheets-Sheet 2 Filed Dec. 30, 1963 United States Patent 3,328,525 SPEECH SYNTHESIZER John L. Kelly, Jr., Berkeley Heights, NJ., assignor to Bell Telephone Laboratories, Incorporated, New York, N.Y., a corporation of New York Filed Dec. 30, 1963, Ser. No. 334,354 4 Claims. (Cl. 179-1) This invention relates to the synthesis of speech, and in particular to the synthesis of natural sounding speech in bandwidth compression systems.

Conventional speech communication systems, for example, commercial telephone system, typi-cally convey human speech by transmitting an electrical facsimile of the acoustic waveform produced by a human talker. Because of the redundancy of human speech, however, facsimile trans-mission is a relatively inefficient way to transmit speech information, and it is well known thatV the information contained in a typical speech sound may be transmitted over a channel of substantially narrower bandwidth than `that required for facsimile transmission of the speech waveform.

A number of arrangements for compressing or reducing the amount of bandwidth employed in the transmission of speech information have been proposed, one of the best known of these arrangements being the so-called resonance vocoder. A specific version of the resonance vocoder is described in J. C. Steinberg Patent 2,635,146, issued Apr. 14, 1953.

The distinctive feature of resonance vocoder systems is the transmission of speech information in terms of narrow bandwidth control signals representative of the frequency locations of selected peaks or maxima in the speech 'amplitude spectrum which correspond to the principal yformants or resonances of the human vocal tract. A typical resonance vocoder system includes at a transmitter station an analyzer for deriving from an incoming speech 'wave a group of narrow bandwidth control signals including formant contral signals representative of the frequencies of selected formant peaks in the speech spectrum. After transmission to a receiver station, the control signals are applied to a synthesizer that is provided with controllable resonant circuits for shaping an artificial spectrum to have peaks at frequencies specified by the formant control signals, thereby reconstructing a replica of the spectrum of the original speech wave.

From the standpoint of efficiency, it is of course desirable to transmit the smallest number of control signals consistent with a desired level of intelligibility and naturalness in the reconstructed speech wave. It is well known that the first three formants in order of frequency contribute most to the intelligibility of speech; accordingly, it is common practice to transmit three formant control signals representative of the locations of these three principal formant locations. From the standpoint of speechquality, however, it has been Ifound that higher frequency formants contribute significantly to the'naturalness of reconstructed speech, but of course the transmission of additional formant control signals requires a greater transmission channel bandwidth, thereby decreasing the bandwith efiiciency of a resonance vocoder system.

One arrangement that improves thepquality of reconstructed speech without decreasing bandwidth efficiency is described in an article by I. L. Flanagan and A. S. House, Development and Testing of a Formant-Coding Speech Compression System, volume 28, Journal of the Acoustical Society of America, page 1099, (1957). In the Flanagan- House system, the three principal lower order formants are represented by control signals that are transmitted from an analyzer to a synthesizer, as in a conventional resonance vocoder, but in addition to shaping an artificial ice spectrum to have three peaks at frequencies specified by the three transmitted formant control signals, the synthesizer is provided with a separate fixed frequen-cy resonant circuit that shapes the artificial spectrum to have a fourth peak at a fixed frequency corresponding to the average location of a fourth, high frequency formant. However, the human vocal tract is characterized by an infinite number of resonances or formants, `and therefore the Flanagan- House arrangement does not specify completely the higher order speech formants.

The present invention improves the quality of speech reconstructed in a resonance vocoder without decreasing lbandwidth efficiency by providing at a resonance vocoder receiver station a novel arrangement for shaping lan artificial spectrum to have an infinite number of peaks at selected fixed frequencies corresponding to the locations of higher order speech formants, In the apparatus of this invention there is provided a novel resonant circuit having an infinite number of resonances at selected fixed frequencies, the higher frequency resonances of this resonant circuit corresponding to the frequencies of higher order speech formants. One yor more ofthe lower frequency resonances of this circuit may lie within the frequency range of the formants represented by transmitted control signals, hence these lower-frequency resonances are canceled or removed by separate antiresonant circuits having .antiresonances at fixed frequencies corresponding to the unwanted lower frequency resonances of the resonant circuit. In the present invention an artificial spectrum is first shaped in conventional fashion by adjustable resonant circuits to have lower order formant peaks at frequencies specified by transmitted formant control signals, following which the artificial spectrum is further shaped by the apparatus of the present invention to have higher order formant peaks at fixed frequencies specified by the uncanceled resonances of the resonant circuit provided in this invention. l

The invention will be fully understood from the following detailed description of illustrative embodiments thereof, taken in connection with the appended drawings, in which:

FIG. 1 is a block diagram showing a complete resonance vocoder system embodying the principles of this invention;

FIG. 2 is a circuit diagram showing in detail a specific embodiment of an antiresonant circuit employed in this invention; and

FIGS. 3A and 3B are diagrams of assistance in explaining the features of this invention.

Referring first to FIG. l, elements 11 and 12 respectively represent the -analyzer and synthesizer portions of a typical formant vocoder system. Formant vocoder analyzer 11, which is ordinarily located at a transmitter station, includes a pitch detector 111, a voiced amplitude detector 112,- land a formant frequency detector 113, which respectively derive from a speech wave from source 1f), for example, a conventional microphone, a group of narrow bandwidth control signals respectively representative of the fundamental glottal excitation frequency, F0, the amplitude of the glottal excitation, AV, and the frequencies F1, FN, N=2,3, of selected maxima in the spectrum of the incoming speech wave. The frequency locations of these maxima in the speech spectrum correspond to natural frequencies of the formants or normal modes of vibration of the human vocal tract, and as the 'shape of the vocal tract is deformed during the articulation of different speech sounds, the natural or formant frequencies and the corresponding locations of spectral maxima also change. It is generally accepted that the most important formants are the three having the lowest frequencies, and in most formant vocoders, the formant control signals derived by an analyzer represent the first three formants; that is, three is a suitable value for N in the apparatus of FIG. 1.

The control signals derived by analyzer 11` are delivered by way of a suitable transmission medium, indicated by broken lines, to a synthesizer 12 located at a receiver station. In synthesizer 12 there is provided a lbuzz source 121 that generates a relatively fiat artificial amplitude spectrum comprising a plurality of relatively uniform amplitude harmonics of the fundamental frequency FO, and an amplitude modulator 122 that adjusts the uniform amplitude of the harmonics of the artificial spectrum from source 121 to represent the glottal excitation amplitude AV. Synthesizer 12 is further provided with a cascade of uncoupled resonant circuits 123-1 through 12S-N, each. of which has an adjustable resonance that is individually tuned by a corresponding one of the N formant control signals transmitted from analyzer 11. By successively passing the artificial spectrum of uniform amplitude harmonies from modulator 122 through circuits 123-1 through 123-N, the` spectrum is shaped by the resonances of circuits 12341 through 123-N to resemble the spectrum of the original speech wave from source Thus the adjustable resonances of circuits 123-1 through 123-N shape the spectrum from modulator-122 to have N maxima at frequencies F1 through FN corresponding to the locations of the N formant peaks in the original speech spectrum which are represented by the N transmitted formant control signals.

It is evident that the synthesized spectrum developed at the output terminal of synthesizer 12 is limited in its resemblance to the original speech spectrum in that the number of peaks in synthesized spectrum is determined by the number of transmitted formant control signals. Thus in the usual situation Where three formant control signals representative of the three principalspeech formants are transmitted, the synthesized spectrum developed by synthesizer 12 has only three maxima. A closer resemblance to original speech spectrum is obtained in the present invention by passing the synthesized spectrum from synthesizer 12 through higher order formant synthesizer 13 in order to shape further the synthesized spectrum to have additional maxima at frequencies corre-Y sponding to the locations of maxima in the original speech spectrum which are not represented by transmitted control signals.

Within synthesizer 13, the synthesized spectrum from synthesizer 12 is passed through M series-connected antiresonant circuits 131-1 through 131-M toresonant circuit 133, which comprises a delay line 133a of length 1- seconds in negative feedback relation through subtractor 132 with an amplifier 133b `having a gain e-T less than unity. As described in detail below, resonant circuit 133 theoretically has an infinite number of resonances at fixed frequencies dependent upon the length 1- of delay line 13317. The fixed resonances of feedback circuit 133 shape the incoming synthesized spectrum from synthesizer 12 to have maxima which correspond to desired higher order formant locations not specified by the transmitted formant control signals. Since feedback circuit 133 has fixed resonances at low frequencies as well as at high frequencies,l

and since the synthesized spectrum from synthesizer 12 has already been shaped to have N maxima corresponding to N low frequency formants, one or more of the low frequency resonances of feedback circuit 133 must be canceled. Cancellation `of M selected low frequency resonances of feedback circuit 133 is accomplished by the M antiresonant circuits 131-1 through 131-M, M :1,2, detailed illustration of a suitable antiresonant circuit being shown in FIG. 2 and explained below. After the synthesized spectrum from synthesizer 12 has been further shaped by synthesizer 13, the spectrum may be converted into intelligible speech sounds by a suitable transducer 14, for example, a conventional loudspeaker.

Before describing the theoretical,considerations underlying the construction of resonant circuit 133, it will be helpful at this point to describe the properties Aof the human vocal tract during vowel production in terms of Laplace transform notation. The ratio of the Laplace transform, U2(S), ofthe volume velocity of air through the lips, U2U), to the Laplace transform, U1(s) of the volume velocity of air through the glottis,.U2(t), this ratio being commonly known as the transfer characteristic of the vocal tract, can be approximated by a rational transfer function having the following form:

able;,sk=("k+jwk) is a complex number representing a formant or normal mode of vibration of the.vocal tract;

and sk* is the complex conjugate of sk. Equation 1 indi-v cates thait the glottis-to-mouth transfer functions for vowel sounds has only poles, denoted sk, and no zeros, and the the poles coincide with the normalmodes of vibration. The locations of these poles are illustrated graphically in the pole diagram of FIG. 3A, where the Xs indicate the locations in the s-plane of the complex numbers s1, s2, s3, representing the first three poles or rformants, and the dashed lines indicate higher frequency poles.

Since the vocal tract is a distributed acoustic system, it has in theory an infinite number of natural frequencies which change in value with time las the vocal tract is deformed during the articulation of different speech sounds, and correspondingly, the poles of the transfer characteristic in Equation 1 also change with time. However, only at relatively low frequencies, usually including no more than the first three natural frequencies, do the natural frequencies change substantially in value with deformations of the vocal tract, Whereas at high frequencies, the natural frequencies asymptotically approach a uniform spacing in frequency.

Synthesizer 13 of the present invention is provided with an infinite number of natural resonant frequencies having a uniform spacing in frequency corresponding t0 that of the higher order natural frequencies of the Vocal tract by constructing resonant circuit 133 and :antiresonant circuits 131-1 through 131-N in the following manner. As shown in FIG, 1, resonant circuit-133 comprises delay line 133a of length r in feedback relation through subtractor 132 with amplifier 13317 having gain fr", where a suitable value for e-ff may be on the order of 2/ 3. In Laplace transform notation, the transfer characteristic of delay line 133a is ers?, hence the combined f transfer characteristic of delay line 13'3a and amplifier 133b is the product.

Since the elements 133a and 133b are in negative feedback relation with each other through subtractor 132, the product in Equation 2 corresponds to the familiar in the transfer characteristic relating the incoming signal F1 applied to the minuend terminal of subtractor 132 to the outgoing signal F2 developed at the output terminal of circuit 133,

and since e$=1 is periodic, the positive frequency poles of Fz/Fl occur at odd integral multiples of fr,

r(Sn+ff)=-J"(n1f) .77, S=o'i 7;1r, (12:1, 3, (5a) From Equation 5a it is evident that F2/F1 has an infinite number of poles, the radian frequency of the nth pole being or, since w11=21rf11 denotes frequency in cycles per second,

The spacing in frequency between poles is uniform, being 21r/r in radius and 1/1- in cycles per second. Further, each pole has the same constant real part, -r, which represents the so-called formant damping, and which is manifested in the speech spectrum by the bandwidth of the formant peaks.

FIG. 3B illustrates graphically the locations of the poles of F11/F1, in which it is noted that a particular uniform spacing in frequency may be obtained by suitably choosing the length T of delay line 133a and the corresponding factor -r in the gain of amplifier 133b. A suitable value for the length of delay line 133a 7- may be on the order of one millisecond, which corresponds to the round t-rip delay of the human vocal tract, thereby placing the poles of feedback circuit 133 at frequencies of approximately 500, 1500, 2500 cycles per second. Similarly, a suitable value for the formant bandwidth factor a in the gain of amplifier 133th may be on the order of 400 nepers per second, corresponding to a formant bandwidth of about 130` cycles per second.

Depending upon the value selected for the length of delay line 133a, one or more of the fixed poles or resonances of resonant circuit 133 may occur within the frequency range of formants represented by the control signals transmitted from ianalyzer 11 to synthesizer 12. For example, by selecting the length of delay line 133a to -be one millisecond, the rst three xed poles of circuit 133 occur at 500, 1500, and 2500 cycles per second, which respectively `lie within the frequency yranges of the first three speech formants typically represented by transmitted control signals, In this situation it is therefore necessary to remove or cancel the three lowerorder poles of resonant circuit 133 which lie within the frequency range of the formants Irepresented by transmitted control signals in order to prevent interference with the maxima previously synthesized in the artificial spectrum from synthesizer 12. FIG. 3B illusr-ates the situation in which the rst three poles of circuit 133 are to be canceled, as indicated by the three Xs enclosed in circles.

Antiresonant circuits 131-1 through 131-M, which precede feedback circuit 133, are designed to cancel M=l,2, unwanted lower order poles of feedback circuit 133 in the following manner. In order for a particular antiresonant circuit, say 131-1, to cancel a corresponding 4one of the poles of resonant circuit 133, say the pole denoted s'1.=( a1-fm1), it is necessary for the transfer characteristic of circuit 131-1 to have a single zero at s1='(- rija and no other poles or zeros. That is, if Z1 denotes the transfer characteristic of circuit 131-1, then Z1 must be proportional to (tv-s1) (s-s1*),

where A is a constant. A suitable realization of a circuit having a transfer characteristic of the type specified by Equation 6 is shown in detail in FIG. 2, it being understood that other antiresonant circuits having a suitable transfer characteristic may be employed, if desired. Further, it is understood that the transfer characteristic required for the cancellation of other unwanted poles by antiresonant circuits, 131-2, 131-M, may be obtained by substituting other poles s2, s2*; r11, SM* for the quantities s1, s1* in Equation 6.

Turning now to FIG. 2, -the synthesized speech spectrum from synthesizer 12 is applied through a sufficiently high resistance R3 to pass a constant current through the series connected inductance, resistance and capacitance elements respectively denoted L1, R1, and C1, and to apply a constant voltage to cathode folower V1. The output voltage of cathode follower V1 is differentiated by capacitance element C2 and resistance element R2, and the differentiated output signal is passed to the next antiresonant circuit.

The impedance of elements L1, R1, C1, in Laplace transform notation, may be written www? and differentiating elements R1 and C2 change this impedance by a multiplicative factor s, so that the transfer characteristic of circuit 131-1 may be written From Equation 8 it is evident that the values of the inductance, resistance, and capacitance elements L1, R1,

and C1 may be determined from the predetermined values of a and w1 according to the following relations:

Although this invention has been described in terms of a resonance vocoder system of the type shown in FIG. 1, it is to be understood that applications of the principles of this invention are not limited to this particular system, but include other resonance vocoder systems as well as various kinds of speech processing equipment in which speech form-ants are synthesized. In addition, it is to be understood that the above-described embodiments are merely illustrative of the numerous arrangements that may be devised for the principles of this invention by those skilled in the art without departing from the spirit and scope of the invention.

What is claimed is:

, 1. In a resonance vocoder synthesizer,

a source of a plurality of control signals including a pitch control signal, an amplitude control signal, and a group of formant control signals representative of the frequencies of selected low frequency formant peaks in the spectrum of an original speech wave,

first synthesizing means responsive to said plurality of control signals. for developing an artificial speech spectrum having a first group of peaks lat frequencies represented by said formant controlsignals, and second synthesizing means for Ishaping said artificial speech spectrum to have a second group of peaks at selected fixed frequencies representative of high lfrequency speech formants, said second synthesizing Y means including a resonant circuit having a transfer characteristic with no zeroes and an infinite number of poles at equally spaced predetermined frequencies, wherein the higher frequency poles of said transfer characteristic correspond in frequency to high frequency speech formants, andy a plurality of series-connected antiresonant circuits preceding said resonant circuit, each of said antiresonant circuits having a transfer characteristic with no poles and a zero at a predetermined frequency corresponding to an unwanted pole in said transfer characteristic vvof said resonant circuit, and

means for applying said artificial speech spectrum to said second synthesizing means. 2. In combination with a resonance vocoder synthesizer that generates an artificial speech spectrum having peaks at selected low frequencies corresponding to selected low frequency formant peaks in the spectrum of an original speech wave, apparatus for introducing additional peaks into said artificial speech spectrum `at selected frequencies corresponding to selected high frequency formants which comprises a resonant circuit having a transfer characteristic with no zeros and an infinite number of poles at equally spaced lfixed frequencies, wherein the higher frequency poles of said transfer characteristic correspond in frequency to selected high frequencyA speech formants, and Y a plurality of series-connected antiresonant circuits in preceding circuit relation with said resonant circuit for cancelling a corresponding plurality of unwanted poles in the transfer characteristic of said resonant circuit.

3. Apparatus for synthesizing a plurality of peaks of predetermined width at selected fixed frequencies in an incoming amplitude spectrum which comprises a plurality of series-connected antiresonant circuits each provided with a transfer characteristic having a single zero at S=-a+ QW, (n=1, 3,

means for applying said incoming amplitude spectrum to said input point, of said plurality of antiresonant circuits, and

means for conencting the output point of said pluralityl of antiresonant circuits to said resonant circuit.

4. Apparatus for synthesizing an artificial spectrum having a plurality of peaks at selected high frequency locations so that said peaks in said artificial spectrum closely resemble the formant peaks at high frequency locations in the spectrum of an original speech wave, which comprises means for developing an incoming artificial spectrum having peaks at selected low frequency locations corresponding to formant peaks at selected low frequency locations in said spectrum of said original speech wave,

, a plurality M (M=l, 2, of series-connected antiresonant circuits for preventing the Occurrence of peaks at a corresponding plurality of unwanted high 8 frequency locations, wherein the nth of said antiresonant circuits, n=l,2, M, comprises an input terminal,

a cathode follower provided withan input point and an output point,

a first resistance element connected between said input terminal and said input point of said cathode followan inductance element L, a second resistance element Rm, and a first capacitance element Cn connected in series between said input point of said cathode follower and ground,

an output terminal,

a second capacitance element connected between said output point of said cathode follower and said output terminal, and

a third resistance element connected between said output terminal and ground,

wherein the values of said inductance element L, said second resistance element Rh, and said first capaci tance element Cn are determinedby the bandwidth a and the frequency wn of said unwanted peak at said high frequency location according to the following relationships:

means for aplying said incoming artificial spectrum to said input terminal of the first of said antiresonant circuits, n=l,

a resonant circuit including subtracting means provided with a minuend terminal,

a subtrahend terminal and an output terminal,

a delay element having a delay time of r seconds and provided 4with an inputterminal and an output terminal,

an amplifier having a gain e-"r so that said resonant circuit has an infinite number of resonances at frequencies said amplifier being provided with an input terminal and au output terminal,

means for connecting said output terminal of the last of said antiresonant circuits, n=M, to said minuend terminal of said subtracting means,

means for connecting said output terminal of said subtracting` means to said input terminal of said delay element,

means for connecting vsaid output terminal of said delay element to said input terminal of said amplifier, and

means forV connecting said output terminal of said amplifier to said subtrahend terminal of said subtracting means,

thereby developing from said incoming artificial spectrum an outgoing artificial spectrum atsaid o-utput terminal of said delay element, wherein said outgoing artificial spectr-um has peaks atselected low frequency locations corresponding to said peaks of said incoming artificial spectrum'and peaks at selected high frequency locations corresponding to resonances of said resonant circuit .at frequencies wi, for i greater than M.

No references cited.

KATHLEEN H. CLAFFY, Primary Examiner.

R. MURRAY, Assistant Examiner.

Claims

1. IN A RESONANCE VOCODER SYNTHESIZER, A SOURCE OF A PLURALITY OF CONTROL SIGNALS INCLUDING A PITCH CONTROL SIGNAL, AN AMPLITUDE CONTROL SIGNAL, AND A GROUP OF FORMANT CONTROL SIGNALS REPRESENTATIVE OF THE FREQUENCIES OF SELECTED LOW FREQUENCY FORMANT PEAKS IN THE SPECTRUM OF AN ORIGINAL SPEECH WAVE, FIRST SYNTHESIZING MEANS RESPONSIVE TO SAID PLURALITY OF CONTROL SIGNALS FOR DEVELOPING AN ARTIFICIAL SPEECH SPECTRUM HAVING A FIRST GROUP OF PEAKS AT FREQUENCIES REPRESENTED BY SAID FORMANT CONTROL SIGNALS, AND SECOND SYNTHESIZING MEANS FOR SHAPING SAID ARTIFICIAL SPECH SPECTRUM TO HAVE A SECOND GROUP OF PEAKS AT SELECTED FIXED FREQUENCIES REPRESENTATIVE OF HIGH FREQUENCY SPEECH FORMANTS, SAID SECOND SYNTHESIZING MEANS INCLUDING A RESONANT CIRCUIT HAVING A TRANSFER CHARACTERISTIC WITH NO ZEROES AND AN INFINITE NUMBER OF POLES AT EQUALLY SPACED PREDETERMINED FREQUENCIES, WHEREIN THE HIGHER FREQUENCY POLES OF SAID TRANSFER CHARACTERISTIC CORRESPOND IN FREQUENCY TO HIGH FREQUENCY SPEECH FORMANTS, AND A PLURALITY OF SERIES-CONNECTED ANTIRESONANT CIRCUITS PRECEDING SAID RESONANT CIRCUIT, EACH OF SAID ANTIRESONANT CIRCUITS HAVING A TRANSFER CHARACTERISTIC WITH NO POLES AND A ZERO AT A PREDETERMINED FREQUENCY CORRESPONDING TO AN UNWANTED POLE IN SAID TRANSFER CHARACTERISTIC OF SAID RESONANT CIRCUIT, AND MEANS FOR APPLYING SAID ARTIFICIAL SPEECH SPECTRUM TO SAID SECOND SYNTHESIZING MEANS.