CA1181859A

CA1181859A - Variable rate speech synthesizer

Info

Publication number: CA1181859A
Application number: CA000425276A
Authority: CA
Inventors: Forrest S. Mozer
Original assignee: Individual
Current assignee: Individual
Priority date: 1982-07-12
Filing date: 1983-04-06
Publication date: 1985-01-29
Also published as: DE3314674A1; JPS5977496A; GB8313384D0; GB2124455A

Abstract

VARIABLE RATE SPEECH SYNTHESIZER

ABSTRACT OF THE DISCLOSURE
In a speech synthesizer, a method and apparatus for varying the intonation of a word or a phrase relies on the generation of pitch frequencies independent of the formant frequencies of synthesized speech. Pitch frequency is controlled and varied in successive repetitions of a phrase by reference to an external timing reference, such as a stored tables of pitch frequencies as functions of time.
At each repetition of a phrase, a different repetition rate produces successive repetitions of the same phrase with desired variations in intonation. The invention finds application in both time domain speech synthesis and frequency domain speech synthesis.

Description

~ 6655-33 VARIABLE RATE SPEECH SYNTHESIZER

This invention relates to the synthesis of speech and like audible inEormation. More particular Iy9 the invention relates to methods and apparatus for varying the pitch freguency ~or intonation) of a synthesized word or phrase from one repetition to the next without altering the recognizability of the word or phrase and withou~ the requirement of significant additional information beyond that required to synthesize the word or phrase.
Normal speech and like audible sounds contain about 100,000 bits of information per second. The difficulties of storage and transmission of large guantities of such information can be prohibitive in cost and storage space. Thus, economical speech synthesizers require CGmpression of the speech data prior to its storage and synthesis.
Compression and synthesis techniques are generally divided into t~o classes, frequency domain techniques and time domain techniques~ ~hese techniques are distinguished in texms of the type of data stored and the method of utili-zation of the dataO The frequency domain synthesis techniques achieve compression by storing information on the important frequ~ncies in each segment or pitch period.
These frequencies~ called formants, are r~sonances of the mechanical system consisting of the throat, mouth, lips, tongue, nasal passages, and the like. These resonant frequencies change slowly enough with time that information compression can be achieved by assigning a power spectrum label to successive time segmen~s of speech.
The frequency domain spe~ch synthesizers operate by passing ~ noise waveform through digital or analog filters whose parameters are controlled by the label information stored in m~mory in order to produce a waveform 3~P~

having peaks in its power spectrum corresponding to those of the desired waveform~ Time domain synthesis techniques in conkrast, employ a compressed version of the amplitude of a waveform as a function of time ~or information and storage and reproduction.
Digital speech synthesizers are capable of producing artificial speech and like audio sound utilizing an amount of information several orders of magnitude less than that of the original or source sound. A great premium has been placed on th/e amount of storage space requir~d to store speech informa~ion. Because of the pxemium on information storage~artificial speech is generally synthesized in exac~ly,~he same manner in each repetition.
Precise mechanical repetition of words or phrases is annoying and mechanica~ to the human ear. What is needed is a technique for produ`clng artificial speech with a pleasing variation from one rep~tition of a given message to the next. t `

Compression ~nd synthesis of speech signals and the like have been studied for several decades. (Seer for example, Flanagan, Speech Analysis, Synthesis and Perception, Springer-~erlag, 1972.) Interest in the topic has accelerated with the increased technical ability to fabricate complex electronic circuits in a single integrated circuit through techniques of large-scale integration.
Examples of frequency domain synthesizers are given in ~.S. Patent No. 3r 575~ 555 and 3~ 588~ 353 and devices incorporating these techniques have been sold by Texas Instruments, General Instruments, and a number of Japanese companies. Selected digital time domain compression techniques have been described in U.S. Patent No. 3,641,4g6 and ~.SO Patent No. 4,214,125, and devices incorporating time domain compression techniques have been developed and sold by Telesensory Systems~ Inc. t National Semiconductor, and Sharp Electronics.

Present technology is able to overcome -the problem of exact repetition of the intonation in identical words or phrases by storing in the synthesizer memo~y information required to produce a word or phrase with t~o or more different intonations. Unfortunately, there is generally a substantial increase in memory size and a similar increase in device cost.
What is needed is a speech compression technique which is capable of varying intonation of a repeated word or phrase without substantially increasing the size of memory and therefore the cost of data storage.

According to the invention, it has been recognized that pitch frequency is sufficiently unrelated to the formant frequencies of a word or phrase to permit independent control of intonation by control of pitch separate from the reproduction of a word or phrase.
Specifically, words or phrases already broken down into pitch periods by either time domain or frequency domain techniques are reproduced at a variable rateO The rate may be controlled by reference to a set of tables for controlling the commencemen~ of each pitch period or by reference to a pseudorandom clock signal. Intonation may be raised by commencing the generation of a subsequent pitch period of da~a prior to the completion of the generation of the current ~itch period. Intonation may be lowered by inserting an extra number of short constant amplitude time segments between consecutive pitch periods during audio signal generation.
The intonation control tables can ~e referenced pseudorandomly in order to produce variation of intonation ~` from one repetition of a given message to the next with relative sr,looth transition. A minimum amount of additional storage space is required for added control tables. The basic vocabulary is stored essentially without information respecting intonation variation.

.

The invention will be better understood ~y reference to the following detailed description of the invention taken in conjunction with the accompanying drawings~ in whioh:

Fig. lA is a plot of the amplitude versus time of a male speaker saying "Ah" at a pitch frequency of 80 Hz.
Fig. lB is a plot of the amplitude versus time of a male speaker saying "Ah" at a pi~ch frequency of 120 Hz.
Fig. 2A is a plot of the amplitude Yersus time of a single pitch period of speech.
Fig. 2B is a computQr generated plot of the power spectra of various segments of the time domain waveform of Fig. 2A.
Fig. 3, on ~e first sheet, is a block diagram of a t~
doNain speech synthesizer acoording to the invention.
Fig. 4 is a block diagram of a frequency domain speech synthesizer according to the invention.

In order to understand the present invention, it is important to recognize that pitch frequency of a voiced waveform is unrelated to the formant frequencies of the power spectrum. The pitch, or oscillation frequency of the vocal cords, imparts intonation and meaning to a spoken phrase, but its variation in successive repetitions of the same phrase does ~ot alt~r the interpretation of the`
perceived sounds as the same set of words. This phenomenon may be better understQod by observation of waveform 10 of FigL lA and waveform 12 of Fig~ lB/ which corresponds to the phoneme "ah~ spoken at pitch fr2quencies of about 80 Hz and -- about 120 ~z. The two waveforms are seen to be quasi periodic. However, the repetition rates of the waveforms ~re different. Nevertheless, a comparison of the two waveforms 10 and 12 reveals tha~ each is essen~ially identical in wave shape taking into account the difference in time base. The two waveforms 10 and 12 have essentially the same shaped power spectrum, so both sound like n ah" even though waveform 12 is spoken at a higher pitch.
The independence of pitch and formant frequencies is illustrated more clearly in Figs. 2A and 2B in which the waveform 14 is one pitch period of a voiced phoneme. The points labeled "3" in power spectrum 16 of Fig. 2B are the power spectrum of the entire waveform 14, while those points labeled ~2~'5 "1", and "0~ give the power spectra of the first 75%, 50~, and 25% of waveform 14, respectively. Since the pea~s of all the power spectra are at the same frequencies, it is shown thaf the formant frequencies of waveform 14 are independent of the duration of the analyzed segment. Thus, for example, one could reproduce the first half of waveform 14 and the first half of -the first subsequent pitch period then the first half of the next subsequent pitch period and so on in order to produce a sound which would have twice the pitch frequency of the original waveform yet would be interpreted the same as the orlginal phonemeO
Referring to Fig. 3, there is shown a time domain speech synthesizer 21 according to the invention comprising a memory device 18 coupled to an intermediate controller 20 coupled to a digital-to-analog converter 22 driving a loudspeaker 24 Control circuitry 26 supervises the operation of the memory device 18 and the intermediate controller 20 in response to word select and start commands.
The details of the structure of the speech synthesizer ~1 are of no direct concern. The memory device 18 is operative to store compressed time domain waveforms. The intermediate controller 20 is operative to expand into a di~ital wave train the compressed time domain waveform stored in the memory device 18 under the action of the control circuits 26. The digital wave train out of the intermediate controller 20 is converted to an analog signal by the digital-to-analog converter 22 and reproduced by the loudspeaker 24 as an audible waveform~
In one embodiment of the invention, the con-trol circuitry 26 causes the intermediate processor to stop generating the current pitch period of data and to start generating the next subsequent pitch period at a random or pseudorandom time before the nominal termination of the current pitch period. ,The time of starting the generation of each pitch period m'ay.be varied smoothly from pitch period to pitch peri7d o produce a higher than nominal pitch frequency to the~reproduced speech. Since the information used for réproducing each pitch peri.od is essentially unchanged,`~the words in the message would be recognized as merely hlving an increased intonation.
Alternativel~, the control circuit 26 can cause the intermediate processor to append an increasing number of constant amplitude tim~ segments at the end of each pitch period thereby to cause the resultant output waveform to be lower in pitch and intonation than the nominal waveform.
Nevertheless, intelligibility would be unchanged. The com-bination of these two ~echniques is used to generate a phrase with the same nominal pitch and a variation in intonation between successive repetitions of a message~ The control circuitry 26 may incorporate a pseudorandom number generator to create a control signal to vary intonation. No additional memory of any kind is required to vary intonation according to such an embodimentO
Controlled pitch frequency variations in successive repetitions:of a phrase may, however, be produced by storing tables of pitch frequency as a function of time in the memory device 187 At each repetition, a different table may be called by the control circuitry 26 which in response instructs the intermediate controller 20 to produce successive repetitions of the same phrase with precisely the desired variations in intonation. The use of a table for programming pitch frequency variations requires a small amount o memory in addition to that required for storage oE
the speech data~
A frequency domain synthesizer 31 according to the invention is shown in Fig. 4, the exact details of which are not important. Generally, however, the frequency domain synthesizer comprises a voiced excitation source 28 under control of an intermediate controller 38, an unvoiced excitation source 30, a digital filter 32 having as an input either the output of the voiced excitation source 28 or the unvoiced excitation source 30 under control of the intexmediate controller 38 through contral line 44 to a switch 33. ~he digital filter is programmable by a control line 42 from the intermedia~e controller 38. A memory device 40 is coupled to the intermediate controller 38. The digital filter 32 is coupled to a digital-to-analog converter 34 which in turn is coupled to a loudspeaker 36.
In a digital frequency domain speech synthesizer 31, the voiced excitation source 28 produces a periodic pulse at a pitch frequency which is controlled by signal line 46 from the intermediate controller 38. The intermediate controller 38 determines the pitch frequency by use of data from the memory device 40. The appropriate source of excitation, either unvoiced or voiced, is provided to the digital filter 32 under control of control line 42, and the filter parameters of the digital filter 32 are determined by signals applied through control line 42. The intermediate Gontroller 38 specifies the filter parameters of the digital filter 32 as a function of time in accordance with the stored data from the memory device 40. The output of the digital filter 32 is directed to the digital-to~
analog converter 34 whose output is converted to an audible signal by the loudspeaker 35.
In one segment of the memory device 40, there is stored information on the formant frequencies of the phrase to be synthesized. Pitch frequency is determin~d by completely indepenaent data in another segment of the same memory device 40. Pitch frequency and thus intonation may be varied randomly or as specified by data in the memory 40 through the intermediate controller 38. The intermediate controller 38 may vary the rate of pulsing the voiced excitation source 28 from pitch period to pitch period. The variation may be pseudorandom or preprogrammed in accordance with a set of tables s-toring data on the desired repetition rate.
The invention has now been explained with ~eference to specific embodiments. Other embodiments will be apaparent to those of ordinary skill in the art upon reference to this description. It is therefore not intended that this invention be limited except as indicated by the appended claims~

Claims

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows:

1. For use in a speech synthesizer, a method for varying the intonation of segments of speech wherein said segments of speech comprise sets of consecutive pitch periods registered in a storage device, said method comprising the steps of:
generating signals representing a plurality of nominal pitch periods from which said segment of speech can be reproduced; and audibly reproducing each related nominal pitch period with a duration that is controlled independently of the duration of said nominal pitch periods to produce synthesized speech.

2. A method according to claim 1 wherein each said nominal pitch period is reproduced at a repetition rate which is varied in a pseudorandom manner to produce variations in pitch frequency

3. The method according to claim 1 wherein each said nominal pitch period is reproduced at a repetition rate which is varied smoothly in a preprogrammed manner to produce desired variations in pitch frequency.

4. In a time domain speech synthesizer having means for storing nominal pitch periods, the improvement comprising:
means for establishing information regarding variation in pitch period repetition rate independently of said pitch period; and means coupled to and responsive to said establishing means for varying repetition rates of each successive pitch period in accordance with said pitch period repetition rate information.

5. In the apparatus according to claim 4 the improvement wherein said establishing means is a pseudo random number generator.

6. In the apparatus according to claim 4 the improvement wherein said establishing means is means for storing tables of successive pitch frequencies.

7. In a frequency domain synthesizer, the improvement comprising:
means for establishing pitch frequencies of synthesized speech independent of formant frequencies of pitch periods of synthesized speech; and means for reproducing said formant frequencies at a rate in accordance with said pitch frequencies.

8. In the apparatus according to claim 7 the improvement wherein said establishing means is a pseudo-random number generator.

9. In the apparatus according to claim 7 the improvement wherein said establishing means is a means for storing table specifying successive pitch frequencies of pitch periods of selected formant frequencies.