US3564142A

US3564142A - Method of multiplex speech synthesis

Info

Publication number: US3564142A
Application number: US748745A
Authority: US
Inventors: Ernst H Rothauser; Kurt F Bandat
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1967-08-03
Filing date: 1968-07-30
Publication date: 1971-02-16
Anticipated expiration: 1988-02-16
Also published as: GB1227578A; DE1762677A1; AT276495B; JPS5211161B1; FR1577550A

Abstract

The invention relates to a method of channel vocoder speech synthesis. For representing the speech signals for several independent speech lines, digital speech data stored in data processors are used. In addition, the method provides for the storage of the time-sampled digital description of the transient behaviors of n spectrum channel band-pass filters. Only one such description is needed for synthesizing the speech signals for m speech lines. For a given speech line the transient responses of the band-pass filters are modulated by the frequency function for the given line. The modulated transient values are added for corresponding time samples and stored in a delay line for the given speech line. The stored value of a speech line is released at points in time defined by the excitation function, thus releasing a digital description of the transient response of the set of band-pass filters as if they were excited by a unit pulse and modulated by the frequency function of the given speech line. The digital description is demodulated to an analogue form by conventional means.

Description

United States Patent [72] Inventors Ernst H. Rothauser Wollerau, Switzerland; Kurt F. Bandat, Beacon, N.Y. [21] Appl. No. 748,745 [22] Filed July 30, 1968 [45] Patented Feb. 16, 1971 [73] Assignee International Business Machines Corporation Armonk, N.Y. [32] Priority Aug. 3, 1967 [33] Austria [3 l A7231 [54] METHOD OF MULTIPLEX SPEECH SYNTHESIS 2 Claims, 3 Drawing Figs.

[52] U.S.Cl. 179/1, 179/1 5.55 [51] Int. Cl. H04b 1/66 [50] Field olSearch l79/l (AS), 15.55; 340/l5.5 (RSC) [56] References Cited UNITED STATES PATENTS 3,303,335 2/1967 Pryor 340/155 Primary Examiner-Rodney D. Bennett Assistant Examiner-William T. Rifkin Attorney- Hanifin and J ancin ABSTRACT: The invention relates to a method of channel vocoder speech synthesis. For representing the speech signals for several independent speech lines, digital speech data stored in data processors are used. In addition, the method provides for the storage of the time-sampled digital description of the transient behaviors of n spectrum channel bandpass filters. Only one such description is needed for synthesizing the speech signals for m speech lines. For a given speech line the transient responses of the band-pass filters are modulated by the frequency function for the given line. The modulated transient values are added for corresponding time samples and stored in a delay line for the given speech line. The stored value of a speech line is released at points in time defined by the excitation function, thus releasing a digital description of the transient response of the set of band-pass filters as if they were excited by a unit pulse and modulated by the frequency function of the givenspeech line. The digital description is demodulated to an analogue form by conventional means.

The invention relates to a method of channel vocoder multiplex speech synthesis of speech data stored in a data processor for a number of m speech lines.

The known pulse-excited channel vocoder permits the ready derivation of signals for natural speech generation from data stored in a computer, utilizing in an efficient manner the storage available. In accordance with the known principle, the speech signals, by means of filters, are divided into a number of frequency channels (aggregate or spectrum channels) and an excitation channel carrying the information relating to the basic speech wave. In pulse-excited channel vocoders, pulses are generated in the excitation channel of the speech analyzer,

the time spacing of which is equivalent to the period of the basic speech wave just analyzed. This, in a strict sense, applies merely to voiced speech segments. For unvoiced speech segments, the output signals of a noise generator are either applied to an excitation channel or a method is used which does not distinguish between voiced and unvoiced sounds. In the case of the latter method, the speech signal of the excitation channel, which is limited to a range from to 500 cps, is nonlinearly distorted due to the nonlinear characteristics of the elements used in the circuit consisting, in the main, of diodes. In addition to sum and harmonic frequencies, difference frequencies occur. These difference frequencies in the case of vowels, that means the voiced speech segments, in the transient state result in the fundamental frequency of the speech segment just analyzed. For unvoiced sounds, the main energy component lies within a frequency range exceeding 3,000 cps and difference frequencies occur which, behind the diodes, contain a distorted energy component in the range from some 20 to 500 cps, resulting in noiselike sound characteristics.

The value of the speech energy present in the individual lines can, in a known manner, be transmitted in analogue or digital form or be stored for synthesizing the divided speech signal.

The known method of speech signal synthesis in pulse excited recorders invariably starts from the concept that at certain times, for example initiated by the excitation pulses, the aggregate channel values are transmitted in the form of amplitude modulated pulse to the corresponding channel filters of the synthesizer.

These arrangements have the disadvantage that one filter set is required for each speech line, resulting in the technical means used increasing linearly in relation to the number of speech lines.

Apart from this, subsequent changes in the dimensions of the vocoder filter I sets are difficult and expensive.

Therefore, it is the object of the invention to provide a solution which is both technically and economically advantageous.

and which is suitable for the multiple connection of speech signal lines to a speech output unit linked with a computer.

For a method of channel vocoder multiplex speech synthesis of speech data stored in data processors for a number of m speech lines the invention is characterized in that the description of the transient behavior of n aggregate channel filters is stored, that the values of this description of each aggregate channel filter is separately modulated with the frequency function of the same aggregate channel filter, added, subsequently stored and finally at the times given by the speech excitation the stored modulated values for each speech channel are separately called and demodulated.

The method can in an advantageous manner, with the help of digital means, be performed so that the description of the transient behavior of the aggregate channel filters, as a digital representation of the values of k scanning points, is stored in a delayline storage. From the data processor the digital values of the frequency function for'all scanning points of all n aggregate channels and all m speech channels are transmitted to another delayline storage, at such times that the values associated with the two delayline storages, without additional synchronization, are multiplied, added and, subsequently, through a distributor, are separately transmitted to delayline storages associated with each speech line. Finally, from the data processor the digital data relating to speech excitation are transferred to a further delayline storage which, through another distributor, separately control the synchronous calling of the data for each speech line from the delayline storages for transmission to the decoders.

Another advantageous embodiment is characterized in that during each cycle of the delayline storage, a line value in a counter is incremented by one until the counter has reached a predetermined value, thus causinga signal to be emitted from the corresponding delay-line storage to the associated decoder. I

The arrangement of the invention reduces in an advantageous manner the means required for each speech line, permits the dimensions of the vocoder filter set to be readily changed and, in addition, handles the conversion of a major portion of the speech description stored in the computer, particularly coordinating in time the transmission of the speech description to the speech synthesizer. i

The invention is hereafter described by means of an embodiment shown in the following drawings:

H6. 1 is a block diagram of typical operation of the method explained.

FIG. 2 is a detailed representation of the block diagram of FIG. 1.

FIG. 3 is a block diagram showing the excitation-controlled calling of infonnation groups from the delay-line storages.

GENERAL DESCRIPTION The arrangement is based on the known concept of bandpass filters being essentially characterized by their transient behavior Moreover, it is based on the concept that a description of the filter set in the form of transients allows the use of one such description for a multitude of speech lines.

In such a case the filter description permits the generation of a pulse code modulated (PCM) description, which inits turn can easily and simply be decoded in a known manner for analogue speech representation, by multiplying the time description of the filters by the applicable amplitude values of the aggregate function and bysubsequently adding the filter channel values.

As shown in FIG. 1, the concept of the arrangement provides a block comprising 50 speech channels, having delay lines VOL, AGL, EXL and VL which allow a pulse frequency of 4.5 Mcps. Lower frequencies necessitate a different design of the system such as, for example, a parallel arrangement of the delay lines. For a greater number of speech lines additional blocks comprising 50 speech channels can be connected to the existing vocoder description (stored in VOL).

In the multiplex speech synthesis arrangement hereafter explained, the transient behavior of the filter set is described in coded form, and this description is dynamically stored in the delay line VOL. For the purpose of building up the output signal of a speech line L to Lrn, the transient behavior of the filters must be multiplied with the frequency function of the corresponding speech line. The changes in the frequency functions are low-frequency ones and can be described with adequate accuracy by a 25 cps wide frequency band. The frequency or aggregate information for a number of speech linescan be stored in a single delay-line storage AOL. The values stored in the delay-line storages VOL and AOL are multiplied by each other. The values of one filter and the factor of the frequency channel occur simultaneously on the multiplier arrangement MULT. For generating a scanning value in accordance with the transient behavior of the filter set, the results of all frequency channels generally 16 frequency channels are used -must be added. The result after this addition in the adder AD consists of a number of digits indicating the impulse response of the filter set multiplied by the current frequency function of the line, provided the filter set is excited by an individually selected pulse magnitude. As the generation of these results is not synchronized with the rest of the arrangement. the coded representation of the speech must be stored in the delay-line storages VL, to VL,,,. The information groups circulate in these storages, being emitted on the output at the times quantized by means of the kcps quantizing frequency of the speech excitation. The excitation of the filter set, the calling of the contents of the delay-line storages VL to VLm is controlled by the means of the excitation information which in coded form for all speech lines is stored in the delayline storage EXL. During each cycle of the delay-line storage EXL, a line value in the counter is incremented by one until the counter has reached a predetermined value, initiating the calling of a value in the corresponding delay-line storage for transmission to the associated decoder D This line must be so designed that it provides the scanning values in a delayed fashion.

The pulse code modulated speech signal on the output of a delay-line storage VL,- is subsequently converted in the associated decoder D, into an analogue speech signal.

Furthermore, it must be assumed, as is known from the state of the art, that for obtaining a good speech quality the channel values of the aggregate function must be scanned every 50 secs.

The excitation pulses occurring at shorter intervals, every 5 msecs. (according to a max. fundamental frequency of less than 200 cps for the average male voice), have to be described accurately to 0.1 msec.

Another prerequisite for a good speech quality in the PCM representation consists in 8 bits every 0.1 msec. being provided as a description.

The longest time interval to be considered is the interval at which the description of the aggregate functions of the 50 speech lines are transmitted from the data processor EDP to the multiplex speech synthesizer, that means 40.1 msecs. or 180, 450 t, where t is the period time of one pulse in the delay lines. At a repetition frequency of 4.5 Mcps, one pulse period is 0.22 ,1. sec. The time interval of 40.1 msecs., in its turn, is divided into 50 periods of 3,609 bits each, the individual bit times being referred to as t to r The time t, is the time at which the first information pulse is available on the lines A and B (FIG. 2).

The description of the transients of 16 channel filters, according to the division of the speech band into 16 aggregate channels, is dynamically stored in a delay-line arrangement VOL, fifty scanning points of 4 bits each describing one filter. The filter information is stored once and circulates in the delay-line storage VOL, unless a fault occurs, causing the circuit Q for the sum of all digits to respond, thus signalling the need for the vocoder description to be written in anew. After each 64 bit scanning time 8 blanks are provided enabling synchronization with the individual speech line delay lines VL having a 9 bit group length. The delay-line storage VOL is so designed that at 1, every 3,609 I the values of a succeeding scanning point occur on the output line A. This is necessary so that all aggregate channel values of the 50 speech lines can be multiplied by the 50 scanning points of the filter description. An additional shift of 9 bits transfers the head of the information t, to the next group position in the delay-line storage VL.

For the aggregate function it is assumed that every 40.] msecs. the description for the 50 speech lines is transmitted from the computer to the speech synthesizer. The information relating to the aggregate function is dynamically stored in the delay-line storage AGL so that the aggregate values of a speech line (16 X 4 64 bits) occur one after the other, the blocks of the speech channels being separated by 8 blanks (altogether 72 bits). The fifty blocks are succeeded by 9 further blanks, causing the aggregate function to be shifted by one group length of 9 bits in the delay-line storage VL. The information arriving on the lines A and B is the corresponding information of the filter description, that means 16 channels described by 4 bits each. In the multiplier circuit MULT the transmitted by the data processor EDP for the next excitation binary product 4 bits X 4 bits is formed. The sixteen values resulting in their entirety in a time for the final speech signal are added in the adder AD and fed to the corresponding speech line delay line VL; through a switch 8,. The results of the adder are thus written into the delay lines VL,.

The speech channel delay lines VL to VL 0are each 450 bits long, that means they can accommodate 50 groups of (8 +1) bits, the positions of which are referred to as v1 n/m.

It shall be assumed that the first adding result is emitted by the speech line 1 and is stored in VL1 in position V1 1/1. Subsequently, after 72 t, a signal in VL position v1 2/9, occurs corresponding to the second of the 50 scanning values of VOL. The first scanning value, upon completion of the writing process, can be found in VL position v1 2/8.

The following table 1 is a survey of the division of the speech lines 1 to 50 and shows the first group into which a scanning value is entered and to which the first group corresponds Apart from this, it shows the position in which the first scanning value of the transient, which is derived from the output function, can be found.

TABLE 1.-DIVISION OF SPEECH LINE First group First scanposition Scannin ning value Speech line written in (m.) value (In. in position Consequently, there is a defined initial time of a time function for each of the 50 speech channel delay lines VL1 to VL The response of all filters to the standard excitation pulse which for each line is multiplied by the value of the aggregate function, the individual results being added, corresponds in the channel vocoder to the time function stored in a delay-line storage VL,. The sum of these time functions represents the speech signal.

For each speech line, quantized at 450 t, that means at 0.1 msec., a signal is available which in the channel vocoder is released by an excitation pulse.

Every 22,500 I, that means every 5 msec., a set of values is DETAILED DESCRIPTION It is assumed that at a certain time the filter information, through line C and the AND gates U1, U2, U3, U4, the OR

gates

01, 02 and 03 and the delay lines VZl and VZ2, has

been stored in the delay-line storage VOL and is in a phase position in which, at the times t, to t4 (timing pulses) the first 4 bits of the VOL information, referred to as tvl to tv,, are transmitted to the multiplier circuit MULT through line A.

The information of the filter set is so stored in the delay-line storage VOL that 50 scanning points are described by 16 frequency values each (16 aggregate channels) of 4 bits. The information arriving first is the frequency value fl of the scanning point 1', (tv to tv,), followed by f to f of the scanning point 'r,. Then the frequency value f of the scanning point 1- and finallyf, of the scanning point 1 (tv to w occur. Every 64 bit values are followed by 8 blanks enabling a joint time pattern with the delay-line storages VL to VL FIG. 2 shows that the delay-line storage VOL, in which the filter information is stored, consists in the main of three partial delay-line storages VZl to V23 connected in series. The total number of bits which can be stored in this arrangement is identical to that of the delay-line storage AGL in which the values of 16 aggregate channels for speech lines are stored. The two delay-line storages have a capacity of 3609 bits.

The bits are circulated in the delay-line storage arrangement VOL as hereafter described.

The bit 1v, at the time I is on the output of the delay line VZ2, circulating in the latter in the same manner as the suc ceeding bits tv to tv The bit rv at the time 2 is written into the delay line VZ3 as the first information, appearing on the output of this line, that means on line A at the time t The bit 11 the last one of the filter description, at the time l arrives on the input of the delay line V23. This bit is immediately succeeded, at the time t,, by the bit tv from the delay line VZ2. In between the times t and 1 no bits are transferred to the input of the delay line V23 (9 blanks).

These processes are controlled by the AND gates U2 to U4 and the OR gates 01 to 03. The timing signals required for controlling the processes are the signals T to T, which are shown in detail in FIG. 2.

After a period of 3609 t the information appears in the storage arrangement VOL shifted by 72 bits, that means on line A from the time t the frequency value f of the scanning point r tv to tv occurs, whereas the last bits of the series, that means [V3589 to [V describe the frequency value f of the scanning point 1,.

Although for the purpose of obtaining this shift, 3600 delay elements would be sufficient, 9 blanks are inserted for the shortest joint period of VOL and AGL to be 180,450 t. The delay line VZl comprising 9 bits for claritys sake is shown .separate of the delay line VZ3. When changing the time pattern, it can be realized as an extension of the 3528 bit line to 3537 bits.

Similar to the delay-line storage VOL, the aggregate function for 50 speech lines circulates in the delay-line storage 'AGI, each speech line being described by 16 frequency values of 4 bits each. This information, in contrast to the storage VOL, is not shifted. At the times t, to t the information concerning the frequency value of f of the first speech channel line reached reaches line B and, consequently, through the series to parallel converter SPW2, the multiplier circuit MULT. Every 3609 t, that means every 0.8 msec., the full description of a value of the aggregate function of 50 speech lines is available. After 40.1 msecs. in accordance with the slow change in the aggregate function, it is replaced by new values. These new values are transmitted, through the line D, from the data processor, via the AND gate U6 and the OR gate 04 to f the delay-line storage AGL. This information is enabled to cir- 'culate with the help ofthe AND circuit U7 and the OR circuit 04 in conjunction with the timing pulse T As shown in FIG. 1, every 40.1 msecs. the values of 50 speech channels of l 6 X 4 bits each, separated by 8 blanks, are transmitted from the data processor EDP.

The description circulating in the storage VOL represents, as already mentioned, the scanning values of the transient, this means the response to the standard excitation pulse sampled at 50 instants of time for the 16 filters of a vocoder aggregate filter set.

In an orthodox vocoder a standard excitation pulse having the same magnitude is applied to all 16 filters, their output functions being multiplied by the respective amplitude values of the aggregate function for a speech line and added. In the arrangement of the invention the same effect is obtained by means of the time multiplex method for 50 speech lines. The sum of the filter responses, as in the case of a simple vocoder,

is added. However, the sum for each line is described quantized by 50 scanning values of 8 bits each.

TABLE 2 [Values for the (a) speech line 1, (b) filters 1 to 6 and sum of products of aXb for speech line 1/1 Filter fr Line 1 fr Product; Filter f2. Line 1 f Product 2 27 Channel 1/1 TABLE 3 [Values for the (a) speech line 2, (b) filters 1 to 6 and sum of aXb for line 2/12] 12 ea-2s 1520-32 at-as 1531-00 1541-44 1745-48 Filter f1 Line 2 fi-.

Product:

Z 36 Channel 2/12 Tables 2 and 3, by way of an example, show the distribution of the values for the first and second speech line, and for the first 6 filters as well as the sum of the values for the speech line 1 for the scanning point r, and the speech line 2 for the scanning point 1- In the examples of the two tables 4 bits each of the filter description stored in VOL an the aggregate function stored in AGL are represented by a decimal digit 0-7. Apart from this, for simplifying the examples and for rendering them more readily understandable, it is assumed that instead of 16 frequency values of a channel only 6 are described. It is, furthermore, assumed that the description includes no blanks.

For each scanning point 1-V(v ...50) a result (2) comprising 8 bits is formed on the output of the adder AD within the time pattern T (v-v-I) X 72 t.

Each result constitutes a scanning value at the scanning time 1-,. of a line and is transferred to the delay-line storage VL,- corresponding to the line.

The output signal of the delay line VL,-, which is transferred to the associated decoder DEC,, is shown in FIG. 3explaining in particular the conditions for the line 1. Parallel scanning values occur on the input AG 1 on 8 lines at the times t(65 m X 50 X 72). These values are taken up by a static storage BR and, shifted by 9!, written into the delay line VL through the OR gate 05. The signal written in circulates in the loop formed by VL and VL, at a period of 450 t, the control signal TlN so controlling the processes that during the writing of new information the information in the loop is suppressed. In this manner the possibility of the first pulse of a group of 9 pulses being transmitted is prevented on the AND gate U8. This position is reserved for a control pulse.

At the time determined by the excitation function, a signal appears on the input EXI, and a scanning process is initiated during which the pulse of DU initially written into the first position of VLl is shifted by 9t at a period of 450t. This pulse rather than being transferred through the partial delay line VL',, is written back through 06, n9, DLY, FF,, U and 05 via a loop shorter by 9 bits. After the pulse has been shifted by 50 blocks and prior to the pulse being rewritten into block 1, this process is terminated by a delay arrangement TF The times at which a pulse occurs on EXl, every Smsec. or at times tt (1 n 450), coincide with the occurrence of the first of the 50 pulse groups (ttl, period 450 t) so that EXl invariably releases the same series of pulse groups on the output of the delay-line storage VL The succeeding table 4 shows the distribution of the pulse groups and is a survey of the principal timing pulses required for controlling the delay-line storages.

TABLE 4 Period of VL1-5 =Pv1,=450t=O.11I1SeC. Spacing of Pulse Groups=0.1msec.

The control switch STS, on the one hand, so controls the arrangement that the pulse groups inf to inl reach the output of the delay-line storage only when a pulse from EXl circulated in the line VL and, on the other, when this pulse, bypassing VL,, is shifted from block to block.

The synchronization in time of AG and EX for all 50 lines is ensured by the pulse on EXm occurring at the time at which the first scanning value for lime in appears on the line input. As can be seen from the division ofthe speech lines of table 1, a shift by 7 positions of the first scanning value, that means 56 1, occurs between two neighboring lines. For maintaining the synchronization with EX, the selective switch S in FIG. 1 if so designed that the channel sequence corresponds to the sequence of the first scanning value in table 1 (that means EX], EX8, EXIS etc.). Similarly, the time function TlN for controlling the writing into the inputs A0,, of the lines n must be staggered by 72!, switch S of FIG. 1 advancing according to the order of

speech lines

1,2 50.

While the invention has been particularly shown and described with reference to a p referred embodiment thereof,

it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope ofthc invention We claim: 1. Method of vocoder speech synthesis of speech data, said summing the products associated with each scanning point across all said it filters;

storing said summations;

calling and decoding said summations at a time dictated by said speech excitation signals; and

for converting the speech data into a form capable of synthesizing speech.

2. Method of vocoder multiplex speech synthesis of speech data stored in a data processor, the speech data'r'epresenting a plurality of speech lines, where the speech data for each speech line is derived from n filters and includes speech excitation signals, comprising the steps of:

storing only once the description of the transient behavior of each of said )1 filters for simulating said M filters, said transient behavior for each of said u filters being described by the discrete values at It scanning points of said transient behavior;

multiplying said description of the transient behavior of each of said it filters, said description consisting of said discrete values at k scanning points, by the speech data associated with the corresponding filter of said it filters for each of the plurality ofspeech lines;

summing the product associated with the same scanning point across all said it filters for each of said plurality of speech lines;

distributing said summation to a plurality of storage means,

one storage means as associated with each speech line;

storing said summation associated with each speech line in the storage means of said plurality of storage means associated with that speech line;

calling and decoding said summations at a time dictated by said speech excitation signals associated with said plurality of speech lines; and

for converting the speech data associated with each speech line into a form capable of synthesizing speech and presenting said converted speech data on corresponding output speech lines.

Claims

1. Method of vocoder speech synthesis of speech data, said speech data derived from n filters and includes speech excitation signals, stored in a data processor, comprising the steps of: storing the description of the transient behavior of each of said n filters for simulating said n filters, said transient behavior for each of said n filters being described by the discrete values at k scanning points of said transient behavior; multiplying said description of the transient behavior of each of said n filters, said description consisting of said discrete values of k scanning points, by the speech data associated with the corresponding filter of said n filters; summing the products associated with each scanning point across all said n filters; storing said summations; calling and decodIng said summations at a time dictated by said speech excitation signals; and for converting the speech data into a form capable of synthesizing speech.

2. Method of vocoder multiplex speech synthesis of speech data stored in a data processor, the speech data representing a plurality of speech lines, where the speech data for each speech line is derived from n filters and includes speech excitation signals, comprising the steps of: storing only once the description of the transient behavior of each of said n filters for simulating said n filters, said transient behavior for each of said n filters being described by the discrete values at k scanning points of said transient behavior; multiplying said description of the transient behavior of each of said n filters, said description consisting of said discrete values at k scanning points, by the speech data associated with the corresponding filter of said n filters for each of the plurality of speech lines; summing the product associated with the same scanning point across all said n filters for each of said plurality of speech lines; distributing said summation to a plurality of storage means, one storage means as associated with each speech line; storing said summation associated with each speech line in the storage means of said plurality of storage means associated with that speech line; calling and decoding said summations at a time dictated by said speech excitation signals associated with said plurality of speech lines; and for converting the speech data associated with each speech line into a form capable of synthesizing speech and presenting said converted speech data on corresponding output speech lines.