GB2130852A

GB2130852A - Speech signal reproducing systems

Info

Publication number: GB2130852A
Application number: GB8330820A
Authority: GB
Inventors: Tad Weng Chong; Angela Druckman; Michael John Shearme
Original assignee: General Electric Co PLC
Current assignee: General Electric Co PLC
Priority date: 1982-11-19
Filing date: 1983-11-18
Publication date: 1984-06-06
Also published as: GB2130852B; GB8330820D0

Abstract

In an L.P.C. speech synthesizer arrangement in which a reproduced- speech pitch period may be longer than the frame period the filter coefficients are arranged to be updated either pitch-synchronously or at least a predetermined time after the commencement of a pitch period to avoid "thumping" in the reproduced speech.

Description

SPECIFICATION Speech signal reproducing systems The present invention relates to speech signal reproducing systems and to speech signal synthesizers for such systems.

Speech signal reproducing systems are presently being developed in which electric signals representing a speech utterance are analysed, in respect of each of a succession of time intervals or frames of, say, 20 milliseconds duration, to derive successive sets of parameters representing a model of the vocal tract of the speaker producing that utterance.

The parameters in respect of each frame may comprise for example the pitch period, a voiced/unvoiced apportionment or mixing parameter, a measure of the total energy of the speech signal during the frame, and a set of twelve coefficients, all these parameters being conveyed by a total of, say, eighty binary digits.

The speech signals may then be fairly reproduced by means of a synthesizer arrangement comprising a recursive filter excited by a series of pulses separated by the pitch period for voiced sounds and by pseudo-random noise for unvoiced sounds, or more generally by a mix of the two, the set of twelve coefficients being utilised to determine, say, the reflection coefficients of the recursive filter. The synthesizer arrangement is commonly a digital arrangement, the initial and reproduced speech signals being of the form of linear or compression-law coded PCM speech.

This "parametric" form of coding, know as linear prediction coding, enables speech signals to be represented with acceptable quality on reproduction by a series of binary digits generated at bit rates of 4Kbits/sec or less. Where better quality reproduction is required and higher bit rates are tolerable, the original speech signals may be analysed in overlapping 20 msecond blocks to derive updated parameters, say, every 10 mseconds.

Since the pitch period of a speech signal typically may vary between 2 and 17 mseconds a timing problem exists at the synthesizer, particularly when the parameters are made available for updating every 10 milliseconds.

According to one aspect of the present invention in a speech signal reproducing system in which a speech signal synthesizer includes means for substantially reproducing at least the pitch and the vocal tract characterisation of an original speech signal, during each of a succession of regular time intervals or frames of predetermined duration, in dependence upon respective pulse coded signals which represent said pitch and said characterisation and which are made available to said synthesizer at the commencement of each said interval, the pulse coded signals utilised by the pitch reproducing means during a time interval are updated to match those made available in respect of that interval at the first commencement of a pitch period after the commencement of said interval, and the pulse coded signals utilised by the vocal tract characterisation means during that interval are updated either contemporaneously with the updating of the pitch-representing signals or after a predetermined delay time from the commencement of said time interval, whichever is sooner.

According to another aspect of the present invention in a speech signal reproducing system in.

which a speech signal synthesizer includes means for substantially reproducing at least the pitch and the vocal tract characterisation of an original speech signal, during each of a succession of regular time intervals or frames of predetermined duration, in dependence upon respective pulse coded signals which represent said pitch and said characterisation and which are made available to said synthesizer at the commencement of each said interval, there are provided means to generate a signal to mark a predetermined delay time after the commencement of each pitch period of the reproduced signal, the pulse coded signals utilised by the pitch reproducing means during a time interval being updated to match those made available in respect of that interval at the first commencement of a pitch period after the commencement of said interval, and the pulse coded signals utilised by the vocal tract characterisation means being updated either at the commencement of the respective interval in the absence of a delay time marking signal, contemporaneously with the updating of the pitch-representing signals if said updating occurs in the presence of said delay time marking signal, or at the termination of said delay time marking signal if no pitch updating has yet taken place in that interval.

The pulse coded signals which represent the pitch and vocal tract characterisation of the original speech signal and which are made available at the commencement of each time interval may themselves be updated at different rates. For example the pitch-representing signals may be updated at the commencement of each time interval while the vocal tract characterisation signals may be updated at the commencement of alternate time intervals. The time intervals may be of 10 msec. duration.

A speech signal reproducing system in accordance with the present invention will now be described with reference to the accompanying drawings, of which Figures 1 A and 1 B and Figures 2A and 2B show timing diagrams illustrating respective modes of operation of the system.

In the established method of analysing speech signals using linear prediction coding (LPC) each 20 msecond segment of speech signal represented, say, by 160 PCM coded amplitude samples is analysed to derive values for fifteen parameters represented by a total of some 80 binary digits. These LPC parameter values can be conveyed to a synthesizer arrangement as a stream of binary digits at bit rate of 4 Kbits/sec.

For better quality speech reproduction the analysis can be carried out on "overlapping" 20 msecond segments with all parameters being updated and transmitted afresh every 10 msecs, giving a bit rate of 8 Kbits/sec.

At the synthesizer twelve of the parameter values of each set of fifteen are applied as coefficient values in a recursive filter. One of the remaining three values, representing the pitch period in respect of a voiced sound, is arranged to give a unit positive pulse excitation to the input of the filter at the beginning of each pitch period and small negative pulses at subsequent 8 KHz sample intervals in order to give a zero mean excitation value. Thus at the beginning of a pitch period the excitation values in the filter are large but subsequently tend to decrease.At the same time when the twelve coefficients are updated the excitation values within the filter do not in general correspond with the new filter coefficients since these values have been generated in accordance with the preceding coefficients, and a number of sample intervals have to pass before the change is completed.

If the excitation values within the filter are large when the coefficients are updated the changeover will be audible in the resulting synthesized speech, and for this reason the updating is best carried out when the excitation values are small, preferably at the end of each pitch period. With this method of updating, which is referred to as pitchsynchronous updating, where the pitch period exceeds the frame interval, as in the case of a lowpitched voice with parameters being updated every 10 mseconds, a whole set of parameter values has to be discarded whenever a pitch period extends from one frame interval into the next but one, and the quality of the reproduced speech suffers accordingly.

In order to overcome this loss of quality, in the present arrangement the twelve coefficients are arranged to be updated within each frame interval either pitch synchronously if the pitch period is less than a predetermined time or at least that predetermined time after the commencement of a pitch period if that pitch period is greater than the predetermined time. In the latter case the twelve coefficients may either be updated at the commencement of a frame interval or at the predetermined time after the commencement of a frame interval. The predetermined time may be, for example, of the order of 5 mseconds.

The three remaining parameters, determining the pitch period, gain and the voiced/unvoiced mix are updated pitch-synchronously at the commencement of the first pitch period in a frame interval.

Referring now to Figures 1 A and 1 B, in a first method of carrying out this conditional updating a counter (not shown) is set at the commencement of each frame to count 8 KHz sample intervals up to a total count x. If within this count period a pitch boundary occurs then the twelve coefficients are updated pitch-synchronously, as shown in Figure 1 A, where PA and CA represent the pitch and coefficient values in respect of frame A and so on.

If a pitch boundary does not occur within the count period x, as shown in respect of frames B, C and E of Figure 1 B, then the updating of the twelve coefficients is arranged to take place at the end of the count period.

In an alternative method indicated in Figures 2A and 2B a counter (not shown) is set at each pitch boundary to count 8 KHz sample intervals up to a total count y. If a frame boundary occurs during this count then updating of the twelve coefficients takes place when the counter is reset, either when the next pitch boundary occurs, as shown in Figure 2A, or when the count y is reached, as shown in Figure 2B, frames B and C. If at a frame boundary the counter is not running, as in Figure 2B, frames D and E, then the twelve coefficients are updated at the frame boundary, that is, frame-synchronously.

Both of these methods ensure that each set of twelve coefficients is actually used for most of the frame period for which it is intended, with the updating pitch-synchronous where possible but where this is not possible at lower than peak excitation levels. In general the counts x and y are expected to be the same, and of the order of 40, that is, equivalent to a time period of 5 mseconds.

Where the channel capacity between the analyser and synthesizer is limited it has been found that compared with the results obtained by transmitting all fifteen parameters every 20 mseconds a significant improvement in reproduced speech quality can be obtained by transmitting the pitch, gain and voiced/unvoiced mix parameters every 10 mseconds while transmitting the twelve coefficients only every 20 mseconds. The methods described above for conditional updating at the synthesizer ensure that the best possible use is made of the coefficient values once they are received.

Speech synthesizers to which this updating technique may be applied were described, for example by Atal and Hanauer in The Journal of the Acoustical Society of America, Volume 50, 1971, pages 637 and 655.

Claims

1. A speech signal reproducing system in which a speech signal synthesizer includes means for substantially reproducing at least the pitch and the vocal tract characterisation of an original speech signal, during each of a succession of regular time intervals or frames of predetermined duration, in dependence upon respective pulse coded signals which represent said pitch and said characterisation and which are made available to said synthesizer at the commencement of each said interval, wherein the pulse coded signals utilised by the pitch reproducing means during a time interval are updated to match those made available in respect of that interval at the first commencement of a pitch period after the commencement of said interval, and the pulse coded signals utilised by the vocal tract characterisation means during that interval are updated either contemporaneously with the updating of the pitch-representing signals or after a predetermined delay time from the commencement of said time interval, whichever is sooner.

2. A speech signal reproducing system in which a speech signal synthesizer includes means for substantially reproducing at least the pitch and the vocal tract characterisation of an original speech signal, during each of a succession of regular time intervals or frames of predetermined duration, in dependence upon respective pulse coded signals which represent said pitch and said characterisation and which are made available to said synthesizer at the commencement of each said interval, wherein there are provided means to generate a signal to mark a predetermined delay time after the commencement cf each pitch period of the reproduced signal, the pulse coded signals utilised by the pitch reproducing means during a time interval being updated to match those made available in respect of that interval at the first commencement of a pitch period after the commencement of said interval, and the pulse coded signals utilised by the vocal tract characterisation means being updated either at the commencement of the respective interval in the absence of a delay time marking signal, contemporaneously with the updating of the pitch-representing signals if said updating occurs in the presence of said delay time marking signal, or at the termination of said delay time marking signal if no pitch updating has yet taken place in that interval.

3. A speech signal reproducing system in accordance with Claim 2 wherein the pulse coded signals which represent the pitch and vocal tract characterisation of the original speech signal and which are made available at the commencement of each time interval are themselves updated at different rates.

4. A speech signal reproducing system in accordance with Claim 3 wherein the pitchrepresenting signals are updated at the commencement of each time interval while the vocal tract characterisation signals may be updated at the commencement of alternate time intervals.

5. A speech signal reproducing system in accordance with any preceding claim wherein the time intervals are of 10 milliseconds duration.

6. A speech signal reproducing system substantially as hereinbefore described with reference to Figures 1 A and 1 B of the accompanying drawings.

7. A speech signal reproducing system substantially as hereinbefore described with reference to Figures 2A and 2B of the accompanying drawings.