GB2266213A

GB2266213A - Digital signal coding

Info

Publication number: GB2266213A
Application number: GB9307485A
Authority: GB
Inventors: Nigel Charles Sedgwick
Original assignee: CAMBRIDGE ALGORITHMICA Ltd
Current assignee: CAMBRIDGE ALGORITHMICA Ltd
Priority date: 1992-04-13
Filing date: 1993-04-08
Publication date: 1993-10-20
Anticipated expiration: 2013-04-08
Also published as: WO1993021627A1; GB9208177D0; GB9307485D0; GB2266213B

Abstract

The present invention seeks to use a novel combination of techniques - primarily vector quantisation and variable duration segments with sophisticated segment models - to provide acceptable speech quality even at data rates as low as 600 bpss. It proposes that the signal be described in terms of a sequence of concatenated variable duration segments, where each segment is described by part of the digital data sequence, giving advantages based on better descriptions of variable duration segments that are longer on average, and on considering very large numbers of possible sequences of segments of various different lengths and choosing the best sequence from all those considered.

Description

Diaital Sianal Codina The present invention relates to digital signal coding, and concerns in particular methods of deriving a digital stream of data that represents an analogue signal waveform.

Digitising an analogue signal waveform - for instance, that representing speech - may be of benefit in many ways. For example, digital data can be transmitted long distances by the use of relay stations which, knowing the data is digitised, can reconstitute it exactly and without cumulative noise and cumulative errors being added to the signal by each relay stage (this is particularly important for trunk and international telephone links). Again, digitised data can be encrypted very securely to reduce the benefit of eavesdropping (though of especial interest in military and other government applications, encryption is becoming of more interest to commercial users).

Finally, signals in digitised form can be stored in and manipulated by computers and other electronic systems more easily than can signals in analogue form.

There are, however, problems with using digitised data. Thus, the bit rate necessary for the original analogue signal to be reconstituted sufficiently accurately may be so high as to exceed the storage capacity of the equipment employed and the bandwidth of the transmission channels utilised. Then again, the increasing load to analyse the signal, or to reconstitute it in analogue form, may well exceed the capability of the available equipment, or at the very least increase its cost beyond acceptable limits.

In digital signal coding, the original analogue signal waveform is analysed to produce a stream of digital data from which there can in due course b constructed an analogue signal waveform that approximates to the original analogue signal. The degree of approximation is often referred to as the coding error, and the way in which the coding error is defined varies from application to application. In speech coding, for instance, two main methods of defining the coding error are used.One is waveform coding error, which is usually defined as the ratio of the power in the difference signal (i.e., that obtained by subtracting the original signal from the reconstructed signal) to the power in the original signal, while the other is perceptual coding error, which measures how closely to the original signal the reconstructed signal is perceived to be by human listeners.

A general description of speech coding techniques is given in the book "Speech Synthesis and Recognition" by J N Holmes, published by Van Nostrand Reinhold in the UK in 1988, reference ISBN 0-278-00013-4. Holmes classifies speech coders into three main classes, namely waveform coders, vocoders (voice coders) and intermediate coders, and the present invention - at least, when applied to speech - relates to vocoders.

Vocoders (voice coders) separately model the energy sources of the signal and the subsequent filtering effects that are an intrinsic part of the signal generation process; this is called source/filter modelling. In vocoders, the perceptual coding error is made as small as practical and the waveform coding error could be very large. Vocoders are not particularly suitable for coding signals that are not speech-like.

Generally, vocoders produce smaller digital streams of data using more complex analysis procedures than do other classes of speech coder.

Vocoders consist of two main parts, these being the analyser, which accepts the electrical signal from a microphone supplied with an acoustic speech signal and processes it into a digital stream of data, and the synthesizer, which takes that digital stream of data and processes it into an electrical signal suitable for driving a loudspeaker, etc, to reproduce the original acoustic speech signal. The vocoder analyser is usually more complex and more expensive to manufacture than the vocoder synthesizer.

The three main features of vocoders that are important in their application are the data rate, the output speech quality, and the equipment cost.

The data rate is the amount of digital data required to represent a particular amount of speech. It is usually measured as the number of bits (of binary data) per second of speech (or bpss), and is important as it determines both the load on the communications channel used to transmit the digitised speech and also the amount of memory required to store the speech. The speech quality determines how intelligible the speech is, or how pleasant or natural it sounds, after the vocoder synthesizer has reconstituted the speech signal from the data stream, and hence how acceptable the vocoder is in certain applications (trained military radio'operators can tolerate and use much lower quality speech than typical telephone users).

As to cost, in general it is greater with methods that provide better quality speech and/or lower data rates Only when the combination of data rate, speech quality and equipment cost is correct will a particular vocoder find useful application.

The essence of the vocoder is that it models the speech signal as coming from a signal generator consisting of two main parts, these being the energy source, where the sound is begun, and the vocal tract, which modifies the sound by a filtering effect as it passes therethrough.

For vowel sounds such as "ah", the energy source is vibration of the glottal folds in the larynx. For fricative sounds, the energy source is turbulent airflow through a constriction in the vocal tract (for example, in "s" the constriction is between the tip of the tongue and the alveolar ridge, which is just behind the upper front teeth). For plosive sounds such as "p" and "t", the energy source is the sudden release of air under pressure which has been built up behind a total obstruction of the vocal tract (in the case of "p" this is the closed lips).

Some speech sounds have mixed energy sources. For example: for "z" there is both vibration of the glottal folds and turbulent airflow; for "b" and "g" there is both vibration of the glottal folds and a sudden release of pressure. In some languages, other energy sources are important, such as the implosive or click sounds.

The vocal tract consists of the oral tract and nasal cavity. As the sound from the energy sources passes through these its characteristics are changed.

The most important effect is. the filtering which causes some frequencies to be attenuated more than others; other effects include energy absorption by the lungs and the lining of the vocal tract. The filtering characteristics are changed in various ways; for example, change of length of the vocal tract (by raising and lowering the larynx and retracting or protruding the lips), changing the cross sectional area of the vocal tract along its length (by moving the position of the tongue) connecting the nasal cavity to a greater of lessor extent (by raising or lowering the velum), spreading or rounding the lips, etc.

Even though the amplitude of the speech signal changes rapidly, the characteristics of the energy source and vocal tract filter change much more slowly.

By measuring these characteristics at an appropriate rate (typically 50 to 100 times per second) and transmitting that information (usually as a digital data stream) to the vocoder synthesizer, an analogue signal waveform can be produced which is perceived as an approximation to the original speech signal. And because the source/filter characteristics change relatively slowly, and thus need to be measured relatively infrequently, the information when digitised produces a relatively low bit rate, making it easier to store, transmit and process the data. The short periods (typically around 0.02 seconds) for which the source and filter information is coded are usually called frames, and frame by frame the input signal is analysed to determine a suitable code defining its energy source and the vocal tract filter. That part of the data stream, usually referred to as the side information or the excitation information, relating to the energy source defines the combination of energy sources, and their relative powers, the frequency of vibration of the glottal folds (the pitch) if they are vibrating, and the total power from the energy sources (although often this is taken as part of the description of the vocal tract filter). The first of these nearly always consists only of the proportion of energy from vibration of the glottal folds and from frication (the degree of voicing), and in many vocoders it is assumed that only one energy source is in operation at any one time (i.e., all energy is from the larynx or all is from frication).

As to analysing and describing the vocal tract filter, several methods are in common use, though in each case the description includes an approximation to the short-term power spectrum of the speech, ignoring the fine harmonic structure introduced by the vibration of the glottal folds. The three main methods result in what are known as a channel vocoder, a linear predictive coding (LPC) vocoder, and a formant vocoder; the present invention relates mainly, though not exclusively, to use of the latter.

In a channel vocoder the vocal tract is described as a bank of bandpass filters, arranged in parallel and covering the frequency range of speech. The coded information is the amount of amplification to apply to each bandpass filter (channel) in the vocoder synthesizer. In a linear predictive coding vocoder the vocal tract is modelled as an all-pole filter. One interpretation of this is that the vocal tract is assumed to be a single tube consisting of a small number (typically 10 to 20) of equal-length sections, each of different uniform cross-section, and joined together.

The coded information consists of the cross-sectional area of each of the sections of the tube. The LPC vocoder has found favour because of its low cost of manufacture and reasonably good speech quality.

Finally, in a formant vocoder there is made use of the well known fact that, for most speech sounds, the vocal tract has a small number of main resonances called formants. In a formant vocoder the coded information consists of the centre frequency and amplitude of each of the formants. Usually, information on 3 to 5 formants is coded.

Having found the source and filter description for a frame, these are quantised to give the coded data.

This is usually done either by a scalar quantisation method, in which each type of value to be coded (e.g., the degree of voicing, the pitch, the amplitude of each channel in a channel vocoder) is approximated by the index number of the nearest entry into a fixed table for that type of value, or by a vector quantisation (VQ) method, in which many of the types of values to be coded are approximated together by the index number of the nearest entry into a fixed table (usually called a VQ codebook) of vectors, where each dimension of the vectors represents one type of value. The VQ approach, used in the method of the invention, is more efficient in its allocation of binary bits of the coded data than is scalar quantisation.This means that the coded data rate should be lower for the same speech quality than for scalar quantisation (although against this the cost of the vocoder analyser is usually greater). Usually, VQ is applied only to the filter values, and scalar quantisation is applied to the side information.

For low data rate vocoders additional techniques have been used to model the speech over a series of frames. A useful review of these techniques is given in "A Survey of Low Bit Rate Vocoders" by C J Jaskie and B Fetee, published in the Official Proceedings of Voice Systems Worldwide 1992. These techniques include superframe coding, variable frame rate (VFR) coding, matrix quantisation (MQ), and phonetic or segment vocoders. The present invention makes use of a modified form of the second of these, variable frame rate coding, which is a technique in which only one out of every few frames is transmitted, but which frame is chosen depends upon the original speech signal (such that a frame is transmitted when the speech signal changes), and at the vocoder synthesizer the "missing" data between the received frames is filled in by, for example, linear interpolation or replication.

Vocoders with data rates as low as 2400 bpss have been used in operational military and other government communications systems since the late 1970s. However, the speech quality obtained from vocoders at lower rates has not yet been judged adequate for widespread operational use, especially at rates below 800 bpss where there are military needs. Performance problems have included: highly variable speech quality with different talkers, especially with vocoders including VQ, MQ and phonetic approaches; poor performance if the original speech waveform has a noticeable level of background noise; high levels of confusion of certain consonant sounds, especially those that change rapidly in time such as the plosives.

Particular technical weaknesses in previous approaches have included: insufficiently large VQ codebooks to deal well with all talkers and languages, high data rates needed to code side information and changes of the vocal tract filter with time, vulnerability of variable frame rate approaches to background noise and the great difficulty of accurate formant analysis (which means the benefits of lower data rates implicit in formant vocoders have not been obtained in practice).

The present invention is based on vector quantisation and variable duration segments with sophisticated segment models. The segment models used allow not only more accurate modelling of the frame analysis, which improves speech quality, but also the modelling of longer segments, which reduces the data rate. With less data needed to describe less segments, more data could be used to describe each segment, and in particular there can be employed a long VQ codebook which improves the speech quality for most talkers and for different languages. Optionally, the invention includes a technique for simultaneous selection of the optimal or near-optimal sequence of segment lengths, VQ entries and segment infill (or interpolation or trajectory approximation) models; this gives better speech quality for a given data rate than sequential selection.By concentrating on modelling correctly the slowly varying periods of speech and implicitly finding the transitions between them, the invention's approach is less vulnerable to background noise than cruder VFR approaches.

The present invention can be used with several different filter models, and in particular with line spectral pairs (LSP, a particular type of LPC), with log area ratios (another LPC type), and with the formant vocoder model (and when used with the latter it overcomes the difficulty of the accuracy of that model).

Optionally, the present invention can include more accurate modelling of the pitch variation with time, and so give higher speech quality for longer segments and hence a lower data rate.

In summary, the present invention describes the signal in terms of a sequence of concatenated variable duration segments, where each segment is described by part of the digital data sequence. The advantages of this approach over the Prior Art accrue largely from having better descriptions of variable duration segments that are longer on average and from considering very large numbers of possible sequences of segments of various different lengths and choosint3 the best sequence from all those considered.

In one aspect, therefore, the invention provides a method for processing an analogue signal waveform into a digital data sequence, in which method the following steps are carried out: the analogue signal waveform is digitised; the digitised signal waveform is analysed in short sections called frames, and each such frame is described in terms of the combination and characteristics of the energy sources and subsequent filtering effects; a sequence of frame descriptions is examined to find:: - the starts and ends of the periods of relatively slow change in the combination and characteristics of the energy source and the subsequent filtering effects, such periods being called variable duration segments; - the entries in a previously-prepared vector quantised codebook that describe the start and end of each variable duration segment; - the method, chosen from a previously-prepared list of methods, for approximating the trajectory (or otherwise filling in the frame values) between the vector codebook entries chosen for the start and end of that variable duration segment, that interpolates or otherwise defines intervening frame values that most closely match the frame values of the analysed signal; and finally, the description of each variable duration segment, which consists of a set of digital values, is formatted suitably for subsequent use.

As will be seen, the method of the invention processes the analogue signal waveform into a digital data sequence which represents the signal as a concatenated sequence of variable duration segments each of which is described by digital values of the following types: the duration of the segment (usually as a number of frames); the entry numbers in the vector quantised codebook that define the frame values at the start and end of the segment; and the method of trajectory modelling or infill to be used in reconstituting the segment - and, as described in more detail hereinafter, optional additional descriptions of the segment that are necessary to reconstitute the signal waveform but are not contained in the vector quantised codebook.

The invention provides a method for processing an analogue signal waveform into a digital data sequence.

Although notionally the analogue signal to be processed could be of any type, the invention is primarily concerned with signals that are (or take the form of), originally, sound signals, and specifically speech signals. Thus, in its preferred embodiments the invention provides a method of processing speech into digitised and encoded form with a very low bits-persecond value but which despite this can be decoded and used to synthesise an intelligible version of the original speech.

In the first stage of the method of the invention the analogue signal waveform, optionally preprocessed by any appropriate techniques such as amplification and analogue filtering, is digitised by analogue-to-digital conversion. This may be carried out in any convenient way, to any degree of accuracy, employing any suitable apparatus, and no more need be said about it here.

In the second stage of the method of the invention the digitised signal waveform is analysed in short sections called frames (these are usually of fixed duration, typically of 0.005 to 0.03, especially 0.01 to 0.025 sec long) and each such frame is described in terms of the combination and characteristics of the energy sources and subsequent filtering effects. For example, for a speech signal the energy source could be described by the degree of voicing and pitch, while the filtering effect of the vocal tract could be described in terms of the short-term power spectrum derived by the discrete Fourier transform.

In the third stage of the method of the invention the sequence of frame descriptions from the second stage is examined to find the starts and ends of the periods of relatively slow change in the sequence of frames and describe these periods. These periods will generally be segments of various lengths, so they are known as variable duration segments. One or more possible way of dividing the sequence of frames into variable duration segments is considered; each way considered consists of a sequence of hypothesised variable duration segments.

For each hypothesised variable duration segment one or more way of describing the sequence of frames within that hypothesised segment is considered. Each way of describing a hypothesised variable duration segment consists of two entries in the VQ codebook, one for the start of the hypothesised segment and one for the end, and one entry for the method, chosen from the list of methods, for approximating the trajectory or otherwise filling in the frame values between the start and end VQ codebook entries.The possible ways of dividing the sequence of frames into segments together with the possible ways of describing each hypothesised segment are compared with each other to find the one way of dividing the sequence of frames into variable duration segments together with the one way of describing each of the segments in that sequence that most closely matches the actual frame values of the analysed signal.

Thus this third stage of the method of the invention describes the speech signal in terms of an ordered sequence of variable duration segments. Each variable duration segment of the signal is described, as previously defined herein, in terms of its start time and end time, the VQ codebook entries used to describe the frames at the start and end of the segment, and the method of infill for the other frames of the segment.

The selection of the VQ codebook entries to describe the start and end frames of a variable duration segment can be chosen from the whole of the VQ codebook or from a previously defined part of it - or even from previously prepared shortlists of VQ codebook entries applicable respectively to the frames of the analysed signal at the start and end of that variable duration segment or from a combination of such shortlists for frames within or near to that variable duration segment.

One way of drawing up the shortlist for each frame is to use those entries in the VQ codebook that have the shortest weighted Euclidean distances to the description of that frame of analysed signal. Optionally, the weights used can be different for each frame of analysed signal, and derived from the appropriate frame.

Although the method of finding the boundaries and description of each of the variable duration segments has been discussed as though the three aspects (boundary times, start and end entries of the VQ codebook, and filling in) could/should be ascertained in sequence, in fact it is possible to deal with them in combination.

Indeed, doing it in combination is to be preferred, because choosing similar aspects to each of the above in isolation is known from other technical areas not to give the best results (in particular, choosing the boundaries first and then describing the segments could lead to segments that are difficult to describe well, and could also give poor quality boundaries for signals corrupted by noise).

One good method of finding simultaneously all the boundaries, VQ codebook entries and infill methods uses the technique of dynamic programming graph search, a well-known algorithmic technique for choosing the best amongst many competing combinations of options without having to consider them all. A comprehensive review of the field is given in "Applied Dynamic Programming" by Bellman and Dreyfus (Princetown University Press, 1962).

By way of a simple example of dynamic programming graph search, consider the problem of finding the shortest route between two towns A and Z joined by many different routes which all go through one or other of the eight intermediate towns B, C, D, E, F, G, H or J, each of the four towns B, C, D and E being joined to Z not only directly but also via each of the four towns F, G, H and J (and the map of possible routes is the "graph" to be "searched"). The total number of different routes from A to Z is therefore 4 times 5 (through B, C, D or E, and then either direct or via one of the other for) plus 4 (avoiding B, C, D aiid E) which is 24 routes in total made up of 32 sections of road between pairs of towns.Of the 24 routes, 16 have 3 sections of road and 8 have 2 sections of road; thus the direct calculation of the shortest route requires 40 additions to be made to give the lengths of the 24 routes, followed by 23 comparisons to find the shortest route. However, using dynamic programming graph search, the problem can be broken down into the "simpler" one of finding the shortest partial route from the start (town A) to each of B, C, D and E, then finding the shortest partial routes to each of F, G, H and J (either from A directly or from B, C, D or E), and so on. Finding the best route from A to Z using dynamic programming graph search therefore requires a total of 20 additions and 23 comparisons; this is half the number of additions and the same number of comparisons required for the direct calculation of the shortest of all 24 routes.

From the above simple example, dynamic programming graph search can be seen to reduce the total number of calculations required. For larger graphs much larger proportional savings can be made. For 1 second of speech described by 100 frames of data, there are approximately 6x10'9 - a very large number - ways of segmenting this speech into 20 variable duration segments with segment lengths limited from 1 to 16 frames.For the direct calculation of the best segmentation, 19 additions would be required to find the coding error for each of these ways of segmenting the speech (i.e., approximately 10Z' additions) and then the lowest coding error would have to be found (requiring approximately 6x1019 comparisons). However, -only approximately 20,720 additions and -20,720 comparisons are required to find the segmentation with the lowest coding error using dynamic programming graph search.

(Note: the above estimates of additions and comparisons are for the combination and selection part of the calculations only; in both cases it is assumed that the coding error for all possible hypothesised segments has previously been calculated and stored.) Thus, dynamic programming graph search can make practical a seemingly intractable calculation.

For each variable duration segment, additional descriptive information can optionally be derived by referring back to the frame values of the analysed signal. For example, for speech, it is usual not to use the frequency of vibration of the glottal folds (the pitch) in determining the boundaries and description of the variable duration segments as mentioned above; instead, the description of the pitch for each variable duration segment is found once the boundaries are known.

One way of doing this is to calculate a pitch value for the start and end of every variable duration segment, and to select one of several methods of filling in which most closely matches the pitch values of the sequence of frames in the analysed speech.

Finally, the description produced by the method of the invention of each variable duration segment, which consists of a set of digital values, is formatted suitably for subsequent use. For example, for a vocoder to be used over a communications system, this could require quantising some of the digital values to approximate digital values, merging all of the digital values into a data packet of fixed length, adding synchronisation bits, and sending the data packet to a communications modem.

The method of the invention can be implemented in software using a suitably-programned general-purpose computer, or as a mixture of software and dedicated, purpose-built hardware (which is rather faster).

As will be apparent from the aforesaid, the present invention relates to digital signal coding, in particular to methods of deriving a digital stream of data that represents an analogue signal waveform. It especially concerns methods of coding based on source/ filter modelling, vector quantisation and variable duration segment modelling, which methods are of particular interest when applied to the coding of speech signals. The present invention, when applied to speech and in relation to the Prior Art, produces a smaller digital stream of data for the same speech quality and/or a better quality for the same amount of data.

The invention has so far been defined in terms of a method of processing an analogue signal waveform into a digital data stream. Of course, the invention also extends to a digital information system in which an analogue signal is so processed, and thereafter the resulting digital data is used in any desired manner.

For example, that signal might be stored in situ for subsequent use at the same site to re-construct a reasonable facsimile of the original analogue signal, or it might be transferred or transmitted in any convenient way - for instance, by being stored in suitable form on a paper or magnetic disk, a magnetic drum, or an optical disk, and then being physically carried, or being kept in "electrical" pulse form and being sent by electrical or optical landline, or by some wireless, possibly radio, communications channel - to a distant location where it is used to re-construct the original signal.

A first embodiment of the invention, as applied to a vocoder analyser, has been implemented by way of non-real-time emulation by computer programs on a general purpose computer. This embodiment is now described, though by way of illustration only, with reference to the accompanying Drawings in which: Figure 1 shows a block diagram of the various stages/equipment modules used in the method of the invention; and Figure 2 shows an example analysis of some speech using the method to which Figure 1 relates.

The embodiment of Figure 1 implements a vocoder analyser and synthesizer which takes an analogue speech waveform and codes it into a digital data stream at a variety of data rates, but mainly at 600 bpss. The vocal tract filter description is described first in terms of spectral channel amplitudes and subsequently in terms of formants.

Figure 1 shows a block diagram of the embodiment.

The analogue speech signal was captured by a microphone of medium to high quality (both an electret microphone and a condenser microphone have proven satisfactory), then amplified, bandpass-filtered in the range of approximately 30 Hz to 4500 Hz, sampled by an analogue-to-digital converter of at least 12 bits accuracy at a rate of 10,000 samples per second, and stored on computer disc prior to processing.

Optionally, an automatic gain control (AGC) could be used to optimize signal levels during the analogue filtering and sampling by the A-D converter.

Alternatively, the analogue signals were captured using a high-quality microphone, bandpass filtered in the range 0 Hz to 7000 Hz, sampled at 20,000 samples per second, digitally low-pass filtered at 4500 Hz, and subsampled to 10,000 samples per second. This is not thought to be a material difference. Likewise, the exact characteristics of the analogue bandpass and anti-aliasing filters are not thought to be material, providing they are of general good response.

Optionally, the sampled signal at 10,000 s/s was pre-emphasised to give approximately 6dB boost per octave, and this was done with a single-tap FIR filter with z-transform 1.0 - 0.995z-l.

The signal was then analysed at a fixed frame rate of 100 frames per second to give a description for each frame in terms of: - the short-term power spectrum at 32 frequency points on a perceptual (Mel) scale covering the range approximately 100 Hz to 4.5 kHz; - the degree of voicing on a scale of 4 levels representing totally unvoiced (0) through two levels of partial voicing to fully voiced (3); - for partially or fully voiced speech, the pitch frequency; - the term-averaged energy (i.e., that averaged over several syllables); - the term-averaged spectral tilt (i.e., that averaged over several syllables).

In fact, the term-averaged energy and spectral tilt were not used, but could be used later to improve speech quality for any given VQ codebook length.

The frame analysis method used was as follows.

1. An estimate of the likelihood of each of a range of pitch periods was made employing the algorithm described in "Psuedo-Maximum-Likelihood Speech Pitch Extraction", D H Friedman, IEEE Trans. on Acoustics Speech and Signal Processing, Vol.

ASSP-25, No 3, June 1977. For the male speech analysed, there was used a Hamming window with length of SOms and calculated the likelihoods for pitches in the approximate range 50 Hz to 300 Hz.

2. From these likelihood values, preliminary estimates of the degree of voicing and pitch were made for the frame on the basis of picking the largest two peaks in the likelihood values for the frame. These were then refined by a smoothing algorithm that operated over three frames to correct pitch doubling and halving and remove isolated voiced and unvoiced frames.

3. An order 20 covariance method LPC analysis without windowing was then done on the (optionally pre-emphasised) sampled speech signal. For unvoiced speech frames the window length was fixed, usually at 30 ms. For partially or fully voiced speech frames, the window length was chosen to be the largest integral multiple of the pitch period less than a fixed length, usually 30 ms.

4. A spectral cross section (usually with 128 equally spaced frequency bins) was then obtained from the LPC coefficients by, in effect, discrete Fourier transform of the impulse response of the LPC resynthesis filter.

5. The spectral cross section was then integrated within each of the 32 spectral channels on the perceptual scale to give the channel amplitudes, with channel amplitudes limited to the range of 0 to 32,000 on a logarithmic scale of milli-Bels (i.e., 1/100 of a dB).

6. The term-averaged spectral channel amplitudes were then calculated by averaging over a long window, and the term-averaged energy and spectral tilt were derived by solution of the best linear fit. The term averages applicable to each frame were saved as part of that frame's values, but were not used in the subsequent processing in our initial embodiment.

Next, the nearest few VQ codebook entries (derived previously: see below) to each analysed frame (the shortlist) were found using a weighted Euclidean distance in spectral space (i.e., that of the 32 spectral channel amplitudes and side information).

Many different weightings were used. A good one had the spectral channels amplitudes with unity weight, a weight of 200,000 for the degree of voicing, and weights for the pitch and term-averaged energy and spectral tilt at 0. Various short-list lengths were tried, from 6 to 50.

Various VQ codebooks had been previously derived from example speech from one or several talkers.

Codebooks were derived from large quantities of speech (typically in excess of 50,000 frames) and codebook lengths of up to 8192 were derived by "clustering" in "formant" space using a weighted Euclidean distance metric. Various clustering algorithms were tried, including cluster initialisation with the well known k-D tree algorithm followed by optimization using the well known LBG algorithm.

The "formant" space parameters were derived from the training speech using the frame analysis described above and a formant assignment algorithm modified (to use the different frame analysis) from that in "Automatic Generation of Control Signals for a Parallel Formant Synthesizer", by P M Seeviour, J N Holmes and M W Judd, Proc. IEEE Conf. on Acoustics Speech and Signal Processing, Philadelphia USA, 1976. Smoothing of formant parameters was done using a dynamic programming graph search algorithm very similar to that described below for the vocoder analyser.

During low energy portions of the signal, some or all of the short-list entries were replaced by VQ codebook entries with low energy and formant frequencies close to those of the following energy onset. This was done to reduce allocation of segments to model irrelevant low-energy spectral variations in stop gaps.

The sequence of variable duration segments, their lengths, the VQ entries to characterise the start and end of each segment, and the infill model to use were all simultaneously found using a pruned dynamic programming graph search modified from that described in "A Method for Segmenting Acoustic Patterns with Applications to Automatic Speech Recognition" by J S Bridle and N C Sedgwick in Proc. IEEE Conf. on Acoustics Speech and Signal Processing, Hartford USA, 1977.The dynamic programming step used was as follows: F(m,j,b) = MIN(F(m-1,j-i1a) + SMF(a,b,t,j-i+1,j)j a,i,t where: F(m,j,b) is the score (coding error) of the partial path to the jth frame with m segments with the last segment characterised by a segment-final VQ entry of b; SMF(a,b,t,j-i+l,j) is the segment matching function (error) that indicates how well the speech from frames j-i+l to j (i.e., a segment of length i ending at frame j) is modelled by a segment characterised by a segment-initial VQ entry a, a segment-final VQ entry b and a segment trajectory of type t; and the optimization step is made over the following: a segment-initial VQ entry which is the segment-final VQ entry for the preceding segment on the partial path; i the segment length in frames; t the segment trajectory type which is 0 for a segment with piecewise constant formant parameters equal to the segment-final VQ entry, and 1 for a segment with formant parameters interpolated linearly between the segment-final VQ entry for the preceding segment a and the segment-final VQ entry for this segment b.

The segment matching function was the sum of the weighted Euclidean distances in spectral channel space between the frames of the hypothesised segment and the respective frame analysis of the original speech, plus a proportion (different for each trajectory type) of a weighted Euclidean distance in formant space between the segment-final VQ entries of the preceding and current segments. This latter distance penalised segments that introduced rapid formant transitions or steps without a good (compensating) spectral match.

Partial path traceback and output of a segment description was forced after every occurrence of a fixed number of input frames (typically 5 for a data rate of 480 bpss and 4 for a data rate of 600 bpss) after initial filling of the traceback buffers (typically 50 frames). Options were provided to: prune all partial paths worse by some threshold than the best partial path to each frame for each number of segments; limit the minimum and maximum number of segments since the end of the forced common path (i.e., the end of the last segment output so far); etc. These options were provided for the purpose of reducing the computational load at some cost to the quality (i.e., optimality) of the graph search.

Segment lengths from 1 to 16 frames were allowed, with the average length forced to be the same as the fixed number of input frames between forcing output of a segment (i.e., typically 4 or 5).

The degree of voicing for each segment was taken as that of the segment-final VQ entry.

Given the start and end of each segment, the pitch contour for partially or fully voiced frames was taken to be one of 8 types: - piecewise constant, at the mean of the pitch values in the frame analysis of the origia speech; - piecewise linear, starting at the segment-final value of the preceding segment and having a segment-final value for the current segment that gave the minimum least squared error over the segment; - quadratic rise-fall or fall-rise trajectory with one of two fixed ratios of the peak/trough to final value, starting and ending as for piecewise linear and calculated to give the minimum least squared error over the segment;; - cubic rise-fall-rise or fall-rise-fall trajectory with one fixed ratio of the peak and trough to final value, starting and ending as for piecewise linear and calculated to give the minimum least squared error over the segment.

The segment-final pitch values were scalar quantised on a 32-point logarithmic scale of frequency.

The data stream output for each variable duration segment was then formatted, usually as follows (in Table I below), although alternatives with different bit assignments could optionally be selected.

Table I ~ab e I DESCRIPTION OF FIELD BITS USED Segment length (frames) 4 Formant trajectory type within segment 1 Segment-final VQ codebook entry 13 Degree of voicing (replicates information in VQ entry) 0 Segment-final pitch index 5 Pitch contour method (only 2 of 8 methods used) 1 Energy scaling for segment (not used) 0 TOTAL BITS PER SEGMENT 24 A vocoder synthesizer was also implemented to allow listening to the output speech, as. shown in Figure 1.

This consisted of a computer program to expand the description of each variable duration segment into a sequence of frame descriptions in terms of the frequencies and amplitudes of the formants, the degree of voicing and the pitch. Then a commercially available equipment was used to generate an analogue signal waveform; that equipment implements the Parallel Formant Synthesizer described in "A Versatile Software Parallel-Formant Speech Synthesizer" by J M Rye and J N Holmes, JSRU Research Report 1016, November 1982, available from the Speech Research Unit, Defence Research Agency, St. Andrews Road, Malvern, Worcs WR14 3PS.

A second embodiment of the inven(ion, as applied to a vocoder analyser, has been implemented by way of non-real-time emulation by computer. This embodiment is now described primarily in the ways it differs from the first embodiment just described hereinabove.

The analogue speech signal was captured and optionally pre-emphasised, as for the first embodiment.

The signal was then analysed at a fixed frame rate, again as for the first embodiment. However, the frame analysis method used was different, as follows: 1. The speech waveform (after pre-emphasis) was pre-emphasised for a second time. Then, for each lOms frame of data, the peak absolute value in the doubly pre-emphasised waveform was used to define a 4ms analysis window for that frame startling just after the peak, and typically 200microseconds after the peak.

Note: for voiced speech the chosen 4ms window would usually be during the closed glottis period, and this facilitates spectral analysis as described in "Speech Synthesis and Recognition" by Dr J N Holmes.] 2. The 32 samples of the singly pre-emphasised speech signal, over the above found 4ms window, were padded with zero samples, typically to make a 256 sample signal. A discrete Fourier transform (DFT) of this signal was then taken, and converted to a spectral cross section (magnitude only) with typically 128 equally spaced frequency bins.

ENote: this spectral cross section differs from that of the first embodiment in that it represents the spectral cross section of the speech signal convolved with the spectral shaping from the 4ms rectangular time window.) 3. The spectral cross section derived above was then integrated within each of the 32 spectral channels on the perceptual scale to give the channel amplitudes, in the same way as for the first embodiment.

4. The pre-emphasised speech signal over a window of length 51.2ms (i.e., 512 samples), centred on the centre of the lOms frame, was then analysed as follows. The signal was windowed using a Hamming window, and then a DFT was taken. The logarithm of the spectral magnitude was then calculated for each of the frequency bins, and those bins with frequencies above and below fixed frequency thresholds were set to zero. Then an inverse DFT was taken. This processing gives what is usually referred to as the "cepstrum" of the speech signal.

5. From this cepstrum, preliminary estimates of the degree of voicing and pitch were made for the frame on the basis of picking the largest two peaks in the cepstral values for the frame. These were then refined by a smoothing algorithm that operated over three frames to correct pitch doubling and halving and remove isolated voiced and unvoiced frames.

Note that only 2 degrees of voicing were actually used for initial experiments with this second embodiment.

6. The term-averaged spectral channel amplitudes and term-averaged spectral tilt were then calculated in the same way as for the first embodiment, and again were not used in the subsequent processing for initial experiments.

Next, the nearest few VQ codebook entries (derived previously as described for the first embodiment) to each analysed frame (the shortlist) were found using a weighted Euclidean distance in spectral space, in a similar way to that in the first embodiment. However the weights used for the weighted Euclidean distance metric were different for each analysed framer as follows. For each spectral channel, the weight was a fixed weight (as for the first embodiment) multiplied by a weight equal to the linear amplitude for that spectral channel in the analysed frame divided by the average linear amplitude for all spectral channels in the analysed frame. Fixed weights were used for the side information, as in the first embodiment.Again, various shortlist lengths were tried, and a length of 14 was used for most initial experiments, as this was thought a good compromise between processing time and speech quality.

Various VQ codebooks had been previously derived, as for the first embodiment.

During low energy portions of the signal, shortlist entries were replaced, as for the first embodiment.

The sequence of variable duration segments, their lengths, the VQ entries to characterise the start and end frame of each segment, and the infill model to use, were all simultaneously found using the pruned dynamic programming graph search, as for the first embodiment.

For the segment matching function, the weighted Euclidean distance metric in spectral channel space used different weights for each frame of analysed signal, as described above for finding the shortlist of VQ codebook entries for this second embodiment. Also the method of transforming the descriptions of VQ codebook entries and infilled frames from formant space to spectral channel space was modified to include spectral smearing equivalent to that introduced by the 4ms rectangular window used as part of the frame analysis for the second embodiment.

Partial path traceback was performed as for the first embodiment. Partial path traceback and output of a segment description was as follows in Table II below, for the different data rates tried.

Table II No of Frames Initial infilling Data rate between segment of traceback buffers (bpss) outputs (frames) 480 5 50 600 4 48 800 3 48 The pitch contour for partially or fully voiced frames was coded as for the first embodiment.

The data stream output for each variable duration segment was then formatted, as for the first embodiment.

Thus: with an average of 20 segments per second the data rate is 480 bpss, with an average of 25 segments per second the data rate is 600 bpss, and with an average of 33'/3 segments per second the data rate is 800 bpss. For the second embodiment, experiments were conducted with speech coded at these three data rates.

A vocoder synthesizer was also implemented, as for the first embodiment.

The speech quality from the second embodiment of the vocoder was found to be better, at all three data rates tried, than the speech quality from the first embodiment (which was only tried at a data rate of 480 bpss).

Thus, with an average rate of 20 segments per second of speech and 24 bits per segment, the vocoder data rate was 480 bpss. Now, very high quality broadcast recordings (including music) use a data rate of 1,058,400 bits per second of signal per channel such a data rate would involve no perceivable degradation to the signal. For ordinary speech signals, data rates of 280,000 bpss and above would be considered as having negligible degradation, while waveform coding of speech for telephone truck calls uses a data rate of 64,000 bpss, and for the next generation of mobile cellular telephones speech will be waveform coded at a data rate of 32,000 bpss, with quality substantially equivalent to that used for telephone trunk calls. Good waveform or intermediate coders can give telephone quality speech at data rates as low as 8,000 bpss.

Vocoders in operational use for military and other government purposes use data rates as low as 2,400 bpss, but with noticeable degradation. For message storage and replay, with careful manual correction of vocoder analyser errors, data rates as low as 1,000 bpss are in use, with degraded but intelligible speech.

Experimental vocoders of various sorts have been demonstrated in the laboratory with data rates in the range 400 to 2,400 bpss, with a range of degraded speech qualities, including barely intelligible. The system of the invention is capable of providing speech intelligible in context to Users familiar with the system - albeit a little distorted, and somewhat slurred - at a bit rate as low as 480 bpss; at 600 bpss the speech is much crisper, and is intelligible in context even to unfamiliar Users.

Figure 2 shows the speech signal (approximately 2.0s long) for the utterance "Jill tumbled after Jack" by a male talker in various forms at various stages during the processing in the vocoder analyser and vocoder synthesizer as follows.

1. The upper section of Figure 2, marked W, shows the speech waveform after speech signal capture. This is the representation of the speech signal at point A in Figure 1.

(Note: the very low frequency signal component came from an external source (passing traffic), and does not adversely affect the coding process. Also, the waveform is misaligned by approximately 0.2secs with the other sections.) 2. The second section of Figure 2, marked X, shows the spectral channel amplitudes on the perceptual frequency scale in the usual form of a pseudo spectrogram (where the blacker the spectrogram the greater the energy at that channel for that frame).

(Note: the side information is not shown. This is the representation of the speech signal at point B in Figure 1.) 3. Table III below shows a textual representation of the transmitted data stream (i.e., the vocoded speech signal for the first lsec of signal). Each line of the table describes a single variable duration segment. This is the representation of the speech signal at point C in Figure 1.

Table III Segment Formant Segment- Degree Segment- Pitch Energy length traj'ory final VQ of final contour scaling type entry voicing pitch method index 1 0 0 0 15 0 0 1 0 1656 0 15 0 0 2 0 1917 0 15 0 0 2 1 1908 0 15 0 0 3 1 1644 0 15 0 0 6 1 1982 0 15 0 0 7 1 1490 0 15 0 0 4 1 3712 0 15 0 0 8 1 3842 0 15 0 0 8 1 3809 0 25 1 0 8 1 4021 0 29 1 0 9 1 1560 0 30 1 0 2 1 982 0 30 0 0 3 1 986 0 30 0 0 2 1 263 0 30 0 0 2 1 1400 0 0 1 0 2 1 3988 0 8 0 0 4 1 6553 0 15 1 0 4 1 7005 3 18 0 0 5 1 7370 3 19 1 0 8 1 5004 3 19 1 0 4 1 49 0 19 0 0 2 1 66 0 19 0 0 2 1 3654 0 19 0 0 4. The third section of Figure 2, marked Y, shows the formant description of the re-synthesised speech, frame by frame. The four uppermost and widest traces represent the first four formants (resonances) of the vocal tract.The centre of each trace gives the frequency of the formant according to the left-hand vertical scale. The width of the trace gives the amplitude of the formant. The hatching across the width of the trace of the second formant has dotted lines for unvoiced frames and has solid lines for partially or fully voiced frames. This is the representation of the speech signal at point D in Figure 1.

5. The lower section of Figure 2, marked Z, shows the speech waveform after vocoder synthesis. Note that the timescales of the plots in Figure 2 do not allow sufficient detail of the differences between the upper and lower sections (the original and re-synthesised speech waveforms). This is the representation of the speech signal at point E in Figure 1. Note also that around 1.58secs into the signal the burst in section W (of the original) is missing in section Z (of the re-synthesized output). This burst represents the "t" sound in the word "after", and is an example of one of the relatively few cases where the method of the invention does not produce entirely satisfactory results.

Claims

1. A method for processing an analogue signal waveform into a digital data sequence, in which method the.

following steps are carried out: the analogue signal waveform is digitised; the digitised signal waveform is analysed in short sections called frames, and each such frame is described in terms of the combination and characteristics of the energy sources and subsequent filtering effects; a sequence of frame descriptions is examined to find:: - the starts and ends of the periods of relatively slow change in the combination and characteristics of the energy source and the subsequent filtering effects, such periods being called variable duration segments; - the entries in a previously-prepared vector quantised codebook that describe the start and end of each variable duration segment; - the method, chosen from a previously-prepared list of methods, for approximating the trajectory (or otherwise filling in the frame values) between the vector codebook entries chosen for the start and end of that variable duration segment, that interpolates or otherwise defines intervening frame values that most closely match the frame values of the analysed signal; and finally, the description of each variable duration segment, which consists of a set of digital values, is formatted suitably for subsequent use.

2. A method as claimed in Claim 1, in which the analogue signal waveform to be processed into a digital data sequence is (or takes the form of), originally, a speech signal.

3. A method as claimed in either of the preceding Claims, in which the analogue signal waveform is first preprocessed by amplification and analogue filtering.

4. A method as claimed in any of the preceding Claims, in which, in the second stage, the digitised signal waveform is analysed in short sections called frames of from 0.005 to 0.03 sec long, and each such frame is described in terms of the combination and characteristics of the energy sources and subsequent filtering effects.

5. A method as claimed in any of the preceding Claims, in which, in the third stage: one or more possible way of dividing the sequence of frames into variable duration segments is considered, each way providing a sequence of hypothesised variable duration segments; for each hypothesised variable duration segment one or more way of describing the sequence of frames within that hypothesised segment is considered, each such way providing two entries in the VQ codebook, one for the start of the hypothesised segment and one for the end, and one entry for the method, chosen from the list of methods, for approximating the trajectory or otherwise filling in the frame values between the start and end VQ codebook entries; and the possible ways of dividing the sequence of frames into segments together with the possible ways of describing each hypothesised segment are compared with each other to find the one way of dividing the sequence of frames into variable duration segments together with the one way of describing each of the segments in that sequence that most closely matches the actual frame values of the analysed signal.

6. A method as claimed in any of the preceding Claims, in which the selection of the VQ codebook entries to describe the start and end frames of a variable duration segment is chosen from one or a combination of previously prepared shortlists of VQ codebook entries applicable respectively to the frames of the analysed signal at the start and end of that variable duration segment.

7. A method as claimed in Claim 6, in which the shortlist for each frame is drawn up by using those entries in the VQ codebook that have the shortest weighted Euclidean distances to the description of that frame of analysed signal.

8. A method as claimed in Claim 7, in which the weights used are different for each frame of analysed signal, and derived from the appropriate frame.

9. A method as claimed in any of the preceding Claims, in which the ascertaining of the three aspects boundary times, start and end entries of the VQ codebook, and filling in - is effected in combination.

10. A method as claimed in Claim 9, in which there is employed the technique of dynamic programming graph search.

11. A method as claimed in any of the preceding Claims, in which additional descriptive information for each variable duration segment is derived by referring back to the frame values of the analysed signal.

12. A method as claimed in Claim 11, in which, for a speech signal, the description of the pitch for each variable duration segment is found once the boundaries are known, this being done by calculating a pitch value for the start and end of every variable duration segment, and selecting that one of several methods of filling in which most closely matches the pitch values of the sequence of frames in the analysed speech.

13. A method as claimed in any of the preceding Claims, in which, for a vocoder to be used over a communications system, the formatting of the description of each variable duration segment involves quantising some of the digital values to approximate digital values, merging all of the digital values into a data packet of fixed length, and adding synchronisation bits.

14. A method for processing an analogue signal waveform into a digital data sequence as claimed in any of the preceding Claims and substantially as described hereinbefore.

15. A digital information system in which an analogue signal is processed by a method as claimed in any of the preceding Claims.