WO1997025708A1

WO1997025708A1 - Method and system for coding human speech for subsequent reproduction thereof

Info

Publication number: WO1997025708A1
Application number: PCT/IB1996/001448
Authority: WO
Inventors: Raymond Nicolaas Johan Veldhuis; Paul Augustinus Peter Kaufholz
Original assignee: Philips Electronics N.V.; Philips Norden Ab
Priority date: 1996-01-04
Filing date: 1996-12-18
Publication date: 1997-07-17
Also published as: JPH11502326A; EP0815555A1

Abstract

Human speech is coded for subsequent reproduction through the following steps: (a) receiving an amount of human-speech-expressive information; (b) estimating all complex poles of a transfer function of an LPC speech synthesis filter that has a spectral envelope corresponding to the information; (c) singling out from the transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; (d) defining a glottal pulse related sequence representing the singled out poles; (e) defining a second filter with a complex transfer function to express all the other poles; (f) outputting speech represented by a filter based on combining the glottal pulse related sequence and a representation of the second filter.

Description

Method and system for coding human speech for subsequent reproduction thereof.

BACKGROUND TO THE INVENTION

The invention relates to a method for coding human speech for subsequent reproduction thereof. A well-known method is based on the principles of LPC-coding, but the results thereof have proven to be of moderate quality only. The present inventors have discovered that the principles of LPC coding represent a good starting point for undertaking further effort for improvement. In particular, values of various filter LPC-characteristics can be adapted to get an improved result, when the various influences on speech generation are taken into account in a more refined manner.

SUMMARY TO THE INVENTION

Accordingly, amongst other things it is an object of the invention to improve speech generation filter characteristics for operating with the above recited technology, and in particular to maintain compatibility with LPC-data bases to a certain extent. Now, according to one its aspects, the method according to the invention comprises the further steps of receiving an amount of human-speech-expressive information; estimating all complex poles of a transfer function of an LPC speech synthesis filter which has a spectral envelope corresponding to said information; singling out from said transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; defining a glottal pulse related sequence representing said singled out poles; defining a second filter having a complex transfer function, as expressing said all other poles; outputting speech represented by filter means based on combining said glottal pulse related sequence and a representation of said second filter.

Distinguishing of the complex poles into two groups allows to model each of the groups separately in to an optimal manner.

Advantageously, said estimating furthermore includes estimating a fixed first line spectrum (expr.5) associated to said human speech expressive information, estimating a fixed second line spectrum (expr.7) as pertaining to said human vocal tract model, and furthermore including finding of a variable third line spectrum (expr.8) corresponding to said glottal pulse related sequence, for matching said third line spectrum to said estimated first line spectrum, until attaining an appropriate matching level. It has been found that this way of matching is straightforward, yet results in a very good performance.

Preferably, said singling out pertains to all poles associated to a frequency below a predetermined threshold frequency. In this way, the distinguishing is simple and straightforward to implement. In practice, these low-frequency poles are just the ones that must be singled out. Advantageously, a method as recited uses an LPC-compatible speech data base. Such databases are readily available for a great variety of speech types and languages. The invention also relates to a system for executing a method for coding human speech as described supra. Further advantageous aspects of the invention are recited in dependent Claims. By itself, manipulating speech in various ways has been disclosed in EP

527 527, corresponding US 5,479,564 (PHN 13801), in EP 527 529, corresponding US Application Serial No. 07/924,726 (PHN 13993), and EP 95203210.0, corresponding US Application Serial No. 08/...,... (PHN 15553), all to the present assignee. The first two references describe the affecting of speech duration through systematically inserting and/or deleting pitch periods of the unprocessed speech. The third reference operates in comparable manner on a short-time-Fourier-transform of the speech. As stated earlier, the present invention tries to attain a compact storage of coded speech for seeking a low cost solution. The references require a rather more extensive storage space.

BRIEF DESCRIPTION OF THE DRAWING

These and other aspects and advantages of the invention will be described with reference to the preferred embodiments disclosed hereinafter, and in particular with reference to the appended Figures that show:

Figure 1, a known mono-pulse vocoder; Figure 2a, excitation of such vocoder;

Figure 2b, an exemplary speech signal generated thereby;

Figure 3a, a speech generation model on a filter basis;

Figure 3b, a secondary model derived therefrom;

Figure 4a, a transfer function of a vocal tract; Figure 4b, a transfer function of a synthesis filter;

Figure 4c, a transfer function of a glottal pulse filter;

Figure 5a, an exemplary natural speech signal;

Figure 5b, a sequence of glottal pulses associated therewith; Figure 5c, the same sequence differentiated versus time;

Figure 6, an impulse response of a glottal pulse filter;

Figure 7, a proposed synthesizer;

Figure 8, a pole plot of a filter as used;

Figure 9a, two transfer functions compared; Figure 9b, two further transfer functions compared;

Figure 10, an exemplary glottal pulse time derivative;

Figures 1 la, 1 lb an all-pole spectral representation of the pulse of Figure 10;

Figure 12, a graph illustrating spectral tilt; Figures 13a, 13b a glottal pulse and its time derivative.

DESCRIPTION OF THE PRINCIPLES OF THE INVENTION

Figure 1 shows a mono-pulse or LPC-(linear predictive coding)- based vocoder according to the state of the art, such as described in many textbooks, of which a relevant citation is Douglas O'Shaughnessy, Speech communication, Human and Machine, Addison-Wesley 1987, p. 336-364. Advantages of LPC are the extremely compact manner of storage and its facility for manipulating speech so coded in an easy manner. A disadvantage is the relatively poor quality of the speech produced. Conceptually, speech synthesis is by means of all-pole filter 54 that receives the coded speech and outputs a sequence of speech frames on output 58. Input 40 symbolizes actual pitch frequency, which at the actual pitch period recurrency is fed to item 42 that controls the generating of voiced frames. In contradistinction, item 44 controls the generating of unvoiced frames, that are generally represented by (white) noise. Multiplexer 46, as controlled by selection signals 48, selects between voiced and unvoiced. Amplifier block 52, as controlled by item 50, can vary the actual gain factor. Filter 54 has time varying filter coefficients as symbolized by controlling item 56. Typically, the various parameters are updated every 5-20 millisecs. The synthesizer is called mono-pulse-excited, because there is only one single excitation pulse per pitch period. The input from amplifier block 52 into filter 54 is called the excitation signal. Generally, Figure 1 is a parametric model that has no direct relationship to the properties of the human vocal tract. The approach according to Figure 1 is widespread, and a large data base has been compounded for application in many fields.

In this respect, Figure 2a shows an excitation example of such vocoder, and Figure 2b the speech signal generated thereby, wherein time has been denoted in seconds and actual speech signal amplitude in arbitrary units.

Now, the present invention intends to improve on the above reproduction of voiced speech in a simple manner. Here, the prime point of view of the invention is to mimick in a certain way the physical generation of human speech. For explanation, Figure 3a is a speech generation model on a filter basis, as based on the way speech is generated in the human vocal tract: in contradistinction to Figure 1 , Figure 3a is a physical, or even physiological model, in that it is much closer related to the geometrical and phsyical properties of the vocal tract. Block 20 is again an all-pole filter, fed by source 22 with a sequence of glottal pulses in the form of a pulsating air flow, as they will be shown in Figure 5. In humans, the sound radiated from the lips on notional output 26 is in this radiating process more or less differentiated as symbolized by differentiator or high-pass filter 24. The set-up itself of the module is similar to that of Figure 1, but both source 22, and filters 20, 24 have different characteristics. Through combining the differentiator with the source, an amended set-up is now reached according to Figure 3b, where the source 23 produces the time derivative of the glottal air flow. One of the advantages of the present invention is the possible use of the LPC-inspired data base. In the future, data bases that will have been improved with a view to the present invention will provide still better performance.

In view of this differential feature, the original sound track (of an exemplary vowel /a/), that has been shown in Figure 5a is represented by the associated glottal pulse stream of Figure 5b, with a view of separating the properties of the glottal pulses from those of the vocal tract proper. The speech generated is based on both these constituents, through feeding the vocal tract parameter representation with the glottal pulse representation. Next, the glottal pulses are translated into their time-differential, according to Figure 5c. In this latter Figure, the sharp peaks indicate the instants of glottal closure, which is the prime instant for the inputting. The segment length as shown corresponds to the typical length of a synthesis frame. The glottal pulse and its time derivative have been obtained by an inverse filtering technique called closed-phase analysis. In this technique, first an estimate is made of the intervals of glottal closure. Within those intervals, the speech consists only of resonances of the vocal tract. These intervals are subsequently used to produce an all-zero inverse filter. The glottal pulse time derivative is then obtained by inverse filtering with this filter. The glottal pulse itself is subsequently obtained by integrating the time derivative. The vocal-tract filter is the inverse of the obtained all-zero filter. The magnitude of the transfer function of the vocal tract filter H* is shown in Figure 4a. The magnitude of the transfer function of the synthesis filter Hs for the same segment is shown in Figure 4b. The two transfer functions apparently contain the same formant resonances, but are different at low frequencies. This is caused by the fact that Hs describes both the spectral behaviour of the vocal tract and the spectral behaviour of the glottal pulse time derivative, whereas the Hr only describes the behaviour of the vocal tract. Figure 4c gives the transfer function of the glottal-pulse filter. By itself, D.G. Childers and C.K. Lee, Vocal quality factors: Analysis, synthesis, and perception, J. Accoust. Soc. Am. 90(5), November 1991, p. 2394-2410 describes the influence of the glottal pulse on the sounding of a voice.

The synthesis system of Figure 1 is now compared with the model of Figure 3b by writing:

Hs(z) = {Hs(z)/H»(z)}*H»(z)=Hg(z)*H*(z) (1)

Herein Hg is a linear filter, called the glottal pulse filter. Its impulse response models the glottal-pulse time derivative in the synthesizer. The filter Hg has a minimum phase transfer function. This is caused by both Hs and Hi* being stable all-pole filters. The glottal-pulse-filter transfer function is shown in Figure 4c and the impulse response is shown in Figure 6. Comparing this synthesis model of the glottal-pulse time derivative with the true time derivative in Figure 5c, shows that although the spectral magnitudes may be identical, their time-domain representations are quite different. Such difference also exists between the time domain representations of the original speech and the synthesized speech.

Clearly, the implicit glottal-pulse model of the mono-pulse vocoder differs from the true glottal pulse. The reason is that the true glottal-pulse time derivative cannot be approximated closely as the impulse response of a minimum-phase system. It is proposed that a synthesizer derived from the model of Figure 3b, provided with an improved representation of the glottal-pulse time derivative and a synthesis filter that only models the resonances of the vocal tract, will result in a better perceptual speech quality.

The proposed synthesizer is shown in Figure 7. A specific requirement is to remain compatible with existing data bases, which necessitates to generate the parameters pertaining to the sources 40, 48, 50 and 56 in Figure 1. This is realized as follows. The filter coefficents of the original synthesis filter are are used to derive the coefficients of the vocal-tract filter and the glottal-pulse filter. By way of preferred example, the Liljencrants- Fant (LF) model is used for describing the glottal pulse, of which a lucid explanation has also been given in the above-referred to Childers-Lee publication (references to Fant, and Fant et al.). The parameters thereof are tuned to attain magnitude-matching in the frequency domain between the glottal pulse filter and the LF pulse. This leads to an excitation of the vocal tract filter hat has both the desired spectral characteristics as well as a realistic temporal representation. The necessary steps are recited hereinafter.

According to the invention, both the glottal pulse sequence, and also the filter characteristic are adapted to attain improved sound quality for the available facilities. The problems to be solved are: a. what filter coefficients correspond to the original filter; b. what filter coefficients correspond to the spectral behaviour of the input pulse sequence (here, the one according to Figure 4c). In particular, the phase of the processing result of the glottal pulse sequence is taken into account, which according to existing technology was considered rather misbehaving. The filter used is a so-called minimum-phase filter that controls the phase relationships. In particular, it models the resonances of the vocal tract. The remainder of the transfer function is modelled through shaping the glottal pulses themselves. Now, the transfer function of the filter can be written as:

H= l/{H-a₁e-^jθ-ra₂e-²J^θ+a₃e-³J^θ+ ...};

Herein θ varies between 0 and v, that is one half of the sample frequency. Another representation is:

H-¹=(l-α,e-^jθ )*(l-α₂e-^yθ)*(l-α₃e-^3jθ)*...;

Herein, each a is a complex pole lying inside the unit circle, which means that its complex conjugate is a pole too. In this respect, Figure 8 is a pole plot of a filter as used. As example, a pole 30 and its complex conjugate 32 of the above function have been shown, as corresponding to a particular resonance of the human vocal tract. Now in Figure 8, a shaded region has been shown in the pole plot. At right, this comprises a sector between angles +/- θ_min, that are the lowest resonance frequencies of the human vocal tract, slightly dependent on age, gender, etcetera. A common value for this angle corresponds to a frequency of 200Hz, which may depend on the particular voice type selected. Also, a narrow strip along the negative real axis may comprise poles that would not spring from the above resonances. Therefore, a new filter is constructed that represents only poles in the unshaded region. In this respect, Figure 12 is a graph illustrating spectral tilt, as present in the real part of the transfer function of such a 'rest' filter as a function of θ. The curve starts at a value of 1, and more or less gradually decreases for higher values of θ. The initial downward slope is called the spectral tilt of the filter. The glottal pulse sequence must now have an initial spectral tilt that has substantially the same value as the transfer function shown. This is effected by shaping the parameters of the LF model.

In particular, the spectral tilt influences the 'warmth' of the speech as subjectively felt by a human listener: a steeper slope gives a 'warmer' sound. Physiologically, the tilt is connected with the closing speed of the vocal chords. If the closing is fast, relatively much high-frequency energy persists, but if the closing is slow, relatively little high-frequency energy is present in the voice.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The coefficients of the vocal tract filter, and the spectral representation of the glottal pulse from the coefficients of the synthesis filter are derived as follows. First, all formant frequencies are assumed to lie over 200 Hz, and the magnitudes of the complex poles of Hv to lie above a threshold of 0.85, but within the unit circle. Separating the complex poles that correspond to formants from those that do not, results in a representation of the transfer function as a product as:

Hs={Ag(z)}-¹*{A z)}-¹ (2)

Herein, the first factor is the estimate of the glottal pulse-filter Hs/H* in (1), and which contains all poles that cannot be assigned to formants. The second factor is the estimate for the vocal-tract filter, that contains all formant poles. In this respect, Figure 9a shows a comparison of the vocal-tract filters by means of closed-phase analysis and by means of the above approximation. The same comparison is made for the glottal-pulse filters in Figure 9b. There are only limited differences to be found around the formant frequencies. These are produced because the closed-phase analysis generally favours sharper formant peaks. The separation criterion used here was as follows: all poles corresponding to frequencies below the mentioned threshold frequency 200 Hz were assumed to be unrelated to formant frequencies.

The separation between formant poles and non-formant poles from Hs is particularly simple if Hs is represented itself as a product of second-order sections, or in so- called PQ pairs, that are different representations of the formant parameters, cf. John R. Deller, Jr. et al, Discrete-Time Processing of Speech Signals, Macmillan 1993, pp. 331-333. The LF parameters can be estimated according to the following example.

First, a time-continuous version of the LF model of the glottal-pulse time derivative is employed as follows;

δg(t)/δt = 0 t <0

= Asin(ω_pt)exp(αt) 0≤t≤t_e

= Asin(ω_pt_e)exp(ort_e)exp(-et) t_e <t≤t₀

Herein, the quantities A (arbitrary amplitude), ω, or, t_e e, and the LF parameter pitch T₀ are the generation parameters, of which the middle four need yet to be ascertained, and which are most apt for attaining a closed mathematical expression. There are also other sets of parameters that describe the LF glottal pulse. The pitch is known in the synthesizer. The other parameters must be optimized in a systematic manner. A first approach to this optimization is to tune the four parameters until there is a good magnitude match in the frequency domain between the glottal-pulse filter and the LF filter. The estimated glottal-pulse filter is an all-pole filter of a certain order. This filter can be taken as a reference for an all-pole filter of the same order derived from the LF pulse. The parameters of the LF must then be adapted until a sufficient match occurs. An all- pole filter can be derived from the LF pulse by first finding a correlation function:

R_LF(k) = ∑_nδg(n)/δt*δg(n+ k)/δt (4)

and then applying the Levinson-Durbin method to obtain the filter coefficients. For the Levison-Durbin algorithm, cf. Deller, op.cit. pp. 297-302. Figures 11a, l ib, show the spectral magnitude of the LF pulse in Figure 10 obtained in this manner.

A second exemplary procedure is to measure certain characteristic parameters from the estimated glottal pulse filter, such as the spectral tilt recited supra, and generate an LF pulse having the same characteristics. The relation between the LF parameters and the estimated characteristics is determined by the eventual outcome.

A further feasible procedure is to choose the amplitude of the LF pulse in such a manner that its energy measured over one pitch period can be rendered equal to the energy of the response of the glottal pulse filter when excited with an impulse with the magnitude of the gain parameter. The required quantities are calculated in a straightforward manner. The quality of the results attained is advantageously evaluated in a perceptual manner. The objects to be compared are preferably a sustained but short vowel in the respective three variants: the original one, the mono-pulse synthesized vowel, and the vowel synthesized with the improved glottal modelling. A further extension of the procedure is as follows. The estimating of the complex poles of the transfer function of the LPC speech synthesis filter which has a spectral envelope corresponding to the human speech information includes estimating a fixed first line spectrum that is associated to expression 5 hereinafter. Moreover the procedure includes estimating a fixed second line spectrum that is associated to expression 7 hereinafter, as pertaining to the human vocal tract model. Furthermore the procedure includes finding of a variable third line spectrum, associated to expression 7 hereinafter, which corresponds to the glottal pulse related sequence, for matching said third line spectrum to the estimated first line spectrum, until attaining an appropriate matching level.

Figures 13a, 13b, give an exemplary glottal pulse, and its time derivative, respectively, as modelled. The sampling frequency is f_s, the fundamental frequency is f₀, the fundamental period is t₀=l/f₀. Further, t =2*7 ω_. The parameters used hereinafter are the so-called specification parameters, that are equivalent with the generation parameters but are more closely related to the physical aspects of the speech generation instrument. In particular, t_e and t_a have no immediate translation to the generation parameters. Note that the signal segment shown in the Figures contains at least two fundamental periods.

Now, the signal line spectrum is

with w_k, k = 0, ... , M-l a window function, e.g. the Hanning window, and

the number of spectral lines in the spectrum.

The vocal-tract line spectrum:

with A(exp(jθ)) the transfer function of the vocal-tract filter. The glottal-pulse line spectrum:

^GMΛ _« lf₀t)di I - 1, ..., A., (8)

with

the time derivative of the glottal pulse e.g. according to the LF model. The glottal pulse parameters t_e, t_p, t_a are obtained as the minimizing arguments of the function

with β added to increase the perceptual relevance of this distance measure. It has been found 11 that β = 1/3 gives satisfactory results.

An alternative distance measure is

By itself, minimizing of function values until attainint either the overall minimum, or at least attaining an appropriate level, is a straightforward mathematical procedure. It has been found that the above minimizing leads to an extremely agreeable speech generation.

Claims

CLAIMS:

1. A method for coding human speech for subsequent reproduction thereof, said method comprising the steps of: receiving an amount of human-speech-expressive information; estimating all complex poles of a transfer function of an LPC speech synthesis filter which has a spectral envelope corresponding to said information; singling out from said transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; defining a glottal pulse related sequence representing said singled out poles; defining a second filter having a complex transfer function, as expressing said all other poles; outputting speech represented by filter means based on combining said glottal pulse related sequence and a representation of said second filter.

2. A method as claimed in Claim 1, wherein said estimating furthermore includes estimating a fixed first line spectrum associated to said human speech expressive information, estimating a fixed second line spectrum as pertaining to said human vocal tract model, and furthermore including finding of a variable third line spectium corresponding to said glottal pulse related sequence, for matching said third line spectrum to said estimated first line spectrum, until attaining an appropriate matching level.

3. A method as claimed in Claims 1 or 2, wherein said singling out pertains exclusively to all poles associated to a frequency below a predetermined threshold frequency.

4. A method as claimed in Claims 1 , 2 or 3, wherein said glottal pulse sequence is modelled according to a Liljencrants-Fant model.

5. A method as claimed in any of Claims 1 to 4, wherein before said outputting various parameters of the glottal pulse related sequence are manipulated.

6. A system for coding human speech for subsequent reproduction thereof, comprising; input means for receiving an amount of human-speech-expressive information; storage means for storing an estimation of all complex poles of a transfer function of an LPC speech synthesis filter which has a spectral envelope corresponding to said information, and singled out from said transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; defining means fed by said storage means for defining a glottal pulse related sequence representing said singled out poles; a second filter defined by a complex transfer function, as expressing said all other poles; filter means for outputting speech as represented by combining said glottal puls related sequence and a representation of said second filter.

7. A system as claimed in Claim 6, wherein said estimation furthermore accommodates an estimation of a fixed first line spectrum (expr.1) associated to said human speech expressive information, and further an estimation of a fixed second line spectrum (expr.3) as pertaining to said human vocal tract model, and said system furthermore comprising matching means for finding a variable third line spectrum (expr.4) corresponding to said glottal pulse related sequence, and for matching said third line spectrum to said estimated first line spectrum, and said system including detecting means for detecting attainment of an appropriate matching level.

8. A system as claimed in Claims 6 or 7, wherein said singled-out poles are associated to a frequency below a predetermined threshold theory.

9. A system as claimed in Claims 6, 7 or 8, based on using an LPC- compatible data base.