WO1997025708A1 - Method and system for coding human speech for subsequent reproduction thereof - Google Patents

Method and system for coding human speech for subsequent reproduction thereof Download PDF

Info

Publication number
WO1997025708A1
WO1997025708A1 PCT/IB1996/001448 IB9601448W WO9725708A1 WO 1997025708 A1 WO1997025708 A1 WO 1997025708A1 IB 9601448 W IB9601448 W IB 9601448W WO 9725708 A1 WO9725708 A1 WO 9725708A1
Authority
WO
WIPO (PCT)
Prior art keywords
poles
filter
speech
human
transfer function
Prior art date
Application number
PCT/IB1996/001448
Other languages
French (fr)
Inventor
Raymond Nicolaas Johan Veldhuis
Paul Augustinus Peter Kaufholz
Original Assignee
Philips Electronics N.V.
Philips Norden Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Philips Electronics N.V., Philips Norden Ab filed Critical Philips Electronics N.V.
Priority to JP9525031A priority Critical patent/JPH11502326A/en
Priority to EP96940095A priority patent/EP0815555A1/en
Publication of WO1997025708A1 publication Critical patent/WO1997025708A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/06Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information

Definitions

  • the invention relates to a method for coding human speech for subsequent reproduction thereof.
  • a well-known method is based on the principles of LPC-coding, but the results thereof have proven to be of moderate quality only.
  • the present inventors have discovered that the principles of LPC coding represent a good starting point for undertaking further effort for improvement.
  • values of various filter LPC-characteristics can be adapted to get an improved result, when the various influences on speech generation are taken into account in a more refined manner.
  • the method according to the invention comprises the further steps of receiving an amount of human-speech-expressive information; estimating all complex poles of a transfer function of an LPC speech synthesis filter which has a spectral envelope corresponding to said information; singling out from said transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; defining a glottal pulse related sequence representing said singled out poles; defining a second filter having a complex transfer function, as expressing said all other poles; outputting speech represented by filter means based on combining said glottal pulse related sequence and a representation of said second filter.
  • Distinguishing of the complex poles into two groups allows to model each of the groups separately in to an optimal manner.
  • said estimating furthermore includes estimating a fixed first line spectrum (expr.5) associated to said human speech expressive information, estimating a fixed second line spectrum (expr.7) as pertaining to said human vocal tract model, and furthermore including finding of a variable third line spectrum (expr.8) corresponding to said glottal pulse related sequence, for matching said third line spectrum to said estimated first line spectrum, until attaining an appropriate matching level. It has been found that this way of matching is straightforward, yet results in a very good performance.
  • said singling out pertains to all poles associated to a frequency below a predetermined threshold frequency.
  • these low-frequency poles are just the ones that must be singled out.
  • a method as recited uses an LPC-compatible speech data base. Such databases are readily available for a great variety of speech types and languages.
  • the invention also relates to a system for executing a method for coding human speech as described supra. Further advantageous aspects of the invention are recited in dependent Claims. By itself, manipulating speech in various ways has been disclosed in EP
  • Figure 1 a known mono-pulse vocoder
  • Figure 2a excitation of such vocoder
  • Figure 2b an exemplary speech signal generated thereby
  • Figure 3a a speech generation model on a filter basis
  • Figure 4a a transfer function of a vocal tract
  • Figure 4b a transfer function of a synthesis filter
  • Figure 4c a transfer function of a glottal pulse filter
  • Figure 5a an exemplary natural speech signal
  • Figure 5b a sequence of glottal pulses associated therewith;
  • Figure 5c the same sequence differentiated versus time;
  • Figure 6 an impulse response of a glottal pulse filter
  • Figure 8 a pole plot of a filter as used
  • Figure 9a two transfer functions compared
  • Figure 9b two further transfer functions compared
  • Figures 1 la, 1 lb an all-pole spectral representation of the pulse of Figure 10;
  • Figure 12 a graph illustrating spectral tilt
  • Figures 13a, 13b a glottal pulse and its time derivative.
  • Figure 1 shows a mono-pulse or LPC-(linear predictive coding)- based vocoder according to the state of the art, such as described in many textbooks, of which a relevant citation is Douglas O'Shaughnessy, Speech communication, Human and Machine, Addison-Wesley 1987, p. 336-364.
  • Advantages of LPC are the extremely compact manner of storage and its facility for manipulating speech so coded in an easy manner.
  • a disadvantage is the relatively poor quality of the speech produced.
  • speech synthesis is by means of all-pole filter 54 that receives the coded speech and outputs a sequence of speech frames on output 58.
  • Input 40 symbolizes actual pitch frequency, which at the actual pitch period recurrency is fed to item 42 that controls the generating of voiced frames.
  • item 44 controls the generating of unvoiced frames, that are generally represented by (white) noise.
  • Multiplexer 46 as controlled by selection signals 48, selects between voiced and unvoiced.
  • Amplifier block 52 as controlled by item 50, can vary the actual gain factor.
  • Filter 54 has time varying filter coefficients as symbolized by controlling item 56. Typically, the various parameters are updated every 5-20 millisecs.
  • the synthesizer is called mono-pulse-excited, because there is only one single excitation pulse per pitch period.
  • the input from amplifier block 52 into filter 54 is called the excitation signal.
  • Figure 1 is a parametric model that has no direct relationship to the properties of the human vocal tract. The approach according to Figure 1 is widespread, and a large data base has been compounded for application in many fields.
  • Figure 2a shows an excitation example of such vocoder
  • Figure 2b the speech signal generated thereby, wherein time has been denoted in seconds and actual speech signal amplitude in arbitrary units.
  • Figure 3a is a speech generation model on a filter basis, as based on the way speech is generated in the human vocal tract: in contradistinction to Figure 1 , Figure 3a is a physical, or even physiological model, in that it is much closer related to the geometrical and phsyical properties of the vocal tract.
  • Block 20 is again an all-pole filter, fed by source 22 with a sequence of glottal pulses in the form of a pulsating air flow, as they will be shown in Figure 5.
  • the original sound track (of an exemplary vowel /a/), that has been shown in Figure 5a is represented by the associated glottal pulse stream of Figure 5b, with a view of separating the properties of the glottal pulses from those of the vocal tract proper.
  • the speech generated is based on both these constituents, through feeding the vocal tract parameter representation with the glottal pulse representation.
  • the glottal pulses are translated into their time-differential, according to Figure 5c.
  • the sharp peaks indicate the instants of glottal closure, which is the prime instant for the inputting.
  • the segment length as shown corresponds to the typical length of a synthesis frame.
  • the glottal pulse and its time derivative have been obtained by an inverse filtering technique called closed-phase analysis.
  • closed-phase analysis In this technique, first an estimate is made of the intervals of glottal closure. Within those intervals, the speech consists only of resonances of the vocal tract. These intervals are subsequently used to produce an all-zero inverse filter. The glottal pulse time derivative is then obtained by inverse filtering with this filter. The glottal pulse itself is subsequently obtained by integrating the time derivative. The vocal-tract filter is the inverse of the obtained all-zero filter.
  • the magnitude of the transfer function of the vocal tract filter H* is shown in Figure 4a.
  • the magnitude of the transfer function of the synthesis filter Hs for the same segment is shown in Figure 4b.
  • Hg is a linear filter, called the glottal pulse filter. Its impulse response models the glottal-pulse time derivative in the synthesizer.
  • the filter Hg has a minimum phase transfer function. This is caused by both Hs and Hi* being stable all-pole filters.
  • the glottal-pulse-filter transfer function is shown in Figure 4c and the impulse response is shown in Figure 6. Comparing this synthesis model of the glottal-pulse time derivative with the true time derivative in Figure 5c, shows that although the spectral magnitudes may be identical, their time-domain representations are quite different. Such difference also exists between the time domain representations of the original speech and the synthesized speech.
  • the implicit glottal-pulse model of the mono-pulse vocoder differs from the true glottal pulse.
  • the reason is that the true glottal-pulse time derivative cannot be approximated closely as the impulse response of a minimum-phase system.
  • a synthesizer derived from the model of Figure 3b provided with an improved representation of the glottal-pulse time derivative and a synthesis filter that only models the resonances of the vocal tract, will result in a better perceptual speech quality.
  • the proposed synthesizer is shown in Figure 7.
  • a specific requirement is to remain compatible with existing data bases, which necessitates to generate the parameters pertaining to the sources 40, 48, 50 and 56 in Figure 1.
  • the filter coefficents of the original synthesis filter are are used to derive the coefficients of the vocal-tract filter and the glottal-pulse filter.
  • the Liljencrants- Fant (LF) model is used for describing the glottal pulse, of which a lucid explanation has also been given in the above-referred to Childers-Lee publication (references to Fant, and Fant et al.).
  • the parameters thereof are tuned to attain magnitude-matching in the frequency domain between the glottal pulse filter and the LF pulse. This leads to an excitation of the vocal tract filter hat has both the desired spectral characteristics as well as a realistic temporal representation. The necessary steps are recited hereinafter.
  • both the glottal pulse sequence, and also the filter characteristic are adapted to attain improved sound quality for the available facilities.
  • the problems to be solved are: a. what filter coefficients correspond to the original filter; b. what filter coefficients correspond to the spectral behaviour of the input pulse sequence (here, the one according to Figure 4c).
  • the phase of the processing result of the glottal pulse sequence is taken into account, which according to existing technology was considered rather misbehaving.
  • the filter used is a so-called minimum-phase filter that controls the phase relationships. In particular, it models the resonances of the vocal tract.
  • the remainder of the transfer function is modelled through shaping the glottal pulses themselves. Now, the transfer function of the filter can be written as:
  • H l/ ⁇ H-a 1 e- j ⁇ -ra 2 e- 2 J ⁇ +a 3 e- 3 J ⁇ + ... ⁇ ;
  • H- 1 (l- ⁇ ,e- j ⁇ )*(l- ⁇ 2 e- y ⁇ )*(l- ⁇ 3 e- 3j ⁇ )*...;
  • each a is a complex pole lying inside the unit circle, which means that its complex conjugate is a pole too.
  • Figure 8 is a pole plot of a filter as used.
  • a pole 30 and its complex conjugate 32 of the above function have been shown, as corresponding to a particular resonance of the human vocal tract.
  • a shaded region has been shown in the pole plot. At right, this comprises a sector between angles +/- ⁇ min , that are the lowest resonance frequencies of the human vocal tract, slightly dependent on age, gender, etcetera.
  • a common value for this angle corresponds to a frequency of 200Hz, which may depend on the particular voice type selected.
  • Figure 12 is a graph illustrating spectral tilt, as present in the real part of the transfer function of such a 'rest' filter as a function of ⁇ .
  • the curve starts at a value of 1, and more or less gradually decreases for higher values of ⁇ .
  • the initial downward slope is called the spectral tilt of the filter.
  • the glottal pulse sequence must now have an initial spectral tilt that has substantially the same value as the transfer function shown. This is effected by shaping the parameters of the LF model.
  • the spectral tilt influences the 'warmth' of the speech as subjectively felt by a human listener: a steeper slope gives a 'warmer' sound.
  • the tilt is connected with the closing speed of the vocal chords. If the closing is fast, relatively much high-frequency energy persists, but if the closing is slow, relatively little high-frequency energy is present in the voice.
  • the coefficients of the vocal tract filter, and the spectral representation of the glottal pulse from the coefficients of the synthesis filter are derived as follows. First, all formant frequencies are assumed to lie over 200 Hz, and the magnitudes of the complex poles of Hv to lie above a threshold of 0.85, but within the unit circle. Separating the complex poles that correspond to formants from those that do not, results in a representation of the transfer function as a product as:
  • the first factor is the estimate of the glottal pulse-filter Hs/H* in (1), and which contains all poles that cannot be assigned to formants.
  • the second factor is the estimate for the vocal-tract filter, that contains all formant poles.
  • Figure 9a shows a comparison of the vocal-tract filters by means of closed-phase analysis and by means of the above approximation. The same comparison is made for the glottal-pulse filters in Figure 9b. There are only limited differences to be found around the formant frequencies. These are produced because the closed-phase analysis generally favours sharper formant peaks.
  • the separation criterion used here was as follows: all poles corresponding to frequencies below the mentioned threshold frequency 200 Hz were assumed to be unrelated to formant frequencies.
  • Hs The separation between formant poles and non-formant poles from Hs is particularly simple if Hs is represented itself as a product of second-order sections, or in so- called PQ pairs, that are different representations of the formant parameters, cf. John R. Deller, Jr. et al, Discrete-Time Processing of Speech Signals, Macmillan 1993, pp. 331-333.
  • the LF parameters can be estimated according to the following example.
  • the quantities A (arbitrary amplitude), ⁇ , or, t e e, and the LF parameter pitch T 0 are the generation parameters, of which the middle four need yet to be ascertained, and which are most apt for attaining a closed mathematical expression.
  • the pitch is known in the synthesizer.
  • the other parameters must be optimized in a systematic manner. A first approach to this optimization is to tune the four parameters until there is a good magnitude match in the frequency domain between the glottal-pulse filter and the LF filter.
  • the estimated glottal-pulse filter is an all-pole filter of a certain order.
  • This filter can be taken as a reference for an all-pole filter of the same order derived from the LF pulse.
  • the parameters of the LF must then be adapted until a sufficient match occurs.
  • An all- pole filter can be derived from the LF pulse by first finding a correlation function:
  • R LF (k) ⁇ n ⁇ g(n)/ ⁇ t* ⁇ g(n+ k)/ ⁇ t (4)
  • a second exemplary procedure is to measure certain characteristic parameters from the estimated glottal pulse filter, such as the spectral tilt recited supra, and generate an LF pulse having the same characteristics.
  • the relation between the LF parameters and the estimated characteristics is determined by the eventual outcome.
  • a further feasible procedure is to choose the amplitude of the LF pulse in such a manner that its energy measured over one pitch period can be rendered equal to the energy of the response of the glottal pulse filter when excited with an impulse with the magnitude of the gain parameter.
  • the required quantities are calculated in a straightforward manner.
  • the quality of the results attained is advantageously evaluated in a perceptual manner.
  • the objects to be compared are preferably a sustained but short vowel in the respective three variants: the original one, the mono-pulse synthesized vowel, and the vowel synthesized with the improved glottal modelling.
  • the estimating of the complex poles of the transfer function of the LPC speech synthesis filter which has a spectral envelope corresponding to the human speech information includes estimating a fixed first line spectrum that is associated to expression 5 hereinafter. Moreover the procedure includes estimating a fixed second line spectrum that is associated to expression 7 hereinafter, as pertaining to the human vocal tract model. Furthermore the procedure includes finding of a variable third line spectrum, associated to expression 7 hereinafter, which corresponds to the glottal pulse related sequence, for matching said third line spectrum to the estimated first line spectrum, until attaining an appropriate matching level.
  • Figures 13a, 13b give an exemplary glottal pulse, and its time derivative, respectively, as modelled.
  • the sampling frequency is f s
  • the fundamental frequency is f 0
  • t 2*7 ⁇ _.
  • the parameters used hereinafter are the so-called specification parameters, that are equivalent with the generation parameters but are more closely related to the physical aspects of the speech generation instrument.
  • t e and t a have no immediate translation to the generation parameters.
  • the signal segment shown in the Figures contains at least two fundamental periods.
  • a window function e.g. the Hanning window
  • the glottal pulse parameters t e , t p , t a are obtained as the minimizing arguments of the function

Abstract

Human speech is coded for subsequent reproduction through the following steps: (a) receiving an amount of human-speech-expressive information; (b) estimating all complex poles of a transfer function of an LPC speech synthesis filter that has a spectral envelope corresponding to the information; (c) singling out from the transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; (d) defining a glottal pulse related sequence representing the singled out poles; (e) defining a second filter with a complex transfer function to express all the other poles; (f) outputting speech represented by a filter based on combining the glottal pulse related sequence and a representation of the second filter.

Description

Method and system for coding human speech for subsequent reproduction thereof.
BACKGROUND TO THE INVENTION
The invention relates to a method for coding human speech for subsequent reproduction thereof. A well-known method is based on the principles of LPC-coding, but the results thereof have proven to be of moderate quality only. The present inventors have discovered that the principles of LPC coding represent a good starting point for undertaking further effort for improvement. In particular, values of various filter LPC-characteristics can be adapted to get an improved result, when the various influences on speech generation are taken into account in a more refined manner.
SUMMARY TO THE INVENTION
Accordingly, amongst other things it is an object of the invention to improve speech generation filter characteristics for operating with the above recited technology, and in particular to maintain compatibility with LPC-data bases to a certain extent. Now, according to one its aspects, the method according to the invention comprises the further steps of receiving an amount of human-speech-expressive information; estimating all complex poles of a transfer function of an LPC speech synthesis filter which has a spectral envelope corresponding to said information; singling out from said transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; defining a glottal pulse related sequence representing said singled out poles; defining a second filter having a complex transfer function, as expressing said all other poles; outputting speech represented by filter means based on combining said glottal pulse related sequence and a representation of said second filter.
Distinguishing of the complex poles into two groups allows to model each of the groups separately in to an optimal manner.
Advantageously, said estimating furthermore includes estimating a fixed first line spectrum (expr.5) associated to said human speech expressive information, estimating a fixed second line spectrum (expr.7) as pertaining to said human vocal tract model, and furthermore including finding of a variable third line spectrum (expr.8) corresponding to said glottal pulse related sequence, for matching said third line spectrum to said estimated first line spectrum, until attaining an appropriate matching level. It has been found that this way of matching is straightforward, yet results in a very good performance.
Preferably, said singling out pertains to all poles associated to a frequency below a predetermined threshold frequency. In this way, the distinguishing is simple and straightforward to implement. In practice, these low-frequency poles are just the ones that must be singled out. Advantageously, a method as recited uses an LPC-compatible speech data base. Such databases are readily available for a great variety of speech types and languages. The invention also relates to a system for executing a method for coding human speech as described supra. Further advantageous aspects of the invention are recited in dependent Claims. By itself, manipulating speech in various ways has been disclosed in EP
527 527, corresponding US 5,479,564 (PHN 13801), in EP 527 529, corresponding US Application Serial No. 07/924,726 (PHN 13993), and EP 95203210.0, corresponding US Application Serial No. 08/...,... (PHN 15553), all to the present assignee. The first two references describe the affecting of speech duration through systematically inserting and/or deleting pitch periods of the unprocessed speech. The third reference operates in comparable manner on a short-time-Fourier-transform of the speech. As stated earlier, the present invention tries to attain a compact storage of coded speech for seeking a low cost solution. The references require a rather more extensive storage space.
BRIEF DESCRIPTION OF THE DRAWING
These and other aspects and advantages of the invention will be described with reference to the preferred embodiments disclosed hereinafter, and in particular with reference to the appended Figures that show:
Figure 1, a known mono-pulse vocoder; Figure 2a, excitation of such vocoder;
Figure 2b, an exemplary speech signal generated thereby;
Figure 3a, a speech generation model on a filter basis;
Figure 3b, a secondary model derived therefrom;
Figure 4a, a transfer function of a vocal tract; Figure 4b, a transfer function of a synthesis filter;
Figure 4c, a transfer function of a glottal pulse filter;
Figure 5a, an exemplary natural speech signal;
Figure 5b, a sequence of glottal pulses associated therewith; Figure 5c, the same sequence differentiated versus time;
Figure 6, an impulse response of a glottal pulse filter;
Figure 7, a proposed synthesizer;
Figure 8, a pole plot of a filter as used;
Figure 9a, two transfer functions compared; Figure 9b, two further transfer functions compared;
Figure 10, an exemplary glottal pulse time derivative;
Figures 1 la, 1 lb an all-pole spectral representation of the pulse of Figure 10;
Figure 12, a graph illustrating spectral tilt; Figures 13a, 13b a glottal pulse and its time derivative.
DESCRIPTION OF THE PRINCIPLES OF THE INVENTION
Figure 1 shows a mono-pulse or LPC-(linear predictive coding)- based vocoder according to the state of the art, such as described in many textbooks, of which a relevant citation is Douglas O'Shaughnessy, Speech communication, Human and Machine, Addison-Wesley 1987, p. 336-364. Advantages of LPC are the extremely compact manner of storage and its facility for manipulating speech so coded in an easy manner. A disadvantage is the relatively poor quality of the speech produced. Conceptually, speech synthesis is by means of all-pole filter 54 that receives the coded speech and outputs a sequence of speech frames on output 58. Input 40 symbolizes actual pitch frequency, which at the actual pitch period recurrency is fed to item 42 that controls the generating of voiced frames. In contradistinction, item 44 controls the generating of unvoiced frames, that are generally represented by (white) noise. Multiplexer 46, as controlled by selection signals 48, selects between voiced and unvoiced. Amplifier block 52, as controlled by item 50, can vary the actual gain factor. Filter 54 has time varying filter coefficients as symbolized by controlling item 56. Typically, the various parameters are updated every 5-20 millisecs. The synthesizer is called mono-pulse-excited, because there is only one single excitation pulse per pitch period. The input from amplifier block 52 into filter 54 is called the excitation signal. Generally, Figure 1 is a parametric model that has no direct relationship to the properties of the human vocal tract. The approach according to Figure 1 is widespread, and a large data base has been compounded for application in many fields.
In this respect, Figure 2a shows an excitation example of such vocoder, and Figure 2b the speech signal generated thereby, wherein time has been denoted in seconds and actual speech signal amplitude in arbitrary units.
Now, the present invention intends to improve on the above reproduction of voiced speech in a simple manner. Here, the prime point of view of the invention is to mimick in a certain way the physical generation of human speech. For explanation, Figure 3a is a speech generation model on a filter basis, as based on the way speech is generated in the human vocal tract: in contradistinction to Figure 1 , Figure 3a is a physical, or even physiological model, in that it is much closer related to the geometrical and phsyical properties of the vocal tract. Block 20 is again an all-pole filter, fed by source 22 with a sequence of glottal pulses in the form of a pulsating air flow, as they will be shown in Figure 5. In humans, the sound radiated from the lips on notional output 26 is in this radiating process more or less differentiated as symbolized by differentiator or high-pass filter 24. The set-up itself of the module is similar to that of Figure 1, but both source 22, and filters 20, 24 have different characteristics. Through combining the differentiator with the source, an amended set-up is now reached according to Figure 3b, where the source 23 produces the time derivative of the glottal air flow. One of the advantages of the present invention is the possible use of the LPC-inspired data base. In the future, data bases that will have been improved with a view to the present invention will provide still better performance.
In view of this differential feature, the original sound track (of an exemplary vowel /a/), that has been shown in Figure 5a is represented by the associated glottal pulse stream of Figure 5b, with a view of separating the properties of the glottal pulses from those of the vocal tract proper. The speech generated is based on both these constituents, through feeding the vocal tract parameter representation with the glottal pulse representation. Next, the glottal pulses are translated into their time-differential, according to Figure 5c. In this latter Figure, the sharp peaks indicate the instants of glottal closure, which is the prime instant for the inputting. The segment length as shown corresponds to the typical length of a synthesis frame. The glottal pulse and its time derivative have been obtained by an inverse filtering technique called closed-phase analysis. In this technique, first an estimate is made of the intervals of glottal closure. Within those intervals, the speech consists only of resonances of the vocal tract. These intervals are subsequently used to produce an all-zero inverse filter. The glottal pulse time derivative is then obtained by inverse filtering with this filter. The glottal pulse itself is subsequently obtained by integrating the time derivative. The vocal-tract filter is the inverse of the obtained all-zero filter. The magnitude of the transfer function of the vocal tract filter H* is shown in Figure 4a. The magnitude of the transfer function of the synthesis filter Hs for the same segment is shown in Figure 4b. The two transfer functions apparently contain the same formant resonances, but are different at low frequencies. This is caused by the fact that Hs describes both the spectral behaviour of the vocal tract and the spectral behaviour of the glottal pulse time derivative, whereas the Hr only describes the behaviour of the vocal tract. Figure 4c gives the transfer function of the glottal-pulse filter. By itself, D.G. Childers and C.K. Lee, Vocal quality factors: Analysis, synthesis, and perception, J. Accoust. Soc. Am. 90(5), November 1991, p. 2394-2410 describes the influence of the glottal pulse on the sounding of a voice.
The synthesis system of Figure 1 is now compared with the model of Figure 3b by writing:
Hs(z) = {Hs(z)/H»(z)}*H»(z)=Hg(z)*H*(z) (1)
Herein Hg is a linear filter, called the glottal pulse filter. Its impulse response models the glottal-pulse time derivative in the synthesizer. The filter Hg has a minimum phase transfer function. This is caused by both Hs and Hi* being stable all-pole filters. The glottal-pulse-filter transfer function is shown in Figure 4c and the impulse response is shown in Figure 6. Comparing this synthesis model of the glottal-pulse time derivative with the true time derivative in Figure 5c, shows that although the spectral magnitudes may be identical, their time-domain representations are quite different. Such difference also exists between the time domain representations of the original speech and the synthesized speech.
Clearly, the implicit glottal-pulse model of the mono-pulse vocoder differs from the true glottal pulse. The reason is that the true glottal-pulse time derivative cannot be approximated closely as the impulse response of a minimum-phase system. It is proposed that a synthesizer derived from the model of Figure 3b, provided with an improved representation of the glottal-pulse time derivative and a synthesis filter that only models the resonances of the vocal tract, will result in a better perceptual speech quality.
The proposed synthesizer is shown in Figure 7. A specific requirement is to remain compatible with existing data bases, which necessitates to generate the parameters pertaining to the sources 40, 48, 50 and 56 in Figure 1. This is realized as follows. The filter coefficents of the original synthesis filter are are used to derive the coefficients of the vocal-tract filter and the glottal-pulse filter. By way of preferred example, the Liljencrants- Fant (LF) model is used for describing the glottal pulse, of which a lucid explanation has also been given in the above-referred to Childers-Lee publication (references to Fant, and Fant et al.). The parameters thereof are tuned to attain magnitude-matching in the frequency domain between the glottal pulse filter and the LF pulse. This leads to an excitation of the vocal tract filter hat has both the desired spectral characteristics as well as a realistic temporal representation. The necessary steps are recited hereinafter.
According to the invention, both the glottal pulse sequence, and also the filter characteristic are adapted to attain improved sound quality for the available facilities. The problems to be solved are: a. what filter coefficients correspond to the original filter; b. what filter coefficients correspond to the spectral behaviour of the input pulse sequence (here, the one according to Figure 4c). In particular, the phase of the processing result of the glottal pulse sequence is taken into account, which according to existing technology was considered rather misbehaving. The filter used is a so-called minimum-phase filter that controls the phase relationships. In particular, it models the resonances of the vocal tract. The remainder of the transfer function is modelled through shaping the glottal pulses themselves. Now, the transfer function of the filter can be written as:
H= l/{H-a1e--ra2e-2Jθ+a3e-3Jθ+ ...};
Herein θ varies between 0 and v, that is one half of the sample frequency. Another representation is:
H-1=(l-α,e- )*(l-α2e-)*(l-α3e-3jθ)*...;
Herein, each a is a complex pole lying inside the unit circle, which means that its complex conjugate is a pole too. In this respect, Figure 8 is a pole plot of a filter as used. As example, a pole 30 and its complex conjugate 32 of the above function have been shown, as corresponding to a particular resonance of the human vocal tract. Now in Figure 8, a shaded region has been shown in the pole plot. At right, this comprises a sector between angles +/- θmin, that are the lowest resonance frequencies of the human vocal tract, slightly dependent on age, gender, etcetera. A common value for this angle corresponds to a frequency of 200Hz, which may depend on the particular voice type selected. Also, a narrow strip along the negative real axis may comprise poles that would not spring from the above resonances. Therefore, a new filter is constructed that represents only poles in the unshaded region. In this respect, Figure 12 is a graph illustrating spectral tilt, as present in the real part of the transfer function of such a 'rest' filter as a function of θ. The curve starts at a value of 1, and more or less gradually decreases for higher values of θ. The initial downward slope is called the spectral tilt of the filter. The glottal pulse sequence must now have an initial spectral tilt that has substantially the same value as the transfer function shown. This is effected by shaping the parameters of the LF model.
In particular, the spectral tilt influences the 'warmth' of the speech as subjectively felt by a human listener: a steeper slope gives a 'warmer' sound. Physiologically, the tilt is connected with the closing speed of the vocal chords. If the closing is fast, relatively much high-frequency energy persists, but if the closing is slow, relatively little high-frequency energy is present in the voice.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The coefficients of the vocal tract filter, and the spectral representation of the glottal pulse from the coefficients of the synthesis filter are derived as follows. First, all formant frequencies are assumed to lie over 200 Hz, and the magnitudes of the complex poles of Hv to lie above a threshold of 0.85, but within the unit circle. Separating the complex poles that correspond to formants from those that do not, results in a representation of the transfer function as a product as:
Hs={Ag(z)}-1*{A z)}-1 (2)
Herein, the first factor is the estimate of the glottal pulse-filter Hs/H* in (1), and which contains all poles that cannot be assigned to formants. The second factor is the estimate for the vocal-tract filter, that contains all formant poles. In this respect, Figure 9a shows a comparison of the vocal-tract filters by means of closed-phase analysis and by means of the above approximation. The same comparison is made for the glottal-pulse filters in Figure 9b. There are only limited differences to be found around the formant frequencies. These are produced because the closed-phase analysis generally favours sharper formant peaks. The separation criterion used here was as follows: all poles corresponding to frequencies below the mentioned threshold frequency 200 Hz were assumed to be unrelated to formant frequencies.
The separation between formant poles and non-formant poles from Hs is particularly simple if Hs is represented itself as a product of second-order sections, or in so- called PQ pairs, that are different representations of the formant parameters, cf. John R. Deller, Jr. et al, Discrete-Time Processing of Speech Signals, Macmillan 1993, pp. 331-333. The LF parameters can be estimated according to the following example.
First, a time-continuous version of the LF model of the glottal-pulse time derivative is employed as follows;
δg(t)/δt = 0 t <0
= Asin(ωpt)exp(αt) 0≤t≤te
= Asin(ωpte)exp(orte)exp(-et) te <t≤t0
Herein, the quantities A (arbitrary amplitude), ω, or, te e, and the LF parameter pitch T0 are the generation parameters, of which the middle four need yet to be ascertained, and which are most apt for attaining a closed mathematical expression. There are also other sets of parameters that describe the LF glottal pulse. The pitch is known in the synthesizer. The other parameters must be optimized in a systematic manner. A first approach to this optimization is to tune the four parameters until there is a good magnitude match in the frequency domain between the glottal-pulse filter and the LF filter. The estimated glottal-pulse filter is an all-pole filter of a certain order. This filter can be taken as a reference for an all-pole filter of the same order derived from the LF pulse. The parameters of the LF must then be adapted until a sufficient match occurs. An all- pole filter can be derived from the LF pulse by first finding a correlation function:
RLF(k) = ∑nδg(n)/δt*δg(n+ k)/δt (4)
and then applying the Levinson-Durbin method to obtain the filter coefficients. For the Levison-Durbin algorithm, cf. Deller, op.cit. pp. 297-302. Figures 11a, l ib, show the spectral magnitude of the LF pulse in Figure 10 obtained in this manner.
A second exemplary procedure is to measure certain characteristic parameters from the estimated glottal pulse filter, such as the spectral tilt recited supra, and generate an LF pulse having the same characteristics. The relation between the LF parameters and the estimated characteristics is determined by the eventual outcome.
A further feasible procedure is to choose the amplitude of the LF pulse in such a manner that its energy measured over one pitch period can be rendered equal to the energy of the response of the glottal pulse filter when excited with an impulse with the magnitude of the gain parameter. The required quantities are calculated in a straightforward manner. The quality of the results attained is advantageously evaluated in a perceptual manner. The objects to be compared are preferably a sustained but short vowel in the respective three variants: the original one, the mono-pulse synthesized vowel, and the vowel synthesized with the improved glottal modelling. A further extension of the procedure is as follows. The estimating of the complex poles of the transfer function of the LPC speech synthesis filter which has a spectral envelope corresponding to the human speech information includes estimating a fixed first line spectrum that is associated to expression 5 hereinafter. Moreover the procedure includes estimating a fixed second line spectrum that is associated to expression 7 hereinafter, as pertaining to the human vocal tract model. Furthermore the procedure includes finding of a variable third line spectrum, associated to expression 7 hereinafter, which corresponds to the glottal pulse related sequence, for matching said third line spectrum to the estimated first line spectrum, until attaining an appropriate matching level.
Figures 13a, 13b, give an exemplary glottal pulse, and its time derivative, respectively, as modelled. The sampling frequency is fs, the fundamental frequency is f0, the fundamental period is t0=l/f0. Further, t =2*7 ω_. The parameters used hereinafter are the so-called specification parameters, that are equivalent with the generation parameters but are more closely related to the physical aspects of the speech generation instrument. In particular, te and ta have no immediate translation to the generation parameters. Note that the signal segment shown in the Figures contains at least two fundamental periods.
Now, the signal line spectrum is
Figure imgf000011_0001
with wk, k = 0, ... , M-l a window function, e.g. the Hanning window, and
Figure imgf000012_0001
the number of spectral lines in the spectrum.
The vocal-tract line spectrum:
Figure imgf000012_0002
with A(exp(jθ)) the transfer function of the vocal-tract filter. The glottal-pulse line spectrum:
G« lf0t)di I - 1, ..., A., (8)
Figure imgf000012_0003
with
Figure imgf000012_0004
the time derivative of the glottal pulse e.g. according to the LF model. The glottal pulse parameters te, tp, ta are obtained as the minimizing arguments of the function
Figure imgf000012_0005
with β added to increase the perceptual relevance of this distance measure. It has been found 11 that β = 1/3 gives satisfactory results.
An alternative distance measure is
Figure imgf000013_0001
By itself, minimizing of function values until attainint either the overall minimum, or at least attaining an appropriate level, is a straightforward mathematical procedure. It has been found that the above minimizing leads to an extremely agreeable speech generation.

Claims

CLAIMS:
1. A method for coding human speech for subsequent reproduction thereof, said method comprising the steps of: receiving an amount of human-speech-expressive information; estimating all complex poles of a transfer function of an LPC speech synthesis filter which has a spectral envelope corresponding to said information; singling out from said transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; defining a glottal pulse related sequence representing said singled out poles; defining a second filter having a complex transfer function, as expressing said all other poles; outputting speech represented by filter means based on combining said glottal pulse related sequence and a representation of said second filter.
2. A method as claimed in Claim 1, wherein said estimating furthermore includes estimating a fixed first line spectrum associated to said human speech expressive information, estimating a fixed second line spectrum as pertaining to said human vocal tract model, and furthermore including finding of a variable third line spectium corresponding to said glottal pulse related sequence, for matching said third line spectrum to said estimated first line spectrum, until attaining an appropriate matching level.
3. A method as claimed in Claims 1 or 2, wherein said singling out pertains exclusively to all poles associated to a frequency below a predetermined threshold frequency.
4. A method as claimed in Claims 1 , 2 or 3, wherein said glottal pulse sequence is modelled according to a Liljencrants-Fant model.
5. A method as claimed in any of Claims 1 to 4, wherein before said outputting various parameters of the glottal pulse related sequence are manipulated.
6. A system for coding human speech for subsequent reproduction thereof, comprising; input means for receiving an amount of human-speech-expressive information; storage means for storing an estimation of all complex poles of a transfer function of an LPC speech synthesis filter which has a spectral envelope corresponding to said information, and singled out from said transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; defining means fed by said storage means for defining a glottal pulse related sequence representing said singled out poles; a second filter defined by a complex transfer function, as expressing said all other poles; filter means for outputting speech as represented by combining said glottal puls related sequence and a representation of said second filter.
7. A system as claimed in Claim 6, wherein said estimation furthermore accommodates an estimation of a fixed first line spectrum (expr.1) associated to said human speech expressive information, and further an estimation of a fixed second line spectrum (expr.3) as pertaining to said human vocal tract model, and said system furthermore comprising matching means for finding a variable third line spectrum (expr.4) corresponding to said glottal pulse related sequence, and for matching said third line spectrum to said estimated first line spectrum, and said system including detecting means for detecting attainment of an appropriate matching level.
8. A system as claimed in Claims 6 or 7, wherein said singled-out poles are associated to a frequency below a predetermined threshold theory.
9. A system as claimed in Claims 6, 7 or 8, based on using an LPC- compatible data base.
PCT/IB1996/001448 1996-01-04 1996-12-18 Method and system for coding human speech for subsequent reproduction thereof WO1997025708A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP9525031A JPH11502326A (en) 1996-01-04 1996-12-18 Method and system for encoding and subsequently playing back human speech
EP96940095A EP0815555A1 (en) 1996-01-04 1996-12-18 Method and system for coding human speech for subsequent reproduction thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP96200015 1996-01-04
EP96200015.4 1996-01-04

Publications (1)

Publication Number Publication Date
WO1997025708A1 true WO1997025708A1 (en) 1997-07-17

Family

ID=8223569

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB1996/001448 WO1997025708A1 (en) 1996-01-04 1996-12-18 Method and system for coding human speech for subsequent reproduction thereof

Country Status (3)

Country Link
EP (1) EP0815555A1 (en)
JP (1) JPH11502326A (en)
WO (1) WO1997025708A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10063402A1 (en) * 2000-12-19 2002-06-20 Dietrich Karl Werner Electronic reproduction of a human individual and storage in a database by capturing the identity optically, acoustically, mentally and neuronally

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0599569A2 (en) * 1992-11-26 1994-06-01 Nokia Mobile Phones Ltd. A method of coding a speech signal

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0599569A2 (en) * 1992-11-26 1994-06-01 Nokia Mobile Phones Ltd. A method of coding a speech signal

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ICASSP-92, Volume 1, March 1992, M. DUNN et al., "Pole-zero Code Excited Linear Prediction Using a Perceptually Weighted Error Criterion", pages I-637 - I-639. *
IEEE TRANS. ON COMMUNICATIONS, Volume 42, No. 1, January 1994, M. YONG, "A New LPC Interpolation Technique for CELP Coders", pages 34-38. *
IEEE TRANS. ON SPEECH AND AUDIO PROCESSING, Volume 3, No. 6, November 1995, Q. LIN, "A Fast Algorithm for Computing the Vocal-tract Impulse Response from the Transfer Function", pages 449-457. *
JOURNAL OF THE ACOUSTIC SOCIETY OF AMERICA, Volume 88, No. 3, Sept. 1990, Y. QI, "Replacing Tracheoesophageal Voicing Sources Using LPC Synthesis", pages 1228-1235. *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10063402A1 (en) * 2000-12-19 2002-06-20 Dietrich Karl Werner Electronic reproduction of a human individual and storage in a database by capturing the identity optically, acoustically, mentally and neuronally

Also Published As

Publication number Publication date
JPH11502326A (en) 1999-02-23
EP0815555A1 (en) 1998-01-07

Similar Documents

Publication Publication Date Title
Valle et al. Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens
US10825433B2 (en) Electronic musical instrument, electronic musical instrument control method, and storage medium
US5749073A (en) System for automatically morphing audio information
KR101214402B1 (en) Method, apparatus and computer program product for providing improved speech synthesis
JP2003150187A (en) System and method for speech synthesis using smoothing filter, device and method for controlling smoothing filter characteristic
US8996378B2 (en) Voice synthesis apparatus
JP2004522186A (en) Speech synthesis of speech synthesizer
CN109416911B (en) Speech synthesis device and speech synthesis method
JP2010237703A (en) Sound signal processing device and sound signal processing method
JP2003255998A (en) Singing synthesizing method, device, and recording medium
JPS5930280B2 (en) speech synthesizer
CN102473416A (en) Voice quality conversion device, method therefor, vowel information generating device, and voice quality conversion system
CN111429877B (en) Song processing method and device
JP2002244689A (en) Synthesizing method for averaged voice and method for synthesizing arbitrary-speaker&#39;s voice from averaged voice
CN109410971B (en) Method and device for beautifying sound
JP6330069B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
JP2002268658A (en) Device, method, and program for analyzing and synthesizing voice
KR20220134347A (en) Speech synthesis method and apparatus based on multiple speaker training dataset
KR100422261B1 (en) Voice coding method and voice playback device
WO1997025708A1 (en) Method and system for coding human speech for subsequent reproduction thereof
JP4430174B2 (en) Voice conversion device and voice conversion method
JP5106274B2 (en) Audio processing apparatus, audio processing method, and program
EP0909443B1 (en) Method and system for coding human speech for subsequent reproduction thereof
JP2615856B2 (en) Speech synthesis method and apparatus
JPH09179576A (en) Voice synthesizing method

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

WWE Wipo information: entry into national phase

Ref document number: 1996940095

Country of ref document: EP

ENP Entry into the national phase

Ref country code: JP

Ref document number: 1997 525031

Kind code of ref document: A

Format of ref document f/p: F

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWP Wipo information: published in national office

Ref document number: 1996940095

Country of ref document: EP

WWR Wipo information: refused in national office

Ref document number: 1996940095

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1996940095

Country of ref document: EP