WO1997025708A1 - Method and system for coding human speech for subsequent reproduction thereof - Google Patents
Method and system for coding human speech for subsequent reproduction thereof Download PDFInfo
- Publication number
- WO1997025708A1 WO1997025708A1 PCT/IB1996/001448 IB9601448W WO9725708A1 WO 1997025708 A1 WO1997025708 A1 WO 1997025708A1 IB 9601448 W IB9601448 W IB 9601448W WO 9725708 A1 WO9725708 A1 WO 9725708A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- poles
- filter
- speech
- human
- transfer function
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/06—Determination or coding of the spectral characteristics, e.g. of the short-term prediction coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/12—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/15—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
Definitions
- the invention relates to a method for coding human speech for subsequent reproduction thereof.
- a well-known method is based on the principles of LPC-coding, but the results thereof have proven to be of moderate quality only.
- the present inventors have discovered that the principles of LPC coding represent a good starting point for undertaking further effort for improvement.
- values of various filter LPC-characteristics can be adapted to get an improved result, when the various influences on speech generation are taken into account in a more refined manner.
- the method according to the invention comprises the further steps of receiving an amount of human-speech-expressive information; estimating all complex poles of a transfer function of an LPC speech synthesis filter which has a spectral envelope corresponding to said information; singling out from said transfer function all poles that are unrelated to any particular resonance of a human vocal tract model, while maintaining all other poles; defining a glottal pulse related sequence representing said singled out poles; defining a second filter having a complex transfer function, as expressing said all other poles; outputting speech represented by filter means based on combining said glottal pulse related sequence and a representation of said second filter.
- Distinguishing of the complex poles into two groups allows to model each of the groups separately in to an optimal manner.
- said estimating furthermore includes estimating a fixed first line spectrum (expr.5) associated to said human speech expressive information, estimating a fixed second line spectrum (expr.7) as pertaining to said human vocal tract model, and furthermore including finding of a variable third line spectrum (expr.8) corresponding to said glottal pulse related sequence, for matching said third line spectrum to said estimated first line spectrum, until attaining an appropriate matching level. It has been found that this way of matching is straightforward, yet results in a very good performance.
- said singling out pertains to all poles associated to a frequency below a predetermined threshold frequency.
- these low-frequency poles are just the ones that must be singled out.
- a method as recited uses an LPC-compatible speech data base. Such databases are readily available for a great variety of speech types and languages.
- the invention also relates to a system for executing a method for coding human speech as described supra. Further advantageous aspects of the invention are recited in dependent Claims. By itself, manipulating speech in various ways has been disclosed in EP
- Figure 1 a known mono-pulse vocoder
- Figure 2a excitation of such vocoder
- Figure 2b an exemplary speech signal generated thereby
- Figure 3a a speech generation model on a filter basis
- Figure 4a a transfer function of a vocal tract
- Figure 4b a transfer function of a synthesis filter
- Figure 4c a transfer function of a glottal pulse filter
- Figure 5a an exemplary natural speech signal
- Figure 5b a sequence of glottal pulses associated therewith;
- Figure 5c the same sequence differentiated versus time;
- Figure 6 an impulse response of a glottal pulse filter
- Figure 8 a pole plot of a filter as used
- Figure 9a two transfer functions compared
- Figure 9b two further transfer functions compared
- Figures 1 la, 1 lb an all-pole spectral representation of the pulse of Figure 10;
- Figure 12 a graph illustrating spectral tilt
- Figures 13a, 13b a glottal pulse and its time derivative.
- Figure 1 shows a mono-pulse or LPC-(linear predictive coding)- based vocoder according to the state of the art, such as described in many textbooks, of which a relevant citation is Douglas O'Shaughnessy, Speech communication, Human and Machine, Addison-Wesley 1987, p. 336-364.
- Advantages of LPC are the extremely compact manner of storage and its facility for manipulating speech so coded in an easy manner.
- a disadvantage is the relatively poor quality of the speech produced.
- speech synthesis is by means of all-pole filter 54 that receives the coded speech and outputs a sequence of speech frames on output 58.
- Input 40 symbolizes actual pitch frequency, which at the actual pitch period recurrency is fed to item 42 that controls the generating of voiced frames.
- item 44 controls the generating of unvoiced frames, that are generally represented by (white) noise.
- Multiplexer 46 as controlled by selection signals 48, selects between voiced and unvoiced.
- Amplifier block 52 as controlled by item 50, can vary the actual gain factor.
- Filter 54 has time varying filter coefficients as symbolized by controlling item 56. Typically, the various parameters are updated every 5-20 millisecs.
- the synthesizer is called mono-pulse-excited, because there is only one single excitation pulse per pitch period.
- the input from amplifier block 52 into filter 54 is called the excitation signal.
- Figure 1 is a parametric model that has no direct relationship to the properties of the human vocal tract. The approach according to Figure 1 is widespread, and a large data base has been compounded for application in many fields.
- Figure 2a shows an excitation example of such vocoder
- Figure 2b the speech signal generated thereby, wherein time has been denoted in seconds and actual speech signal amplitude in arbitrary units.
- Figure 3a is a speech generation model on a filter basis, as based on the way speech is generated in the human vocal tract: in contradistinction to Figure 1 , Figure 3a is a physical, or even physiological model, in that it is much closer related to the geometrical and phsyical properties of the vocal tract.
- Block 20 is again an all-pole filter, fed by source 22 with a sequence of glottal pulses in the form of a pulsating air flow, as they will be shown in Figure 5.
- the original sound track (of an exemplary vowel /a/), that has been shown in Figure 5a is represented by the associated glottal pulse stream of Figure 5b, with a view of separating the properties of the glottal pulses from those of the vocal tract proper.
- the speech generated is based on both these constituents, through feeding the vocal tract parameter representation with the glottal pulse representation.
- the glottal pulses are translated into their time-differential, according to Figure 5c.
- the sharp peaks indicate the instants of glottal closure, which is the prime instant for the inputting.
- the segment length as shown corresponds to the typical length of a synthesis frame.
- the glottal pulse and its time derivative have been obtained by an inverse filtering technique called closed-phase analysis.
- closed-phase analysis In this technique, first an estimate is made of the intervals of glottal closure. Within those intervals, the speech consists only of resonances of the vocal tract. These intervals are subsequently used to produce an all-zero inverse filter. The glottal pulse time derivative is then obtained by inverse filtering with this filter. The glottal pulse itself is subsequently obtained by integrating the time derivative. The vocal-tract filter is the inverse of the obtained all-zero filter.
- the magnitude of the transfer function of the vocal tract filter H* is shown in Figure 4a.
- the magnitude of the transfer function of the synthesis filter Hs for the same segment is shown in Figure 4b.
- Hg is a linear filter, called the glottal pulse filter. Its impulse response models the glottal-pulse time derivative in the synthesizer.
- the filter Hg has a minimum phase transfer function. This is caused by both Hs and Hi* being stable all-pole filters.
- the glottal-pulse-filter transfer function is shown in Figure 4c and the impulse response is shown in Figure 6. Comparing this synthesis model of the glottal-pulse time derivative with the true time derivative in Figure 5c, shows that although the spectral magnitudes may be identical, their time-domain representations are quite different. Such difference also exists between the time domain representations of the original speech and the synthesized speech.
- the implicit glottal-pulse model of the mono-pulse vocoder differs from the true glottal pulse.
- the reason is that the true glottal-pulse time derivative cannot be approximated closely as the impulse response of a minimum-phase system.
- a synthesizer derived from the model of Figure 3b provided with an improved representation of the glottal-pulse time derivative and a synthesis filter that only models the resonances of the vocal tract, will result in a better perceptual speech quality.
- the proposed synthesizer is shown in Figure 7.
- a specific requirement is to remain compatible with existing data bases, which necessitates to generate the parameters pertaining to the sources 40, 48, 50 and 56 in Figure 1.
- the filter coefficents of the original synthesis filter are are used to derive the coefficients of the vocal-tract filter and the glottal-pulse filter.
- the Liljencrants- Fant (LF) model is used for describing the glottal pulse, of which a lucid explanation has also been given in the above-referred to Childers-Lee publication (references to Fant, and Fant et al.).
- the parameters thereof are tuned to attain magnitude-matching in the frequency domain between the glottal pulse filter and the LF pulse. This leads to an excitation of the vocal tract filter hat has both the desired spectral characteristics as well as a realistic temporal representation. The necessary steps are recited hereinafter.
- both the glottal pulse sequence, and also the filter characteristic are adapted to attain improved sound quality for the available facilities.
- the problems to be solved are: a. what filter coefficients correspond to the original filter; b. what filter coefficients correspond to the spectral behaviour of the input pulse sequence (here, the one according to Figure 4c).
- the phase of the processing result of the glottal pulse sequence is taken into account, which according to existing technology was considered rather misbehaving.
- the filter used is a so-called minimum-phase filter that controls the phase relationships. In particular, it models the resonances of the vocal tract.
- the remainder of the transfer function is modelled through shaping the glottal pulses themselves. Now, the transfer function of the filter can be written as:
- H l/ ⁇ H-a 1 e- j ⁇ -ra 2 e- 2 J ⁇ +a 3 e- 3 J ⁇ + ... ⁇ ;
- H- 1 (l- ⁇ ,e- j ⁇ )*(l- ⁇ 2 e- y ⁇ )*(l- ⁇ 3 e- 3j ⁇ )*...;
- each a is a complex pole lying inside the unit circle, which means that its complex conjugate is a pole too.
- Figure 8 is a pole plot of a filter as used.
- a pole 30 and its complex conjugate 32 of the above function have been shown, as corresponding to a particular resonance of the human vocal tract.
- a shaded region has been shown in the pole plot. At right, this comprises a sector between angles +/- ⁇ min , that are the lowest resonance frequencies of the human vocal tract, slightly dependent on age, gender, etcetera.
- a common value for this angle corresponds to a frequency of 200Hz, which may depend on the particular voice type selected.
- Figure 12 is a graph illustrating spectral tilt, as present in the real part of the transfer function of such a 'rest' filter as a function of ⁇ .
- the curve starts at a value of 1, and more or less gradually decreases for higher values of ⁇ .
- the initial downward slope is called the spectral tilt of the filter.
- the glottal pulse sequence must now have an initial spectral tilt that has substantially the same value as the transfer function shown. This is effected by shaping the parameters of the LF model.
- the spectral tilt influences the 'warmth' of the speech as subjectively felt by a human listener: a steeper slope gives a 'warmer' sound.
- the tilt is connected with the closing speed of the vocal chords. If the closing is fast, relatively much high-frequency energy persists, but if the closing is slow, relatively little high-frequency energy is present in the voice.
- the coefficients of the vocal tract filter, and the spectral representation of the glottal pulse from the coefficients of the synthesis filter are derived as follows. First, all formant frequencies are assumed to lie over 200 Hz, and the magnitudes of the complex poles of Hv to lie above a threshold of 0.85, but within the unit circle. Separating the complex poles that correspond to formants from those that do not, results in a representation of the transfer function as a product as:
- the first factor is the estimate of the glottal pulse-filter Hs/H* in (1), and which contains all poles that cannot be assigned to formants.
- the second factor is the estimate for the vocal-tract filter, that contains all formant poles.
- Figure 9a shows a comparison of the vocal-tract filters by means of closed-phase analysis and by means of the above approximation. The same comparison is made for the glottal-pulse filters in Figure 9b. There are only limited differences to be found around the formant frequencies. These are produced because the closed-phase analysis generally favours sharper formant peaks.
- the separation criterion used here was as follows: all poles corresponding to frequencies below the mentioned threshold frequency 200 Hz were assumed to be unrelated to formant frequencies.
- Hs The separation between formant poles and non-formant poles from Hs is particularly simple if Hs is represented itself as a product of second-order sections, or in so- called PQ pairs, that are different representations of the formant parameters, cf. John R. Deller, Jr. et al, Discrete-Time Processing of Speech Signals, Macmillan 1993, pp. 331-333.
- the LF parameters can be estimated according to the following example.
- the quantities A (arbitrary amplitude), ⁇ , or, t e e, and the LF parameter pitch T 0 are the generation parameters, of which the middle four need yet to be ascertained, and which are most apt for attaining a closed mathematical expression.
- the pitch is known in the synthesizer.
- the other parameters must be optimized in a systematic manner. A first approach to this optimization is to tune the four parameters until there is a good magnitude match in the frequency domain between the glottal-pulse filter and the LF filter.
- the estimated glottal-pulse filter is an all-pole filter of a certain order.
- This filter can be taken as a reference for an all-pole filter of the same order derived from the LF pulse.
- the parameters of the LF must then be adapted until a sufficient match occurs.
- An all- pole filter can be derived from the LF pulse by first finding a correlation function:
- R LF (k) ⁇ n ⁇ g(n)/ ⁇ t* ⁇ g(n+ k)/ ⁇ t (4)
- a second exemplary procedure is to measure certain characteristic parameters from the estimated glottal pulse filter, such as the spectral tilt recited supra, and generate an LF pulse having the same characteristics.
- the relation between the LF parameters and the estimated characteristics is determined by the eventual outcome.
- a further feasible procedure is to choose the amplitude of the LF pulse in such a manner that its energy measured over one pitch period can be rendered equal to the energy of the response of the glottal pulse filter when excited with an impulse with the magnitude of the gain parameter.
- the required quantities are calculated in a straightforward manner.
- the quality of the results attained is advantageously evaluated in a perceptual manner.
- the objects to be compared are preferably a sustained but short vowel in the respective three variants: the original one, the mono-pulse synthesized vowel, and the vowel synthesized with the improved glottal modelling.
- the estimating of the complex poles of the transfer function of the LPC speech synthesis filter which has a spectral envelope corresponding to the human speech information includes estimating a fixed first line spectrum that is associated to expression 5 hereinafter. Moreover the procedure includes estimating a fixed second line spectrum that is associated to expression 7 hereinafter, as pertaining to the human vocal tract model. Furthermore the procedure includes finding of a variable third line spectrum, associated to expression 7 hereinafter, which corresponds to the glottal pulse related sequence, for matching said third line spectrum to the estimated first line spectrum, until attaining an appropriate matching level.
- Figures 13a, 13b give an exemplary glottal pulse, and its time derivative, respectively, as modelled.
- the sampling frequency is f s
- the fundamental frequency is f 0
- t 2*7 ⁇ _.
- the parameters used hereinafter are the so-called specification parameters, that are equivalent with the generation parameters but are more closely related to the physical aspects of the speech generation instrument.
- t e and t a have no immediate translation to the generation parameters.
- the signal segment shown in the Figures contains at least two fundamental periods.
- a window function e.g. the Hanning window
- the glottal pulse parameters t e , t p , t a are obtained as the minimizing arguments of the function
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP9525031A JPH11502326A (en) | 1996-01-04 | 1996-12-18 | Method and system for encoding and subsequently playing back human speech |
EP96940095A EP0815555A1 (en) | 1996-01-04 | 1996-12-18 | Method and system for coding human speech for subsequent reproduction thereof |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP96200015 | 1996-01-04 | ||
EP96200015.4 | 1996-01-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1997025708A1 true WO1997025708A1 (en) | 1997-07-17 |
Family
ID=8223569
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB1996/001448 WO1997025708A1 (en) | 1996-01-04 | 1996-12-18 | Method and system for coding human speech for subsequent reproduction thereof |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP0815555A1 (en) |
JP (1) | JPH11502326A (en) |
WO (1) | WO1997025708A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE10063402A1 (en) * | 2000-12-19 | 2002-06-20 | Dietrich Karl Werner | Electronic reproduction of a human individual and storage in a database by capturing the identity optically, acoustically, mentally and neuronally |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0599569A2 (en) * | 1992-11-26 | 1994-06-01 | Nokia Mobile Phones Ltd. | A method of coding a speech signal |
-
1996
- 1996-12-18 JP JP9525031A patent/JPH11502326A/en active Pending
- 1996-12-18 WO PCT/IB1996/001448 patent/WO1997025708A1/en not_active Application Discontinuation
- 1996-12-18 EP EP96940095A patent/EP0815555A1/en not_active Ceased
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0599569A2 (en) * | 1992-11-26 | 1994-06-01 | Nokia Mobile Phones Ltd. | A method of coding a speech signal |
Non-Patent Citations (4)
Title |
---|
ICASSP-92, Volume 1, March 1992, M. DUNN et al., "Pole-zero Code Excited Linear Prediction Using a Perceptually Weighted Error Criterion", pages I-637 - I-639. * |
IEEE TRANS. ON COMMUNICATIONS, Volume 42, No. 1, January 1994, M. YONG, "A New LPC Interpolation Technique for CELP Coders", pages 34-38. * |
IEEE TRANS. ON SPEECH AND AUDIO PROCESSING, Volume 3, No. 6, November 1995, Q. LIN, "A Fast Algorithm for Computing the Vocal-tract Impulse Response from the Transfer Function", pages 449-457. * |
JOURNAL OF THE ACOUSTIC SOCIETY OF AMERICA, Volume 88, No. 3, Sept. 1990, Y. QI, "Replacing Tracheoesophageal Voicing Sources Using LPC Synthesis", pages 1228-1235. * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE10063402A1 (en) * | 2000-12-19 | 2002-06-20 | Dietrich Karl Werner | Electronic reproduction of a human individual and storage in a database by capturing the identity optically, acoustically, mentally and neuronally |
Also Published As
Publication number | Publication date |
---|---|
JPH11502326A (en) | 1999-02-23 |
EP0815555A1 (en) | 1998-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Valle et al. | Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens | |
US10825433B2 (en) | Electronic musical instrument, electronic musical instrument control method, and storage medium | |
US5749073A (en) | System for automatically morphing audio information | |
KR101214402B1 (en) | Method, apparatus and computer program product for providing improved speech synthesis | |
JP2003150187A (en) | System and method for speech synthesis using smoothing filter, device and method for controlling smoothing filter characteristic | |
US8996378B2 (en) | Voice synthesis apparatus | |
JP2004522186A (en) | Speech synthesis of speech synthesizer | |
CN109416911B (en) | Speech synthesis device and speech synthesis method | |
JP2010237703A (en) | Sound signal processing device and sound signal processing method | |
JP2003255998A (en) | Singing synthesizing method, device, and recording medium | |
JPS5930280B2 (en) | speech synthesizer | |
CN102473416A (en) | Voice quality conversion device, method therefor, vowel information generating device, and voice quality conversion system | |
CN111429877B (en) | Song processing method and device | |
JP2002244689A (en) | Synthesizing method for averaged voice and method for synthesizing arbitrary-speaker's voice from averaged voice | |
CN109410971B (en) | Method and device for beautifying sound | |
JP6330069B2 (en) | Multi-stream spectral representation for statistical parametric speech synthesis | |
JP2002268658A (en) | Device, method, and program for analyzing and synthesizing voice | |
KR20220134347A (en) | Speech synthesis method and apparatus based on multiple speaker training dataset | |
KR100422261B1 (en) | Voice coding method and voice playback device | |
WO1997025708A1 (en) | Method and system for coding human speech for subsequent reproduction thereof | |
JP4430174B2 (en) | Voice conversion device and voice conversion method | |
JP5106274B2 (en) | Audio processing apparatus, audio processing method, and program | |
EP0909443B1 (en) | Method and system for coding human speech for subsequent reproduction thereof | |
JP2615856B2 (en) | Speech synthesis method and apparatus | |
JPH09179576A (en) | Voice synthesizing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1996940095 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref country code: JP Ref document number: 1997 525031 Kind code of ref document: A Format of ref document f/p: F |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWP | Wipo information: published in national office |
Ref document number: 1996940095 Country of ref document: EP |
|
WWR | Wipo information: refused in national office |
Ref document number: 1996940095 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1996940095 Country of ref document: EP |