EP0421360B1 - Procédé et dispositif d'analyse par synthèse de la parole - Google Patents

Procédé et dispositif d'analyse par synthèse de la parole Download PDF

Info

Publication number
EP0421360B1
EP0421360B1 EP90118888A EP90118888A EP0421360B1 EP 0421360 B1 EP0421360 B1 EP 0421360B1 EP 90118888 A EP90118888 A EP 90118888A EP 90118888 A EP90118888 A EP 90118888A EP 0421360 B1 EP0421360 B1 EP 0421360B1
Authority
EP
European Patent Office
Prior art keywords
speech
impulse
phase
filter
waveform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
EP90118888A
Other languages
German (de)
English (en)
Other versions
EP0421360A2 (fr
EP0421360A3 (en
Inventor
Masaaki Honda
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Publication of EP0421360A2 publication Critical patent/EP0421360A2/fr
Publication of EP0421360A3 publication Critical patent/EP0421360A3/en
Application granted granted Critical
Publication of EP0421360B1 publication Critical patent/EP0421360B1/fr
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters

Definitions

  • the present invention relates to a speech analysis-synthesis method and apparatus in which a linear filter representing the spectral envelope characteristic of a speech is excited by an excitation signal to synthesize a speech signal.
  • linear predictive vocoder and multipulse predictive coding have been proposed for use in speech analysis-synthesis systems of this kind.
  • the linear predictive vocoder is now widely used for speech coding in a low bit rate region below 4.8 kb/s and this system includes a PARCOR system and a line spectrum pair (LSP) system. These systems are described in detail in Saito and Nakata,”Fundamentals of Speech Signal Processing," ACADEMIC PRESS, INC., 1985, for instance.
  • the linear predictive vocoder is made up of an all-pole filter representing the spectral envelope characteristic of a speech and an excitation signal generating part for generating a signal for exciting the all-pole filter.
  • the excitation signal is a pitch frequency impulse sequence for a voiced sound and a white noise for an unvoiced sound.
  • Excitation parameters are the distinction between voiced and unvoiced sounds, the pitch frequency and the magnitude of the excitation signal. These parameters are extracted as average features of the speech signal in an analysis window about 30 msec.
  • speech feature parameters extracted for each analysis window as mentioned above are interpolated temporarily to synthesize a speech, features of its waveform cannot be reproduced with sufficient accuracy when the pitch frequency, magnitude and spectrum characteristic of the speech undergo rapid changes.
  • the excitation signal composed of the pitch frequency impulse sequence and the white noise is insufficient for reproducing features of various speech waveforms, it is difficult to produce a highly natural-sounding synthesized speech. To improve the quality of the synthesized speech in the linear predictive vocoder, it is considered in the art to use excitation which permits more accurate reproduction of features of the speech waveform.
  • the multipulse predictive coding is a method that uses excitation of higher producibility than in the conventional vocoder.
  • the excitation signal is expressed using a plurality of impulses and two all-pole filters representing proximity correlation and pitch correlation characteristics of speech are excited by the excitation signal to synthesize the speech.
  • the temporal positions and magnitudes of the impulses are selected such that an error between input original and synthesized speech waveforms is minimized. This is described in detail in B.S. Atal, "A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates," IEEE Int. Conf. on ASSP, pp 614-617, 1982.
  • the speech quality can be enhanced by increasing the number of impulses used, but when the bit rate is low, the number of impulses is limited, and consequently, reproducibility of the speech waveform is impaired and no sufficient speech quality can be obtained. It is considered in the art that an amount of information of about 8 kb/s is needed to produce high speech quality.
  • a quasi-periodic impulse sequence having limited fluctuation in its pitch period is produced.
  • the quasi-periodic impulse sequence as an exitation signal for a voiced sound in the speech analysis, it is possible to reduce the amount of parameter information representing the impulse sequence.
  • Fig. 1 illustrates in block form the constitution of the speech analysis-synthesis system of the present invention.
  • a sampled digital speech signal s(t) is input via an input terminal 1.
  • a prediction residual signal e(t) of the input speech signal s(t) is obtained by an inverse filter (not shown) which uses the set of prediction coefficients as its filter coefficients.
  • phase equalizing-analyzing part 4 coefficients of a phase equalizing filter for rendering the phase characteristic of the speech into a zero phase and reference time points of phase equalization are computed.
  • Fig. 2 shows in detail the constitution of the phase equalizing-analyzing part 4.
  • the speech signal s(t) is applied to an inverse filter 31 to obtain the prediction residual e(t).
  • the prediction residual e(t) is provided to a maximum magnitude position detecting part 32 and a phase equalizing filter 37.
  • a switch control part 33C monitors the decision signal VU fed from the linear predictive analyzing part 2 and normally connects a switch 33 to the output side of a magnitude comparing part 38, but when the current window is of a voiced sound V and the immediately preceding frame is of an unvoiced sound U, the switch 33 is connected to the output side of the maximum magnitude position detecting part 32.
  • the maximum magnitude position detecting part 32 detects and outputs a sample time point t' p at which the magnitude of the prediction residual e(t) is maximum.
  • phase-equalizing filter coefficients h t ' i (k) have been obtained for the currently determined reference time point t' i at a coefficient smoothing part 35.
  • the coefficients h t ' i (k) are supplied from the filter coefficient holding part 36 to the phase equalizing filter 37.
  • the prediction residual e(t) which is the output of the inverse filter 31, is phase-equalized by the phase equalizing filter 37 and output therefrom as phase-equalized prediction residual e p (t). It is well known that when the input speech signal s(t) is a voiced sound signal, the prediction residual e(t) of the speech signal has a waveform having impulses at the pitch intervals of the voiced sound.
  • the phase equalizing filter 37 produces an effect of emphasizing the magnitudes of impulses of such pitch intervals.
  • the magnitude comparing part 38 compares levels of the phase-equalized prediction residual e p (t) with a predetermined threshold value, determines, as an impulse position, each sample time point where the sample value exceeds the threshold value, and outputs the impulse position as the next reference time point t' i+1 on the condition that an allowable minimum value of- the impulse intervals is L min and the next reference time point t' i+1 is searched for from sample points spaced more than the value L min apart from the time point t' i .
  • the phase-equalized residual e p (t) during the unvoiced sound frame is composed of substantially random components (or white noise) which are considerably lower than the threshold value mentioned above, and the magnitude comparing part 38 does not produce, as an output of the phase equalizing-analyzing part 4, the next reference time point t' i+1 . Rather, the magnitude comparing part 38 determines a dummy reference time point t' i+1 at, for example, the last sample point of the frame (but not limited thereto) so as to be used for determination of smoothed filter coefficients at the smoothing part 35 as will be explained later.
  • the characteristic of the phase-equalizing filter 37 expressed by Eq. (2) represents such a characteristic that the input signal thereto is passed therethrough intact.
  • the filter coefficients h*(k) thus calculated for the next reference time point t' i+1 are smoothed by the coefficient smoothing part 35 as will be described later to obtain smoothed phase equalizing filter coefficients h t' i+1 (k), which are held by the coefficient holding part 36 and supplied as updated coefficients h t'i (k) to the phase equalizing filter 37.
  • the phase equalizing filter 37 having its coefficients thus updated phase-equalizes the prediction residual e(t) again, and based on its output, the next impulse position, i.e., a new next reference time point t' i+1 is determined by the magnitude comparing part 38.
  • a next reference time point t' i+1 is determined based on the phase-equalized residual e p (t) output from the phase equalizing filter 37 whose coefficients have been set to h t' i (k) and, thereafter, new smoothed filter coefficients h t' i+1 (k) are calculated for the reference time point t' i+1 .
  • the prediction residual e(t) including impulses of the pitch frequency are provided, for the first time, to the phase equalizing filter 37 having set therein the filter coefficients given essentially by Eq. (1).
  • the magnitudes of impulses are not emphasized and, consequently, the prediction residual e(t) is output intact from the filter 37.
  • the magnitudes of impulses of the pitch frequency happen to be smaller than the threshold value, the impulses cannot be detected in the magnitude comparing part 38. That is, the speech is processed as if no impulses are contained in the prediction residual, and consequently, the filter coefficients h*(k) for the impulse positions are not obtained -- this is not preferable from the viewpoint of the speech quality in the speech analysis-synthesis.
  • the maximum magnitude detecting part 32 detects the maximum magnitude position t' p of the prediction residual e(t) in the voiced sound frame and provides it via the switch 33 to the filter coefficient calculating part 34 and, at the same time, outputs it as a reference time point.
  • the filter coefficient calculating part 34 calculates the filter coefficients h*(k), using the reference time point t' p in place of t' i+1 in Eq. (2).
  • the coefficient b is set to a value of about 0.97.
  • h t-1 (k) represents smoothed filter coefficients at an arbitrary sample point (t-1) in the time interval between the current reference time point t' i and the next reference time point t' i+1
  • h t (k) represents the smoothed filter coefficients at the next sample point. This smoothing takes place for every sample point from a sample point next to the current reference time point t' i , for which the smoothed filter coefficients have already been obtained, to the next reference time point t' i+1 for which the smoothed filter coefficients are to be obtained next.
  • the filter coefficient holding part 36 holds those of the thus sequentially smoothed filter coefficients h t (k) which were obtained for the last sample point which is the next reference time point, that is, h t' i+1 (k), and supplies them as updated filter coefficients h t' i (k) to the phase equalizing filter 37 for further determination of a subsequent next reference time point.
  • the phase equalizing filter 37 is supplied with the prediction residual e(t) and calculates the phase-equalized prediction residual e p (t) by the following equation:
  • Eq. (4) needs only to be performed until the next impulse position is detected by the magnitude comparing part 38 after the reference time point t' i at which the above-said smoothed filter coefficients were obtained.
  • the magnitude comparing part 38 the magnitude level of the phase-equalized prediction residual e p (t) is compared with a threshold value, and the sample point where the former exceeds the latter is detected as the next reference time point t' i+1 in the current frame.
  • processing is performed by which the time point where the phase-equalized prediction residual e p (t) takes the maximum magnitude until then is detected as the next reference time point t' i+1 .
  • steps 5 through 8 are repeatedly performed in the same manner as mentioned above, by which the smoothed filter coefficients h t' i (k) at all impulse positions in the frame can be obtained.
  • the smoothed filter coefficients h t (k) obtained in the phase equalizing-analyzing part 4 are used to control the phase equalizing filter 5.
  • the processing expressed by the following equation is performed to obtain a phase-equalized speech signal Sp(t).
  • the voiced sound excitation source comprises an impulse sequence generating part 7 and an all-zero filter (hereinafter referred to simply as zero filter) 10.
  • the impulse sequence generating part 7 generates such a quasi-periodic impulse sequence as shown in Fig. 3 in which the impulse position t i and the magnitude m i of each impulse are specified.
  • the temporal position (the impulse position) t i and the magnitude m i of each impulse in the quasi-periodic impulse sequence are represented as parameters.
  • the impulse position t i is produced by an impulse position generating part 6 based on the reference time point t' i , and the impulse magnitude m i is controlled by an impulse magnitude calculating part 8.
  • the impulse magnitude at each impulse position t i generated by the impulse position generating part 6 is selected so that a frequency-weighted mean square error between a synthesized speech waveform Sp'(t) produced by exciting such an all-pole filter 18 with the impulse sequence created by the impulse sequence generating part 7 and an input speech waveform Sp(t) phase-equalized by a phase equalizing filter 5 may be eventually minimized.
  • Fig. 6 shows the internal construction of the impulse magnitude calculating part 8.
  • the phase-equalized input speech waveform Sp(t) is supplied to a frequency weighting filter processing part 39.
  • the frequency weighting filter processing part 39 has such a construction as shown in Fig. 6A.
  • the linear prediction coefficients a i are provided to a frequency weighting filter coefficient calculating part 39A, in which coefficients ⁇ i a i of a filter having a transfer characteristic A(z/ ⁇ ) are calculated.
  • a zero input response calculating part 39C uses, as an initial value, a synthesized speech ⁇ (n-1) (t) obtained as the output of an all-pole filter 18A (see Fig. 1) of a transfer characteristic 1/A(z/ ⁇ ) in the preceding frame and outputs an initial response when the all-pole filter 18A is excited by a zero input.
  • a target signal calculating part 39D subtracts the output of the zero input response calculating part 39C from the output S'w(t) of the frequency weighting filter 39B to obtain a frequency-weighted signal Sw(t).
  • the output ⁇ i a i of the frequency weighting filter coefficient processing part 39A is supplied to an impulse response calculating part 40 in Fig. 6, in which an impulse response f(t) of a filter having the transfer characteristic 1/A(z/ ⁇ ) is calculated.
  • Another correlation calculating part 42 calculates a covariance ⁇ (i, j) of the impulse response for a set of impulse positions t i , t j as follows:
  • An impulse magnitude calculating part 43 obtains impulse magnitudes m i from ⁇ (t) and ⁇ (i, j) by solving the following simultaneous equations, which equivalently minimize a mean square error between a synthesized speech waveform obtainable by exciting the all-pole filter 18 with the impulse sequence thus determined and the phase-equalized speech waveform Sp(t).
  • the impulse magnitudes m i are quantized by the quantizer 9 in Fig. 1 for each frame. This is carried out by, for example, a scalar quantization or vector quantization method.
  • a vector (a magnitude pattern) using respective impulse magnitudes m i as its elements is compared with a plurality of predetermined standard impulse magnitude patterns and is quantized to that one of them which minimizes the distance between the patterns.
  • a measure of the distance between the magnitude patterns corresponds essentially to a mean square error between the speech waveform Sp'(t) synthesized, without using the zero filter, from the standard impulse magnitude pattern selected in the quantizer 9 and the phase-equalized input speech waveform Sp(t).
  • the quantized value m ⁇ of the above-mentioned magnitude pattern is expressed by the following equation, as a standard pattern which minimizes the mean square error d(m, m c ) in Eq, (12) in the afore-mentioned plurality of standard pattern vectors m ci .
  • the zero filter 10 is to provide an input impulse sequence with a feature of the phase-equalized prediction residual waveform, and the coefficients of this filter are produced by a zero filter coefficient calculating part 11.
  • Fig. 7A shows an example of the phase-equalized prediction residual waveform e p (t)
  • Fig. 7B an example of an impulse response waveform of the zero filter 10 for the input impulse thereto.
  • the phase-equalized prediction residual e p (t) has a flat spectral envelope characteristic and a phase close to zero, and hence is impulsive and large in magnitude at impulse positions t i , t i+1 ,... but relatively small at other positions.
  • the waveform is substantially symmetric with respect to each impulse position and each midpoint between adjacent impulse positions, respectively.
  • the magnitude at the midpoint is relatively larger than at other positions (except for impulse positions) as will be seen from Fig. 7A, and this tendency increases for a speech of a long pitch frequency, in particular.
  • the zero filter 10 is set so that its impulse response assume values at successive q sample points on either side of the impulse position t i and at successive r sample points on either side of the midpoint between the adjacent impulse positions t i and t i+1 , as depicted in Fig. 7B.
  • the transfer characteristic of the zero filter 10 is expressed as follows:
  • filter coefficients v k are determined such that a frequency-weighted mean square error between the synthesized speech waveform Sp'(t) and the phase-equalized input speech waveform Sp(t) may be minimum.
  • Fig. 8 illustrates the construction of the filter coefficient calculating part 11.
  • a frequency weighting filter processing part 44 and an impulse response calculating part 45 are identical in construction with the frequency weighting filter processing part 39 and the impulse response calculating part 40 in Fig. 6, respectively.
  • a correlation calculating part 47 calculates the cross-covariance ⁇ (i) between the signals Sw(t) and u i (t), and another correlation calculating part 48 calculates the auto-covariance ⁇ (i, j) between the signals u i (t) and u j (t).
  • a filter coefficient calculating part 49 calculates coefficients v i of the zero filter 10 from the above-said cross correlation ⁇ (i) and covariance ⁇ (i, j) by solving the following simultaneous equations: These solutions eventually minimize a mean square error between a synthesized speech waveform obtainable by exciting the all-pole filter 18 with the output of the zero filter 10 and the phase-equalized speech waveform Sp(t).
  • the filter coefficient v i is quantized by a quantizer 12 in Fig. 1. This is performed by use of a scalar quantization or vector quantization technique, for example.
  • a vector (a coefficient pattern) using the filter coefficients v i as its elements is compared with a plurality of predetermined standard coefficient patterns and is quantized to a standard pattern which minimizes the distance between patterns.
  • the quantized value v ⁇ of the filter coefficients is obtained by the following equation: where v is a vector using, as its elements, coefficients v -q , v -q+1 , ..., v q+2r+1 obtained by solving Eq. (16), and v ci is a standard pattern vector of the filter coefficients. Further, ⁇ is a matrix using as its elements the covariance ⁇ (i, j) of the impulse response u i (t).
  • the speech signal Sp'(t) is synthesized by exciting an all-pole filter featuring the speech spectrum envelope characteristic, with a quasi-periodic impulse sequence which is determined by impulse positions based on the phase-equalized residual e p (t) and impulse magnitudes determined so that an error of the synthesized speech is minimum.
  • the impulse magnitudes m i and the coefficients v i of the zero filter are set to optimum values which minimize the matching error between the synthesized speech waveform Sp'(t) and the phase-equalized speech waveform Sp(t).
  • a random pattern generating part 13 in Fig. 1 has stored therein a plurality of patterns each composed of a plurality of normal random numbers with a mean 0 and a variance 1.
  • a gain calculating part 15 calculates, for each random pattern, a gain g i which makes equal the power of the synthesized speech Sp'(t) by the output random pattern and the power of the phase-equalized speech Sp(t), and a scalar-quantized gain ⁇ i by a quantizer 16 is used to control an amplifier 14.
  • a matching error between a synthesized speech waveform Sp'(t) obtained by applying each of all the random patterns to the all-pole filter 18 and the phase-equalized speech Sp'(t) is obtained by the waveform matching error calculating part 19.
  • the errors thus obtained are decided by the error deciding part 20 and the random pattern generating part 13 searches for an optimum random pattern which minimizes the waveform matching error.
  • one frame is composed of three successive random patterns. This random pattern sequence is applied as the excitation signal to the all-pole filter 18 via the amplifier 14.
  • the speech signal is represented by the linear prediction coefficients a i and the voiced/unvoiced sound parameter VU; the voiced sound is represented by the impulse positions t i , the impulse magnitudes m ⁇ i and zero filter coefficients v ⁇ i , and the unvoiced sound is represented by the random number code pattern (number) c i and the gain ⁇ i .
  • These speech parameters are coded by a coding part 21 and then transmitted or stored. In a speech synthesizing part the speech parameters are decoded by a decoding part 22.
  • an impulse sequence composed of the impulse positions t i and the impulse magnitudes m ⁇ i is produced in an impulse sequence generating part 23 and is applied to a zero filter 24 to create an excitation signal.
  • a random pattern is selectively generated by a random pattern generating part 25 using the random number code (signal) c i and is applied to an amplifier 26 which is controlled by the gain ⁇ i and in which it is magnitude-controlled to produce an excitation signal.
  • Either one of the excitation signals thus produced is selected by a switch 27 which is controlled by the voiced/unvoiced parameter VU and the excitation signal thus selected is applied to an all-pole filter 28 to excite it, providing a synthesized speech at its output end 29.
  • the filter coefficients of the zero filter 24 are controlled byv ⁇ i and the filter coefficients of the all-pole filter 28 are controlled by a i .
  • the impulse excitation source is used in common to voiced and unvoiced sounds in the construction of Fig. 1. That is, the random pattern generating part 13, the amplifier 14, the gain calculating part 15, the quantizer 16 and the switch 17 are omitted, and the output of the zero filter 10 is applied directly to the all-pole filter 18.
  • the bit rate is reduced by 60 bits per second.
  • the zero filter 10 is not included in the impulse excitation source in Fig. 1, that is, the zero filter 10, the zero filter coefficient calculating part 11 and the quantizer 12 are omitted, and the output of the impulse sequence generating part 7 is provided via the switch 17 to the all-pole filter 18. (The zero filter 24 is also omitted accordingly.)
  • the natural sounding property of the synthesized speech is somewhat degraded for a speech of a male voice of a low pitch frequency, but the removal of the zero filter 10 reduces the scale of hardware used and the bit rate is reduced by 600 bits per second which are needed for coding filter coefficients.
  • a third modified form processing by the impulse magnitude calculating part 8 and processing by the vector quantizing part 9 in Fig. 1 are integrated for calculating a quantized value of the impulse magnitudes.
  • Fig. 9 shows the construction of this modified form.
  • a frequency weighting filter processing part 50, an impulse response calculating part 51, a correlation calculating part 52 and another correlation calculating part 53 are identical in construction with those in Fig. 6.
  • Figs. 6 and 9 are nearly equal in the amount of data to be processed for obtaining the optimum impulse magnitude, but in Fig. 9 processing for solving the simultaneous equations included in the processing of Fig. 6 is not required and the processor is simple-structured accordingly.
  • the maximum value of the impulse magnitude can be scalar-quantized, whereas in Fig. 9 it is premised that the vector quantization method is used.
  • the impulse position generating part 6 is not provided, and consequently, processing shown in Fig. 4 is not involved, but instead all the reference time points t' i provided from the phase equalizing-analyzing part 4 are used as impulse positions t i .
  • the throughput for enhancing the quality of the synthesized speech by the use of the zero filter 10 may also be assigned for the reduction of the impulse position information at the expense of the speech quality.
  • the constant J representing the allowed limit of fluctuations in the impulse frequency in the impulse source, the allowed maximum number of impulses per frame, Np, and the allowed minimum value of impulse intervals, L min , are dependent on the number of bits assigned for coding of the impulse positions.
  • the difference between adjacent impulse intervals, ⁇ T be equal to or smaller than 5 samples
  • the maximum number of impulses, Np be equal to or smaller than 6 samples
  • the allowed minimum impulse interval L min be equal to or greater than 13 samples.
  • the random pattern vector c i is composed of 40 samples (5 ms) and is selected from 512 kinds of patterns (9-bit).
  • the gain g i is scalar-quantized using 6 bits including a sign bit.
  • the speech coded using the above conditions is far natural sounding than the speech by the conventional vocoder and its quality is close to that of the original speech. Further, the dependence of speech quality on the speaker in the present invention is lower than in the case of the prior art vocoder. It has been ascertained that the quality of the coded speech is apparently higher than in the cases of the conventional multipulse predictive coding and the code excited predictive coding.
  • a spectral envelope error of a speech coded at 4.8 kb/s is about 1 dB.
  • a coding delay of this invention is 45 ms, which is equal to or shorter than that of the conventional low-bit rate speech coding schemes.
  • the reproducibility of speech waveform information is higher than in the conventional vocoder and the excitation signal can be expressed with a smaller amount of information than in the conventional multipulse predictive coding.
  • the present invention enhances matching between the synthesized speech waveform and the input speech waveform as compared with the prior art utilizing an error between the input speech itself and the synthesized speech, and hence permits an accurate estimation of the excitation parameters.
  • the zero filter produces the effect of reproducing fine spectral characteristics of the original speech, thereby making the synthesized speech more natural sounding.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Claims (9)

  1. Procédé d'analyse de la parole pour produire un signal d'excitation pour exciter un filtre linéaire représentant une caractéristique d'enveloppe spectrale de la parole comprenant :
    une étape dans laquelle on détermine des positions d'impulsions et l'on produit une séquence d'impulsions dans lesdites positions prédéterminées lors de la production dudit signal d'excitation ;
    une étape dans laquelle on détermine des paramètres représentant ledit signal d'excitation de façon à minimiser une erreur entre une forme d'onde de parole égalisée en phase, après égalisation de phase d'une parole d'entrée, et une forme d'onde de parole synthétisée, pouvant être obtenue en excitant ledit filtre linéaire à l'aide de ladite séquence d'impulsions ;
    une étape dans laquelle on produit un résidu de prédiction égalisée en phase de la forme d'onde de parole d'entrée ;
    caractérisé en ce que :
    ladite étape de détermination des positions d'impulsions et de production de la séquence d'impulsions produisant ledit signal d'excitation comprend :
    une étape dans laquelle on détermine des points temporels de référence où les niveaux dudit résidu de prédiction égalisée en phase dépassent un seuil prédéterminé ; et
    une étape dans laquelle on détermine des positions d'impulsions d'une séquence d'impulsions quasi-périodique en tant que positions d'impulsions de ladite séquence d'impulsions en se basant sur les points temporels de référence de sorte que la fluctuation d'intervalles de temps successifs des positions d'impulsions reste à l'intérieur d'une plage limitée.
  2. Procédé selon la revendication 1, comprenant une étape dans laquelle on détermine des coefficients d'un filtre de zéro (10), caractérisant une structure spectrale fine de ladite parole, de façon à minimiser une erreur entre ladite forme d'onde de parole égalisée en phase et une forme d'onde de parole synthétisée, pouvant être obtenue en excitant ledit filtre linéaire (18) avec la sortie dudit filtre de zéro, lesdits coefficients dudit filtre de zéro étant utilisés comme l'un desdits paramètres représentant ledit signal d'excitation.
  3. Procédé selon la revendication 1 ou 2, dans lequel on utilise ledit signal d'excitation pour un son vocalisé, et dans lequel on utilise une séquence aléatoire choisie à partir de plusieurs modèles aléatoires comme signal d'excitation pour un son non vocalisé, et incluant une étape dans laquelle on détermine des paramètres représentant ledit signal d'excitation pour ledit son non vocalisé de façon à minimiser une erreur entre ladite forme d'onde de parole égalisée en phase et une forme d'onde de parole synthétisée pouvant être obtenue en excitant ledit filtre linéaire avec lesdits modèles aléatoires.
  4. Procédé selon la revendication 1 ou 2, dans lequel lesdits paramètres représentant ledit signal d'excitation comprennent un paramètre représentant l'amplitude de chaque impulsion de ladite séquence d'impulsions, ledit paramètre d'amplitude étant déterminé de façon à minimiser une erreur entre ladite forme d'onde de parole égalisée en phase et une forme d'onde de parole synthétisée pouvant être obtenue en excitant ledit filtre linéaire avec ladite séquence d'impulsions.
  5. Dispositif d'analyse de la parole comprenant :
    un moyen d'analyse prédictif linéaire (2) pour effectuer une analyse prédictive linéaire d'un signal de parole d'entrée (s(t)) pour chaque fenêtre d'analyse d'une longueur fixe, pour obtenir des coefficients de prédiction (ai) ;
    un moyen de filtrage inverse (31) commandé par lesdits coefficients de prédiction, pour obtenir un résidu de prédiction (e(t)) à partir dudit signal de parole d'entrée (s(t)) ;
    un moyen de filtrage d'égalisation de phase de parole (5) pour rendre nulle la phase dudit signal de parole d'entrée, pour obtenir un signal de parole égalisé en phase (Sp(t)) ;
    un moyen de filtrage d'égalisation de phase de résidu de prédiction (37) pour rendre nulle la phase dudit résidu de prédiction (e(t)), pour obtenir un signal de résidu de prédiction égalisé en phase (ep(t)) ;
    un moyen (4, 6, 7) pour déterminer des positions d'impulsions et pour produire, en tant que signal d'excitation, une séquence d'impulsions au droit desdites positions ;
    un moyen de filtrage de tous les pôles (18) commandé par lesdits coefficients de prédiction et excité par ladite séquence d'impulsions pour produire une parole synthétisée ; et
    un moyen de calcul d'amplitude d'impulsion (8) par lequel des valeurs d'amplitude de ladite séquence d'impulsions sont déterminées de manière à minimiser une erreur entre une forme d'onde d'une parole synthétisée pouvant être obtenue par l'excitation dudit moyen de filtrage de tous les pôles avec ladite séquence d'impulsions et une forme d'onde de ladite parole égalisée en phase, les paramètres incluant lesdites positions d'impulsions, et lesdites valeurs d'amplitude des impulsions étant sorties du dispositif d'analyse de la parole ;
    caractérisé en ce que ledit moyen (4, 6, 7) pour déterminer les positions d'impulsions et pour produire la séquence d'impulsions comprend :
    un moyen générateur de points temporels de référence (4, 38) pour détecter les impulsions d'amplitude plus grande qu'une valeur de seuil prédéterminée dans ledit signal de résidu de prédiction égalisé en phase et pour sortir les positions desdites impulsions en tant que points temporels de référence ; et
    un moyen générateur de positions d'impulsions (6) pour déterminer, en se basant sur lesdites positions temporelles de référence, les positions des impulsions ayant une fréquence de hauteur de son d'une largeur de fluctuation limitée.
  6. Dispositif selon la revendication 5, comprenant en outre :
    un moyen de filtrage de zéro (10) alimenté avec ladite séquence d'impulsions, pour donner à ladite séquence d'impulsions les particularités de la forme d'onde dudit signal de résidu de prédiction égalisé en phase et pour délivrer sa sortie, en tant que signal d'excitation, audit moyen de filtrage de tous les pôles (18) ; et
    un moyen de calcul de coefficients de filtre de zéro (11) pour déterminer les coefficients dudit moyen de filtrage de zéro de façon à minimiser l'erreur entre une forme d'onde d'une parole synthétisée, obtenue par excitation dudit moyen de filtrage de tous les pôles avec la sortie dudit moyen de filtrage de zéro, et une forme d'onde de ladite parole égalisée en phase.
  7. Dispositif selon la revendication 5 ou 6, dans lequel ledit moyen d'analyse prédictive linéaire (2) comprend un moyen pour déterminer si ledit signal d'entrée dans une fenêtre d'analyse d'une longueur fixe est vocalisé ou non vocalisé et pour sortir un signal de décision vocalisé/non vocalisé (VU), ledit dispositif comprenant en outre un moyen générateur de modèle aléatoire (13) pour produire un modèle aléatoire qui minimise l'erreur entre une forme d'onde d'une parole synthétisée, obtenue par excitation dudit moyen de filtrage de tous les pôles (18) avec l'un de plusieurs modèles aléatoires et une forme d'onde de ladite parole égalisée en phase dans une fenêtre durant laquelle ledit signal de décision est non vocalisé.
  8. Dispositif selon la revendication 5 ou 6, dans lequel ledit moyen (4, 6, 7) pour produire ladite séquence d'impulsions comprend un moyen de quantification vectorielle (9) pour effectuer la quantification vectorielle des valeurs d'amplitude desdites impulsions déterminées par ledit moyen de calcul d'amplitude d'impulsion (8), ce par quoi ladite séquence d'impulsions a lesdites valeurs d'amplitude quantifiées.
  9. Dispositif de synthèse de la parole synthétisant une parole en réponse à des paramètres représentant une sortie de signal d'excitation par un dispositif d'analyse de la parole selon la revendication 5 ou 6, comprenant :
    un moyen générateur de séquence d'impulsions (23) pour produire une séquence d'impulsions basée sur lesdits paramètres ;
    un moyen de filtrage de zéro (24), excité par ladite séquence d'impulsions sous les ordres de coefficients de filtre de zéro qui lui sont délivrés en tant que l'un desdits paramètres, pour donner à ladite séquence d'impulsions la caractéristique spectrale de la parole ; et
    un moyen de filtrage de tous les pôles (28), excité par la sortie dudit moyen de filtrage de zéro sous les ordres de coefficients de prédiction représentant une caractéristique d'enveloppe spectrale de la parole, pour synthétiser une forme d'onde de parole.
EP90118888A 1989-10-02 1990-10-02 Procédé et dispositif d'analyse par synthèse de la parole Expired - Lifetime EP0421360B1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP257503/89 1989-10-02
JP1257503A JPH0782360B2 (ja) 1989-10-02 1989-10-02 音声分析合成方法

Publications (3)

Publication Number Publication Date
EP0421360A2 EP0421360A2 (fr) 1991-04-10
EP0421360A3 EP0421360A3 (en) 1991-12-27
EP0421360B1 true EP0421360B1 (fr) 1996-01-17

Family

ID=17307200

Family Applications (1)

Application Number Title Priority Date Filing Date
EP90118888A Expired - Lifetime EP0421360B1 (fr) 1989-10-02 1990-10-02 Procédé et dispositif d'analyse par synthèse de la parole

Country Status (4)

Country Link
EP (1) EP0421360B1 (fr)
JP (1) JPH0782360B2 (fr)
CA (1) CA2026640C (fr)
DE (1) DE69024899T2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8620647B2 (en) 1998-09-18 2013-12-31 Wiav Solutions Llc Selection of scalar quantixation (SQ) and vector quantization (VQ) for speech coding

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2741744B1 (fr) * 1995-11-23 1998-01-02 Thomson Csf Procede et dispositif d'evaluation de l'energie du signal de parole par sous bande pour vocodeur bas debits
CN1252679C (zh) 1997-03-12 2006-04-19 三菱电机株式会社 声音编码装置、声音编码译码装置、以及声音编码方法
US6385573B1 (en) 1998-08-24 2002-05-07 Conexant Systems, Inc. Adaptive tilt compensation for synthesized speech residual
JP4999757B2 (ja) * 2008-03-31 2012-08-15 日本電信電話株式会社 音声分析合成装置、音声分析合成方法、コンピュータプログラム、および記録媒体
JP5325130B2 (ja) * 2010-01-25 2013-10-23 日本電信電話株式会社 Lpc分析装置、lpc分析方法、音声分析合成装置、音声分析合成方法及びプログラム
CN108281150B (zh) * 2018-01-29 2020-11-17 上海泰亿格康复医疗科技股份有限公司 一种基于微分声门波模型的语音变调变嗓音方法
CN113066476B (zh) * 2019-12-13 2024-05-31 科大讯飞股份有限公司 合成语音处理方法及相关装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0163829B1 (fr) * 1984-03-21 1989-08-23 Nippon Telegraph And Telephone Corporation Dispositif pour le traitement des signaux de parole

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8620647B2 (en) 1998-09-18 2013-12-31 Wiav Solutions Llc Selection of scalar quantixation (SQ) and vector quantization (VQ) for speech coding
US8635063B2 (en) 1998-09-18 2014-01-21 Wiav Solutions Llc Codebook sharing for LSF quantization
US8650028B2 (en) 1998-09-18 2014-02-11 Mindspeed Technologies, Inc. Multi-mode speech encoding system for encoding a speech signal used for selection of one of the speech encoding modes including multiple speech encoding rates
US9190066B2 (en) 1998-09-18 2015-11-17 Mindspeed Technologies, Inc. Adaptive codebook gain control for speech coding
US9269365B2 (en) 1998-09-18 2016-02-23 Mindspeed Technologies, Inc. Adaptive gain reduction for encoding a speech signal
US9401156B2 (en) 1998-09-18 2016-07-26 Samsung Electronics Co., Ltd. Adaptive tilt compensation for synthesized speech

Also Published As

Publication number Publication date
EP0421360A2 (fr) 1991-04-10
DE69024899D1 (de) 1996-02-29
JPH03119398A (ja) 1991-05-21
CA2026640A1 (fr) 1991-04-03
EP0421360A3 (en) 1991-12-27
DE69024899T2 (de) 1996-07-04
CA2026640C (fr) 1996-07-09
JPH0782360B2 (ja) 1995-09-06

Similar Documents

Publication Publication Date Title
US5293448A (en) Speech analysis-synthesis method and apparatus therefor
McCree et al. A mixed excitation LPC vocoder model for low bit rate speech coding
US5305421A (en) Low bit rate speech coding system and compression
US6345248B1 (en) Low bit-rate speech coder using adaptive open-loop subframe pitch lag estimation and vector quantization
EP0745971A2 (fr) Système d'estimation du pitchlag utilisant codage résiduel selon prédiction
US4918734A (en) Speech coding system using variable threshold values for noise reduction
US5884251A (en) Voice coding and decoding method and device therefor
JP3687181B2 (ja) 有声音/無声音判定方法及び装置、並びに音声符号化方法
US4720865A (en) Multi-pulse type vocoder
JP3180786B2 (ja) 音声符号化方法及び音声符号化装置
US8195463B2 (en) Method for the selection of synthesis units
EP0421360B1 (fr) Procédé et dispositif d'analyse par synthèse de la parole
KR100421648B1 (ko) 음성코딩을 위한 적응성 표준
EP1204092A2 (fr) Décodeur de parole avec reproduction du bruit de fond
EP0745972B1 (fr) Procédé et dispositif de codage de parole
JP3531780B2 (ja) 音声符号化方法および復号化方法
Yeldener et al. A mixed sinusoidally excited linear prediction coder at 4 kb/s and below
JP3490324B2 (ja) 音響信号符号化装置、復号化装置、これらの方法、及びプログラム記録媒体
Lee et al. Applying a speaker-dependent speech compression technique to concatenative TTS synthesizers
JPH0830299A (ja) 音声符号化装置
JP2001318698A (ja) 音声符号化装置及び音声復号化装置
EP0713208B1 (fr) Système d'estimation de la fréquence fondamentale
JP3552201B2 (ja) 音声符号化方法および装置
JP3192051B2 (ja) 音声符号化装置
Hernandez-Gomez et al. On the behaviour of reduced complexity code-excited linear prediction (CELP)

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19901002

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): DE FR GB SE

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): DE FR GB SE

17Q First examination report despatched

Effective date: 19940526

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE FR GB SE

REF Corresponds to:

Ref document number: 69024899

Country of ref document: DE

Date of ref document: 19960229

ET Fr: translation filed
REG Reference to a national code

Ref country code: FR

Ref legal event code: CA

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed
REG Reference to a national code

Ref country code: GB

Ref legal event code: IF02

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20070926

Year of fee payment: 18

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20071030

Year of fee payment: 18

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: SE

Payment date: 20071004

Year of fee payment: 18

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20070716

Year of fee payment: 18

EUG Se: european patent has lapsed
GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20081002

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

Effective date: 20090630

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20090501

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20081031

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20081002

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20081003