JPH0782360B2

JPH0782360B2 - Speech analysis and synthesis method

Info

Publication number: JPH0782360B2
Application number: JP1257503A
Authority: JP
Inventors: 雅彰誉田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1989-10-02
Filing date: 1989-10-02
Publication date: 1995-09-06
Anticipated expiration: 2010-09-06
Also published as: CA2026640A1; EP0421360A3; JPH03119398A; CA2026640C; DE69024899T2; EP0421360B1; EP0421360A2; DE69024899D1

Description

【発明の詳細な説明】「産業上の利用分野」この発明は音声スペクトル包絡特性を表す線形フィルタ
を音源信号で駆動して音声信号を合成する音声分析合成
方法に関する。The present invention relates to a voice analysis / synthesis method of synthesizing a voice signal by driving a linear filter representing a voice spectrum envelope characteristic with a sound source signal.

「従来の技術」この発明に関連する従来技術として、線形予測ボコーダ
とマルチパルス予測符号化がある。線形予測ボコーダ
は、4.8kb/s以下の低ビットレート領域における音声符
号化方法としてこれまで広く用いられ、パーコール方式
や線スペクトル対（LSP）方式などの方法がある。これ
らの方式の詳細は、例えば斉藤、中田著“音声情報処理
の基礎”（オーム社出版）に記載されている。線形予測
ボコーダは、音声のスペクトル包絡特性を表す全極形の
フィルタとそれを駆動する音源信号の生成部とによって
構成される。その駆動音源信号には、有声音に対しては
ピッチ周期パルス列、無声音に対しては白色雑音が用い
られる。音源パラメータは、有声・無声の区別、ピッチ
周期および音源信号の振幅であり、これらのパラメータ
は30ミリ秒程度の分析区間における音声信号の平均的な
特徴として抽出される。線形予測ボコーダでは、このよ
うに一定の分析区間毎に抽出した音声の特徴パラメータ
を時間的に補間して音声を合成するため、音声のピッチ
周期、振幅、およびスペクトル特性が急速に変化する場
合には、音声波形の特徴が十分な精度では再現すること
ができない。さらに、周期パルス列と白色雑音からなる
駆動音源信号は多様な音声波形の特徴を再現するには不
十分なため、自然性の高い合成音声を得ることは困難で
あった。このように、線形予測ボコーダにおいて合成音
声の品質を高めるには、より音声波形の特徴を再現でき
る駆動音源が必要とされてきた。"Prior Art" As a prior art related to the present invention, there are a linear predictive vocoder and multi-pulse predictive coding. The linear prediction vocoder has been widely used as a speech coding method in a low bit rate region of 4.8 kb / s or less, and there are methods such as the Percoll method and the line spectrum pair (LSP) method. Details of these methods are described in, for example, "Basics of Speech Information Processing" by Saito and Nakata (published by Ohmsha). The linear predictive vocoder is composed of an all-pole filter that represents the spectral envelope characteristic of speech and a sound source signal generator that drives the filter. As the driving sound source signal, a pitch period pulse train is used for voiced sound and white noise is used for unvoiced sound. The sound source parameters are voiced / unvoiced discrimination, the pitch period, and the amplitude of the sound source signal, and these parameters are extracted as average features of the speech signal in the analysis period of about 30 milliseconds. The linear predictive vocoder synthesizes a voice by temporally interpolating the feature parameters of the voice extracted in such a fixed analysis interval in this way, and thus when the pitch period, amplitude, and spectral characteristics of the voice change rapidly. Cannot reproduce the characteristics of the voice waveform with sufficient accuracy. Furthermore, it is difficult to obtain highly natural synthetic speech because the driving sound source signal consisting of a periodic pulse train and white noise is not sufficient to reproduce the characteristics of various speech waveforms. As described above, in order to improve the quality of synthesized speech in the linear predictive vocoder, a driving sound source that can more reproduce the characteristics of the speech waveform has been required.

一方、マルチパルス予測符号化は従来のボコーダにくら
べて再現能力の高い駆動音源を用いる方法である（特許
1234567）。この方法では、複数個のパルスにより駆動
音源信号を表現し、音声の近接相関とピッチ相関特性を
表す２つの全極形フィルタを駆動することにより音声を
合成する。パルスの時間的位置と振幅は、入力音声波形
と合成音声波形との誤差を最小にするように決定され
る。その詳細は、文献（B.S.Atal,“A New model of LP
C excitation for producing natural-sounding speech
at low bit rates",IEEE Int.Conf.on ASSP,pp614-61
7,1982）に示されている。マルチパルス予測符号化で
は、パルスの個数を増やすことによって音声品質を高め
ることができるが、逆にビットレートが低くなるとパル
スの個数が制限されるため音声波形の再現精度が劣化
し、その結果十分な音声品質が得られなくなる。良好な
音声品質を得るには8kb/s程度の情報量が必要とされ
た。On the other hand, multi-pulse predictive coding is a method that uses a driving sound source with higher reproduction capability than the conventional vocoder (Patent
1234567). In this method, a driving sound source signal is expressed by a plurality of pulses, and two all-pole filters representing the proximity correlation and pitch correlation characteristics of the sound are driven to synthesize the sound. The temporal position and amplitude of the pulse are determined so as to minimize the error between the input speech waveform and the synthesized speech waveform. For details, refer to the literature (BSAtal, “A New model of LP
C excitation for producing natural-sounding speech
at low bit rates ", IEEE Int.Conf.on ASSP, pp614-61
7, 1982). In multi-pulse predictive coding, speech quality can be improved by increasing the number of pulses, but conversely, when the bit rate is low, the number of pulses is limited and the reproduction accuracy of speech waveform deteriorates. Voice quality cannot be obtained. The amount of information of about 8 kb / s was required to obtain good voice quality.

マルチパルス予測符号化では、入力音声波形そのものを
再現するように駆動音源が決定されるのに対して、特願
昭59-53757“音声信号処理方法”の実施例に見られるよ
うに、音声波形の位相成分を一定の位相に等化した後の
位相等化音声信号をマルチパルス予測符号化する方法が
提案されている。この方法では、聴覚的に鈍感な音声の
位相成分を音声波形から取り除くことにより、駆動音源
信号がより少ない個数のパルスで再現されるため、低ビ
ットレートでの音声品質が改善できる。しかし、この方
法でもビットレートが4.8kb/s程度に低くなると、パル
スの個数が不足して音声波形の特徴が十分には再現でき
ず、品質の高い音声を得ることはできなかった。In the multi-pulse predictive coding, the driving sound source is determined so as to reproduce the input speech waveform itself, while as shown in the example of Japanese Patent Application No. 59-53757 "Speech Signal Processing Method", the speech waveform is There has been proposed a method for performing multi-pulse predictive coding on a phase-equalized speech signal after equalizing the phase component of P to a constant phase. In this method, the driving sound source signal is reproduced with a smaller number of pulses by removing the phase component of the auditory insensitive speech from the speech waveform, so that the speech quality at a low bit rate can be improved. However, even with this method, when the bit rate was reduced to about 4.8 kb / s, the number of pulses was insufficient and the characteristics of the speech waveform could not be sufficiently reproduced, so that high quality speech could not be obtained.

この発明の目的は、線形予測ボコーダと波形符号化の境
界領域（2.4-4.8kb/s）において、品質の高い音声分析
合成方法を提供することにある。An object of the present invention is to provide a high-quality speech analysis / synthesis method in a boundary region (2.4-4.8 kb / s) between a linear prediction vocoder and waveform coding.

「課題を解決するための手段」この発明は、音声分析合成に用いられる有声音に対する
駆動音源信号を、ピッチ周期のゆらぎの大きさを制限し
た準周期パルス列と位相等化された音声の予測残差を特
徴づける零形のフィルタとで表現し、この駆動音源信号
により合成された音声波形と位相等化された入力音声波
形との誤差が最小になるように、音源信号を構成するパ
ラメータ、すなわちパルスの時間的位置、振幅、および
零形フィルタの係数を決定することを特徴とする。従来
のボコーダでは一定分析区間毎に求めたピッチ周期と振
幅から生成される周期パルス列を駆動音源信号として用
いているのに対して、この発明ではピッチ周期毎にパル
スの位置と振幅が決定され、さらに零形フィルタを新た
に導入することにより音声波形の再現性の向上が図られ
ている。また、従来のマルチパルス予測符号化では複数
個のパルスによって１ピッチ周期の駆動音源信号を表し
ているのに対して、この発明ではピッチ当たり１個のパ
ルスと一定分析区間毎に設定される零形フィルタで駆動
音源信号を表しており、駆動音源信号の情報量の低減が
図られている。さらに、音源パラメータを決定する評価
基準として、従来方式では入力音声波形との誤差が用い
られているのに対して、この発明では位相等化音声波形
との誤差が用いられている。位相等化音声波形に対する
誤差評価尺度を用いることで、この発明で用いられる駆
動音源信号から合成される音声波形と入力音声波形との
整合度が向上することが可能となる。位相等化音声波形
と合成音声波形と互いに近いため、これらを比較して音
源パラメータを決定することにより音源パラメータの数
を少くすることができる。最後に、従来の位相等化とマ
ルチパルス予測符号化とを組み合わせた方法との相違
は、使用する駆動音源信号および音源パラメータの決定
方法の違いである。"Means for Solving the Problem" The present invention relates to a speech prediction residual obtained by phase-equalizing a driving sound source signal for a voiced sound used for speech analysis and synthesis with a quasi-periodic pulse train in which the magnitude of pitch period fluctuation is limited. A zero-type filter that characterizes the difference, and a parameter that constitutes the sound source signal, that is, the error between the sound waveform synthesized by this driving sound source signal and the phase-equalized input sound waveform is minimized, that is, It is characterized by determining the temporal position of the pulse, the amplitude and the coefficient of the zero filter. In the conventional vocoder, the periodic pulse train generated from the pitch period and the amplitude obtained for each constant analysis section is used as the driving sound source signal, whereas in the present invention, the position and the amplitude of the pulse are determined for each pitch period, Furthermore, the reproducibility of the voice waveform is improved by introducing a new zero filter. Further, in the conventional multi-pulse predictive coding, the driving excitation signal of one pitch period is represented by a plurality of pulses, whereas in the present invention, one pulse per pitch and zero set for each constant analysis section. The driving sound source signal is represented by a shape filter, and the amount of information of the driving sound source signal is reduced. Further, as the evaluation criterion for determining the sound source parameter, the error with the input speech waveform is used in the conventional method, whereas the error with the phase equalized speech waveform is used in the present invention. By using the error evaluation scale for the phase equalized speech waveform, the degree of matching between the speech waveform synthesized from the driving sound source signal used in the present invention and the input speech waveform can be improved. Since the phase equalized speech waveform and the synthesized speech waveform are close to each other, the number of sound source parameters can be reduced by comparing them to determine the sound source parameter. Finally, the difference between the conventional method of combining phase equalization and multi-pulse predictive coding is the method of determining the driving excitation signal and the excitation parameter to be used.

「実施例」第１図は、この発明による音声分析合成法の構成を示し
たものである。入力端子１からは標本化されたデジタル
の音声信号ｓ（ｔ）が入力される。線形予測分析部２で
は、Ｎ個の音声信号のサンプルを一旦データバッファに
蓄えた後、これらのサンプルに対して線形予測分析を行
って予測係数a_i（ｉ＝1,2,・・・,p）を算出し、その予
測係数a_iを量子化器３で量子化する。また、その予測係
数をフィルタ係数とする逆フィルタを用いて予測残差信
号を求め、その予測残差信号の自己相関係数の最大値に
対するレベル判定にもとづいて音声の有声・無声VUを判
定する。これらの処理方法の詳細は、前述の斉藤等によ
る著書に記載されている。[Embodiment] FIG. 1 shows a configuration of a speech analysis and synthesis method according to the present invention. A sampled digital audio signal s (t) is input from the input terminal 1. In the linear prediction analysis unit 2, after storing N audio signal samples in a data buffer once, a linear prediction analysis is performed on these samples to predict coefficients a _i (i = 1, 2, ..., p) is calculated, and the prediction coefficient a _i is quantized by the quantizer 3. In addition, the prediction residual signal is obtained using an inverse filter that uses the prediction coefficient as a filter coefficient, and the voiced / unvoiced VU of the voice is determined based on the level determination with respect to the maximum value of the autocorrelation coefficient of the prediction residual signal. . Details of these processing methods are described in the above-mentioned book by Saito et al.

位相等化分析部４では、音声の位相特性を零位相化する
位相等化フィルタの係数と位相等化の基準時点を算出す
る。第２図は位相等化分析部４の細部の構成を示したも
のである。音声信号ｓ（ｔ）を逆フィルタ31に入力して
予測残差ｅ（ｔ）が求まる。その予測残差は最大振幅位
置検出部32と位相等化フィルタ37に供給される。スイッ
チ33は通常振幅比較部38の出力側に設定されており、当
該分析フレームが有声で一つ前の分析フレームが無声の
場合のみ最大振幅位置検出部32の出力側に設定される。
この場合は、最大振幅位置検出部32において予測残差の
振幅が最大になる時点ｔ′₀が検出され、これがフィル
タ係数算出部34に入力されて位相等化フィルタの係数が
次式により求められる。The phase equalization analysis unit 4 calculates the coefficient of a phase equalization filter that zero-phases the phase characteristic of voice and a reference time point for phase equalization. FIG. 2 shows a detailed configuration of the phase equalization analysis unit 4. The speech signal s (t) is input to the inverse filter 31 to obtain the prediction residual e (t). The prediction residual is supplied to the maximum amplitude position detector 32 and the phase equalization filter 37. The switch 33 is normally set on the output side of the amplitude comparison section 38, and is set on the output side of the maximum amplitude position detection section 32 only when the analysis frame is voiced and the previous analysis frame is unvoiced.
In this case, the amplitude of the prediction residual at the maximum amplitude position detector 32 is detected when t _'0 is maximized, this coefficient of the phase equalizing filter is input to the filter coefficient calculation unit 34 is calculated by the following formula .

その後スイッチ33は振幅比較部38の出力側に切り替わ
り、振幅比較部38の出力がフィルタ係数算出部34に入力
される。 After that, the switch 33 is switched to the output side of the amplitude comparison unit 38, and the output of the amplitude comparison unit 38 is input to the filter coefficient calculation unit 34.

フィルタ係数算出部34では、当該フレームが有声の場合
は基準時点t_iに対して、上式と同様に次式で計算され
る。In the case where the frame is voiced, the filter coefficient calculation unit 34 calculates the reference time point t _{i according} to the following equation similar to the above equation.

また、当該フレームが無声の場合は、次のように設定さ
れる。 If the frame is unvoiced, it is set as follows.

フィルタ係数算出部34の出力は平滑部35へ供給され、例
えば次式のような１次のフィルタを用いて位相等化フィ
ルタの係数h^*(m)が時間的に平滑化される。 The output of the filter coefficient calculation unit 34 is supplied to the smoothing unit 35, and the coefficient h ^* (m) of the phase equalization filter is temporally smoothed by using, for example, a primary filter such as the following equation.

h_t(m)＝bh_t-1(m)＋(1-b)h^*(m)t_i-1＜ｔt_i ここで、係数ｂは.97程度の値に設定される。フィルタ
係数保持部36では、平滑化されたフィルタ係数h_t(m)を
各基準時点での値h_ti(m)を保持し、位相等化フィルタ37
を制御する。位相等化フィルタ37へは予測残差ｅ（ｔ）
が入力され、次式により位相等化予測残差e_p(t)を出力
する。 _{_{h t (m) = bh t}} -1 (m) + (1-b) h * (m) t i-1 <tt i where the coefficient b is set to a value of about .97. The filter coefficient holding unit 36 holds the smoothed filter coefficient h _t (m) at the value h _ti (m) at each reference time, and the phase equalization filter 37
To control. The prediction residual e (t) is sent to the phase equalization filter 37.
Is input, and the phase equalization prediction residual e _p (t) is output by the following equation.

振幅比較部38では、位相等化予測残差e_p(t)の振幅レベ
ルがしきい値と比較され、しきい値を越える場合はその
時点を次の基準時点ｔ′_iとして検出する。 In the amplitude comparison unit 38, the amplitude level of the phase equalization prediction residual e _p (t) is compared with the threshold value, and when it exceeds the threshold value, that time point is detected as the next reference time point t ′ _i .

第１図に示すように、位相等化分析部４で求められたフ
ィルタ係数h_t(m)は位相等化フィルタ５を制御する。こ
の位相等化フィルタ５に音声信号ｓ（ｔ）を入力するこ
とにより位相等化音声信号s_p(t)がその出力として求め
られる。As shown in FIG. 1, the filter coefficient h _t (m) obtained by the phase equalization analysis unit 4 controls the phase equalization filter 5. By inputting the audio signal s (t) into the phase equalizing filter 5, the phase equalized audio signal s _p (t) is obtained as its output.

次に、音源パラメータ分析部30について説明する。この
分析合成法では有声音と無声音とで別々の駆動音源を使
用し、有声・無声パラメータVUによってスッチ17が切り
替えられる。有声音の駆動音源はパルス系列生成部７と
零形フィルタ10から構成される。 Next, the sound source parameter analysis unit 30 will be described. In this analysis and synthesis method, separate voice sources are used for voiced sound and unvoiced sound, and the switch 17 is switched by the voiced / unvoiced parameter VU. The drive source of voiced sound is composed of a pulse sequence generator 7 and a zero filter 10.

パルス系列生成部７では第３図に示すような準周期パル
ス列を生成する。準周期パルス列は、各パルスの時間的
な位置（パルス位置）t_iと振幅m_iをパラメータとして表
される。パルス位置はパルス位置生成部６により制御さ
れ、パルス振幅はパルス振幅算出部８によって制御され
る。パルス位置は位置間隔が準周期的になるように制限
される。すなわち、第３図におけるパルス位置間隔T_i＝
t_i−t_i-1は、連続するパルス位置間隔の差が一定値以下
で、かつその差の分析フレーム内での総和が一定値以下
になるように次式によって制限される。The pulse sequence generator 7 generates a quasi-periodic pulse train as shown in FIG. The quasi-periodic pulse train is represented by parameters of temporal position (pulse position) t _i and amplitude m _i of each pulse. The pulse position is controlled by the pulse position generator 6, and the pulse amplitude is controlled by the pulse amplitude calculator 8. The pulse positions are limited so that the position spacing is quasi-periodic. That is, the pulse position interval T _i =
t _i −t _i−1 is limited by the following equation so that the difference between successive pulse position intervals is equal to or less than a certain value, and the sum of the differences in the analysis frame is less than or equal to a certain value.

条件１ ΔT_i＝｜T_i−T_i-1｜Ｊ条件２ここで、n_pは分析フレーム内でのパルスの個数、ＪとJ
_sumは定数である。パルス位置生成部６では、位相等化
分析部４で求められる基準時点ｔ′_iを基に、上記の制
限を満足するパルス位置の系列を生成する。第４図は基
準時点からパルス位置系列を生成する処理手順を示した
ものである。この処理では、まず基準時点から求まる位
置間隔の差に関して条件１に関する判定を行い、条件１
を満たさない場合は第４図の手順にしたがってパルス位
置の挿入、除去、修正を行う。その結果、全ての基準時
点が条件１を満たす場合は条件２の判定を行い、条件２
を満たす場合はその基準時点をパルス位置とする。条件
２を満たさない場合、基準時点の近傍で条件２を満たす
全てのパルス位置を候補として生成する。また、条件１
を満たさない場合は、基準時点の個数をその最大取り得
る個数N_Pと比較し、最大パルス数より少ない時は基準時
点をそのままパルス位置として用いる。基準時点の個数
が最大パルス数より多い時は、基準時点の中から個数が
最大パルス数となるパルス位置の全部の組み合わせを生
成する。生成されるパルス位置の候補が複数個ある場合
は、各パルス位置に対して合成される音声波形と位相等
化後の入力音声波形との誤差を波形歪み算出部19で求
め、歪み判定部20において誤差が最小になるパルス位置
を選択する。Condition 1 ΔT _i = | T _i −T _i-1 | J Condition 2 Where n _p is the number of pulses in the analysis frame, J and J
_sum is a constant. The pulse position generation unit 6 generates a sequence of pulse positions satisfying the above-mentioned restrictions based on the reference time point t ′ _i obtained by the phase equalization analysis unit 4. FIG. 4 shows a processing procedure for generating a pulse position sequence from a reference time point. In this process, first, the determination regarding the condition 1 is performed regarding the difference in the position interval obtained from the reference time point, and the condition 1 is determined.
If the condition is not satisfied, the pulse position is inserted, removed, and corrected according to the procedure shown in FIG. As a result, if all the reference time points satisfy the condition 1, the condition 2 is judged, and the condition 2
When the condition is satisfied, the reference time point is set as the pulse position. If the condition 2 is not satisfied, all pulse positions that satisfy the condition 2 in the vicinity of the reference time point are generated as candidates. Also, condition 1
If is not satisfied, the number of reference time points is compared with the maximum possible number N _P, and when the number is less than the maximum pulse number, the reference time point is used as it is as a pulse position. When the number of reference time points is greater than the maximum number of pulses, all combinations of pulse positions having the maximum number of pulses are generated from the reference time points. When there are a plurality of pulse position candidates to be generated, an error between the voice waveform synthesized for each pulse position and the input voice waveform after phase equalization is calculated by the waveform distortion calculation unit 19, and the distortion determination unit 20 Select the pulse position that minimizes the error in.

パルス振幅算出部８では、各パルスの振幅を合成音声波
形と位相等化後の入力音声波形との周波数重み付け平均
二乗誤差が最小になるように決定する。第５図は、パル
ス振幅算出部８の内部の構成を示したものである。位相
等化信号の入力音声波形s_p(t)は周波数重み付けフィル
タ39へ供給され、このフィルタ39は音声スペクトルの強
い周波数成分を抑圧する働きを持ち、その伝達特性は次
のように表される。The pulse amplitude calculator 8 determines the amplitude of each pulse so that the frequency weighted mean square error between the synthesized speech waveform and the input speech waveform after phase equalization is minimized. FIG. 5 shows an internal configuration of the pulse amplitude calculating section 8. The input speech waveform s _p (t) of the phase equalized signal is supplied to the frequency weighting filter 39, and this filter 39 has a function of suppressing a strong frequency component of the speech spectrum, and its transfer characteristic is expressed as follows. .

ただし、ここで、a_iは線形予測係数であり、z^-1は標本化遅延を
表す。γは抑圧の程度を制御するパラメータであり、０
＜γ１の範囲の値をとり、小さい値になるほど抑圧の
程度が大きくなる。通常は0.7−0.9の値が用いられる。
周波数重み付きフィルタ39は、位相等化音声信号を周波
数重み付きフィルタに通した出力信号から、１つ前の分
析フレームの合成音声を初期値としてフィルタ1/A（γ
ｚ）を零入力で駆動した時の初期値応答を差し引くこと
により信号s_w(t)を得る。一方、線形予測係数a_iは、イ
ンパルス応答算出部40へ供給され、1/A（γｚ）の伝達
特性をもつフィルタのインパルス応答ｆ（ｔ）が算出さ
れる。相関器41では、各パルス位置t_iに対してインパル
ス応答ｆ（ｔ−t_i）と周波数信号S_w(t)との相互共分散
ψ（ｉ）を次式で算出する。 However, Where a _i is the linear prediction coefficient and z ⁻¹ represents the sampling delay. γ is a parameter that controls the degree of suppression, and is 0
The value is in the range of <γ1, and the smaller the value, the greater the degree of suppression. Values of 0.7-0.9 are usually used.
The frequency-weighted filter 39 uses the output signal obtained by passing the phase-equalized speech signal through the frequency-weighted filter as the initial value of the synthesized speech of the immediately preceding analysis frame, and filters 1 / A (γ
The signal s _w (t) is obtained by subtracting the initial value response when z) is driven with zero input. On the other hand, the linear prediction coefficient a _i is supplied to the impulse response calculation unit 40, and the impulse response f (t) of the filter having the transfer characteristic of 1 / A (γz) is calculated. The correlator 41 calculates the mutual covariance ψ (i) between the impulse response f (t−t _i ) and the frequency signal S _w (t) for each pulse position t _i by the following equation.

また、相関器42では、各パルス位置t_i,t_jの組に関して
インパルス応答の自己供分散φ（i,j）を次式で算出す
る。 Further, the correlator 42 calculates the self-covariance φ (i, j) of the impulse response for each set of pulse positions t _i , t _j by the following equation.

パルス振幅算出部43では、ψ（ｔ）とφ（i,j）とから
パルス振幅を次の連立方程式を解くことによって求め
る。 The pulse amplitude calculator 43 obtains the pulse amplitude from ψ (t) and φ (i, j) by solving the following simultaneous equations.

第１図中のパルス振幅は量子化器９において、例えばベ
クトル量子化の手法を用いて量子化される。ベクトル量
子化を用いる場合、パルス振幅を要素とするベクトル
（振幅パタン）を複数個のパルス振幅標準パタンと比較
し、パタン間の距離が最小となる標準パタンに量子化さ
れる。振幅パタンの距離尺度としては、パルス振幅標準
パタンから零形フィルムを用いず合成された音声波形と
位相等化後の入力音声波形との平均二乗誤差が用いられ
る。振幅パタンベクトルをｍ＝（m₁,m₂,...,m_np）（ｔ
は行列の転値を表す）、標準パタンベクトルをm_ci（ｉ
＝1,2,...,Nc）とすると、平均二乗誤差は次式で表され
る。 The pulse amplitude in FIG. 1 is quantized in the quantizer 9 using, for example, a vector quantization method. In the case of using vector quantization, a vector (amplitude pattern) having pulse amplitude as an element is compared with a plurality of pulse amplitude standard patterns and quantized to a standard pattern in which the distance between patterns is minimized. As the distance measure of the amplitude pattern, the mean square error between the speech waveform synthesized from the pulse amplitude standard pattern without using the zero-shaped film and the input speech waveform after phase equalization is used. The amplitude pattern vector is m = (m ₁ , m ₂ , ..., m _np ) (t
Represents the transposed value of the matrix), and the standard pattern vector is m _ci (i
= 1,2, ..., Nc), the mean square error is expressed by the following equation.

ｄ（m,m_c）＝（ｍ−m_ci）^tΦ（ｍ−m_ci）ここで、Φはインパルス応答の自己共分散φ（i,j）を
要素とする行列である。この時、振幅パタンの量子化値
は、平均二乗誤差を最小にする標準パタンとして次式
で求められる。d (m, m _c ) = (m−m _ci ) ^t Φ (m−m _ci ), where Φ is a matrix whose elements are the autocovariance φ (i, j) of the impulse response. At this time, the quantized value of the amplitude pattern is obtained by the following equation as a standard pattern that minimizes the mean square error.

零形フィルタ10は位相等化後の予測残差波形を特徴づけ
るフィルタであり、フィルタの係数は零形フィルタ係数
算出部11によって制御される。第６図は、位相等化後の
予測残差波形の例とそれに対する零形フィルタ10のイン
パルス応答波形を示したものである。位相等化後の予測
残差は、スペクトル包絡特性が平坦で位相が零位相に近
いことからインパルス的になり、各パルス位置で大きな
振幅を示して、それ以外の区間では比較的小さな振幅と
なる。また、パルス位置および隣り合うパルス位置の中
間時点を中心に対称に近い波形となる。パルス位置の中
間時点での振幅は、第６図にも見られるように他の区間
にくらべて比較的大きな振幅をもつことが多く、特にピ
ッチ周期が長い音声に対して、この傾向が強くなる。零
形フィルタ10は、第６図に示すようにそのインパルス応
答がパルス位置を中心に左右に各ｑ個の時点とパルス位
置の中間時点を中心に左右にｒ個の時点で値をとるよう
に設定される。この時、零形フィルタ10の伝達特性は次
のように表される。 The zero filter 10 is a filter that characterizes the prediction residual waveform after phase equalization, and the filter coefficient is controlled by the zero filter coefficient calculation unit 11. FIG. 6 shows an example of a prediction residual waveform after phase equalization and an impulse response waveform of the zero filter 10 corresponding to it. The prediction residual after phase equalization is impulse-like because the spectrum envelope characteristic is flat and the phase is close to zero phase, and shows large amplitude at each pulse position, and becomes relatively small in other sections. . In addition, the waveform has a waveform that is almost symmetrical with respect to the middle point between the pulse positions and the adjacent pulse positions. As shown in FIG. 6, the amplitude at the intermediate point of the pulse position often has a relatively large amplitude as compared with other sections, and this tendency becomes stronger especially for a voice with a long pitch period. . As shown in FIG. 6, the zero filter 10 has an impulse response that takes values at q time points on the left and right centering on the pulse position and at r time points on the left and right centering on the intermediate point of the pulse position. Is set. At this time, the transfer characteristic of the zero filter 10 is expressed as follows.

零形フィルタ係数算出部11では、与えられたピッチ位置
とパルス振幅に対してフィルタ係数v_kを合成音声波形と
位相等化後の入力音声波形との周波数重み付き平均二乗
誤差が最小になるように算出する。第７図は、フィルタ
係数算出部11の構成を示したものである。周波数重み付
きフィルタ44とインパルス応答算出部45はそれぞれ第５
図の周波数重み付きフィルタ39とインパルス応答算出部
40と同じ構成をもつ。加算器46は次式にしたがってイン
パルス応答ｆ（ｔ）を加算する。 The zero-type filter coefficient calculation unit 11 uses the filter coefficient v _k for a given pitch position and pulse amplitude so that the frequency-weighted mean square error between the synthesized speech waveform and the input speech waveform after phase equalization is minimized. Calculate to. FIG. 7 shows the configuration of the filter coefficient calculation unit 11. The frequency weighted filter 44 and the impulse response calculation unit 45 are respectively the fifth
Frequency-weighted filter 39 and impulse response calculator in the figure
It has the same structure as 40. The adder 46 adds the impulse response f (t) according to the following equation.

相関器47は、信号s_w（ｔ）とu_i（ｔ）との相互共分散ψ
（ｉ）を計算し、相関器48は、信号ui（ｔ）とuj（ｔ）
との自己共分散φ（i,J）を計算する。フィルタ係数算
出部49では、ψ（ｉ）とφ（i,J）とから次の連立方程
式を解くことにより零形フィルタ10の係数v_iを算出す
る。 The correlator 47 calculates the mutual covariance ψ of the signals _sw (t) and u _i (t).
(I) is calculated, and the correlator 48 calculates the signals ui (t) and uj (t).
Compute the autocovariance φ (i, J) with and. The filter coefficient calculation unit 49 calculates the coefficient v _i of the zero filter 10 by solving the following simultaneous equations from ψ (i) and φ (i, J).

フィルタ係数v_iは第１図中の量子化器12において、例え
ばベクトル量子化の手法を用いて量子化される。ベクト
ル量子化を用いる場合、フィルタ係数を要素とするベク
トル（振幅パタン）を複数個のパルス振幅標準パタンと
比較し、パタン間の距離が最小となる標準パタンに量子
化される。パルス振幅のベクトル量子化と同様にして、
合成音声波形と位相等化後の入力音声波形との平均二乗
誤差を距離尺度とすると、フィルタ係数の量子化値
は、次式で求められる。 The filter coefficient v _i is quantized in the quantizer 12 in FIG. 1 by using, for example, a vector quantization method. When vector quantization is used, a vector (amplitude pattern) having a filter coefficient as an element is compared with a plurality of pulse amplitude standard patterns and quantized to a standard pattern in which the distance between patterns is minimized. Similar to vector quantization of pulse amplitude,
When the mean square error between the synthesized speech waveform and the input speech waveform after phase equalization is used as the distance measure, the quantized value of the filter coefficient is calculated by the following equation.

ｄ（v,v_c）＝（ｖ−v_ci）^tΦ（ｖ−v_ci）ただし、ｖはフィルタ係数を要素とするベクトル、v_ci
はその標準パタンベクトルである。また、Φはインパル
ス応答u_i(t)の自己共分散φ（i,j）を要素とする行列で
ある。 d (v, v _c ) = (v−v _ci ) ^t Φ (v−v _ci ), where v is a vector whose elements are filter coefficients, and v _ci
Is the standard pattern vector. Further, Φ is a matrix whose elements are the autocovariance Φ (i, j) of the impulse response u _i (t).

以上まとめると、音声音区間においては、パルス位置の
振幅によって決まる準周期パルス列を零形フィルタ10に
通した後の信号を駆動音源信号として、音声スペクトル
包絡特性を特徴づける全極形フィルタ18を駆動すること
により音声を合成する。音源パラメータは、パルス振幅
と零形フィルタの係数については、合成音声波形と位相
等化後の入力音声波形との誤差を最小とする最適値がパ
ルス位置に対して決定される。パルス位置の候補が複数
存在する場合は、各候補に対して上記の誤差を求め、誤
差が最小となる最適なパルス位置を全探索によって決定
する。In summary, in the voice sound section, the quasi-periodic pulse train determined by the amplitude of the pulse position is passed through the zero-type filter 10 as a driving sound source signal, and the all-pole filter 18 that characterizes the voice spectrum envelope characteristic is driven. To synthesize the voice. As for the sound source parameter, with respect to the pulse amplitude and the coefficient of the zero filter, the optimum value that minimizes the error between the synthesized speech waveform and the input speech waveform after phase equalization is determined for the pulse position. When there are a plurality of pulse position candidates, the above error is obtained for each candidate, and the optimum pulse position that minimizes the error is determined by a full search.

次に、無声音区間における駆動音源について説明する。
無声音区間ではコード励振型予測符号化（文献Schroede
r他、“Code excited Iinearprediction（CELP）",IEEE
Int.Conf.on ASSP,pp937−940,1985）と同じく、駆動
音源信号として乱数パタンを使用する。第１図の乱数パ
タン生成部13には、平均０、分散１の正規乱数を複数サ
ンプルまとめたパタンが複数個蓄えられている。乱数振
幅算出部15では各乱数パタン毎に、乱数パタンについて
合成音声波形と位相等化後の入力音声波形との誤差が最
小となるゲイン最適値を算出し、量子化器16で量子化さ
れたゲインを用いてゲイン増幅器14を制御する。次に、
各乱数パタンに対して合成音声と位相等化音声との誤差
を求め、それが最小となる最適な乱数パタンを全探索に
よって求め、この乱数パタンの系列をゲイン増幅器14を
通じて駆動音源信号として全極形フィルタ18へ供給す
る。Next, the driving sound source in the unvoiced sound section will be described.
In the unvoiced interval, code-excited predictive coding (Reference Schroede
r et al., “Code excited Iinearprediction (CELP)”, IEEE
Int.Conf.on ASSP, pp937-940, 1985), a random number pattern is used as a driving sound source signal. The random number pattern generation unit 13 in FIG. 1 stores a plurality of patterns in which a plurality of normal random numbers having an average of 0 and a variance of 1 are collected. The random number amplitude calculation unit 15 calculates, for each random number pattern, a gain optimum value that minimizes the error between the synthesized speech waveform and the input speech waveform after phase equalization for the random number pattern, and is quantized by the quantizer 16. The gain is used to control the gain amplifier 14. next,
For each random number pattern, the error between the synthesized voice and the phase equalized voice is found, the optimal random number pattern that minimizes it is found by full search, and the sequence of this random number pattern is passed through the gain amplifier 14 as the driving sound source signal to all poles. Supply to the filter 18.

以上の手順により、音声信号は線形予測係数a_i、有声・
無声パラメータVU、有声音ではパルス位置t_i、パルス振
幅m_i、零形フィルタ係数v_i、無声音では乱数コードパタ
ン（番号）c_iとゲインg_iによって表される。これらの音
声パラメータは符号化部21で符号化された後、伝送ある
いは蓄積される。音声合成部では、音声パラメータを復
号化部22で復号化した後、有声音の場合はパルス系列生
成部23でパルス位置t_iとパルス振幅m_iとにより生成され
たパルス列を零形フィルタ24に通して駆動音源信号を生
成し、無声音の場合は乱数コードパタン（信号）c_iで乱
数パタン生成部25より乱数パタンを選択生成し、これを
ゲインg_iにより制御される増幅器26に通して振幅制御し
て駆動音源信号を生成し、有声・無声によって切り替わ
るスイッチ27で両駆動音源信号の一方が選択され、全極
形フィルタ28を駆動することによりその出力端29に合成
音声が出力される。零形フィルタ24のフィルタ係数はv_i
で制御され、全極形フィルタ28のフィルタ係数はa_iで制
御される。With the above procedure, the speech signal is linearly predicted, a _i ,
It is represented by the unvoiced parameter VU, the pulse position t _i , the pulse amplitude m _i , the zero-shaped filter coefficient v _i for voiced sound, and the random code pattern (number) c _i and the gain g _{i for} unvoiced sound. These audio parameters are transmitted or stored after being encoded by the encoding unit 21. In the speech synthesis unit, after decoding the speech parameters in the decoding unit 22, in the case of voiced sound, the pulse train generated in the pulse sequence generation unit 23 by the pulse position t _i and the pulse amplitude m _i to the zero filter 24. To generate a driving sound source signal, and in the case of an unvoiced sound, a random number pattern (signal) c _i is used to selectively generate a random number pattern from the random number pattern generation unit 25, and this is passed through an amplifier 26 controlled by a gain g _i to generate an amplitude. One of the two driving sound source signals is selected by the switch 27 which is controlled to generate a driving sound source signal, and which is switched between voiced and unvoiced, and by driving the all-pole filter 28, a synthetic voice is output to the output end 29 thereof. The filter coefficient of the zero filter 24 is v _i
And the filter coefficient of the all-pole filter 28 is controlled by a _i .

変形例有声と無声によって駆動音源を区別せず、いずれの場合
もパルス駆動音源を用いる。この場合、摩擦子音に対し
て品質が若干劣化するが、処理構成が簡単で処理量が低
減でき、ハード規模が小さくて済む。また、有声・無声
パラメータを伝送する必要がないため、毎秒60ビット分
ビットレートが低減される。Modified Example A drive source is not distinguished by voiced and unvoiced, and a pulse drive source is used in both cases. In this case, the quality of the fricative consonants is slightly deteriorated, but the processing configuration is simple, the processing amount can be reduced, and the hardware scale can be small. Also, since it is not necessary to transmit voiced / unvoiced parameters, the bit rate is reduced by 60 bits per second.

パルス駆動音源において零形フィルタを含めない構成。
この方法では、特にピッチ周波数が低い男声音声に対し
て合成音声の自然性が若干劣化するが、零形フィルタを
除くことによりハード規模が低減され、またフィルタ係
数の符号化に要する毎秒600ビット分、ビットレートが
低減される。A configuration that does not include the zero filter in the pulse-driven sound source.
With this method, the naturalness of synthesized speech is slightly degraded especially for male voice with a low pitch frequency, but the hardware scale is reduced by eliminating the zero filter, and 600 bits per second required for coding the filter coefficient is used. , The bit rate is reduced.

パルス振幅算出部８とベクトル量子化部９の処理を統合
してパルス振幅の量子化値を算出する構成。この方法に
よる構成を第８図に示す。周波数重み付きフィルタ50、
インパルス応答算出部51、相関器52、相関器53は実施例
１の第５図の対応するものと同じ構成である。パルス振
幅量子化部54では、パタンコード帳55に蓄えられている
各パルス振幅標準パタンm_ci（ｉ＝1,2,…，N_c）につい
て、その振幅標準パタンを用いて合成した時の音声波形
と位相等化後の入力音声波形の平均二乗誤差を算出し、
誤差が最も小さくなるパルス振幅標準パタンが求められ
る。距離計算は次式にしたがって行われる。A configuration in which the processes of the pulse amplitude calculation unit 8 and the vector quantization unit 9 are integrated to calculate the quantized value of the pulse amplitude. The structure of this method is shown in FIG. Frequency weighted filter 50,
The impulse response calculator 51, the correlator 52, and the correlator 53 have the same configurations as those corresponding to those in FIG. 5 of the first embodiment. In the pulse amplitude quantizing unit 54, for each pulse amplitude standard pattern m _ci (i = 1,2, ..., N _c ) stored in the pattern code book 55, the speech when synthesized using that amplitude standard pattern Calculate the mean square error of the waveform and the input speech waveform after phase equalization,
The pulse amplitude standard pattern with the smallest error is obtained. The distance calculation is performed according to the following equation.

ここで、Φはインパルス応答ｆ（ｔ）の自己共分散φ
（i,j）を要素とする行列、ψはインパルス応答と周波
数重み付きフィルタの出力s_w（ｔ）との相互共分散ψ
（ｉ）（ｉ＝1,2,…，n_P）を要素とする列ベクトルであ
る。 Where Φ is the autocovariance φ of the impulse response f (t)
A matrix having (i, j) as elements, and ψ is the mutual covariance ψ between the impulse response and the output of the frequency weighted filter s _w (t)
(I) A column vector having (i = 1, 2, ..., N _P ) as an element.

この第８図と第５図とでは、最適なパルス振幅を求める
のに必要な処理量はほぼ同じであるが、第８図では第５
図の処理に含まれる連立方程式の解法が不要となり、処
理構成が簡単になる。ただし、第５図ではパルス振幅の
最適値を求めた後に、これをスカラー量子化することが
可能であるのに対して、第８図では量子化法としてベク
トル量子化を使用することが前提となる。The processing amount required to obtain the optimum pulse amplitude is almost the same in FIGS. 8 and 5, but in FIG.
The solution of simultaneous equations included in the processing of the figure is not required, and the processing configuration is simplified. However, in FIG. 5, it is possible to perform scalar quantization after obtaining the optimum value of the pulse amplitude, whereas in FIG. 8, it is assumed that vector quantization is used as the quantization method. Become.

第８図と同様な方法で、零形フィルタの係数の算出とベ
クトル量子化を統合して、係数の量子化値を算出するこ
ともできる。It is also possible to calculate the quantized value of the coefficient by integrating the calculation of the coefficient of the zero filter and the vector quantization in the same manner as in FIG.

「発明の効果」この発明による音声分析合成法の効果を調べるために、
以下の条件で分析合成音声実験をおこなった。０−4kHz
帯域の音声を標本化周波数8kHzで標本化した後、音声信
号に分析窓長30msのハミング窓を乗じ、分析次数を12次
として自己相関法による線形予測分析を行い、12個の予
測係数と音声・無声パラメータを求める。符号化の分析
フレーム長は15ms（120音声サンプル）とする。予測係
数は差分多段ベクトル量子化法を用いて量子化する。ベ
クトル量子化における距離尺度としては、周波数重み付
きケプストラム距離を用いた。ビットレートが4.8kb/s
の場合、フレーム当たりのビット数は72ビットであり、
その内訳は次の様になる。"Effect of Invention" In order to investigate the effect of the speech analysis and synthesis method according to the present invention,
An analysis and synthesis speech experiment was conducted under the following conditions. 0-4kHz
After sampling the speech in the band at a sampling frequency of 8 kHz, multiply the speech signal by a Hamming window with an analysis window length of 30 ms, perform a linear prediction analysis by the autocorrelation method with the analysis order as 12th order, and perform 12 prediction coefficients and speech. • Find unvoiced parameters. The analysis frame length for encoding is 15 ms (120 audio samples). The prediction coefficient is quantized using the differential multistage vector quantization method. A frequency weighted cepstrum distance was used as a distance measure in vector quantization. Bit rate 4.8kb / s
, The number of bits per frame is 72 bits,
The breakdown is as follows.

パルス音源におけるパルス周期のゆらぎの許容範囲を表
す定数ＪとJ_sum、及び許容範囲に入らない場合の最大パ
ルス数N_Pは、パルス位置の符号化に割り当てられるビッ
ト数によって定まる。パルス位置を29ビット／フレーム
で符号化する場合、隣り合うパルス周期の差ΔＴは５サ
ンプル以下、そのフレーム内で総和は14サンプル以下と
なる。また、許容範囲に入らない場合のパルスの最大個
数は５となる。零形フィルタは７次（ｑ＝ｒ＝１）のフ
ィルタを用いた。乱数パタンベクトルは40サンプル（5m
s）からなり、512種類（9bit）のパタンから選択され
る。また、乱数振幅は正負の符号を含めて６ビットで量
子化される。 The constants J and J _sum , which represent the allowable range of fluctuations in the pulse period in the pulse sound source, and the maximum number of pulses N _P when it does not fall within the allowable range are determined by the number of bits allocated for encoding the pulse position. When the pulse position is encoded with 29 bits / frame, the difference ΔT between adjacent pulse periods is 5 samples or less, and the total sum is 14 samples or less in the frame. Further, the maximum number of pulses is 5 when it is not within the allowable range. As the zero-type filter, a 7th-order (q = r = 1) filter was used. Random pattern vector is 40 samples (5m
s) and is selected from 512 types (9 bits) of patterns. Further, the random number amplitude is quantized by 6 bits including positive and negative signs.

上記の条件で符号化された音声は、従来のボコーダにく
らべてはるかに高い自然性をもち、その品質は原音に近
いものになっている。また、従来のボコーダにくらべて
話者に対する音声品質の依存性は小さい。また、従来の
マルチパルス予測符号化やコード励振形予測符号化とく
らべても、符号化音声に品質が明らかに高いことが確認
された。4.8kb/sで符号化された音声のスペクトル包絡
歪みは約1dBである。符号化で生じる時間遅延は45msで
あり、低ビットレート領域における従来の方法と同程度
以下である。The speech coded under the above conditions has much higher naturalness than the conventional vocoder, and its quality is close to that of the original sound. Moreover, the dependence of the voice quality on the speaker is smaller than that of the conventional vocoder. It was also confirmed that the quality of the coded speech is clearly higher than that of the conventional multi-pulse predictive coding and code-excited predictive coding. The spectral envelope distortion of speech coded at 4.8 kb / s is about 1 dB. The time delay caused by encoding is 45 ms, which is less than or equal to the conventional method in the low bit rate region.

この発明の効果は、有声音に対する駆動音源信号を準周
期パルス列として表現することにより、従来のボコーダ
より音声の波形情報の再現性が高く、また従来のマルチ
パルス予測符号化より少ない情報量で駆動音源信号を表
現できることにある。また、この駆動音源信号のパラメ
ータを入力音声から推定する方法として、位相等化後の
音声波形に対する誤差を評価尺度として用いているため
に、入力音声そのものに対する誤差を用いる従来方法に
比べて、合成音声波形と入力音声波形との整合度が向上
し、より精度良く音源パラメータの推定が行える効果が
ある。また、零形フィルタは音声スペクトルの微細な特
徴を再現する効果があり、これにより合成音声の自然性
が向上する。The effect of the present invention is that, by expressing the driving sound source signal for voiced sound as a quasi-periodic pulse train, the reproducibility of the waveform information of the voice is higher than that of the conventional vocoder, and the amount of information is less than that of the conventional multi-pulse predictive coding. It is possible to express the sound source signal. In addition, as a method of estimating the parameters of this driving sound source signal from the input speech, since the error with respect to the speech waveform after phase equalization is used as an evaluation measure, compared with the conventional method that uses the error with respect to the input speech itself, The degree of matching between the voice waveform and the input voice waveform is improved, and the sound source parameter can be estimated more accurately. In addition, the zero-shaped filter has an effect of reproducing minute features of the speech spectrum, which improves the naturalness of the synthesized speech.

[Brief description of drawings]

第１図はこの発明による分析合成法の一例を示す構成
図、第２図は位相等化分析部４の構成例を示すブロック
図、第３図は準周期パルス駆動音源信号の説明図、第４
図はパルス位置を生成する処理の流れ図、第５図はパル
ス振幅算出部８の構成例を示すブロック図、第６図は零
形フィルタの説明図、第７図は零形フィルタ係数算出部
11の構成例を示すブロック図、第８図はパルス振幅算出
部８の他の構成例を示すブロック図である。FIG. 1 is a block diagram showing an example of the analysis and synthesis method according to the present invention, FIG. 2 is a block diagram showing a configuration example of the phase equalization analysis unit 4, and FIG. 3 is an explanatory diagram of a quasi-periodic pulse drive source signal. Four
FIG. 5 is a flow chart of processing for generating a pulse position, FIG. 5 is a block diagram showing a configuration example of the pulse amplitude calculation unit 8, FIG. 6 is an explanatory diagram of a zero filter, and FIG. 7 is a zero filter coefficient calculation unit.
FIG. 8 is a block diagram showing a configuration example of 11, and FIG. 8 is a block diagram showing another configuration example of the pulse amplitude calculation unit 8.

Claims

[Claims]

1. A speech analysis and synthesis system comprising a linear filter representing a speech spectrum envelope characteristic and a sound source signal generating section for driving the linear filter, wherein said sound source signal is quasi-limited to a size of fluctuation of pitch period. Expressed by a periodic pulse train, the parameters that make up the sound source signal are determined so as to minimize the error between the phase-equalized speech waveform and the synthesized speech waveform after the phase of the input speech is pitch-synchronized to zero phase, A voice analysis / synthesis method comprising synthesizing a voice signal by driving a linear filter representing the voice spectrum envelope characteristic with the sound source signal.

2. The sound source signal is used for voiced sound, and for unvoiced sound, a random number sequence selected from a plurality of random number patterns and having its average power set is used as the sound source signal, and for this unvoiced sound. 2. The speech analysis / synthesis method according to claim 1, wherein the parameters forming the sound source signal are determined so as to minimize an error between the phase equalized speech waveform and the synthesized speech waveform.

3. A source signal represented by a quasi-periodic pulse train in which the fluctuation of the pitch period is limited is supplied to the linear filter through a zero-shaped filter which characterizes the fine structure of the speech spectrum, and the zero filter is supplied. 3. The speech analysis / synthesis method according to claim 1, wherein the coefficient of the shape filter is determined so as to minimize an error between the phase equalized speech waveform and the synthesized speech waveform.