JPH04346400A

JPH04346400A - Voice analysis/synthesis method

Info

Publication number: JPH04346400A
Application number: JP3119965A
Authority: JP
Inventors: Masaaki Yoda; 雅彰誉田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1991-05-24
Filing date: 1991-05-24
Publication date: 1992-12-02

Abstract

PURPOSE:To provide a voice analysis/synthesis method which can analyze/ synthesize voices having a high voice quality at a low bit rate and has excellent ambient noise resistance. CONSTITUTION:A quasi-periodic pulse train, on which fluctuation size of a pitch period of an input voice is controlled, is created by means of a pulse series creating unit 10, and a noise code selected from a noise code book 12 is formed as a periodic noise series through a pitch correlation filter 13, and a result obtained by adding these quasi-periodic pulse train and noise series together is used as a sound source signal, and a voice signal (s'(t)) is synthesized by driving a total pole type filter 19 having voice spectrum envelope characteristics. Respective amplitudes of the quasi-periodic pulse train and the noise series and patterns of the noise series are determined, so that errors between a voice signal (sp(t)) obtained by transforming the input voice pitch- periodically into a zero phase by means of a phase equalization filter 4 and the synthesized voice signal (s'(t)) become minimal.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】この発明は、少ない情報量で品質
の高い音声を提供するための高能率音声符号化、特に、
従来のボコーダと呼ばれる音声分析合成系と波形符号化
との境界領域である２．４−４．８ｋｂ／ｓのビットレ
ートで高品質な音声符号化を実現する音声分析合成方法
に関するものである。[Industrial Application Field] This invention relates to high-efficiency speech coding for providing high-quality speech with a small amount of information, in particular,
The present invention relates to a speech analysis and synthesis method that realizes high-quality speech encoding at a bit rate of 2.4 to 4.8 kb/s, which is the boundary between a conventional speech analysis and synthesis system called a vocoder and waveform encoding.

【０００２】0002

【従来の技術】この発明に関連する従来技術として、線
形予測ボコーダとコード励振予測符号化（ＣＥＬＰ：Ｃ
ｏｄｅ　　Ｅｘｃｉｔｅｄ　　Ｌｉｎｅａｒ　　Ｐｒｅ
ｄｉｃｔｉｏｎ）とがある。線形予測ボコーダは、４．
８ｋｂ／ｓ以下の低ビットレート領域における音声符号
化方法としてこれまで広く用いられ、パーコール方式や
線スペクトル対（ＬＳＰ）方式などの方式がある。これ
らの方式の詳細は、例えば斎藤、中田著“音声情報処理
の基礎”（オーム社出版）に記載されている。線形予測
ボコーダは、音声のスペクトル包絡特性を表す全極形の
フィルタとそれを駆動する音源信号の生成部とによって
構成される。駆動音源信号には、有声音に対してはピッ
チ周期パルス列、無声音に対しては白色雑音が用いられ
る。音源信号のパラメータとしては、有声・無声の区別
、ピッチ周期および振幅であり、これらのパラメータは
３０ミリ秒程度の分析区間における音声信号の平均的な
特徴として抽出される。線形予測ボコーダでは、このよ
うに一定の分析区間毎に抽出した音声の特徴パラメータ
を時間的に補間して音声を合成するため、音声のピッチ
周期、振幅、およびスペクトル特性が急速に変化する場
合には、音声波形の特徴が十分な精度では再現すること
ができない。さらに、周期パルス列と白色雑音からなる
駆動音源は多様な音声波形の特徴を再現するには不十分
なため、自然性の高い合成音声を得ることは困難であっ
た。このように、線形予測ボコーダにおいて合成音声の
品質を高めるには、より音声波形の特徴を再現できる駆
動音源が必要とされてきた。[Prior Art] As prior art related to the present invention, a linear predictive vocoder and a code excitation predictive coding (CELP)
ode Excited Linear Pre
diction). The linear predictive vocoder consists of 4.
Audio encoding methods have been widely used in the low bit rate region of 8 kb/s or less, and include methods such as the Percoll method and the Line Spectral Pair (LSP) method. Details of these methods are described, for example, in Saito and Nakata's "Fundamentals of Audio Information Processing" (Ohmsha Publishing). A linear predictive vocoder is composed of an all-pole filter representing the spectral envelope characteristics of speech and a sound source signal generator that drives the filter. As the driving sound source signal, a pitch periodic pulse train is used for voiced sounds, and white noise is used for unvoiced sounds. The parameters of the sound source signal include voiced/unvoiced distinction, pitch period, and amplitude, and these parameters are extracted as average features of the sound signal in an analysis interval of about 30 milliseconds. In this way, linear predictive vocoders synthesize speech by temporally interpolating the speech feature parameters extracted for each fixed analysis interval. , the features of the audio waveform cannot be reproduced with sufficient accuracy. Furthermore, since the drive sound source consisting of a periodic pulse train and white noise is insufficient to reproduce the characteristics of various speech waveforms, it has been difficult to obtain highly natural synthesized speech. As described above, in order to improve the quality of synthesized speech in a linear predictive vocoder, a driving sound source that can better reproduce the characteristics of the speech waveform has been required.

【０００３】一方、コード励振予測符号化では、雑音系
列を駆動音源として音声の近接相関とピッチ相関特性を
表す２つの全極形フィルタを駆動することにより音声を
合成する。雑音系列は複数個のコードパタンとしてあら
かじめ用意され、その中から、入力音声波形と合成音声
波形との誤差を最小にするコードパタンが選択される。その詳細は、文献Ｓｃｈｒｏｅｄｅｒ他、“Ｃｏｄｅ　
　ｅｘｃｉｔｅｄ　　ｌｉｎｅａｒ　　ｐｒｅｄｉｃｔ
ｉｏｎ（ＣＥＬＰ）”，ＩＥＥＥ　　Ｉｎｔ．Ｃｏｎｆ
．ｏｎ　　ＡＳＳＰ，ｐｐ９３７−９４０，１９８５に
示されている。コード励振予測符号化では、コードパタ
ンの数と符号化された音声波形の再現精度は比例する関
係にある。したがって、多くの系列パタンを用意すれば音声波形の
再現精度が高まり、それにともなって品質を高めること
ができる。しかし、音声符号化のビットレートを４ｋｂ
／ｓ以下にするとコードパタンの数が制限され、その結
果十分な音声品質が得られなくなる。良好な音声品質を
得るには４．８ｋｂ／ｓ程度の情報量が必要とされた。On the other hand, in code-excited predictive coding, speech is synthesized by using a noise sequence as a driving sound source to drive two all-pole filters representing the proximity correlation and pitch correlation characteristics of speech. The noise sequence is prepared in advance as a plurality of code patterns, from which a code pattern that minimizes the error between the input speech waveform and the synthesized speech waveform is selected. For details, see Schroeder et al., “Code
excited linear predictor
ion (CELP)”, IEEE Int. Conf
．． on ASSP, pp937-940, 1985. In code-excited predictive coding, there is a proportional relationship between the number of code patterns and the reproduction accuracy of the encoded speech waveform. Therefore, by preparing a large number of sequence patterns, the reproduction accuracy of the audio waveform can be improved, and the quality can be improved accordingly. However, the audio encoding bit rate is 4kb.
If it is less than /s, the number of code patterns will be limited, and as a result, sufficient voice quality will not be obtained. To obtain good voice quality, an amount of information of about 4.8 kb/s was required.

【０００４】コード励振予測符号化では音声波形そのも
のを再現するように駆動音源が決定されるのに対して、
聴覚的に鈍感な音声波形の位相成分を取り除いた後の波
形、つまり零位相化された波形を再現するように駆動音
源を決定する符号化法が提案されている。その詳細は特
願昭５９−５３７５７号“音声信号処理方法”や特願平
１−２５７５０３号“音声分析合成方法”に記載されて
いる。この方法では、駆動音源信号に対応する音声の予
測残差波形の短時間位相が近似的に零位相に等化される
ため、零位相化された音声波形はピッチ駆動時点で大き
なピークを示す波形に変換される。その結果、零位相化
された予測残差波形は元の波形よりもより少ない情報量
で符号化することが可能になった。前述の特許願“音声
分析合成方法”（特願平１−２５７５０３）では、有声
音に関して零位相化された予測残差波形（駆動音源）を
、準周期パルス列と零型フィルタのフィルタ係数とで表
現する方法が示されている。この方法では、入力音声に
混入雑音が存在しない場合には、４ｋｂ／ｓ以下のビッ
トレートで高い音声品質を提供できる。しかし、入力音
声に周囲雑音が混入する場合、この駆動音源信号では有
声音声に重畳した雑音成分を表現することができないた
めに混入雑音が雑音性ではない別の歪みとなって符号化
音声の品質を劣化させる問題点があった。[0004] In code excitation predictive coding, the driving sound source is determined so as to reproduce the speech waveform itself.
An encoding method has been proposed in which a driving sound source is determined so as to reproduce a waveform after removing the phase component of an audio waveform that is auditory insensitive, that is, a waveform with zero phase. The details are described in Japanese Patent Application No. 59-53757 "Audio Signal Processing Method" and Japanese Patent Application No. 1-257503 "Speech Analysis and Synthesis Method". In this method, the short-time phase of the speech predicted residual waveform corresponding to the driving sound source signal is approximately equalized to zero phase, so the zero-phase speech waveform is a waveform that shows a large peak at the pitch drive point. is converted to As a result, it has become possible to encode the zero-phase predicted residual waveform with a smaller amount of information than the original waveform. In the aforementioned patent application "Speech analysis and synthesis method" (Japanese Patent Application No. 1-257503), a zero-phase predicted residual waveform (drive sound source) for a voiced sound is generated using a quasi-periodic pulse train and a filter coefficient of a zero-type filter. It shows how to express it. This method can provide high audio quality at a bit rate of 4 kb/s or less if there is no mixed noise in the input audio. However, when ambient noise is mixed into the input speech, this driving sound source signal cannot express the noise component superimposed on the voiced speech, so the mixed noise becomes another distortion that is not noise-like, resulting in the quality of the encoded speech. There was a problem that caused deterioration.

【０００５】この発明の目的は、線型予測ボコーダと波
型符号化の境界領域（２．４−４．８ｋｂ／ｓ）におい
て、高い音声品質を有し、かつ周囲雑音耐性に優れた音
声分析合成方法を提供することにある。An object of the present invention is to provide speech analysis and synthesis that has high speech quality and excellent resistance to ambient noise in the boundary region between linear predictive vocoder and waveform coding (2.4-4.8 kb/s). The purpose is to provide a method.

【０００６】[0006]

【課題を解決するための手段】この発明によれば音声分
析合成に用いられる有声音に対する駆動音源信号として
、入力音声のピッチ周期のゆらぎの大きさを制限した準
周期パルス列と雑音系列とを加え合わせた信号を用い、
この駆動音源信号により音声スペクトル包絡特性を表す
線形フィルタを駆動して音声波形を合成し、その合成さ
れた音声波形と位相等化された入力音声波形との誤差が
最小になるように、音源信号のパラメータ、つまりパル
ス系列の時間的位置と振幅、および雑音系列のパタンと
振幅を決定する。[Means for Solving the Problems] According to the present invention, a quasi-periodic pulse train and a noise sequence are added as driving sound source signals for voiced sounds used in speech analysis and synthesis, in which the magnitude of fluctuation in the pitch period of input speech is limited. Using the combined signal,
This driving sound source signal drives a linear filter representing the sound spectrum envelope characteristic to synthesize a sound waveform, and the sound source signal is parameters, that is, the temporal position and amplitude of the pulse sequence, and the pattern and amplitude of the noise sequence.

【０００７】従来のボコーダでは一定分析区間毎に求め
た平均的なピッチ周期と振幅から生成される周期パルス
列を駆動音源信号として用いているのに対して、この発
明ではピッチ周期毎にパルスの位置と振幅が与えられる
準周期パルス列と一定ブロック長をもつ雑音系列との和
で駆動音源信号を構成している。また、従来のコード励
振予測符号化では、駆動音源信号を雑音系列だけで構成
しているのに対して、この発明ではピッチ周期当たり１
個のパルス系列と一定ブロック長をもつ雑音系列との和
で駆動音源信号を構成している。更に、従来のマルチパ
ルス予測符号化法では、ピッチ周期とは無関係に決定さ
れる複数個のパルスによって駆動音源信号を構成してい
るのに対して、この発明ではピッチ当たり１個のパルス
と一定ブロック長をもつ雑音系列の和で駆動音源信号を
構成している。さらに、上記コード励振予測符号化やマ
ルチパルス符号化では、従来音源パラメータを決定する
評価基準として、入力音声波形と合成音声波形との二乗
誤差が用いられているのに対して、この発明では位相等
化音声波形と合成音声波形との二乗誤差が用いられてい
る。最後に、前記特許願“音声分析合成方法”（特願平
１−２５７５０３）では、駆動音源信号を準周期パルス
列と零型フィルタのフィルタ係数とで構成しているのに
対し、この発明では、零型フィルタの代わりに雑音系列
を用いている。While conventional vocoders use a periodic pulse train generated from the average pitch period and amplitude determined for each fixed analysis interval as the driving sound source signal, in the present invention, the position of the pulse is determined for each pitch period. The driving sound source signal is composed of the sum of a quasi-periodic pulse train given an amplitude and a noise sequence having a constant block length. In addition, in conventional code excitation predictive coding, the driving excitation signal is composed of only a noise sequence, whereas in this invention, one
The driving sound source signal is composed of the sum of the pulse sequence and the noise sequence having a constant block length. Furthermore, in the conventional multi-pulse predictive coding method, the drive sound source signal is composed of a plurality of pulses that are determined independently of the pitch period, whereas in this invention, the drive sound source signal is composed of a plurality of pulses that are determined independently of the pitch period. The driving sound source signal is composed of the sum of noise sequences having a block length. Furthermore, in the above-mentioned code excitation predictive coding and multipulse coding, the square error between the input speech waveform and the synthesized speech waveform is conventionally used as an evaluation criterion for determining the sound source parameters. The squared error between the equalized speech waveform and the synthesized speech waveform is used. Finally, in the patent application "Speech analysis and synthesis method" (Japanese Patent Application No. 1-257503), the drive sound source signal is composed of a quasi-periodic pulse train and the filter coefficients of a zero-type filter, whereas in this invention, A noise sequence is used instead of a zero-type filter.

【０００８】[0008]

【実施例】図１に、この発明による音声分析合成法を適
用した装置の構成を示す。入力端子１からは標本化され
た音声信号ｓ（ｔ）が入力される。線形予測分析部２で
は、Ｎ個の音声信号のサンプルを一旦データバッファに
蓄えた後、これらのサンプルに対して線形予測分析を行
って予測係数ａｉ　（ｉ＝１，２，…，ｐ）を算出する
。また、予測係数をフィルタ係数とする逆フィルタを用い
て予測残差信号ｅ（ｔ）を次式によって求める。Σはｉ
＝１からｐまでである。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows the configuration of an apparatus to which the speech analysis and synthesis method according to the present invention is applied. A sampled audio signal s(t) is input from the input terminal 1. In the linear prediction analysis unit 2, after once storing N audio signal samples in a data buffer, linear prediction analysis is performed on these samples to obtain prediction coefficients ai (i=1, 2,..., p). calculate. Further, a prediction residual signal e(t) is obtained by the following equation using an inverse filter using the prediction coefficient as a filter coefficient. Σ is i
=1 to p.

【０００９】ｅ（ｔ）＝ｓｔ　−Σａｉ　ｓ（ｔ−ｉ）
次に、予測残差の自己相関係数を算出し、その最大値に
対してレベル判定を行ない、当該分析フレームの有声・
無声ＶＵＶを判定する。これらの処理方法の詳細は、前
述の斎藤等による著書に記載されている。位相等化分析
部３では、音声の位相特性を零位相化する位相等化フィ
ルタ４の係数ｃｔ　（ｎ）と位相等化の基準時点ｔ′ｉ
　とを算出する。その構成の細部は、前述の特許願“音
声分析合成方法”（特願平１−２５７５０３）に記載さ
れている。位相等化分析部３で求められたフィルタ係数
では位相等化フィルタ４をサンプル単位毎に制御する。この位相等化フィルタ４に端子１からの音声信号を入力
することにより位相等化音声信号ｓｐ　（ｔ）がその出
力として次式により求められる。Σはｉ＝−Ｍ／２から
Ｍ／２までである。[0009] e(t)=st −Σai s(t−i)
Next, the autocorrelation coefficient of the prediction residual is calculated, the level is judged based on the maximum value, and the voiced and
Determine silent VUV. Details of these processing methods are described in the aforementioned book by Saito et al. The phase equalization analysis unit 3 calculates the coefficient ct (n) of the phase equalization filter 4 that zero-phases the phase characteristic of the audio and the reference time t'i for phase equalization.
Calculate. The details of its configuration are described in the aforementioned patent application "Speech analysis and synthesis method" (Japanese Patent Application No. 1-257503). The phase equalization filter 4 is controlled on a sample-by-sample basis using the filter coefficients determined by the phase equalization analysis section 3. By inputting the audio signal from the terminal 1 to the phase equalization filter 4, a phase equalized audio signal sp (t) is obtained as its output by the following equation. Σ is from i=-M/2 to M/2.

【００１０】ｓｐ　（ｔ）＝Σｃｔ　（ｉ）ｓ（ｔ−ｉ
）この分析合成法では有声音と無声音とで別々の駆動音
源を使用し、有声・無声パラメータＶＵＶによってスイ
ッチ１８が切り替えられる。まず、有声音の駆動音源の
構成を以下に説明する。有声音の駆動音源はパルス系列
生成部１０と雑音コードブック１２とから構成される。パルス系列生成部１０では、パルス時点ｔｉ　を与える
ことによって準周期パルス系列を生成する。個々のパル
スの振幅は、パルス振幅制御部１１において、ゲインｍ
ｉ　を乗じることによって制御される。[0010] sp (t)=Σct (i) s(t-i
) In this analysis and synthesis method, separate driving sound sources are used for voiced sounds and unvoiced sounds, and the switch 18 is switched depending on the voiced/unvoiced parameter VUV. First, the configuration of the driving sound source for voiced sounds will be explained below. The voiced sound driving sound source is composed of a pulse sequence generation section 10 and a noise codebook 12. The pulse sequence generation section 10 generates a quasi-periodic pulse sequence by giving pulse time points ti. The amplitude of each pulse is determined by a gain m in the pulse amplitude control section 11.
controlled by multiplying by i.

【００１１】一方、雑音コードブック１２には、一定の
ブロック長を持つ雑音系列（雑音ベクトル）が複数個蓄
えられている。雑音系列としては、例えば平均０、分散
１の正規乱数の系列が用いられる。雑音コードブック１
２から出力される雑音系列は、ピッチ相関フィルタ１３
に入力され、周期性を有する雑音に変換される。ピッチ
相関フィルタ１３は次のような伝達特性を持つデジタル
フィルタで実現される。On the other hand, the noise codebook 12 stores a plurality of noise sequences (noise vectors) having a fixed block length. As the noise sequence, for example, a normal random number sequence with an average of 0 and a variance of 1 is used. Noise code book 1
The noise sequence output from 2 is passed through the pitch correlation filter 13
and is converted into periodic noise. The pitch correlation filter 13 is realized by a digital filter having the following transfer characteristics.

【００１２】Ｂ（ｚ）＝１／（１−γｂ　ｂｚ−Ｔｐ　
）ここで、ｂはピッチゲインであり、Ｔｐ　はピッチ周
期、γｂ　は周期性の強調係数である。ピッチゲインｂ
は、ピッチゲイン算出部５において、ピッチ周期の時間
遅れに対する位相等化予測残差信号の自己相関係数とし
て算出される。ピッチゲイン算出部５で求めたピッチゲ
インは量子化部５ａで量子化され、ピッチゲインｂとし
てピッチ相関フィルタ１３に与えられる。ピッチ周期Ｔ
ｐ　は、準周期パルス列の隣接するパルス時点の平均間
隔として与えられる。ピッチ相関フィルタ１３の出力信
号は、振幅制御部１４においてゲインＧＶを乗じること
により、その振幅が制御される。B(z)=1/(1-γb bz-Tp
) Here, b is the pitch gain, Tp is the pitch period, and γb is the periodicity emphasis coefficient. pitch gain b
is calculated by the pitch gain calculation unit 5 as an autocorrelation coefficient of the phase equalization prediction residual signal with respect to the time delay of the pitch period. The pitch gain calculated by the pitch gain calculating section 5 is quantized by the quantizing section 5a, and is given to the pitch correlation filter 13 as a pitch gain b. Pitch period T
p is given as the average spacing of adjacent pulse instants of the quasi-periodic pulse train. The amplitude of the output signal of the pitch correlation filter 13 is controlled by multiplying it by a gain GV in an amplitude control section 14.

【００１３】このようにして得られたパルス特性制御部
１１からの準周期パルス系列ｐ（ｔ）と、振幅制御部１
４からの雑音系列ｖ（ｔ）とがサンプル毎に加算器１５
で加算され、駆動音源信号ｅｖ　（ｔ）が生成される。次に、無声音の駆動音源の構成を説明する。無声音に対
しては、雑音系列を駆動音源信号として用いる。有声音
の雑音コードブック１２と同様に、雑音コードブック１
６には一定のブロック長を持つ雑音系列が複数個蓄えら
れている。雑音コードブック１６から出力される雑音系
列は、振幅制御部１７においてゲインＧＵを乗じること
により、その振幅が制御され、駆動音源信号ｅｕ　（ｔ
）が生成される。また、無声音の場合には音声の周期性
は弱いため、ピッチ相関フィルタは構成に含まれない。一般に、無声音の雑音系列のブロック長は有声音の雑音
系列のブロック長とは異っている。The thus obtained quasi-periodic pulse sequence p(t) from the pulse characteristic control section 11 and the amplitude control section 1
The noise sequence v(t) from 4 is added to the adder 15 for each sample.
and the drive excitation signal ev (t) is generated. Next, the configuration of the unvoiced sound driving sound source will be explained. For unvoiced sounds, a noise sequence is used as the driving sound source signal. Similar to noise codebook 12 for voiced sounds, noise codebook 1
6 stores a plurality of noise sequences having a constant block length. The amplitude of the noise sequence output from the noise codebook 16 is controlled by multiplying it by a gain GU in the amplitude control unit 17, and the drive excitation signal eu (t
) is generated. Furthermore, in the case of unvoiced sounds, the pitch correlation filter is not included in the configuration because the periodicity of the speech is weak. Generally, the block length of an unvoiced noise sequence is different from the block length of a voiced noise sequence.

【００１４】音声の合成は、スイッチ１８で有声・無声
パラメータＶＵＶに応じた駆動音源信号ｅｖ　（ｔ）又
はｅｕ　（ｔ）を選択し、音声のスペクトル包絡特性を
特徴づける全極形（線形）フィルタ１９を駆動すること
によって行なわれる。全極形フィルタ１９は、次の伝達
特性Ａ（ｚ）をもつデジタルフィルタで実現される。Ａ（ｚ）＝１／（１＋ａ１　ｚ−１＋…＋ａｐ　ｚ−ｐ
）ここで、ａｉ　は線形予測係数であり、ｚ−１は標本
化遅延、ｐはフィルタの次数である。合成時に用いられ
る線形予測係数ａｉ　は、線形予測分析部６において、
位相等化フィルタ４の出力位相等化音声を線形予測分析
することによって求められ、量子化部６ａで量子化され
て得られる。For voice synthesis, the switch 18 selects the driving excitation signal ev (t) or eu (t) according to the voiced/unvoiced parameter VUV, and an all-pole (linear) filter is applied to characterize the spectral envelope characteristic of the voice. This is done by driving 19. The all-pole filter 19 is realized by a digital filter having the following transfer characteristic A(z). A(z)=1/(1+a1 z-1+...+ap z-p
) where ai are the linear prediction coefficients, z-1 is the sampling delay, and p is the order of the filter. The linear prediction coefficient ai used at the time of synthesis is calculated by the linear prediction analysis unit 6,
It is obtained by linear predictive analysis of the output phase equalized audio of the phase equalization filter 4, and is quantized by the quantization section 6a.

【００１５】次に、音源パラメータの分析方法について
説明する。パルス時点算出部７では準周期パルス系のパ
ルス時点を算出する。パルス時点はその位置間隔が準周
期的になるように制限される。すなわち、図２における
パルス時点間隔Ｔｉ　＝ｔｉ　−ｔｉ−１　は、連続す
るパルス時点間隔の差が一定値以下になるように次式に
よって制限される。Next, a method of analyzing sound source parameters will be explained. The pulse time calculation unit 7 calculates the pulse time of the quasi-periodic pulse system. The pulse instants are limited so that their position spacing is quasi-periodic. That is, the pulse time interval Ti=ti-ti-1 in FIG. 2 is limited by the following equation so that the difference between successive pulse time intervals is equal to or less than a certain value.

【００１６】ΔＴｉ　＝｜Ｔｉ　−Ｔｉ−１　｜≦Ｊこ
こで、Ｊはパルス時点間隔の差の許容値である。パルス
時点ｔｉ　は、位相等化分析部３で求められる基準時点
ｔ′ｉ　を初期値として、上記の制限を満足するパルス
時点の系列を決定し、量子化部７ａで量子化してパルス
系列生成部１０へ供給する。図３は、基準時点ｔ′ｉ　
からパルス時点ｔ′ｉ　の系列を生成する処理手順を示
したものである。この処理では、基準時点ｔｉ　がパル
ス時点の初期値ｔｉ　として入力され、まず基準時点の
数を判定し（Ｓ１　）、基準時点の数が２以下なら基準
時点をパルス時点とする。基準時点の数が３以上なら、
隣接する基準時点の時間間隔の差ΔＴｉ　を算出し（Ｓ
２　）、ΔＴｉ　に関しまず許容値Ｊ以下かを判定し（
Ｓ３　）、許容値以下ならステップＳ４　に移り、許容
値Ｊ以下でなければ、ΔＴｉ　の２分の１がＪ以下かを
判定し（Ｓ５　）、Ｊ以下ならば、パルス時点間隔があ
き過ぎているから中間点にパルス位置を挿入してステッ
プＳ４　に移る（Ｓ６　）。１つおいた基準時点の間隔
ｔｉ＋１　−ｔｉ−１　と、その前の基準時点の間隔ｔ
ｉ−１　−ｔｉ−２　との差ΔＴｉ　を求め（Ｓ７　）
、これがＪ以下かを判定し（Ｓ８　）、Ｊ以下ならばパ
ルス時点間隔が狭過ぎるからパルス時点ｔｉ　を除去し
てステップＳ４　に移る（Ｓ９　）。ステップＳ８　で
Ｊ以下でなければΔＴｉ　の２分の１がＪ以下かを判定
し（Ｓ１０）、Ｊ以下ならばｔｉ＋１　とｔｉ−１　と
の中間点にパルス位置ｔｉ　を修正してステップＳ４　
に移る（Ｓ１１）。ステップＳ１０でＪ以下でなければ基準時点に対して後
述のパルス振幅算出方法を用いて個々のパルスの振幅を
算出し（Ｓ１２）、そのパルス振幅が最小のものの時点
を基準時点から削除してステップＳ１　に戻る（Ｓ１３
）。ステップＳ４　では全パルス位置（時点）について
判定したかをチェックし、終了していなければステップ
Ｓ１　に戻り終了したら終りとする。以上のようにして
パルス時点の挿入、除去、修正が繰り返されてパルス時
点が決定される。ΔTi =|Ti −Ti−1 |≦J where J is the tolerance for the difference in pulse time intervals. The pulse time point ti is determined by using the reference time point t'i obtained by the phase equalization analysis section 3 as an initial value, and determines a series of pulse time points that satisfy the above restrictions, quantizes it in the quantization section 7a, and generates the pulse sequence generation section. Supply to 10. FIG. 3 shows the reference time t'i
This figure shows a processing procedure for generating a sequence of pulse time points t'i from . In this process, the reference time point ti is input as the initial value ti of the pulse time point, the number of reference time points is first determined (S1), and if the number of reference time points is 2 or less, the reference time point is set as the pulse time point. If the number of reference points is 3 or more,
Calculate the difference ΔTi between the time intervals of adjacent reference points (S
2), first determine whether ΔTi is less than the allowable value J (
S3), if it is less than the tolerance value, move on to step S4, and if it is not less than the tolerance value J, it is determined whether 1/2 of ΔTi is less than J (S5), and if it is less than J, the pulse time interval is too long. A pulse position is inserted at an intermediate point from then to step S4 (S6). The interval ti+1 -ti-1 between the next reference time and the interval t between the previous reference time
Find the difference ΔTi from i-1 -ti-2 (S7)
, it is determined whether this is less than or equal to J (S8), and if it is less than J, the pulse time interval is too narrow, the pulse time ti is removed, and the process moves to step S4 (S9). If it is not less than J in step S8, it is determined whether 1/2 of ΔTi is less than J (S10), and if it is less than J, the pulse position ti is corrected to the midpoint between ti+1 and ti-1, and step S4
The process moves to (S11). If it is not J or less in step S10, the amplitude of each pulse is calculated using the pulse amplitude calculation method described later with respect to the reference time (S12), and the time point with the minimum pulse amplitude is deleted from the reference time, and step Return to S1 (S13
). In step S4, it is checked whether all pulse positions (time points) have been determined, and if the determination has not been completed, the process returns to step S1, and once the determination has been completed, the process ends. As described above, the pulse time points are determined by repeating the insertion, removal, and modification of the pulse time points.

【００１７】パルス振幅算出部８では、準周期パルス列
の個々のパルス振幅を算出する。各パルスの振幅は、準
周期パルス系列を用いて合成した音声波形と位相等化入
力音声波形との周波数重み付け平均二乗誤差が最小にな
るように決定する。周波数重み付け平均二乗誤差は次式
で表される。最初のΣはｔ＝０からＮ−１まで、次のΣ
はｊ＝１からｎＰ　までである。ｎｐ　は分析フレーム
内でのパルスの個数である。The pulse amplitude calculation unit 8 calculates the amplitude of each pulse of the quasi-periodic pulse train. The amplitude of each pulse is determined so that the frequency-weighted mean square error between the speech waveform synthesized using the quasi-periodic pulse sequence and the phase-equalized input speech waveform is minimized. The frequency weighted mean square error is expressed by the following equation. The first Σ is from t=0 to N-1, the next Σ
is from j=1 to nP. np is the number of pulses within the analysis frame.

【００１８】ｄ＝Σ｛（ｓｐ　（ｔ）　−ｆｚ　（ｔ）
　−ｆ（ｔ）　＊Σｍｊ　δ（ｔ−ｔｊ　））＊ｗ（ｔ
）｝２　ここで、δ（．）はデルタ関数を表し、＊は畳
み込みを表す。ｆ（ｔ）は全極形フィルタ１９のインパ
ルス応答である。ｆｚ　（ｔ）は１つ前の分析フレーム
の合成音声ｓ′（ｔ）を初期値として伝達特性がＡ（ｚ
）のフィルタを零入力で駆動した時の初期値応答である
。ｗ（ｔ）は、周波数重み付けフィルタのインパルス応
答であり、伝達特性は次のように表される。d=Σ{(sp (t) −fz (t)
-f(t) *Σmj δ(t-tj ))*w(t
)}2 Here, δ(.) represents a delta function and * represents convolution. f(t) is the impulse response of the all-pole filter 19. fz (t) is the transfer characteristic A(z
) is the initial value response when the filter is driven with zero input. w(t) is the impulse response of the frequency weighting filter, and the transfer characteristic is expressed as follows.

【００１９】Ｗ（ｚ）＝Ａ（ｚ）／Ａ（γｚ）ここで、
γは周波数重み付けの程度を制御するパラメータであり
、０＜γ≦１の範囲の値をとり、通常は０．７−０．９
の値が用いられる。図４は、上記の平均二乗誤差を最小
にするパルス振幅が求めるためのパルス振幅算出部８の
内部の構成を示したものである。位相等化音声ｓｐ　（
ｔ）を入力として、フィルタ４１でｓｐ　（ｔ）＊ｗ（
ｔ）を算出し、フィルタ４２でｆｚ（ｔ）＊ｗ（ｔ）を
算出し、加算器４３においてフィルタ４１の出力からフ
ィルタ４２の出力を差し引いてｓｗ　（ｔ）が求められ
る。インパルス応答算出部４４では、１／Ａ（γｚ）の
伝達特性をもつフィルタのインパルス応答ｆｗ　（ｔ）
を算出する。相関器４５では、各パルス時点ｔｉ　毎に
、インパルス応答ｆｗ　（ｔ）と信号ｓｗ　（ｔ）との
相互共分散ψ（ｉ）を次式で算出する。Σはｔ＝０から
Ｎ−１までである。W(z)=A(z)/A(γz) where,
γ is a parameter that controls the degree of frequency weighting, and takes a value in the range 0<γ≦1, usually 0.7-0.9
The value of is used. FIG. 4 shows the internal configuration of the pulse amplitude calculation unit 8 for determining the pulse amplitude that minimizes the above-mentioned mean square error. Phase equalized audio sp (
t) as input, the filter 41 inputs sp (t)*w(
t), the filter 42 calculates fz(t)*w(t), and the adder 43 subtracts the output of the filter 42 from the output of the filter 41 to obtain sw (t). The impulse response calculation unit 44 calculates the impulse response fw (t) of a filter having a transfer characteristic of 1/A(γz).
Calculate. The correlator 45 calculates the mutual covariance ψ(i) between the impulse response fw (t) and the signal sw (t) at each pulse time ti using the following equation. Σ is from t=0 to N-1.

【００２０】ψ（ｉ）＝Σｆｗ　（ｔ−ｔｉ　）ｓｗ　
（ｔ）また、相関器４６では、各パルス時点ｔｉ　，ｔ
ｊ　の組に関してインパルス応答の自己共分散φ（ｉ，
ｊ）を次式で算出する。Σはｔ＝０からＮ−１までであ
る。 φ（ｉ，ｊ）＝Σｆｗ　（ｔ−ｔｉ　）ｆｗ　（ｔ−ｔ
ｊ　）パルス振幅算出部４７では、次の連立方程式を解
くことによってパルス振幅を算出する。[0020]ψ(i)=Σfw (t-ti)sw
(t) Also, in the correlator 46, each pulse time point ti, t
The autocovariance of the impulse response φ(i,
j) is calculated using the following formula. Σ is from t=0 to N-1. φ(i,j)=Σfw(t-ti)fw(t-t
j) The pulse amplitude calculating section 47 calculates the pulse amplitude by solving the following simultaneous equations.

【００２１】[0021]

【数１】[Math 1]

【００２２】これらパルス振幅ｍｉ　は量子化部８ａで
量子化して振幅制御部１１に与える。雑音系列・雑音ゲ
イン算出部９では、有声音における雑音系列とそのゲイ
ン（雑音ゲイン）を決定する。雑音ゲインは、準周期パ
ルス系列と雑音系列との和を駆動音源信号として合成し
た音声波形と位相等化入力音声波形との周波数重み付け
平均二乗誤差が最小になるように決定される。雑音コー
ドブック１２内のｉ番目の雑音系列ＣＶｉ　（ｔ）、複
合した合成フィルタ１／Ａ（ｚ）・１／Ｂ（ｚ）のイン
パルス応答をｈ（ｔ）とすると、合成音声の周波数重み
付け平均二乗誤差は次式で与えられる。Σはｔ＝０から
Ｎ−１までである。These pulse amplitudes mi are quantized by a quantizer 8a and provided to an amplitude controller 11. The noise sequence/noise gain calculation unit 9 determines the noise sequence and its gain (noise gain) in the voiced sound. The noise gain is determined so that the frequency-weighted mean square error between the phase-equalized input audio waveform and the audio waveform obtained by combining the sum of the quasi-periodic pulse sequence and the noise sequence as a driving sound source signal is minimized. If the i-th noise sequence CVi (t) in the noise codebook 12 and the impulse response of the combined synthesis filters 1/A(z) and 1/B(z) are h(t), then the frequency-weighted average of the synthesized speech is The squared error is given by the following equation. Σ is from t=0 to N-1.

【００２３】ｄｉ　＝Σ｛（ｓｐ　（ｔ）−　ｆｚ　（
ｔ）−ｐ（ｔ）＊ｆ（ｔ）−　ｈｚ　（ｔ）−ＧＶｉ　
ＣＶｉ　＊ｈ（ｔ））＊ｗ（ｔ）　｝２　ここで、ｐ（
ｔ）は前述の方法で決定された準周期パルス系列、ｈｚ
　（ｔ）は複合合成フィルタの零入力初期値応答である
。このとき、二乗誤差を最小化する最適ゲインは次式で
算出される。各Σはｔ＝０からｎ−１までである。ＧＶｉ　＝Σｚ（ｔ）ｙ（ｔ）／Σｙ２　（ｔ）ただし
、ｚ（ｔ）＝（ｓｐ　（ｔ）−ｆｚ　（ｔ）−ｐ（ｔ）
＊ｆ（ｔ）−ｈｚ　（ｔ））＊ｗ（ｔ）、ｙ（ｔ）＝Ｃ
Ｖｉ　（ｔ）＊ｈ（ｔ）＊ｗ（ｔ）である。di = Σ{(sp (t) − fz (
t)-p(t)*f(t)-hz(t)-GVi
CVi *h(t))*w(t) }2 where p(
t) is the quasi-periodic pulse sequence determined by the method described above, hz
(t) is the zero-input initial value response of the composite synthesis filter. At this time, the optimal gain that minimizes the squared error is calculated using the following equation. Each Σ is from t=0 to n-1. GVi =Σz(t)y(t)/Σy2(t)where z(t)=(sp(t)-fz(t)-p(t)
*f(t)-hz (t))*w(t),y(t)=C
Vi (t)*h(t)*w(t).

【００２４】図５は雑音系列・雑音ゲイン算出部９の内
部の構成を示したものである。図４のパルス振幅算出部
８で求められた信号ｓｗ　（ｔ）が入力される。フィル
タ５１では、ｐ（ｔ）＊ｆ（ｔ）＊ｗ（ｔ）を算出し、
フィルタ５２ではｈｚ　（ｔ）＊ｗ（ｔ）を算出し、加
算器５３と加算器５４でサンプル毎に信号間の差を求め
ることにより信号ｚ（ｔ）を求める。フィルタ５５では
、雑音系列ＣＶｉ　を入力として、ｙ（ｔ）＝ＣＶｉ　
（ｔ）＊ｈ（ｔ）＊ｗ（ｔ）の演算を行ない、ｙ（ｔ）
を求める。相関器５６では、信号ｚ（ｔ）とｙ（ｔ）間
の相関関数を、ｃｚｙ＝Σｚ（ｔ）ｙ（ｔ）として求め
（Σはｔ＝０からＮ−１までである）、相関器５７では
、信号ｙ（ｔ）の電力を、ｃｙｙ＝Σｙ２　（ｔ）とし
て求める（Σはｔ＝０からＮ−１までである）。割算器
５９では、ＧＶｉ　＝ｃｚｙ／ｃｙｙの演算を行ない、
最適ゲインが算出される。相関器５８では信号ｚ（ｔ）
の電力ｃｚｚを計算し、乗算器６０ではＧＶｉ　とｃｚ
　ｙとの乗算を行ない、加算器６１ではｃｚｚからＧＶ
ｉ　ｃｚ　ｙ　を差し引くことにより、合成音声の平均
二乗誤差ｄｉ　が求められる。最小値選択部６２では、
雑音コードブック１２に含まれる雑音系列の中から、合
成音声の平均二乗誤差ｄｉ　が最小となる雑音系列を選
択し、その雑音系列の番号ＩＣＶと最適雑音ゲインＧＶ
を出力する。ゲインＧＶは量子化部９ａで量子化されて
振幅制御部１４に与えられる。FIG. 5 shows the internal configuration of the noise sequence/noise gain calculation section 9. The signal sw (t) obtained by the pulse amplitude calculation unit 8 in FIG. 4 is input. The filter 51 calculates p(t)*f(t)*w(t),
The filter 52 calculates hz (t)*w(t), and the adders 53 and 54 calculate the difference between the signals for each sample to obtain the signal z(t). The filter 55 receives the noise sequence CVi as input, and y(t)=CVi
(t)*h(t)*w(t), y(t)
seek. In the correlator 56, the correlation function between the signals z(t) and y(t) is obtained as czy=Σz(t)y(t) (Σ is from t=0 to N-1), and the correlator 57, the power of the signal y(t) is determined as cyy=Σy2 (t) (Σ is from t=0 to N-1). The divider 59 performs the calculation GVi=czy/cyy,
The optimal gain is calculated. In the correlator 58, the signal z(t)
The multiplier 60 calculates the power czz of GVi and cz
The adder 61 multiplies GV from czz.
By subtracting i cz y , the mean square error di of the synthesized speech is determined. In the minimum value selection section 62,
From among the noise sequences included in the noise codebook 12, select the noise sequence that minimizes the mean squared error di of synthesized speech, and calculate the number ICV and optimal noise gain GV of that noise sequence.
Output. The gain GV is quantized by the quantization section 9a and given to the amplitude control section 14.

【００２５】次に、無声音における最適な雑音系列及び
最適ゲインの決定方法について述べる。雑音系列・雑音
ゲイン算出部２０では、無声音における雑音系列とその
ゲイン（雑音ゲイン）を決定する。雑音ゲインは、雑音
系列を駆動音源信号として合成した音声波形と位相等化
入力音声波形との周波数重み付け平均二乗誤差が最小に
なるように決定される。雑音コードブック１６内のｉ番
目の雑音系列ＣＵｉ　（ｔ）、合成フィルタ１／Ａ（ｚ
）のインパルス応答をｆ（ｔ）とすると、合成音声の周
波数重み付け平均二乗誤差は次式で与えられる。Σはｔ
＝０からＮ−１までである。Next, a method for determining the optimal noise sequence and optimal gain for unvoiced speech will be described. The noise sequence/noise gain calculation unit 20 determines the noise sequence and its gain (noise gain) in unvoiced sounds. The noise gain is determined so that the frequency-weighted mean square error between the speech waveform synthesized from the noise sequence as the driving sound source signal and the phase-equalized input speech waveform is minimized. The i-th noise sequence CUi (t) in the noise codebook 16, the synthesis filter 1/A(z
) is the impulse response of f(t), the frequency-weighted mean square error of the synthesized speech is given by the following equation. Σ is t
=0 to N-1.

【００２６】ｄｉ　＝Σ｛（ｓｐ　（ｔ）　−ｆｚ　（
ｔ）　−ＧＵｉ　ＣＵｉ　＊ｆ（ｔ））＊ｗ（ｔ）　｝
２　ここで、ｆｚ　（ｔ）は前述した合成フィルタの零
入力初期値応答である。このとき、二乗誤差を最小化す
る最適ゲインは次式で算出される。各Σはｔ＝０からＮ
−１までである。ＧＵｉ　＝Σｓｗ　（ｔ）ｙ（ｔ）／Σｙ２　（ｔ）た
だし、ｓｗ　（ｔ）＝（ｓｐ　（ｔ）−ｆｚ　（ｔ））
＊ｗ（ｔ）、ｙ（ｔ）＝ＣＵｉ　（ｔ）＊ｆ（ｔ）＊ｗ
（ｔ）である。di = Σ{(sp (t) − fz (
t) -GUi CUi *f(t))*w(t) }
2 Here, fz (t) is the 0-input initial value response of the aforementioned synthesis filter. At this time, the optimal gain that minimizes the squared error is calculated using the following equation. Each Σ is from t=0 to N
-1. GUi = Σsw (t)y(t)/Σy2 (t) where sw (t) = (sp (t) - fz (t))
*w(t),y(t)=CUi(t)*f(t)*w
(t).

【００２７】図６は雑音系列、雑音ゲイン算出部２０の
内部の構成を示したものである。位相等化音声信号ｓｐ
　（ｔ）が入力されるフィルタ６３では、ｓｐ　（ｔ）
＊ｗ（ｔ）を算出し、フィルタ６４ではｆｚ　（ｔ）＊
ｗ（ｔ）を算出し、加算器６５でサンプル毎に信号間の
差を求めることにより信号ｓｗ　（ｔ）を求める。フィ
ルタ６６では、雑音系列ＣＵｉ　を入力として、ｙ（ｔ
）＝ＣＵｉ　（ｔ）＊ｆ（ｔ）＊ｗ（ｔ）の演算を行な
い、ｙ（ｔ）を求める。相関器６７では、信号ｓｗ　（
ｔ）とｙ（ｔ）間の相関関数を、ｃｓｙ＝Σｓｗ　（ｔ
）ｙ（ｔ）として求め（Σはｔ＝０からＮ−１までであ
る）、相関器６８では、信号ｙ（ｔ）の電力を、ｃｙｙ
＝Σｙ２　（ｔ）として求める（Σはｔ＝０からＮ−１
までである）。割算器７０では、ＧＶｉ　＝ｃｚｙ／ｃ
ｙｙの演算を行ない、最適ゲインが算出される。相関器
６９では信号ｓｗ　（ｔ）の電力ｃｓｓを計算し、乗算
器７１ではＧＵｉ　とｃｓｙとの乗算を行ない、加算器
７２ではｃｓｓからＧＵｉ　ｃｓｙを差し引くことによ
り、合成音声の平均二乗誤差ｄｉ　が求められる。最小
値選択部７３では、雑音コードブック１６に含まれる雑
音系列の中から、合成音声の平均二乗誤差ｄｉ　が最小
となる雑音系列を選択し、その雑音系列の番号ＩＣＵと
最適雑音ゲインＧＵとを出力する。ゲインＧＵは量子化
部２０ａで量子化されて振幅制御部１７が制御される。FIG. 6 shows the internal configuration of the noise sequence and noise gain calculation section 20. phase equalized audio signal sp
In the filter 63 to which (t) is input, sp (t)
*w(t) is calculated, and the filter 64 calculates fz (t)*
w(t) is calculated, and the adder 65 calculates the difference between the signals for each sample to obtain the signal sw (t). The filter 66 inputs the noise sequence CUi and calculates y(t
)=CUi (t)*f(t)*w(t) is performed to obtain y(t). In the correlator 67, the signal sw (
t) and y(t) as csy=Σsw (t
)y(t) (Σ is from t=0 to N-1), and the correlator 68 calculates the power of the signal y(t) as cyy
= Σy2 (t) (Σ is from t=0 to N-1
up to). In the divider 70, GVi =czy/c
The optimum gain is calculated by calculating yy. The correlator 69 calculates the power css of the signal sw (t), the multiplier 71 multiplies GUi by csy, and the adder 72 subtracts GUi csy from css, so that the mean square error di of the synthesized speech is Desired. The minimum value selection unit 73 selects the noise sequence with the minimum mean squared error di of synthesized speech from among the noise sequences included in the noise codebook 16, and calculates the number ICU and optimal noise gain GU of the noise sequence. Output. The gain GU is quantized by the quantization section 20a, and the amplitude control section 17 is controlled.

【００２８】以上述べた処理により、音声信号は有声・
無声共通に線形予測係数ａｉ　、有声・無声パラメータ
ーＶＵＶ、また有声音ではパルス時点ｔｉ　、パルス振
幅ｍｉ　、雑音系列番号ＩＣＶと振幅ＧＶ、ピッチ相関
フィルタ係数ｂ、無声音では雑音系列の番号ＩＣＵと振
幅ＧＵによって表される。これらの音声パラメータは符
号化部２１で符号化され、伝送あるいは蓄積される。[0028] Through the processing described above, the audio signal is
For unvoiced sounds, linear prediction coefficient ai, voiced/unvoiced parameter VUV, for voiced sounds, pulse time ti, pulse amplitude mi, noise sequence number ICV and amplitude GV, pitch correlation filter coefficient b, for unvoiced sounds, noise sequence number ICU and amplitude GU. Represented by These audio parameters are encoded by the encoder 21 and transmitted or stored.

【００２９】音声合成部では、図７に示すように復号化
部２２で全ての音声パラメータを復号化した後、有声・
無声パラメータＶＵＶに応じて駆動音源信号を復号化す
る。有声音の場合は、パルス系列生成部２３において、
パルス時点ｔｉ　から準周期パルス列を生成し、準周期
パルス列の個々のパルスの振幅を振幅制御部２４でｍｉ
　に制御する。また、送信側の雑音コードブック１２と
同一の雑音コードブック２５を用いて雑音系列の番号Ｉ
ＣＶに対応した雑音系列を読み出す。その雑音系列をピ
ッチ相関フィルタ２６を通した後、振幅制御部２７で雑
音ゲインＧＶを乗じて雑音駆動系列を生成する。この２
つの信号系列を加算器２８でサンプル毎に加算して駆動
音源信号を生成する。一方、無声音の場合は、送信側の
雑音コードブック１６と同一の雑音コードブック２９を
用いて雑音系列番号ＩＣＵに対応した雑音系列を読み出
し、この雑音系列に雑音ゲインＧＵを振幅制御部３０で
乗じて駆動音源信号を生成する。スイッチ３１では、有
声・無声パラメータＶＵＶによって駆動音源信号を選択
し、選択された駆動音源信号を用いてフィルタ係数とし
て線形予測係数ａｉ　が設定された全極形フィルタ３２
を駆動することによりその出力端３３に合成音声が出力
される。In the speech synthesis section, as shown in FIG. 7, after decoding all speech parameters in the decoding section 22, voiced/
The driving sound source signal is decoded according to the unvoiced parameter VUV. In the case of a voiced sound, the pulse sequence generation unit 23
A quasi-periodic pulse train is generated from the pulse time point ti, and the amplitude of each pulse of the quasi-periodic pulse train is controlled by the amplitude controller 24.
to control. Also, using the same noise codebook 25 as the noise codebook 12 on the transmitting side, the noise sequence number I
Read out the noise series corresponding to the CV. After passing the noise sequence through a pitch correlation filter 26, the amplitude control unit 27 multiplies it by a noise gain GV to generate a noise drive sequence. This 2
The two signal sequences are added sample by sample in an adder 28 to generate a drive sound source signal. On the other hand, in the case of unvoiced sound, the noise codebook 29 that is the same as the noise codebook 16 on the transmitting side is used to read out the noise sequence corresponding to the noise sequence number ICU, and this noise sequence is multiplied by the noise gain GU in the amplitude control unit 30. to generate a driving sound source signal. The switch 31 selects a driving sound source signal according to the voiced/unvoiced parameter VUV, and uses the selected driving sound source signal to filter an all-pole filter 32 in which a linear prediction coefficient ai is set as a filter coefficient.
By driving the synthesized speech, synthesized speech is outputted to the output terminal 33 thereof.

【００３０】図１では簡略に示すために、位相等化フィ
ルタ４の出力ｓｐ　（ｔ）と全極形フィルタ１９の出力
ｓ′（Ａ）との差を、パルス振幅算出部８、雑音ゲイン
算出部９，２０にそれぞれ入力しているが、図４，５，
６から明らかなように、具体的には出力ｓｐ　（ｔ）、
ｓ′（ｔ）をパルス振幅算出部８、雑音ゲイン算出部９
，２０へそれぞれ入力される。In order to simplify the illustration in FIG. 4, 5, 20, respectively.
As is clear from 6, specifically, the output sp (t),
s′(t) is calculated by the pulse amplitude calculation unit 8 and the noise gain calculation unit 9.
, 20, respectively.

【００３１】[0031]

【発明の効果】この本発明による音声分析合成法の効果
を調べるために、以下の条件で分析合成音声実験をおこ
なった。０〜４ｋＨｚ　帯域の音声を標本化周波数８ｋ
Ｈｚ　で標本化した後、音声信号に分析窓長３０ｍｓの
ハミング窓を乗じ、分析次数を１２次として自己相関法
による線形予測分析を行い、１２個の予測係数と有声・
無声パラメータを求める。符号化の分析フレーム長は２
５ｍｓ（１６０音声サンプル）とする。予測係数はＬＳ
Ｐパラメータのユークリッド距離を用いて多段ベクトル
量子化する。また、複数個のパルス振幅はまとめてベクトルとみなし
て、ベクトル量子化する。雑音ゲインはスカラー量子化
する。パルス時点は、フレーム内の先頭のパルス位置、
２番めのパルス時点と１番めのパルス時点との間隔、３
番めのパルス以降は隣接するパルス時点の間隔の差をそ
れぞれ符号化した。雑音系列のブロック長は、有声音の
場合は２０ｍｓ（１６０サンプル）、無声音の場合は５
ｍｓ（４０サンプル）とした。ビットレートが３．７ｋ
ｂ／ｓの場合、フレーム当たりのビット数は７４ビット
であり、その内訳は次の様になる。[Effects of the Invention] In order to examine the effects of the speech analysis and synthesis method according to the present invention, an analysis and synthesis speech experiment was conducted under the following conditions. Sampling frequency 8k for audio in the 0-4kHz band
After sampling at Hz, the speech signal is multiplied by a Hamming window with an analysis window length of 30 ms, the analysis order is set to 12th order, linear prediction analysis is performed using the autocorrelation method, and the 12 prediction coefficients and voiced/voiced
Find the silent parameters. The analysis frame length for encoding is 2
5ms (160 audio samples). The prediction coefficient is LS
Multistage vector quantization is performed using the Euclidean distance of the P parameter. Furthermore, a plurality of pulse amplitudes are collectively regarded as a vector and vector quantized. The noise gain is scalar quantized. The pulse point is the first pulse position in the frame,
The interval between the second pulse time and the first pulse time, 3
From the second pulse onward, the difference in interval between adjacent pulse points was encoded. The block length of the noise sequence is 20 ms (160 samples) for voiced sounds and 5 for unvoiced sounds.
ms (40 samples). Bitrate is 3.7k
In the case of b/s, the number of bits per frame is 74 bits, the breakdown of which is as follows.

【００３２】　　　　　　パラメータ　　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　ビット数／フレ
ーム　　　　　　予測係数　　　　　　　　　　　　　
　　　　　　　　　　　　　　　　　　　　　　　　　
２４　　　　　　有声・無声パラメータ　　　　　　　
　　　　　　　　　　　　　　　　　　　１　　　　　
　駆動音源（有声の場合）パルス時点　　　　　　　　
　　　　　　３０　　　　　　　　　　　　　　　　　
　　　　　　　　　　　パルス振幅　　　　　　　　　
　　　　　８　　　　　　　　　　　　　　　　　　　
　　　　　　　　　雑音系列の数　　　　　　　　　　
　　６　　　　　　　　　　　　　　　　　　　　　　
　　　　　　雑音ゲイン　　　　　　　　　　　　　　
３　　　　　　　　　　　　　　　　　　　　　　　　
　　　　ピッチ予測フィルタ係数　　２　　　　　　駆
動音源（無声の場合）雑音系列の数　　　　　　　　　
　　　３６（＝９×４）　　　　　　　　　　　　　　
　　　　　　　　　　　　　　雑音ゲイン　　　　　　
　　　　　　　　１２＝（３×４）上記の条件で符号化
された音声は、従来のボコーダにくらべてはるかに高い
自然性をもち、高い音声品質が達成される。また、従来
のボコーダにくらべて話者に対する音声品質の依存性は
小さい。また、従来のマルチパルス予測符号化やコード
励振形予測符号化とくらべても、符号化音声の品質が高
いことが確認された。符号化で生じる時間遅延は６０ｍ
ｓであり、低ビットレート領域における従来の方法と同
程度以下である。また、入力音声に周囲雑音が混入した
場合にも、雑音が混じった音声がそのまま再現され、従
来の準周期パルス音源だけを用いる場合に比べてより自
然な音声が得られ、周囲雑音耐性が改善されている。Parameters
Number of bits/frame Prediction coefficient

24 Voiced/unvoiced parameters
1
Driving sound source (if voiced) pulse time
30
pulse amplitude
8
Number of noise sequences
6
noise gain
3
Pitch prediction filter coefficient 2 Number of driving sound source (unvoiced) noise sequences
36 (=9×4)
noise gain
12=(3×4) Speech encoded under the above conditions has much higher naturalness than a conventional vocoder, and high speech quality is achieved. Furthermore, the dependence of voice quality on the speaker is smaller than with conventional vocoders. It was also confirmed that the quality of encoded speech is higher than that of conventional multi-pulse predictive coding or code-excited predictive coding. The time delay caused by encoding is 60m
s, which is about the same level or lower than the conventional method in the low bit rate area. Additionally, even when ambient noise is mixed into the input audio, the audio mixed with the noise is reproduced as is, resulting in more natural audio and improved immunity to ambient noise than when using only a conventional quasi-periodic pulse sound source. has been done.

【００３３】この発明の効果は、４ｋｂ／ｓ以下の低ビ
ットレートで極めて自然な音声品質と周囲雑音に対する
耐性を有する音声符号化が実現できることにある。The advantage of the present invention is that it is possible to realize speech encoding with extremely natural speech quality and resistance to ambient noise at a low bit rate of 4 kb/s or less.

[Brief explanation of the drawing]

【図１】この発明による音声分析合成法を適用した装置
の構成を示すブロック図。FIG. 1 is a block diagram showing the configuration of a device to which a speech analysis and synthesis method according to the present invention is applied.

【図２】準周期パルス駆動音源信号の説明図。FIG. 2 is an explanatory diagram of a quasi-periodic pulse drive sound source signal.

【図３】パルス時点を生成する処理例を示す流れ図。FIG. 3 is a flowchart illustrating an example process for generating pulse instants.

【図４】パルス振幅算出部８の具体例を示すブロック図
。FIG. 4 is a block diagram showing a specific example of the pulse amplitude calculation section 8.

【図５】有声音に対する雑音系列・雑音ゲイン算出部９
の具体例を示すブロック図。[Fig. 5] Noise sequence/noise gain calculation unit 9 for voiced sounds
FIG. 2 is a block diagram showing a specific example.

【図６】無声音に対する雑音系列・雑音ゲイン算出部２
０の具体例を示すブロック図。[Figure 6] Noise sequence/noise gain calculation unit 2 for unvoiced sounds
1 is a block diagram showing a specific example of 0. FIG.

【図７】この発明による音声分析合成法を適用した合成
装置の構成を示すブロック図。FIG. 7 is a block diagram showing the configuration of a synthesis device to which the speech analysis and synthesis method according to the present invention is applied.

Claims

[Claims]

[Claim 1] A speech analysis and synthesis method using a linear filter representing speech spectral envelope characteristics and a sound source signal generation section that generates a sound source signal for driving the linear filter, in which the magnitude of pitch period fluctuation is limited. A signal obtained by mixing a quasi-periodic pulse train and a noise sequence is used as the sound source signal, and the sound source signal drives the linear filter to synthesize a sound signal, and the phase of the input sound is pitch-periodically set to zero phase. A speech analysis and synthesis method, characterized in that a sound source signal generation parameter of the sound source signal generation section is determined so that an error between the phase-equalized speech signal and the synthesized speech signal is minimized.