JP3583945B2

JP3583945B2 - Audio coding method

Info

Publication number: JP3583945B2
Application number: JP10816199A
Authority: JP
Inventors: 祐介日和▲崎▼; 一則間野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-04-15
Filing date: 1999-04-15
Publication date: 2004-11-04
Anticipated expiration: 2019-04-15
Also published as: JP2000298500A

Abstract

PROBLEM TO BE SOLVED: To improve the quantization efficiency at a low bit rate. SOLUTION: A linear predictive reverse filter obtains the residue signal of an input voice, the residue signal is segmented by a pitch cycle and shifted in phase into PW so that the correlation with a reference pulse becomes large, and the PW is made cyclic in pitch cycles (15); and an impulse response is convoluted in the output (16) to obtain a target waveform, a PW code vector is selected from a code book 17 and given a gain, and then one pitch cycle is segmented (19); and impulse response is convoluted in the output (22) to obtain a composite waveform, and the code vector and gain that minimize the square of the error between the composite waveform and target waveform are determined.

Description

【０００１】
【発明の属する技術分野】
この発明は、音声の信号系列を少ない情報量でディジタル符号化する高能率音声符号化方法、特に、従来ボコーダと呼ばれる音声分析合成系の領域である２．４ｋｂｉｔ／ｓ以下のビットレートで高品質な音声符号化を実現する符号化方法に関する。
【０００２】
【従来の技術】
この発明に関連する従来技術としては、線形予測ボコーダ、符号励振予測符号化法（ＣＥＬＰ：ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ）、混合領域符号化法（ＭｉｘｅｄＤｏｍａｉｎＣｏｄｉｎｇ）、代表波形補間符号化法（ＰｒｏｔｏｔｙｐｅＷａｖｅｆｏｒｍＩｎｔｅｒｐｏｌａｔｉｏｎ）がある。
【０００３】
線形予測ボコーダは、４．８ｋｂｉｔ／ｓ以下の低ビットレート領域における音声符号化方法としてこれまで広く用いられ、ＰＡＲＣＯＲ方式や、線スペクトル対（ＬＳＰ）方式などの方式がある。これらの方法の詳細は、たとえば斎藤、中田著「音声情報処理の基礎」（オーム社出版）に記載されている。線形予測ボコーダは、音声のスペクトル包絡特性をあらわす全極型のフィルタと、それを駆動する励振信号によって構成される。励振信号には、有声音に対してはピッチ周期パルス列、無声音に対しては白色雑音が用いられる。線形予測ボコーダにおいて、周期パルス列や白色雑音による励振信号では音声波形の特徴を再現するには不十分なため、自然性の高い合成音声を得ることは困難である。
【０００４】
一方、符号励振予測符号化では、雑音系列を励振信号として音声の近接相関とピッチ相関特性をあらわす２つの全極型フィルタを駆動することにより音声を合成する。雑音系列は複数個の符号パターンとしてあらかじめ用意され、その中から、入力音声波形と合成音声波形との誤差を最小にするコードパターンが選択される。その詳細は、文献Ｓｃｈｒｏｅｄｅｒ：“ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ（ＣＥＬＰ）ＨｉｇｈＱｕａｌｉｔｙＳｐｅｅｃｈａｔＶｅｒｙＬｏｗＢｉｔＲａｔｅｓ”Ｐｒｏｃ．ＩＥＥＥ．ＩＣＡＳＳＰ，ｐｐ９３７−９４０，１９８５に記載されている。符号励振予測符号化では、再現精度は符号パターンの数に依存する関係にある。したがって、多くの系列パターンを用意すれば音声波形の再現精度が高まり、それにともなって品質を高めることが出来る。しかし、音声符号化のビットレートを４ｋｂｉｔ／ｓ以下にすると、符号パターンの数が制限され、その結果十分な音声品質が得られなくなる。良好な音声品質を得るには４．８ｋｂｉｔ／ｓ程度の情報量が必要であるとされている。
【０００５】
また、混合領域符号化法（ＭｉｘｅｄＤｏｍａｉｎＣｏｄｉｎｇ）では、有声音でフレーム毎に残差波形よりピッチ周期分の波形が抽出され、前のピッチ周期分の波形との差分が時間領域で量子化される。復号器では周波数領域でこれらの波形の線形補間を行うことによって励振信号を生成し、全極フィルタを駆動して音声を合成する。無声音では符号励振予測符号化と同様な方法で符号化を行う。この方式の詳細は、文献ＤｅＭａｒｔｉｎ等“ＭｉｘｅｄＤｏｍａｉｎＣｏｄｉｎｇｏｆＳｐｅｅｃｈａｔ３ｋｂ／ｓ”Ｐｒｏｃ．ＩＥＥＥ．ＩＣＡＳＳＰ，ＰＰＩＩ／２１６−１７０，１９９６に記載されている。この方法の特徴としては、差分を求める際に、前ピッチ周期波形は、現在のフレームの波形に長さが正規化されることが挙げられる。この差分の量子化には、パルス符号帳と雑音符号帳を用いるが、３．５ｋｂｉｔ／ｓ程度の情報量が必要とされている。
【０００６】
また、代表波形補間符号化法（ＰｒｏｔｏｔｙｐｅＷａｖｅｆｏｒｍＩｎｔｅｒｐｏｌａｔｉｏｎＣｏｄｅｒ）では、プロトタイプ波形（ＰｒｏｔｏｔｙｐｅＷａｖｅｆｏｒｍ）の線形補間を行って合成した励振信号で全極フィルタを駆動することにより音声を合成する。この詳細は、文献ＫｌｅｉｊｎＷ．Ｂ．“ＥｎｃｏｄｉｎｇＳｐｅｅｃｈＵｓｉｎｇＰｒｏｔｏｔｙｐｅＷａｖｅｆｏｒｍｓ”ＩＥＥＥＴｒａｎｓ．ｏｎＳｐｅｅｃｈＡｕｄｉｏＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．１，ｐｐ３８６−３９９１９９３に記載されている。プロトタイプ波形は、一定周期で残差波形より抽出され、フーリエ変換された後に符号化される。この方式では良好な品質を得るには３．４ｋｂｉｔ／ｓ程度の情報量が必要であるとされている。
【０００７】
雑音系列やピッチパルス列を励振信号として用いる線形予測符号化に関して、より能率的な音声波形の量子化を実現するため特開平１０−２３２６９７「音声符号化方法および復号化方法」を提案した。この提案した方法は入力音声のピッチ周期を推定し、残差信号の周期的な部分で推定されたピッチ周期分の波形を抽出し、そのピッチ周期分の波形との波形歪みが最小になるように符号ベクトルをピッチ長で打ち切ったものより決定する。ここで、入力ピッチ周期波形に、合成フィルタのインパルス応答を畳み込んだベクトルと、符号ベクトルをピッチ長で打ち切ったものに同様に畳み込んだベクトルとの距離計算をすることにより符号を選択する。
【０００８】
この方法によっていまだ必ずしも、十分効率的な符号化が行えるとは云えなかった。特に少ないビット数では量子化効率が悪かった。
【０００９】
【発明が解決しようとする課題】
この発明の課題は、雑音系列やピッチパルス列を励振信号として用いる線形予測符号化方法において、電話音声などのように入力信号の周波数帯域が制限されている場合に、より能率的な符号化を実現する方法を提供することにある。
【００１０】
【課題を解決するための手段】
この発明は特開平１０−２３２６９７に示す符号化方法を前提とし、この発明においては目標残差ベクトルに対し、ピッチ周期で周期化すると共に、合成フィルタ処理を行って目標波形ベクトルを求め、また選択した符号ベクトルに対しピッチ周期で周期化すると共に合成フィルタ処理を行って合成波形ベクトルを求める。
【００１１】
【発明の実施の形態】
実施例１
図１にこの発明の量子化方法を適用した符号化器の機能構成を示す。符号化器は、以下の手順をＮサンプル数の長さをもつフレームごとに１回行う。フレームｉにおいて、入力端子１よりの入力音声信号Ｓ（ｔ）のｐ次の線形予測係数（ＬＰＣ）ａ_ｊ（ｊ＝０，１，…，ｐ−１）を線形予測係数計算部２で計算する。この線形予測係数は線形予測係数量子化部３で量子化され、線形予測係数符号Ｉ_１として送出される。線形予測係数の量子化の詳細については「音声の線形予測パラメータ符号化方法」（特願平３−１８０８１９）に記載されている。線形予測係数量子化部３よりの線形予測係数符号Ｉ_１は復号され、その逆量子化された線形予測係数に基づいて、線形予測逆フィルタ４のフィルタ係数を定め、この逆フィルタ４に入力音声信号Ｓ（ｔ）を通して残差信号ｒ′（ｔ）を得る。逆フィルタ４は次の伝達特性を持つデジタルフィルタＡ（ｚ）で実現される。
【００１２】
Ａ（ｚ）＝１＋ａ_１ｚ^−１＋…＋ａ_ｐｚ^−ｐ（１）
ここで得られた残差信号の相関（偏相関関数）ρを相関計算部５で計算し、その相関ρの最大値ρ_ｍａｘの遅れ（間隔）をピッチ周期抽出部６で推定ピッチ周期ｐ_ｉとする。このとき、周期性判定部７で入力音声信号Ｓ（ｔ）が有声部であるか無声部であるかを、例えば以下の様にしきい値θ（０．５〜１．０）で判別する。
【００１３】
┌ ｋ_１／２＋ρ_ｍａｘ＞θ；有声部
└ ｋ_１／２＋ρ_ｍａｘ＜θ；無声部（２）
ここで、ｋ_１は線形予測係数計算部２で求まる第１次の偏自己相関（ＰＡＲＣＯＲ）係数である。
周期性判定部７が無声部と判断すると、残差信号ｒ′（ｔ）を無声部量子化部８で量子化する。無声部量子化部８では、例えば、フレームをＳ分割し、Ｎ_ｓｕｂ（＝Ｎ／Ｓ）サンプル数をサブフレームとし、そのサブフレーム中の逆フィルタ４よりの残差波形ｒ′（ｔ）の平均パワーを計算し、その平均パワーの１フレーム分をベクトル量子化して無声部符号Ｉ_２として出力する。
【００１４】
周期性判定部７が有声部と判断した場合は残差切り出し部９により推定ピッチ周期ｐ_ｉを用いて、逆フィルタ４からの残差信号ｒ′（ｔ）におけるフレームの中央付近からｐ_ｉの長さの波形を切り出す。
次に、この残差波形とパルス信号との相関が大きくなるまで、残差波形をＰＷ整列部１０で巡回させる。ここで、パルス信号との相関が大きくなるように巡回された１周期波形分の残差波形ｒ_ｐをＰＷ（ピッチ周期波形）と呼ぶ。推定ピッチ周期ｐ_ｉはピッチ周期量子化部１１で四捨五入によって整数値に量子化され、ピッチ周期符号Ｉ_３として出力される。
【００１５】
ＰＷ整列部１０からのＰＷはＰＷ量子化部１２でベクトル量子化される。ＰＷ量子化部１２は例えば図２Ａに示すように、図１中の線形予測係数量子化部３よりの逆量子化された線形予測係数によりフィルタ係数が定められた線形予測合成フィルタ１４にインパルス信号が通されて、インパルス応答ｈ_ｊが求められる。図１中のＰＷ整列部１０よりのＰＷがピッチ周期化フィルタ１５でピッチ周期化行列と掛け合わされ、このフィルタ１５の出力ＰＷに畳み込みフィルタ１６でインパルス応答ｈ_ｊに基づくインパルス応答行列Ｈが畳み込まれて音声信号（目標波形ベクトル）ｘが再生される。一方、ＰＷ符号帳１７から選択された符号ベクトルｃ ^ｉ _０に対して、利得符号帳１８より取出された利得ｇ^ｋ _０が与えられ、この利得が与えられた符号ベクトルを、その先頭から、図１中のピッチ周期抽出部６で抽出されたピッチ周期長だけ符号切り出し部１９で切り出され、この切り出された符号ベクトルに対し、ｘと同様にピッチ周期フィルタ２１でピッチ周期化行列と掛け合わせ、そのフィルタ出力に畳み込みフィルタ２２でインパルス応答行列Ｈが畳み込まれ、合成音声信号（合成波形ベクトル）が得られる。この合成音声信号の再生音声信号ｘに対する誤差が引き算部２３でとられ、その二乗誤差が最小になるように、ＰＷ符号帳１７の符号ベクトルｃ ^ｉ _０の選択と、利得符号帳１８の利得ｇ^ｋ _０の選択とが歪み計算部２４で行われる。
【００１６】
なお、ＰＷ符号帳１７の各符号ベクトルの長さｎは検出されるピッチ長ｐ_iよりも十分長くとる必要がある（ｎ：ｎ＞ｐ_max）。ここで、各符号ベクトルのピークの位相は均一とされている。図１中のＰＷ整列部１０で用いたパルス信号はベクトル長がｎであり、位相はＰＷ符号帳１７の符号ベクトルのピークと一致させてある。
【００１７】
図２Ａで説明したようにＰＷ符号は、符号ベクトルを励振信号として合成した波形（合成波形ベクトル）と、ＰＷを励振信号として合成した波形（目標波形ベクトル）との聴覚重み付け平均二乗誤差が最小になるように決定される。この歪み尺度Ｄの距離計算には以下の（３）式を用いる。
Ｄ＝‖ｘ−ｇ^ｋ _ｏ ＨＰｃ ^ｉ _ｏ‖^２（３）
ここで、ｘはターゲット（ＰＷを励振信号として合成した波形）、Ｈは量子化された線形予測係数ａ′_ｉを用いた合成フィルタ１４のインパルス応答をあらわす行列、Ｐはピッチ周期化を表すベクトル（周期化行列）、ｃ _０は符号ベクトル、ｇ_０は符号ベクトルの利得をあらわす。
【００１８】
ターゲットｘは以下の（４）式を用いて、ＰＷｒ _ｐに、ピッチ周期化フィルタ１５を表わすベクトルＰを掛け合わせたものに、合成フィルタ１６で畳み込み演算を行ったものによりあらかじめ計算する。
ｘ＝ＨＰｒ _ｐ（４）
ここで、ｒ _ｐは量子化前の原ＰＷをベクトル表示にしたものである。
【００１９】
従来のＣＥＬＰ符号化では、Ｈにはフレーム長がＮの場合通常下三角の（Ｎ×Ｎ）の正方行列を用いるが、ここではピッチ長の正方行列（ｐ_ｉ×ｐ_ｉ）を下側に（ｍ−ｐ_ｉ）行分、右側に（ｎ−ｐ_ｉ）列分拡張した（ｍ×ｎ）の非正方行列を用いる。ここで、ｍ＞ｐ_ｉ，ｎ＞ｐ_ｉである。Ｈには、聴覚重み付けを行った線形予測フィルタのインパルス応答ｈ_ｊ（ｊ＝０，１，…）を用いる。
【００２０】
【数１】

このとき、ｈ_ｊ（ｊ＝０，１，…）の計算に用いる線形予測合成フィルタ１４は、以下の伝達特性Ｈ′（ｚ）をもつデジタルフィルタで実現される。

聴覚重み付けの伝達特性は、次のように表される。
【００２１】
Ｗ（ｚ）＝Ａ（ｚ／γ_１）／Ａ（ｚ／γ_２）（７）
ここで、γ_１とγ_２は聴覚重み付けの程度を制御するパラメータであり、０＜γ_２＜γ_１＜１の値を取る。図２Ａ中の畳み込みフィルタ１６，２２での畳み込み演算に用いる行列Ｈはインパルス応答ｈより、先に述べた拡張された（ｍ×ｎ）の行列を用いる。
【００２２】
また、Ｐは以下のような（ｎ×ｐ_ｉ）の行列を用いて表現する。
【００２３】
【数２】

このように行列Ｐは対角要素が１である正方行列を行方向に繰返す（この例ではｎ＝２ｐ_ｉ）ものであるから、このｐと波形を掛算すると波形がｎ／ｐ_ｉだけ繰返されることになる。ピッチ周期化フィルタ１５の出力はＰＷがｎ／ｐ_ｉ回繰返されたものとなり、ピッチ周期化フィルタ１４の出力は切り出された符号ベクトルがｎ／ｐ_ｉ回繰返されたものとなる。
【００２４】
このようにＨとＰによる拡張のため、式（４）の演算で得られるターゲットｘも

と次数はｍとなる。ここでｘ_ｊ（ｐ_ｉ＜ｊ＜ｍ−１）は線形予測合成フィルタ１４の自由応答分に対応する成分で線形予測フィルタ１４の零入力初期値応答である。
【００２５】
なおｈの長さを非常に短い長さで打ち切って（例えば１０サンプル程度）、Ｈを構成すればさほど品質劣化を伴わずに、演算量が低減する方法を用いても良い。このとき、インパルス応答ｈをｔ次で打ち切った時のＨ行列は、以下のような（ｍ×ｎ）の行列となる。
【００２６】
【数３】

なお、逆量子化線形予測係数ａ′_ｊを用いて線形予測逆フィルタ４の係数およびＰＷ量子化部１２の合成フィルタ１４の係数を決めるが、双方に量子化していない線形予測係数ａ_ｊを用いてＨ（ｚ）を構成してもＨ′（ｚ）を用いた時と同程度の品質が得られる。
【００２７】
ＰＷ符号ｃ_０の選択では、ＰＷ符号帳１７の中から（３）式が最小となるように、符号ベクトルｃ ^ｉ _０を選択し、その理想利得ｇ^ｉ _０を計算する。実際にこの選択計算には次の等価な手法を用いる。まず、ＰＷ符号帳１７の全ての符号ベクトルｃ ^ｉ _０について式（１１）を計算し、Ｄ′_０値が最大となる符号ベクトルｃ ^ｉ _０を選択する。ｘ ^Ｔはｘの転置行列を表わす。
【００２８】
Ｄ′_０＝（ｘ ^Ｔ ＨＰｃ ^ｉ _０）^２／‖ＨＰｃ ^ｉ _０‖^２（１１）
選択された符号ベクトルの理想利得ｇ^ｉ _０の計算は、（１２）式を用いて行う。
ｇ^ｉ _０＝ｘ ^Ｔ ＨＰｃ ^ｉ _０／‖ＨＰｃ ^ｉ _０‖^２（１２）
次に、利得ｇ^ｉ _０をスカラー量子化する。これら選択した符号ベクトルの符号、選択した利得の符号をＰＷ符号Ｉ_４として出力し、更に、周期判定部７よりそのフレームが有声部か無声部かを示す周期性符号Ｉ_５を出力する。符号Ｉ_１〜Ｉ_５がマルチプレクサ１３でまとめられ、伝送路又は蓄積部へ出力される。
【００２９】
以上のように１フレームは例えば２５ｍｓｅｃとされ、そのうちから１ピッチ周期分の残差波形（信号）が取り出され、つまり１フレーム中の例えば数分の１の部分しか取り出されていない。一方合成フィルタ１４は入力を零として駆動しても、その直前の状態に応じた出力、いわゆる零入力応答が存在する。そのため、ＣＥＬＰ符号化においては、零入力応答を入力波形から差し引いたものをターゲットとしている。しかしこの発明では１フレーム中の一部のみを用いて符号化するため、合成フィルタ１４のインパルス応答行列をＣＥＬＰ符号化よりも零応答に対応する分拡張して、１ピッチ周期分の波形を零入力応答（自由応答）を含めて、これに近い符号ベクトルの選択を行う。
【００３０】
以上のように波形情報については１フレーム中の１ピッチ周期分しか符号化していないため、それだけ少ないビット数で表現でき、又ピーク位置を正規化（一定位相）としているため、この点に置いても符号化ビット数を少なくすることができる。
ＰＷは残差信号の１ピッチ周期であるから、その波形はインパルス波形に近いものであって、その周波数スペクトルは図６中の破線で示すようにほぼ平坦なものとなる。このＰＷに対してピッチ周期で周期化したものは、図６中の実線で示すようにスペクトルの強弱が繰返されたものとなる。ピッチ周期長で切り出した符号ベクトルをピッチ周期で周期化したものは図６中の実線のような凹凸が繰返されるスペクトラムとなる。従ってこれら周期化されたＰＷと符号ベクトルとを比較して量子化を行うため、パワースペクトルが大きい部分が重み付けされて（重点的に）距離計算がなされ、効率的な量子化が行える。
【００３１】
図２Ａ中の線形予測合成フィルタ１４、畳み込みフィルタ１６，２２を省略して代りに図７Ａに示すように量子化線形予測係数でフィルタ係数が決められる合成フィルタ２５，２６を用いてもよい。また図７Ｂに示すように、ピッチ周期化フィルタ１５のピッチ周期化行列と合成フィルタ２５（又は畳み込みフィルタ１６）のインパルス応答行列Ｈとを掛けた特性をもつフィルタ２７，２８を用いてもよい。ピッチ周期化フィルタ２１と畳み込みフィルタ２２又は合成フィルタ２６も同様に１つのフィルタとして用いることもできる。
【００３２】
次に図１に示した符号化方法の実施例と対応した、この発明の復号化方法の実施例を適用した復号器を図３に示す。この復号器は特開平１０−２３２６９７に示したものと同一である。ここでは、入力端子３１に入力された符号Ｉ_１〜Ｉ_５はデマルチプレクサ３２で全ての符号が分離復号された後、無声・有声（ＰＭ）符号Ｉ_２，Ｉ_４によって励振信号を生成する。周期性符号Ｉ_５が無声部の場合は、無声部符号Ｉ_２を無声部復号部４１で励振信号を再生する。無声部復号部４１では白色雑音生成部よりの白色雑音に、無声部符号Ｉ_２の復号パワー符号を利得計算処理して無声部の合成残差波形を生成する。つまりＮサンプルの白色雑音を生成し、各々のサブフレーム（Ｎ_ｓｕｂ長）中の平均パワーを、復号された対応するサブフレームの平均パワーと一致するように利得を計算して乗じたものを励振信号とする。
【００３３】
周期性符号Ｉ_５が有声部を示す場合は、図３においてＰＷ符号Ｉ_４をＰＷ復号部３３で式（１３）に示すように、符号ベクトルｃ ^ｉに利得ｇ^ｉを乗じて、ＰＷ波形ｒ ^ｉを復号する。
ｒ ^ｉ＝ｇ^ｉ _０ｃ ^ｉ _０（１３）
図に示していないが、図２Ａ中のＰＷ符号帳１７、利得符号帳１８と同一のものを備え、符号ベクトルｃ ^ｉ _１がＰＷ符号帳の符号ベクトルの先頭からピッチ周期長ｐ_ｉだけ符号ベクトル切り出し部３４で切り出される。
【００３４】
次に、この復号ＰＷ波形ｒ ^ｉと前ＰＷバッファ３５の内容ｒ ^ｉ−１との間の線形補間を線形補間部３６で行い、中間のＰＷ波形ｒ ^ｉｎｔｍを得る。この線形補間には、例えば式（１４）を用いる。
ｒ ^ｉｎｔｍ（ｊ）＝（１−α）ｒ ^ｉ−１（ｊ）＋αｒ ^ｉ（ｊ）（１４）
（ｊ＝０，１，…，ｐ−１；０＜α＜１）
ここで、αは、波形がＮサンプル長のフレーム中のどの位置にあるかを表す値、ｐは前後のピッチ（ｐ^ｉもしくはｐ^ｉ−１）の長い方のサンプル数、ｒ ^ｉ−１はｒ ^ｉのひとつ前のＰＷベクトルで、ｒ ^ｉｎｔｍは補間されて出来たベクトルをあ
らわす。短いピッチ長のベクトルの余りの部分は零詰めされ、長い方とベクトル長を一致させた後に補間を行う。
【００３５】
つまり、復号化側では残差波形は各フレーム中の１ピッチ周期分しか切り出されていない。従って、現フレームで切り出された波形と、前フレームで切り出された波形との間には本来は、１ピッチ周期から数ピッチ周期分の波形が存在する。この本来は存在すべき波形を前フレームの復号ＰＷ波形ｒ ^ｉ−１と現フレームの復号ＰＷ波形ｒ ^ｉとで線形補間する。この補間される波形が、前フレームの切り出された波形と現フレームの切り出された波形との間に補間されるべき波形の何番目かに応じてαが決定される。ピッチ周期符号Ｉ_３はピッチ周期復号部３７で復号され、その復号ピッチ周期とフレーム長とから補間する波形数が決められる。
【００３６】
また、復号ピッチ周期と前ピッチ周期バッファ３８の内容とにより、前フレームの切り出し波形のピッチ周期と、現フレームの切り出し波形のピッチ周期との間の各ピッチ周期をピッチ補間部３９で線形補間して求め、その補間ピッチ周期を用いて、線形補間部３６で求めた中間ＰＷ波形を残差信号合成部３９で順次つなぎ、これを励振信号とする。
【００３７】
ここで、補間の際にＰＷの長さをサンプリング変換により正規化して前後のベクトルを同一の長さ（Ｎサンプル）にして式（１４）と同様に以下の式（１５）に基づいて２つのベクトルの線形補間を行うことも可能である。
Ｓ ^ｉｎｔｍ（ｊ）＝（１−α）Ｓ ^ｉ−１（ｊ）＋αＳ ^ｉ（ｊ）（１５）
（ｊ＝０，１，…，Ｎ；０＜α＜１）
ここで、αは、波形がＮサンプル長のフレーム中のどの位置にあるかを表す値、Ｎは正規化ベクトル長、Ｓ ^ｉ−１はｒ ^ｉのひとつ前のＰＷベクトルを、Ｓ ^ｉはｒ ^ｉのＰＷベクトルを、それぞれ正規化したもので、Ｓ ^ｉｎｔｍは補間されて出来た正規化ベクトルをあらわす。このＳ ^ｉｎｔｍは、サンプリング変換によって上記と同様に補間されたピッチ周期長に直してから順次つながれる。
【００３８】
周期性信号Ｉ_５が無声部を示す時は無声部復号部４１からの合成励振信号を、Ｉ_５が有声部を示す時は残差信号合成部３９からの合成励振信号を用いて線形予測合成フィルタ４２を駆動し、出力音声を出力端子４３に得る。ここで、線形予測係数符号Ｉ_１を線形予測係数復号部４４で復号し、この線形予測係数についても前係数バッファ４５の内容を用いて前フレーム中の１ピッチ周期分の線形予測係数と現フレーム中の１ピッチ周期分の線形予測係数との間を線形予測係数補間部４６で式（１４）と同様に線形補間を行い、合成フィルタ４２の係数を決定する。なお一般的に線形予測係数の補間はＬＳＰ領域で行う。
実施例２
図１中のＰＷ量子化部１２で多段量子化する場合の実施例のＰＷ量子化部を図４に示す。図４において、図２Ａと対応する部分に同一符号をつけてあり、この例は２段量子化の場合で、新たにＰＷ符号帳５１が設けられ、このＰＷ符号帳５１より選択した符号ベクトルｃ ^ｊ _１に対し、利得符号帳５２から選択された利得ｇ^ｋ _１が与えられ、これが符号切り出し部５３でピッチ周期長ｐ_ｉだけ先頭から切り出されてピッチ周期化フィルタ５４、畳み込みフィルタ５５に順次与えられる。これにより利得が与えられた符号ベクトルの切り出されたものに対しピッチ周期化行列Ｐとインパルス応答Ｈが畳み込まれて合成波形が得られ、この合成波形は、引き算部２３よりの誤差信号から引き算部５６で差し引かれ、その残りが歪み計算部５７に与えられ、歪み計算部５７は引き算部５６の出力の二乗が最小になるようにＰＷ符号帳５１の符号ベクトルｃ ^ｊ _１の選択と利得符号帳５２の利得ｇ^ｋ _１の選択とが行われる。この場合も全体として、符号ベクトルを励振信号として合成した波形と、ＰＷ波形を励振信号として合成した波形との聴覚重みつき平均二乗誤差が最小になるように符号ベクトルｃ ^ｉ _０，ｃ ^ｊ _１、利得ｇ^ｋ _０，ｇ^ｋ _１が決定される。この歪み尺度の距離計算には式（１６）を用いる。
【００３９】
Ｄ＝‖ｘ−ｇ^ｋ _０ ＨＰｃ ^ｉ _０−ｇ^ｋ _１ ＨＰｃ ^ｊ _１‖^２（１６）
ここで、ｘは（４）式で求めたターゲット、Ｈは量子化された線形予測係数ａ′_ｉを用いた合成フィルタのインパルス応答をあわらす行列、Ｐはピッチ周期化をあらわす行列、ｃ _０およびｃ _１は符号ベクトル、そしてｇ_０，ｇ_１はそれぞれの符号ベクトルの利得をあらわす。
【００４０】
まず、図４について説明したとおりに１段目のｃ _０とその理想利得ｇ^ｉ _０を定める。次に、ＰＷ符号帳５１の中から、（１６）式が最小となるような符号ベクトルｃ ^ｊ _１を選択し、その理想利得ｇ^ｊ _１を計算し、ｃ ^ｉ _０の理想利得であるｇ^ｉ _０を再計算する。これは、符号ベクトルｃ ^ｉ _０，ｃ ^ｉ _１のベクトル直交化を行い符号化を行う。このベクトル直交化に基づくベクトル量子化の詳細については、「励振信号直交化音声符号化法」（特開平７−２５３７９５）に記載されている。
【００４１】
選択には、以下の式（１７）のＤ′_１値が最大となる符号ベクトルｃ ^ｉ _１を閉ループで選択する。
【００４２】
【数４】

選択された符号ベクトルの理想利得ｇ^ｊ _１の計算は、式（１８）を用いて行う。
【００４３】
【数５】

また、理想利得ｇ^ｉ _０は式（１９）を用いて再計算を行う。
【００４４】
【数６】

以上の手続きで、符号ベクトルの選択は終了しているため、（１６）式が最小となるような（ｇ^ｋ _０，ｇ^ｋ _１）を選択する。
この場合におけるＰＷの復号には以下の（２０）式を用いる。
ｒ ^ｉ＝ｇ^ｉ _０ｃ ^ｉ _０＋ｇ^ｉ _１ｃ ^ｉ _１（２０）
上述において、ＰＷ符号帳には図５に示すような適応符号帳（ａ）、固定（雑音）符号帳（ｂ）、パルス符号帳（ｃ）のいずれを用いることも可能である。適応符号帳（ａ）は過去の残差波形であり、例えば図２Ａ中の符号切り出し部１９の出力が用いられる。パルス符号帳（ｃ）は規則によりその都度パルスを生成することができるものである。
実施例３
図１中のＰＷ量子化器１２として共役構造の符号帳（２つ）を用いて量子化する場合の実施例を図２Ｂにあらわし、図２Ａと対応する部分に同一符号をつけてある。図２Ａと比較すると、ＰＷ符号帳６１が更に設けられる。このＰＷ符号帳６１の各符号ベクトルおよびＰＷ符号帳１７の符号ベクトルは互いに共役構造を持つもの、つまり互いに直交関係にあるものが選択される。ＰＷ符号帳１７，６１から選択された各符号ベクトルは、利得符号帳１８から選択された利得が与えられ、この利得が与えられた両符号ベクトルが加算部６２で加算され、励振信号としてピッチ周期化フィルタ２１、合成フィルタ２２に順次与えられる。この符号ベクトルを励振信号として合成した波形と、ＰＷ波形を励振信号として合成した波形との聴覚重み付き平均二乗誤差が最小になるようにＰＷ符号帳１７，６１の各符号ベクトルとその利得が決定される。この歪み尺度の距離計算には、実施例２と同様に式（１６）を用いる。この共役構造の符号帳を用いる符号化方法の詳細については「多重ベクトル量子化方法およびその装置」（特願昭６３−２４９４５０）に記載されている。
【００４５】
この場合も、符号帳には図５に示すような適応符号帳、固定符号帳、パルス符号帳といったものを用いることが可能である。上述において、複数の符号帳を用いる場合は、図５に示した複数種類の符号帳から、例えば適応符号帳と、固定符号帳というように組み合わせて用いても良い。
多段ベクトル量子化や、共役構造ベクトル量子化に対する図３中のＰＷ復号部３３は、入力符号ベクトル数と対応する符号帳を用意しておき、これら符号帳からそれぞれ入力ＰＷ符号Ｉ_４に応じた符号ベクトルをそれぞれ取り出し、かつそれらに対して、入力ＰＷ符号Ｉ_４中の利得符号により利得符号帳から得た各対応する利得をそれぞれ与えればよい。このようにして復号されたＰＷベクトルを加算して、前フレームの加算ＰＷベクトルと線形補間を行い順次つなぐことによって連続した信号として合成フィルタ４２に供給する。
【００４６】
【発明の効果】
以上説明したように、この発明の音声波形量子化方法によれば、ＰＷ符号ベクトルに対しピッチ周期化を行うために、ピッチの周期化による重み付けされた量子化がなされるため、ピッチ周期波形（ＰＷ）の量子化効率が向上する。また、ピッチ周期化の操作を全て実時間領域で行うことは、周波数領域で行うものよりも低い演算量で実現できる。
【００４７】
この発明の音声波形量子化方法の効果を調べるために、以下の条件で分析合成音声実験を行った。入力音声としては、０−４ｋＨｚ帯域の音声を標本化周波数８．０ｋＨｚで標本化した後に、電話機の特性と対応するＩＲＳ特性フィルタを通したものを用いた。符号化器は実施例２の構成のものを用いた。まず、この信号に２５ｍｓ（２００サンプル）毎に音声信号に分析窓長３０ｍｓのハミング窓を乗じ、分析次数を１２次として自己相関法による線形予測分析を行い、１２個の予測係数を求める。予測係数はＬＳＰパラメータのユークリッド距離を用いてベクトル量子化する。入力音声の状態が有声と判断された場合、得られるＰＷベクトルを２つの雑音符号ｃ ^ｉ _０，ｃ ^ｊ _１を用いてベクトル量子化する。偏自己相関法で求めたピッチは整数値へと四捨五入を用いてスカラー量子化を行い、ピッチ周期化の値として用いる。
【００４８】
上記の条件でピッチ周期化なしで量子化した音声波形と比べて、ピッチ周期化ありで量子化した音声波形の方が対雑音比が２ｄＢ以上も改善された。
【図面の簡単な説明】
【図１】この発明の符号化方法の実施例を適用した符号化器の機能構成例を示すブロック図。
【図２】Ａは図１中のＰＷ量子化部１２の具体的機能構成例を示すブロック図、Ｂは共役構造ベクトル量子化した場合の機能構成例を示すブロック図である。
【図３】この発明により量子化された符号を復号する復号化方法を適用した復号化器の機能構成例を示すブロック図。
【図４】多段ベクトル量子化の場合のＰＷ量子化部１２の具体的機能構成例を示すブロック図。
【図５】この発明に用いる量子化法のため符号帳の例を示す図。
【図６】ピッチ周期化されていないＰＷのパワースペクトルと、ピッチ周期化されたＰＷのパワースペクトルを概念的に示す図。
【図７】ＰＷ量子化部の一部変形を示す図。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a high-efficiency speech coding method for digitally coding a speech signal sequence with a small amount of information. The present invention relates to an encoding method that realizes efficient speech encoding.
[0002]
[Prior art]
Related arts related to the present invention include a linear prediction vocoder, a code excited linear prediction (CELP), a mixed domain coding (Mixed Domain Coding), and a representative waveform interpolation coding (Prototype Waveform Interpolation). ).
[0003]
The linear prediction vocoder has been widely used as a speech coding method in a low bit rate region of 4.8 kbit / s or less, and includes a PARCOR method, a line spectrum pair (LSP) method, and the like. Details of these methods are described, for example, in Saito and Nakata, "Basics of Speech Information Processing" (Ohmsha Publishing). The linear predictive vocoder is composed of an all-pole filter that represents the spectral envelope characteristic of speech, and an excitation signal that drives the filter. As the excitation signal, a pitch cycle pulse train is used for voiced sounds, and white noise is used for unvoiced sounds. In a linear prediction vocoder, an excitation signal based on a periodic pulse train or white noise is not enough to reproduce the characteristics of a speech waveform, and thus it is difficult to obtain a synthesized speech with high naturalness.
[0004]
On the other hand, in code excitation predictive coding, speech is synthesized by using a noise sequence as an excitation signal to drive two all-pole filters that represent the close correlation and pitch correlation characteristics of speech. The noise sequence is prepared in advance as a plurality of code patterns, and a code pattern that minimizes an error between the input speech waveform and the synthesized speech waveform is selected from among them. The details thereof are described in the document Schroeder: “Code Excited Linear Prediction (CELP) High QualitySpeech at Very Low Bit Rates” Proc. IEEE. ICASSP, pp 937-940, 1985. In the code excitation predictive coding, the reproduction accuracy has a relationship depending on the number of code patterns. Therefore, if a large number of series patterns are prepared, the reproduction accuracy of the audio waveform is improved, and accordingly, the quality can be improved. However, if the bit rate of audio coding is set to 4 kbit / s or less, the number of code patterns is limited, and as a result, sufficient audio quality cannot be obtained. It is said that an amount of information of about 4.8 kbit / s is required to obtain good voice quality.
[0005]
In the Mixed Domain Coding method, a pitch-period waveform is extracted from a residual waveform for each frame of a voiced sound, and a difference from a previous pitch-period waveform is quantized in a time domain. You. The decoder generates an excitation signal by performing linear interpolation of these waveforms in the frequency domain, and drives an all-pole filter to synthesize speech. For unvoiced sound, encoding is performed in the same manner as in code excitation prediction encoding. Details of this method are described in De Martin, et al., “Mixed Domain Coding of Speech at 3 kb / s” Proc. IEEE. ICASSP, PPII / 216-170, 1996. A feature of this method is that when calculating the difference, the previous pitch period waveform is normalized in length to the waveform of the current frame. A pulse codebook and a noise codebook are used for quantization of the difference, but an information amount of about 3.5 kbit / s is required.
[0006]
In the representative waveform interpolation coding method (Prototype Waveform Interpolation Coder), voice is synthesized by driving an all-pole filter with an excitation signal synthesized by performing linear interpolation of a prototype waveform (Prototype Waveform). Details of this can be found in the document Kleijn W. B. "Encoding Speech Using Prototype Waveforms," IEEE Trans. on Speech AudioProcessing, Vol. 1, pp 386-399 1993. The prototype waveform is extracted from the residual waveform at a constant cycle, and is encoded after being subjected to Fourier transform. According to this method, an amount of information of about 3.4 kbit / s is required to obtain good quality.
[0007]
For linear predictive coding using a noise sequence or a pitch pulse train as an excitation signal, Japanese Patent Laid-Open No. 10-232697 "Speech Coding Method and Decoding Method" has been proposed in order to realize more efficient quantization of a speech waveform. The proposed method estimates the pitch period of the input voice, extracts the waveform of the pitch period estimated in the periodic portion of the residual signal, and minimizes the waveform distortion with the waveform of the pitch period. Is determined based on the code vector truncated at the pitch length. Here, a code is selected by calculating a distance between a vector obtained by convolving the impulse response of the synthesis filter with the input pitch period waveform and a vector obtained by similarly convoluting a code vector truncated by the pitch length.
[0008]
This method has not always been able to perform sufficiently efficient encoding. In particular, the quantization efficiency was poor with a small number of bits.
[0009]
[Problems to be solved by the invention]
An object of the present invention is to realize a more efficient encoding in a linear prediction encoding method using a noise sequence or a pitch pulse train as an excitation signal when the frequency band of an input signal is restricted, such as telephone speech. It is to provide a way to do it.
[0010]
[Means for Solving the Problems]
The present invention is based on the encoding method disclosed in Japanese Patent Application Laid-Open No. Hei 10-232697. In the present invention, the target residual vector is periodicized by a pitch period, and a target filter vector is obtained by performing a synthesis filter process. The obtained code vector is cycled with a pitch cycle and a synthesis filter process is performed to obtain a synthesized waveform vector.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
Example 1
FIG. 1 shows a functional configuration of an encoder to which the quantization method of the present invention is applied. The encoder performs the following procedure once for each frame having a length of N samples. In frame i, a p-order linear prediction coefficient (LPC) a of the input audio signal S (t) from the input terminal 1_j(J = 0, 1,..., P−1) are calculated by the linear prediction coefficient calculation unit 2. This linear prediction coefficient is quantized by the linear prediction coefficient quantization unit 3, and the linear prediction coefficient code I₁Is sent as The details of the quantization of the linear prediction coefficient are described in "Speech Linear Prediction Parameter Coding Method" (Japanese Patent Application No. 3-180819). Linear prediction coefficient code I from the linear prediction coefficient quantization unit 3₁Is determined based on the decoded and inversely quantized linear prediction coefficients, and the residual signal r ′ (t) is input to the inverse prediction filter 4 through the input speech signal S (t). obtain. The inverse filter 4 is realized by a digital filter A (z) having the following transfer characteristics.
[0012]
A (z) = 1 + a₁z^-1+ ... + a_pz^-P              (1)
The correlation (partial correlation function) ρ of the obtained residual signal is calculated by the correlation calculator 5, and the maximum value ρ of the correlation ρ_maxThe pitch (pitch) of the estimated pitch period p_iAnd At this time, the periodicity determination unit 7 determines whether the input audio signal S (t) is a voiced part or a non-voiced part, for example, using a threshold value θ (0.5 to 1.0) as described below.
[0013]
┌ k₁/ 2 + ρ_max> Θ; voiced part
└ k₁/ 2 + ρ_max<Θ; silent part (2)
Where k₁Is a first-order partial autocorrelation (PARCOR) coefficient obtained by the linear prediction coefficient calculation unit 2.
When the periodicity determination unit 7 determines that the residual signal is a voiceless part, the residual signal r ′ (t) is quantized by the voiceless part quantization unit 8. The unvoiced portion quantization unit 8 divides the frame into S,_sub(= N / S) The number of samples is set as a subframe, the average power of the residual waveform r '(t) from the inverse filter 4 in the subframe is calculated, and one frame of the average power is vector-quantized. Unvoiced code I₂Is output as
[0014]
When the periodicity determining unit 7 determines that the voiced part is present, the residual pitching unit 9 estimates the pitch pitch p._iFrom the vicinity of the center of the frame in the residual signal r ′ (t) from the inverse filter 4_iCut out a waveform of length.
Next, the residual waveform is circulated by the PW alignment unit 10 until the correlation between the residual waveform and the pulse signal increases. Here, the residual waveform r for one cycle waveform circulated so as to increase the correlation with the pulse signal._pIs called PW (pitch periodic waveform). Estimated pitch period p_iIs quantized to an integer value by rounding in a pitch period quantizing unit 11, and a pitch period code I₃Is output as
[0015]
The PW from the PW alignment unit 10 is vector-quantized by the PW quantization unit 12. As shown in FIG. 2A, for example, as shown in FIG. 2A, the PW quantization unit 12 sends an impulse signal to the linear prediction synthesis filter 14 whose filter coefficient is determined by the inversely quantized linear prediction coefficient from the linear prediction coefficient quantization unit 3 in FIG. Is passed through and the impulse response h_jIs required. The PW from the PW alignment unit 10 in FIG. 1 is multiplied by a pitch periodic matrix by a pitch periodic filter 15, and the output PW of this filter 15 is impulse response h_jImpulse response matrix based onHIs convolved with the audio signal (target waveform vector)xIs played. On the other hand, the code vector selected from the PW codebook 17c ⁱ ₀With respect to the gain g extracted from the gain codebook 18.^k ₀The code vector to which the gain has been given is cut out from its head by the code cutout unit 19 by the pitch cycle length extracted by the pitch cycle extraction unit 6 in FIG. On the other hand,xSimilarly to the above, the pitch period filter 21 multiplies the result by the pitch period matrix, and the convolution filter 22 impulse response matrixHIs convolved to obtain a synthesized speech signal (synthesized waveform vector). Playback audio signal of this synthesized audio signalxIs subtracted by the subtraction unit 23, and the code vector of the PW codebook 17 is set so that the square error is minimized.c ⁱ ₀And the gain g of the gain codebook 18^k ₀Is performed by the distortion calculator 24.
[0016]
Note that the length n of each code vector in the PW codebook 17 is the detected pitch length p._i(N: n>p_max). Here, the phase of the peak of each code vector is assumed to be uniform. The pulse signal used in the PW alignment unit 10 in FIG. 1 has a vector length of n, and the phase matches the peak of the code vector of the PW codebook 17.
[0017]
As described with reference to FIG. 2A, the PW code minimizes the auditory weighted mean square error between a waveform (combined waveform vector) obtained by combining a code vector as an excitation signal and a waveform (target waveform vector) obtained by combining PW as an excitation signal. It is determined to be. The following equation (3) is used for calculating the distance of the distortion scale D.
D = ‖x-G^k _o HPc ⁱ _o‖²                    (3)
here,xIs a target (a waveform obtained by combining PW as an excitation signal),HIs the quantized linear prediction coefficient a '_iA matrix representing the impulse response of the synthesis filter 14 usingPIs a vector (periodic matrix) representing pitch periodicity,c ₀Is the sign vector, g₀Represents the gain of the code vector.
[0018]
targetxIs calculated using the following equation (4).r _pAnd a vector representing the pitch period filter 15PIs calculated in advance by the convolution operation performed by the synthesis filter 16 on the product of
x=HPr _p                                    (4)
here,r _pIs a vector representation of the original PW before quantization.
[0019]
In conventional CELP coding,HIs usually a lower triangular (N × N) square matrix when the frame length is N. Here, a square matrix (p_i× p_i) On the lower side (mp_i) Lines, (n-p_i) Use an (m × n) non-square matrix expanded by columns. Where m>p_i, N>p_iIt is.HContains the impulse response h of the linear prediction filter weighted with auditory weights_j(J = 0, 1,...) Are used.
[0020]
(Equation 1)

At this time, h_jThe linear prediction synthesis filter 14 used for calculating (j = 0, 1,...)H 'This is realized by a digital filter having (z).

The transfer characteristic of the auditory weighting is expressed as follows.
[0021]
W (z) = A (z / γ₁) / A (z / γ₂) (7)
Where γ₁And γ₂Is a parameter for controlling the degree of auditory weighting, and 0<γ₂ <γ₁ <Take the value of 1. Matrix used for convolution operation in

convolution filters

16 and 22 in FIG. 2AHIs the impulse responsehTherefore, the above-described extended (m × n) matrix is used.
[0022]
Also,PIs (n × p_i).
[0023]
(Equation 2)

A matrix like thisPRepeats a square matrix having a diagonal element of 1 in the row direction (in this example, n = 2p_i), Multiplying this p by the waveform gives n / p_iWill be repeated only. The output of the pitch period filter 15 is PW of n / p_iAnd the output of the pitch period filter 14 is n / p_iIt will be repeated times.
[0024]
in this wayHWhenPTarget obtained by the operation of equation (4)xAlso

And the order is m. Where x_j(P_i<J <m-1) is a component corresponding to the free response of the linear prediction synthesis filter 14, and is a zero input initial value response of the linear prediction filter 14.
[0025]
Note thathIs cut off at a very short length (for example, about 10 samples),HIn this case, a method of reducing the amount of calculation without significantly degrading the quality may be used. At this time, the impulse responsehAt the time tHThe matrix is an (m × n) matrix as follows.
[0026]
(Equation 3)

Note that the inverse quantized linear prediction coefficient a '_jIs used to determine the coefficient of the linear prediction inverse filter 4 and the coefficient of the synthesis filter 14 of the PW quantization unit 12, but the linear prediction coefficient a_jEven if H (z) is formed by using, the same quality as when H '(z) is used can be obtained.
[0027]
PW code c₀Is selected, the code vector is set so that the equation (3) is minimized from the PW codebook 17.c ⁱ ₀And its ideal gain gⁱ ₀Is calculated. Actually, the following equivalent method is used for this selection calculation. First, all code vectors of the PW codebook 17c ⁱ ₀Equation (11) is calculated for₀Code vector with maximum valuec ⁱ ₀Selectx ^TRepresents the transposed matrix of x.
[0028]
D '₀= (x ^T HPc ⁱ ₀)²/ ‖HPc ⁱ ₀‖²  (11)
Ideal gain g of selected code vectorⁱ ₀Is calculated using equation (12).
gⁱ ₀=x ^T HPc ⁱ ₀/ ‖HPc ⁱ ₀‖²        (12)
Next, the gain gⁱ ₀Is scalar-quantized. The code of the selected code vector and the code of the selected gain are represented by the PW code I.₄, And the periodic judgment unit 7 outputs a periodic code I indicating whether the frame is a voiced part or an unvoiced part.₅Is output. Sign I₁~ I₅Are combined by the multiplexer 13 and output to the transmission path or the storage unit.
[0029]
As described above, one frame is set to, for example, 25 msec, and a residual waveform (signal) for one pitch period is extracted from it, that is, only a fraction of one frame is extracted. On the other hand, even when the synthesis filter 14 is driven with the input set to zero, there is an output corresponding to the state immediately before that, that is, a so-called zero input response. Therefore, in CELP coding, a target obtained by subtracting a zero input response from an input waveform is targeted. However, in the present invention, since encoding is performed using only a part of one frame, the impulse response matrix of the synthesis filter 14 is extended by CELP encoding by an amount corresponding to the zero response, and the waveform for one pitch period is reduced to zero. A code vector close to this, including the input response (free response), is selected.
[0030]
As described above, since the waveform information is encoded only for one pitch period in one frame, it can be expressed with a smaller number of bits, and the peak position is normalized (constant phase). Can also reduce the number of coding bits.
Since PW is one pitch cycle of the residual signal, its waveform is close to an impulse waveform, and its frequency spectrum is almost flat as shown by the broken line in FIG. When the PW is cycled with a pitch cycle, the intensity of the spectrum is repeated as shown by the solid line in FIG. A code vector cut out by the pitch cycle length and made periodic by the pitch cycle becomes a spectrum in which irregularities as shown by a solid line in FIG. 6 are repeated. Therefore, since quantization is performed by comparing the periodicized PW and the code vector, a portion having a large power spectrum is weighted (weighted) and distance calculation is performed, so that efficient quantization can be performed.
[0031]
The linear prediction synthesis filter 14 and the convolution filters 16 and 22 in FIG. 2A may be omitted, and instead, synthesis filters 25 and 26 whose filter coefficients are determined by quantized linear prediction coefficients as shown in FIG. 7A may be used. Also, as shown in FIG. 7B, the pitch period matrix of the pitch period filter 15 and the impulse response matrix of the synthesis filter 25 (or the convolution filter 16).HMay be used. Similarly, the pitch period filter 21 and the convolution filter 22 or the synthesis filter 26 can be used as one filter.
[0032]
Next, FIG. 3 shows a decoder to which the embodiment of the decoding method according to the present invention is applied, corresponding to the embodiment of the encoding method shown in FIG. This decoder is the same as that shown in JP-A-10-232697. Here, the code I input to the input terminal 31₁~ I₅Is an unvoiced / voiced (PM) code I after all the codes are separated and decoded by the demultiplexer 32.₂, I₄To generate an excitation signal. Periodic code I₅Is a voiceless part, the voiceless part code I₂Is reproduced by the unvoiced part decoding unit 41. The unvoiced decoding unit 41 adds the unvoiced code I to the white noise from the white noise generation unit.₂Of the decoded power code to generate a synthesized residual waveform of the unvoiced portion. That is, white noise of N samples is generated, and each subframe (N_subThe average power during the long period is calculated and multiplied by the gain so as to coincide with the average power of the corresponding decoded subframe, and is used as an excitation signal.
[0033]
Periodic code I₅Indicates a voiced part, the PW code I in FIG.₄In the PW decoding unit 33 as shown in Expression (13).c ⁱGain gⁱMultiply by PW waveformr ⁱIs decrypted.
r ⁱ= Gⁱ ₀ c ⁱ ₀                                    (13)
Although not shown in the figure, it has the same PW codebook 17 and gain codebook 18 in FIG.c ⁱ ₁Is the pitch period length p from the head of the code vector of the PW codebook_iIs extracted by the code vector extraction unit 34.
[0034]
Next, this decoded PW waveformr ⁱAnd contents of previous PW buffer 35r ^i-1Is performed by the linear interpolation unit 36, and an intermediate PW waveformr ^intmGet. For this linear interpolation, for example, equation (14) is used.
r ^intm(J) = (1-α)r ^i-1(J) + αr ⁱ(J) (14)
(J = 0, 1,..., P−1; 0<α<1)
Here, α is a value representing the position of the waveform in the frame of N sample length, and p is the pitch before and after (pⁱOr p^i-1) Longer sample size,r ^i-1Isr ⁱIn the previous PW vector,r ^intmIs the interpolated vector
Pass. The remainder of the short pitch vector is padded with zeros, and interpolation is performed after matching the long vector with the long one.
[0035]
That is, on the decoding side, the residual waveform is cut out only for one pitch period in each frame. Therefore, between the waveform cut out in the current frame and the waveform cut out in the previous frame, there is originally a waveform corresponding to one pitch period to several pitch periods. The waveform that should exist should be the decoded PW waveform of the previous frame.r ^i-1And decoded PW waveform of current framer ⁱInterpolate linearly with. Α is determined in accordance with the order of the waveform to be interpolated between the clipped waveform of the previous frame and the clipped waveform of the current frame. Pitch period code I₃Is decoded by the pitch period decoding unit 37, and the number of waveforms to be interpolated is determined from the decoded pitch period and the frame length.
[0036]
Further, the pitch interpolator 39 linearly interpolates each pitch cycle between the pitch cycle of the cut-out waveform of the previous frame and the pitch cycle of the cut-out waveform of the current frame based on the decoded pitch cycle and the contents of the previous pitch cycle buffer 38. Using the interpolation pitch period, the intermediate PW waveforms obtained by the linear interpolation unit 36 are sequentially connected by the residual signal synthesis unit 39, and this is used as an excitation signal.
[0037]
Here, at the time of interpolation, the length of the PW is normalized by sampling conversion so that the preceding and succeeding vectors have the same length (N samples), and two vectors are obtained based on the following equation (15) in the same manner as equation (14). It is also possible to perform a linear interpolation of the vector.
S ^intm(J) = (1-α)S ^i-1(J) + αS ⁱ(J) (15)
(J = 0, 1,..., N; 0<α<1)
Here, α is a value indicating the position of the waveform in the frame of N sample length, N is the normalized vector length,S ^i-1Isr ⁱThe previous PW vector ofS ⁱIsr ⁱAre normalized PW vectors ofS ^intmRepresents a normalized vector formed by interpolation. thisS ^intmAre converted into the pitch period length interpolated by sampling conversion in the same manner as described above, and are sequentially connected.
[0038]
Periodic signal I₅Indicates the unvoiced portion, the synthesized excitation signal from the unvoiced portion decoding portion 41 is₅Indicates a voiced portion, the linear prediction synthesis filter 42 is driven using the synthesized excitation signal from the residual signal synthesis unit 39, and an output voice is obtained at the output terminal 43. Here, the linear prediction coefficient code I₁Is decoded by the linear prediction coefficient decoding unit 44, and the linear prediction coefficient for one pitch period in the previous frame and the linear prediction coefficient for one pitch period in the current frame are also used for the linear prediction coefficient using the contents of the previous coefficient buffer 45. Linear interpolation is performed between the coefficients by the linear prediction coefficient interpolation unit 46 in the same manner as in Expression (14), and the coefficients of the synthesis filter 42 are determined. In general, interpolation of linear prediction coefficients is performed in the LSP area.
Example 2
FIG. 4 shows a PW quantization unit according to an embodiment in the case where the PW quantization unit 12 in FIG. 1 performs multi-stage quantization. In FIG. 4, the same reference numerals are given to portions corresponding to those in FIG. 2A. In this example, two-stage quantization is performed, a new PW codebook 51 is provided, and a code vector selected from the PW codebook 51 is used.c ^j ₁For the gain g selected from the gain codebook 52^k ₁Is given by the code cutout unit 53 to the pitch period length p._i, And are sequentially applied to the pitch period filter 54 and the convolution filter 55. This gives the pitch periodic matrix for the cut-out code vector with gainPAnd impulse responseHIs convoluted to obtain a composite waveform. This composite waveform is subtracted from the error signal from the subtraction unit 23 by the subtraction unit 56, and the remainder is given to the distortion calculation unit 57, and the distortion calculation unit 57 Of the PW codebook 51 so that the square of the output ofc ^j ₁And gain g of codebook 52^k ₁Is selected. Also in this case, as a whole, the code vector is set so that the auditory weighted mean square error between the waveform synthesized with the code vector as the excitation signal and the waveform synthesized with the PW waveform as the excitation signal is minimized.c ⁱ ₀,c ^j ₁, Gain g^k ₀, G^k ₁Is determined. Equation (16) is used for calculating the distance of this distortion measure.
[0039]
D = ‖x-G^k ₀ HPc ⁱ ₀-G^k ₁ HPc ^j ₁‖²  (16)
here,xIs the target obtained by equation (4),HIs the quantized linear prediction coefficient a '_iMatrix that represents the impulse response of the synthesis filter usingPIs a matrix representing pitch periodicity,c ₀andc ₁Is the sign vector, and g₀, G₁Represents the gain of each code vector.
[0040]
First, as described with reference to FIG.c ₀And its ideal gain gⁱ ₀Is determined. Next, from the PW codebook 51, a code vector that minimizes the expression (16)c ^j ₁And its ideal gain g^j ₁And calculatec ⁱ ₀G, the ideal gain ofⁱ ₀Is recalculated. This is the sign vectorc ⁱ ₀,c ⁱ ₁, And performs encoding. Details of the vector quantization based on the vector orthogonalization are described in "Exciting Signal Orthogonalized Speech Coding Method" (Japanese Patent Laid-Open No. 7-253795).
[0041]
For the selection, D ′ of the following equation (17) is used.₁Code vector with maximum valuec ⁱ ₁Is selected in a closed loop.
[0042]
(Equation 4)

Ideal gain g of selected code vector^j ₁Is calculated using equation (18).
[0043]
(Equation 5)

Also, the ideal gain gⁱ ₀Is recalculated using equation (19).
[0044]
(Equation 6)

Since the selection of the code vector has been completed by the above procedure, (g) that minimizes the expression (16)^k ₀, G^k ₁).
The following equation (20) is used for PW decoding in this case.
r ⁱ= Gⁱ ₀ c ⁱ ₀+ Gⁱ ₁ c ⁱ ₁ (20)
In the above description, it is possible to use any of the adaptive codebook (a), the fixed (noise) codebook (b), and the pulse codebook (c) as shown in FIG. 5 as the PW codebook. The adaptive codebook (a) is a past residual waveform, and for example, the output of the code cutout unit 19 in FIG. 2A is used. The pulse code book (c) can generate a pulse each time according to rules.
Example 3
FIG. 2B shows an embodiment in which quantization is performed using a codebook (two) having a conjugate structure as the PW quantizer 12 in FIG. 1, and portions corresponding to those in FIG. 2A are denoted by the same reference numerals. 2A, a PW codebook 61 is further provided. Each code vector of the PW codebook 61 and the code vector of the PW codebook 17 have a conjugate structure with each other, that is, a code vector that is orthogonal to each other is selected. Each of the code vectors selected from the

PW codebooks

17 and 61 is provided with a gain selected from the gain codebook 18, and the two code vectors provided with the gains are added by the adder 62, and the pitch period is obtained as an excitation signal. To the synthesizing filter 21 and the synthesizing filter 22 sequentially. Each code vector of the

PW codebooks

17 and 61 and its gain are determined so that the auditory weighted mean square error between the waveform obtained by combining the code vector as the excitation signal and the waveform obtained by combining the PW waveform as the excitation signal is minimized. Is done. Equation (16) is used for the distance calculation of the distortion scale, as in the second embodiment. The details of the encoding method using the codebook having the conjugate structure are described in "Multi-vector quantization method and apparatus" (Japanese Patent Application No. 63-249450).
[0045]
In this case as well, an adaptive codebook, fixed codebook, pulse codebook, or the like as shown in FIG. 5 can be used as the codebook. In the above description, when a plurality of codebooks are used, a combination of, for example, an adaptive codebook and a fixed codebook may be used from a plurality of types of codebooks shown in FIG.
The PW decoding unit 33 in FIG. 3 for the multi-stage vector quantization and the conjugate structure vector quantization prepares a codebook corresponding to the number of input code vectors, and outputs the input PW code I from each of these codebooks.₄, And fetch the input PW code I₄Each corresponding gain obtained from the gain codebook may be given by the medium gain code. The PW vectors thus decoded are added, and linear addition is performed with the added PW vector of the previous frame, and the resultant is successively connected to be supplied to the synthesis filter 42 as a continuous signal.
[0046]
【The invention's effect】
As described above, according to the speech waveform quantization method of the present invention, in order to perform pitch periodization on a PW code vector, weighted quantization by pitch periodization is performed. PW) is improved in quantization efficiency. Performing all pitch period operations in the real-time domain can be realized with a smaller amount of calculation than that performed in the frequency domain.
[0047]
In order to examine the effect of the speech waveform quantization method of the present invention, an analysis and synthesis speech experiment was performed under the following conditions. As the input voice, a voice in the 0-4 kHz band that was sampled at a sampling frequency of 8.0 kHz and then passed through an IRS characteristic filter corresponding to the characteristics of the telephone was used. The encoder having the configuration of the second embodiment was used. First, this signal is multiplied by a Hamming window having an analysis window length of 30 ms to the speech signal every 25 ms (200 samples), and linear prediction analysis is performed by the autocorrelation method with the analysis order set to 12, thereby obtaining 12 prediction coefficients. The prediction coefficient is vector-quantized using the Euclidean distance of the LSP parameter. When the state of the input voice is determined to be voiced, the obtained PW vector isc ⁱ ₀,c ^j ₁Is used to perform vector quantization. The pitch obtained by the partial autocorrelation method is subjected to scalar quantization by rounding to an integer value, and is used as a value of the pitch period.
[0048]
Compared with the speech waveform quantized without pitch period under the above conditions, the speech waveform quantized with pitch period improved the noise-to-noise ratio by 2 dB or more.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration example of an encoder to which an embodiment of an encoding method according to the present invention is applied.
2A is a block diagram illustrating a specific functional configuration example of a PW quantization unit 12 in FIG. 1, and FIG. 2B is a block diagram illustrating a functional configuration example when conjugate structure vector quantization is performed.
FIG. 3 is a block diagram showing a functional configuration example of a decoder to which a decoding method for decoding a code quantized according to the present invention is applied.
FIG. 4 is a block diagram showing a specific functional configuration example of a PW quantization unit 12 in the case of multi-stage vector quantization.
FIG. 5 is a diagram showing an example of a codebook for a quantization method used in the present invention.
FIG. 6 is a diagram conceptually showing a power spectrum of a PW that is not pitch-periodized and a power spectrum of a PW that is pitch-periodicized.
FIG. 7 is a diagram showing a partial modification of a PW quantization unit.

Claims

The speech signal is subjected to linear prediction analysis for each frame longer than its pitch period, and the linear prediction coefficient obtained by the analysis and an excitation signal for driving a linear prediction synthesis filter of a filter coefficient based on the linear prediction coefficient are used for speech characteristics. Express,
A linear prediction inverse filter process is performed on the audio signal to obtain a residual signal,
Extract the pitch period of the audio signal,
For each frame, determine whether the audio signal is a voiced or unvoiced section,
If the frame is a voiced section, a residual signal vector having a pitch period length is extracted from the residual signal,
The target signal vector is obtained by circulating the residual signal vector so that the correlation with a predetermined reference signal vector is increased,
The target residual vector is passed through the synthesis filter to obtain a target waveform vector, and a plurality of predetermined code vectors cut off at a pitch period length is used as an excitation signal to drive the synthesis filter to obtain a synthesis waveform vector,
In a speech encoding method for selecting a code vector in which the distortion of the waveform of the synthesized waveform vector with respect to the target waveform vector is minimized and determining a quantization code,
For the target residual vector, after performing a periodical process with the pitch period, through the synthesis filter to determine the target waveform vector ,
To top Kifu No. vector was treated periodic with the pitch period, speech encoding method by driving the synthesis filter as the excitation signal, characterized in Rukoto seek the composite waveform vector.

The target residual vector is multiplied by a pitch period matrix, and the result is multiplied by a lower triangular matrix based on the impulse response of the synthesis filter. Convolution of the square matrix to generate the target waveform vector,
The selected code vector is multiplied by the pitch periodic matrix, and the multiplication result is convolved with the non-square matrix to generate the composite waveform vector,
2. The speech encoding method according to claim 1, wherein the number of rows of the pitch periodic matrix is equal to the number of rows of the non-square matrix.

The speech signal is subjected to linear prediction analysis for each frame longer than its pitch period, and the linear prediction coefficient obtained by the analysis and an excitation signal for driving a linear prediction synthesis filter of a filter coefficient based on the linear prediction coefficient are used for speech characteristics. Express,
A linear prediction inverse filter process is performed on the audio signal to obtain a residual signal,
Extract the pitch period of the audio signal,
For each frame, determine whether the audio signal is a voiced or unvoiced section,
If the frame is a voiced section, a residual signal vector having a pitch period length is extracted from the residual signal,
The target signal vector is obtained by circulating the residual signal vector so that the correlation with a predetermined reference signal vector is increased,
The target residual vector is passed through the synthesis filter to obtain a target waveform vector, and a plurality of predetermined code vectors cut off at a pitch period length is used as an excitation signal to drive the synthesis filter to obtain a synthesis waveform vector,
In a speech encoding method for selecting a code vector in which the distortion of the waveform of the synthesized waveform vector with respect to the target waveform vector is minimized and determining a quantization code,
For the target residual vector, at the same time as performing the periodic processing with the pitch period and performing the synthesis filter processing to determine the target waveform vector,
Speech encoding method characterized by relative upper Kifu No. vector by simultaneously the synthesis filtering operation is treated cycled with the pitch period obtaining the composite waveform vector.

For the lower triangular square matrix based on the impulse response of the synthesis filter, a non-square matrix extended downward to obtain the free response of the filter, and a pitch periodic matrix having the same number of rows and rows as the matrix And multiplied by
Convolving the target residual vector to generate the target vector,
4. The speech encoding method according to claim 3, wherein a product obtained by multiplying the non-square matrix by the pitch periodic matrix is convolved with the selected code vector to generate the composite waveform vector.

5. The speech according to claim 2, wherein an impulse response of a synthesis filter having a filter coefficient based on the linear prediction coefficient is obtained, and the non-square matrix is created by truncating the impulse response with a small number of samples. Encoding method.

A plurality of second code vectors prepared in advance to that aborted by the pitch period length, whereas determine the second composite waveform vector by the synthesis filtering operation while processing cycle of the above pitch period,
The quantization code is determined by selecting the second code vector that minimizes waveform distortion of the second synthesized waveform vector and an error vector of the synthesized waveform vector with respect to the target waveform vector. A speech encoding method according to any one of claims 1 to 5.

6. The excitation signal for obtaining the composite waveform vector by a weighted linear sum of code vectors respectively selected from a plurality of codebooks having a conjugate structure. Voice encoding method.