JPH11504733A

JPH11504733A - Multi-stage speech coder by transform coding of prediction residual signal with quantization by auditory model

Info

Publication number: JPH11504733A
Application number: JP9530382A
Authority: JP
Inventors: ジュインウェイチン
Original assignee: AT&T Corp
Current assignee: AT&T Corp
Priority date: 1996-02-26
Filing date: 1997-02-26
Publication date: 1999-04-27
Also published as: EP0954851A1; WO1997031367A1; EP0954851A4; MX9708203A; CA2219358A1

Abstract

(57)【要約】「変換予測符号化」又はＴＰＣと呼ばれる音声圧縮システムは、１サンプルあたり１つ又は２つの１６又は３２ｋｂ／秒のビットという標準ビットレートでサンプリングする１６ｋＨｚでの７ｋＨｚ帯域の音声の符号化を行う。システムは、冗長性を除去するため、短期及び長期の予測を用いる。予測残余は、（６０）からの時間領域データ及び（１００）からのパラメータ入力を受諾した後（１１０）により図に示されたとおり周波数領域で変換され符号化され、かくして聴覚知覚のためのスペクトルが補正される。ＴＰＣ符号器は（７０）により示されているような開ループ量子化のみを用い、従って低い複雑性をもつ。音声品質は３２ｋｂ／秒で透明であり、２４ｋｂ／秒で非常に優れ、１６ｋｂ／秒で許容できるものである。 (57) [Summary] A speech compression system called "Transform Predictive Coding" or TPC is a 7 kHz band of speech at 16 kHz sampling at a standard bit rate of one or two 16 or 32 kb / s bits per sample. Is performed. The system uses short-term and long-term predictions to remove redundancy. The prediction residual is transformed and encoded in the frequency domain as shown in the figure by (110) after accepting the time domain data from (60) and the parameter input from (100), and thus the spectrum for auditory perception Is corrected. The TPC encoder uses only open-loop quantization as indicated by (70) and thus has low complexity. Voice quality is transparent at 32 kb / s, very good at 24 kb / s and acceptable at 16 kb / s.

Description

【発明の詳細な説明】聴覚モデルによる量子化を伴う予測残余信号の変形符号化による多段音声符号器発明の分野本発明は、例えば音声信号といったオーディオ信号の予測符号化システムを用いた圧縮（符号化）に関する。発明の背景信号圧縮に関する文献で教示されているように、音声及び音楽の波形は、非常に異なる符号化技術によって符号化される。１６ｋｂ／秒以下で符号化する電話帯域幅（３.４ｋＨｚ）の音声といったような音声符号化は、時間領域予測符号器により主に行われてきた。これらの符号器は、符号化すべき音声波形を予測するのに音声生成モデルを使用する。このとき、原信号内の冗長性を低減するため実際の（もとの）（符号化すべき）波形から予測された波形が減算される。信号冗長性の減少は、符号化利得をもたらす。このような予測音声符号器の例としては、音声信号圧縮の技術において全て良く知られたものである適応予測符号化、マルチパルス線形予測符号化及び符号励起線形予測（ＣＥＬＰ）符号化がある。一方、６４ｋｂ／秒以上の広帯域（０〜２０ｋＨｚ）音楽符号化は、周波数領域変換又はサブバンド符号器によって主に行われてきた。これらの音楽符号器は、根本的に上述の音声符号器と非常に異なっている。この差異は、音楽ソースが、音声ソースと異なり直ちに予測できるようにするには変動が大きすぎるものであるという事実に起因する。その結果、音楽ソースのモデルは一般に音楽符号化には使用されない。その代り、音楽符号器は、信号のうち知覚的に関連する部分のみを符号化するために、精巧な人間の聴覚モデルを使用する。すなわち、一般に音声生成モデルを用いる音声符号器とは異なり、音楽符号器は、符号化利徳を得るべく聴力（受音）モデルを利用する。音声符号器においては、符号化すべき音楽の雑音マスキング能力を見極めるために、聴力モデルが使用される。「雑音マスキング能力」という語は、リスナーが雑音に気づかうことなく音楽信号内にどれほどの量子化雑音を導入できるかを言う。この雑音マスキング能力はこのとき、量子化器分解能（例えば量子化器ステップサイズ）をセットするのに用いられる。一般に、音楽は、それが「音調様」になればなるほど、マスキング量子化雑音において粗末なものとなり、従って所要ステップサイズは小さくなり、その逆もあてはまる。より小さなステップサイズは、より小さい符号化利得に対応し、逆も又言える。このような音楽符号器の例としては、ＡＴ＆Ｔの知覚オーディオ符号（ＰＡＣ）及びＩＳＯＭＰＥＧオーディオ符号化規格が含まれる。電話帯域幅音声符号化と広帯域音楽符号化の間には、音声信号が１６ｋＨｚでサンプリングされ７ｋＨｚの帯域幅をもつ広帯域音声符号化が存在する。７ｋＨｚもの広帯域音声のもつ利点は、結果として得られる音声品質が電話帯域幅音声よりもはるかに優れており、しかも符号化に必要となるビットレートは２０ｋＨｚのオーディオ信号よりもはるかに低い。以前に提案されたこれらの広帯域音声符号器の中には、時間領域予測符号化を使用するものもあれば、周波数領域変換又はサブバンド符号化を使用するものもあり、さらには、時間領域と周波数領域の両技術の混合を使用するものもある。広帯域であれその他のものであれ、予測音声符号化に知覚基準を含み入れることは、複数の合成音声信号候補の中から最良の合成音声信号を選択するという場合での知覚重みづけフィルタの使用に限られてきた。例えば、Atal et al に対する米国特許 Re.３２５８０号を参照のこと。このようなフィルタは、符号化プロセスで雑音を減少させるのに有用である、あるタイプの雑音整形を達成する。ある既知の符号器は、知覚重みづけフィルタの形成において知覚モデルを利用することにより、この技術を改善する試みを行なっている。発明の要約上述の努力にも関わらず、既知の音声又はオーディオ符号器のいずれも、信号雑音マスキング能力の分析に従って量子化器分解能をセットするための聴力モデルと、信号予測の目的の音声生成モデルの両方を利用してはいない。一方、本発明は、雑音に対する人間の聴覚感度モデルで決定された雑音マスキング信号に基づいて信号を量子化する量子化プロセスと予測符号化システムを組合わせている。予測符号化システムの出力はかくして、オーディオ知覚モデルに従って決定された雑音マスキング信号の一関数である分解能（例えば、一様スカラ量子化器におけるステップサイズ、又はベクトル量子化器においてベクトルを識別するのに用いられるビット数）をもつ量子化器で量子化される。本発明によれば、音声情報を表わす信号の推定（又は予測）を表わす１つの信号が生成される。「音声情報を表わす原信号」という語は、音声自体のみならず音声符号化システム内に一般に見られる音声信号派生物（例えば線形予測及びピッチ予測残余信号）をも意味するほどに充分広義のものである。このとき推定信号は原信号と比較されて、これらの比較された信号の間の差異を表わす信号を形成する、比較された信号の間の差を表わすこの信号、次に、人間のオーディオ知覚モデルよって生成される知覚雑音マスキング信号に従って量子化される。「変換予測符号化」又はＴＰＣと呼ばれる本発明の実施形態は、１６〜３２ｋｂ／秒のターゲットビットレートで７ｋＨｚもの広帯域音声を符号化する。その名が示すように、ＴＰＣは単一の符号器の中で、変換符号化と予測符号化の技術を組合わせている。より特定的に言うと、符号器は、入力音声波形から冗長性を除去するのに線形予測を使用し、次に結果として得た予測残余を符号化するため変換符号化技術を使用する。変換された予測残余は、可聴なものを符号化し非可聴なものを廃棄するべく、聴覚知覚モデルの形で表現された人間の聴知覚における知識に基づいて量子化される。実施形態の１つの重要な特長は、信号の知覚雑音マスキング能力（例えば「ちょうど認識可能なひずみ」の知覚的なしきい値）が決定されその後のビット割振りが行なわれる方法に関する。従来の音楽符号器において行なわれているように、未量子化入力信号を用いて知覚しきい値を決定するのではなく、この実施形態の雑音マスキングしきい値及びビット割振りは、量子化された合成フィルタ（この実施形態では量子化されたＬＰＣ合成フィルタ）の周波数応答に基づいて決定される。この特長は、復号器が受理され符号化された広帯域音声情報を復号するのに必要とされる知覚しきい値及びビット割振り処理を複製するために、符号器から復号器までビット割振り信号を伝達する必要がない、という利点をシステムに提供する。その代り、その他の目的のために伝達されつつある合成フィルタ係数が、ビットレートを節約するために開発利用される。実施形態のもう１つの重要な特長は、ＴＰＣ符号器が符号器周波数の間でいかにビットを割振りするか、又割振られたビットに基づいて復号器が量子化された出力信号をいかにして生成するかに関するものである。或る種の状況下では、ＴＰＣ符号器は、オーディオ帯域の一部分にのみビットを割振る（例えば、０〜４ｋＨｚの間の係数に対してのみビットを割振ることができる）。４ｋＨｚと７ｋＨｚの間の係数を表わすのにいかなるビットも割振られず、かくして復号器はこの周波数範囲内でいかなる係数も得ない。このような状況は、ＴＰＣ符号器が例えば１６ｋｂ／秒といったひじょうに低いビットレートで作動しなければならない場合に発生する。４ｋＨｚ及び７ｋＨｚの周波数範囲内で符号化された信号を表わすビットを全くもたないにもかかわらず、復号器はそれでも、広帯域応答を提供しなけばならない場合この範囲内の信号を合成しなくてはならない。実施形態のこの特長に従うと、復号器は、その他の利用可能な情報すなわち（ＬＰＣパラメータから得られた）信号スペクトルの推定とその範囲内の周波数での雑音マスキングしきい値の比率とに基づいてこの周波数範囲内の係数信号を生成（すなわち合成）する。この技術によって、復号器は、全帯域について音声信号係数を伝達する必要なく広帯域応答を提供することができる。広帯域音声符号器の混在的利用分野としては、ＩＳＤＮテレビ会議又はオーディオ会議、マルチメディアオーディオ、「ハイファイ」電話技術及び２８.８ｋｂ／秒以上でモデムを用いたダイアル呼出し回線上での同時ボイス＆データ（ＳＶＤ）が含まれる。図面の簡単な説明図１は、本発明の符号器実施形態の図を示す。図２は、本発明の復号器実施形態の図を示す。図３は、図１のＬＰＣパラメータプロセッサの詳細なブロックダイヤグラム図を示す。詳細な説明Ａ．実施形態の概要説明を明確にするため、図に示す本発明の実施形態は、個々の機能ブロック（「プロセッサ」と標識付けされている機能ブロックを含む）を含んで成るものとして示されている。これらのブロックが表わす機能は、ソフトウェアを実行することのできるハードウェアを含む（ただしこれに限られるわけではない）共用か又は専用のハードウェアの使用を通して提供できるものである。例えば、図１〜４に示されているプロセッサの機能は、単一の共用プロセッサによって提供されてもよい。（「プロセッサ」という語の使用は、ソフトウェアを実行できるハードウェアを排他的に指すものとみなされてはならない）。図に示す実施形態には、ＡＴ＆ＴＤＳＰ１６又はＤＳＰ３２Ｃといったようなデジタル信号プロセッサ（ＤＳＰ）ハードウェア、以下に論述するオペレーションを実行するソフトウェアを記憶するための読取り専用メモリ（ＲＯＭ）及びＤＳＰ結果を記憶するためのランダムアクセスメモリ（ＲＡＭ）が含まれてもよい。同様に、超大規模集積回路（ＶＬＳＩ）ハードウェアの実施形態ならびに汎用ＤＳＰ回路と組合わせたカスタムＶＬＳＩ回路を具備してもよい。本発明に従うと、デジタル入力音声サンプルのシーケンスは、フレームと呼ばれる連続した２０ｍｓのブロックに区分され、各々のフレームはさらに、各々４ｍｓの５つの等しいサブフレームに細分される。広帯域音声信号にとって一般的であるように１６ｋＨｚのサンプリングレートを仮定すると、これは、３２０サンプルというフレームサイズ及び６４サンプルというサブフレームサイズに対応する。ＴＰＣ音声符号器は入力音声信号をフレーム毎に緩衝するとともに処理し、各々のフレーム内で、いくつかの符号化オペレーションがサブフレーム毎に実行される。図１は、本発明のＴＰＣ音声符号器の実施形態の一例を示す。図１に示された実施形態を参照されたい。２０ｍｓのフレーム毎に一回、ＬＰＣパラメータプロセッサ１０は、入力音声信号Ｓから線スペクトル対（ＬＳＰ）パラメータを導出し、このようなＬＳＰパラメータを量子化し、各々の４ｍｓのサブフレームについてこれらを補間し、次に各サブフレームについてＬＰＣ予測器係数アレイａに変換する。短期冗長性は、ＬＰＣ予測誤差フィルタ２０により入力音声信号ｓから除去される。結果として得られたＬＰＣ予測残余信号ｄは、なお、有声音声内のピッチ周期性に起因する幾分かの長期冗長性をもつ。整形フィルタ係数プロセッサ３０は、量子化されたＬＰＣフィルタ係数ａから整形フィルタ係数ａｗｃを導出する。整形フィルタ４０はＬＰＣ予測残余信号ｄをろ過して知覚的に重みづけされた音声信号ｓｗを生成する。ゼロ入力応答プロセッサ５０は、整形フィルタのゼロ入力応答ｚｉｒを計算する。減算ユニット６０はｓｗからｚｉｒを減算して、ピッチ予測用の標的信号ｔｐを得る。開ループピッチ抽出器及び補間回路７０は、各々の２０ｍｓフレームについてピッチ周期を抽出するため、ＬＰＣ予測残余ｄを使用し、次に各々の４ｍｓサブフレームについて、補間されたピッチ周期ｋｐｉを計算する。閉ループピッチタップ量子化器及びピッチ予測器８０は、この補間されたピッチ周期ｋｐｉを用いて、ピッチタップの候補セットのコードブックから３つのピッチ予測器タップを１セット選択する。この選択は、以前に量子化されたＬＰＣ残余信号ｄｔが対応する３−タップピッチ合成フィルタによってろ過され、次にゼロ初期メモリを備える整形フィルタによってろ過された時点で、出力信号ｈｄが平均二乗誤差（ＭＳＥ）センス内の標的信号ｔｐに最も近いところにあるような形で行なわれる。減算ユニット９０は、ｔｐからｈｄを減算して、変換符号化のための標的信号ｔｔを得る。整形フィルタ絶対応答プロセッサ１００は、整形フィルタの周波数応答の絶対値である信号ｍａｇを計算する。変換プロセッサ１１０は、信号ｔｔについて、高速フーリエ変換（ＦＦＴ）といったような線形変換を実行する。次に、このプロセッサは、３つの異なる周波数帯域について計算された利得値の量子化バージョン及びｍａｇを用いて変換係数を正規化する。結果は、正規化された変換係数信号ｔｃである。このとき、変換係数量子化器１２０は、異なる周波数での変換係数の時間変動する知覚的大きさに従って、聴覚モデル量子化器制御プロセッサ１３０によって決定される適応ビット割振り信号ｂａを用いて、信号ｔｃを量子化する。１６ｋｂ／秒といった低いビットレートで、プロセッサ１３０は、周波数帯域（０〜４ｋＨｚ）の下半分にのみビットを割振る。この場合、高周波数合成プロセッサ１４０が、高周波数帯域（４〜８ｋＨｚ）内で変換係数を合成し、これらを量子化された低周波数変換係数信号ｄｔｃと組合わせて、最終的な量子化された全帯域変換係数信号ｑｔｃを生成する。２４又は３２ｋｂ／秒といったより高いビットレートでは、周波数帯域全体の中の各々の変換係数は、適応ビット割振りプロセスにおいてビットを受理することが許されるものの、場合によっては、利用可能なビットの不足に起因して全くビットを受理しない可能性もある。この場合、高周波数合成プロセッサ１４０は単に、ビットを全く受理しない４〜８ｋＨｚ帯域内の周波数を検出し、適応変換符号器において標準的に見られる「うずまき」タイプのひずみを避けるため低レベル雑音でこのような「スペクトルホール」を充てんする。逆変換プロセッサ１５０は、量子化された変換係数信号ｑｔｃを取り上げ、変換プロセッサ１１０において利用された線形変換の逆オペーションである線形変換（ここでの我々の特定の実施形態では逆ＦＦＴ）を適用する。この結果、変換符号化のための標的信号であるｔｔの量子化バージョンである時間領域信号ｑｔｔがもたらされる。このとき、逆整形フィルタ１６０は、ｑｔｔをろ過して、量子化された励起信号ｅｔを得る。加算器１７０は、ブロック８０内のピッチ予測器により生成された信号ｄｈ（これはＬＰＣ予測残余ｄのピッチ予測バージョンである）に対してｅｔを加算する。結果として得られた信号ｄｔは、ＬＰＣ予測残余ｄの量子化されたバージョンである。これは、ゼロ入力応答プロセッサ５０内部の整形フィルタのフィルタメモリ及びブロック８０内のピッチ予測器のメモリを更新する。こうして信号ループが完成する。ＬＰＣ予測器パラメータ（ＩＬ）、ピッチ予測器パラメータ（ＩＰ及びＩＴ）、変換利得レベル（ＩＧ）及び量子化変換係数（ＩＣ）を表わすコードブック指標は、マルチプレクサ１８０によって１つのビットストリーム内に多重化され、チャンネル上で復号器まで伝送される。チャンネルは、無線チャンネル、コンピュータ及びデータネットワーク、電話回線網を含む適当なあらゆる通信チャンネルを含んでいてよく、さらに、固体メモリ（例えば半導体メモリ）、光学メモリシステム（例えばＣＤ−ＲＯＭ）、磁気メモリ（例えばディスクメモリ）などを含んでいてよい。図２は、本発明のＴＰＣ音声復号器実施形態を示す。デマルチフルクサ２００は、コードブック指標ＩＬ，ＩＰ，ＩＴ，ＩＧ及びＩＣを分離する。ピッチ復号器及び補間回路２０５は、ＩＰを復号し、補間されたピッチ周期ｋｐｉを計算する。ピッチタップ復号器及びピッチ予測器２１０は、ピッチ予測器タップアレイｂを得るべくＩＴを復号し、又、信号ｄｈ又はＬＰＣ予測残余ｄのピッチ予測されたバージョンを計算する。ＬＰＣパラメータ復号器及び補間回路２１５はＩＬを復号し、次に補間されたＬＰＣフィルタ係数アレイａを計算する。ブロック２２０から２５５までは、量子化されたＬＰＣ残余信号ｄｔを生成するべく、図１内のその対応構成と全く同じオペレーションを実行する。長期ポストフィルタ２６０は、ｄｔ内のピッチ周期性を増強し、その出力としてろ過されたバーションｆｄｔを生成する。この信号はＬＰＣ合成フィルタ２６５内を通過させられ、結果として得られた信号ｓｔはさらに短期ポストフィルタ２７０によってろ過され、このフィルタ２７０は最終的ろ過済み出力音声信号ｆｓｔを生成する。複雑性を低く保つため、ＴＰＣはできるかぎり、開ループ量子化を利用する。開ループ量子化というのは、量子化器が、出力音声品質に対する影響とは無関係に、量子化されていないパラメータとその量子化バージョンの間の差異を最小限にしようと試みることを意味する。これは、ピッチ予測器、利得そして励起が通常閉ループ量子化されるＣＥＬＰ符号器とは対照的である。符号器パラメータの閉ループ量子化においては、量子化器コードブック探索は、最後の再構築された出力音声内のひずみを最小限にしようとする。当然のことながら、これは一般により良い出力音声品質を導くが、その代価は、さらに高いコードブック探索の複雑性である。本発明では、ＴＰＣ符号器は、３ピッチ予測器タップについてのみ閉ループ量子化を使用する。量子化された励起信号ｅｔを導く量子化オペレーションは、基本的に開ループ量子化に類似しているが、出力音声に対する効果は、閉ループ量子化のものに近い。このアプローチの精神は Lefebvre et al.，「変換符号化励起（ＴＣＸ）を用いた広帯域オーディオ信号の高品質符号化」、Proc.IEEE Inte rnational Conf ．Acoustics，Speech．Signal Processing ，１９９４，ｐｐ．Ｉ −１９３〜Ｉ−１９６によりＴＣＸ符号器の中で用いられたアプローチに似ている。例えば、ＴＣＸ符号器の中にない本発明の特長としては、整形フィルタ絶対値応答による変換係数の正規化、聴力モデルにより制御される適応ビット割振りそして高周波数合成及び雑音充てん手順がある。Ｂ．符号器実施形態１．ＬＰＣ解析及び予測ＬＰＣパラメータプロセッサ１０の詳細なブロックダイヤグラムが図３に示されている。プロセッサ１０は、窓かけ及び自己相関プロセッサ３１０；スペクトル平滑化及び白雑音補正プロセッサ３１５；レヴィンソン−ダービン巡回形プロセッサ３２０；帯域幅拡張プロセッサ３２５；ＬＰＣ−ＬＳＰ変換プロセッサ３３０及びＬＰＣパワースペクトルプロセッサ３３５；ＬＳＰ量子化器３４０；ＬＳＰソーティングプロセッサ３４５；ＬＳＰ補間プロセッサ３５０及びＬＳＰ− ＬＰＣ変換プロセッサ３５５を含む。窓かけ及び自己相関プロセッサ３１０は、ＬＰＣ係数生成プロセスを開始する。プロセッサ３１０は、以下で論述する通りＬＰＣ係数がその後計算される基となる自己相関係数ｒを２０ｍｓ毎に一回の従来の要領で生成する。Rabiner L.R ．et al.，音声信号のデジタル処理、Prentice-Hall，Inc．Englewood Cliffs， New Jersey，１９７８（Rabiner et al.）を参照のこと。ＬＰＣフレームサイズは２０ｍｓ（又は、１６ｋＨｚのサンプリングレートで３２０の音声サンプル）である。各々の２０ｍｓフレームはさらに、各々４ｍｓ（又は６４サンプル）の長さの５つのサブフレームに分割される。ＬＰＣ解析プロセッサは、従来の要領で、現行フレームの最後の４ｍｓのサブフレームにセンタリングされる２４ｍｓのハミング窓を使用する。潜在的な不良条件づけを軽減するため、いくつかの従来の信号条件づけ技術が利用される。スペクトル平滑化技術（ＳＳＴ）及び白雑音補正技術は、ＬＰＣ解析の前にスペクトル平滑化及び白雑音補正プロセッサ３１５によって適用される。当該技術分野では周知のものであるＳＳＴ（Tohkura，Y．et al.，「ＰＡＲＣＯＲ音声解析−合成におけるスペクトル平滑化技術」ＩＥＥＥ Trans，Acoust， S peech ，Signal Processing ，ＡＳＳＰ−２６：５８９−５９６，１９７８年１２月（Tohkura et al.））にはが、４０Ｈｚの標準偏差でのガウス分布の確率密度関数（ｐｄｆ）に対応するが、ガウシアン窓に、計算された自己相関係数アレイ（プロセッサ３１０から）を掛け合わせることが言及されている。同じく従来通りのものである（Chen，Ｊ−Ｈ，「１６ｋｂｉｔ／秒での頑強な低遅延ＣＥＬＰ音声符号器」，Proc ＩＥＥＥ，Global Comm．Conf.ｐｐ．１２３７〜１２４１，Dallas，ＴＸ，１９８９年１１月）白雑音補正は、ゼロ−ラグ自己相関係数（すなわちエネルギー項）を０.００１％だけ増大させる。このとき、プロセッサ３１５により生成された係数は、レヴィンソン−ダービン巡回形プロセッサ３２０に提供され、このプロセッサ３２０は、従来の要領でｉ＝１，２，…，１６についての１６のＬＰＣ係数α₁（ＬＰＣ予測誤差フィルタ２０の次数は１６である）を生成する。帯域幅拡張プロセッサ３２５は、さらなる信号条件付けのため各々のα_iにｇ_i という係数を乗じる（なおここで、ｇ＝０.９９４である）。これは，３０Ｈｚの帯域幅拡張に対応する（Tohkura et al.）このような帯域幅拡張の後、ＬＰＣ予測器係数は、従来の要領でＬＰＣ−ＬＳＰ変換プロセッサ３３０により線スペクトル対（ＬＳＰ）係数に変換される。So ong，F.K．et al.，「線スペクトル対（ＬＳＰ）及び音声データ圧縮」Proc.ＩＥＥＥ Int．Conf．Acoust．Speech，Signal Processing ，ｐｐ１.１０.１−１. １０.４,１９８４年３月(Soong et al)。を参照されたい。なおこの文献はここにあたかも完全に記述されているかのごとく参考として含まれないものである。次に、結果として得られたＬＳＰ係数を量子化するためのＬＳＰ量子化器３４０によりベクトル量子化（ＶＱ）が提供される。プロセッサ２４０により利用される特定のＶＱ技術は、ここにあたかも完全に記述されているかのごとく参考として含まれる，Paliwal，K．K．et al.，「２４ビット／フレームでのＬＰＣパラメータの効率の良いベクトル量子化」，Proc ．ＩＥＥＥ Int．Conf．Acoust ．Speech，Signal Processing ｐｐ６６１〜６６４，Toronto，Canada，１９９１年５月(Paliwal et al)の中で提案されている分割ＶＱに類似している。１６次元ＬＳＰベクトルは、低周波数端部から計数して２，２，２，２，２，３，３の次元をもつ７つのさらに小さいベクトルに分割される。７つのサブベクトルの各々は、７ビットに量子化される（すなわち１２８のコードベクトルのＶＱコードブックを用いて）。かくして、各々長さ７ビットの７つのコードブック指標ＩＬ（１）〜ＩＬ（７）が存在し、合計でＬＰＣパラメータ量子化において１フレームにつき４９ビットが使用される。これらの４９個のビットは、サイド情報として復号器に対する伝送のためマルチプレクサ１８０に提供される。プロセッサ３４０は、Paliwal et alに記述されている通り、従来の重みづけされた平均二乗誤差（ＷＭＳＥ）ひずみ測定を用いて、ＶＱコードブックを通してその探索を実行する。ＬＰＣパワースペクトルプロセッサ３３５は、このＷＭＳＥひずみ測定における重みを計算するために使用される。プロセッサ３４０で使用されるコードブックは、当該技術分野において周知の従来のコードブック生成技術を用いて設計される。従来のＭＳＥひずみ測定は同様に、出力音声品質における重大な劣化なく符号器の複雑性を低減するべく、ＷＭＳＥ測定の代りに使用することもできる。通常、ＬＳＰ係数は、単調に増大する。しかしながら、量子化は、この秩序の破断を結果としてもたらす可能性がある。この破断の結果、復号器内のＬＰＣ合成フィルタは不安定なものとなる。この問題を回避するため、ＬＳＰソーティングプロセッサ３４５は、量子化されたＬＳＰ係数をソートして、単調に増加する秩序を復元し安定性を確保する。量子化されたＬＳＰ係数は、現フレームの最後のサブフレーム内で使用される。これらのＬＳＰ係数と先行フレームの最後のサブフレームからのＬＳＰ係数の間の線形補間は、従来通り、ＬＳＰ補間プロセッサ３５０により最初の４つのサブフレームのためのＬＳＰ係数を提供するために実行される。このとき、補間され量子化されたＬＳＰ係数は、従来の要領でＬＳＰ−ＬＰＣ変換プロセッサ３５５による各々のサブフレーム内での使用のためＬＰＣ予測器係数に変換し戻される。これは、符号器及び復号器の両方において行なわれる。ＬＳＰ補間は、出力音声のスムーズな再生を維持する上で重要である。ＬＳＰ補間は、サブフレーム（４ｍｓ）につき一回ずつスムーズにＬＰＣ予測器係数を更新することを可能にする。結果として得られたＬＰＣ予測器係数アレイａは、符号器の入力信号を予測するべくＬＰＣ予測誤差フィルタ２０の中で使用される。入力信号とその予測されたバージョンの間の差は、ＬＰＣ予測残余ｄである。２．整形フィルタ整形フィルタ係数プロセッサ３０は、ＬＰＣ予測器係数アレイａの最初の３つの自己相関係数を計算し、次に、対応する最適な２次全極予測器についてｃ_j、ｊ＝０，１，２として、係数Ｃ_jを解くため、レヴィンソン−ダービン巡回形を使用する。これらの予測器係数はこのとき０.７という係数で帯域幅拡張される（すなわち、ｊ番目の係数Ｃ_jはＣ_j（０.７）^jで置換される）。次に、プロセッサ３０は同様に、１６次全極ＬＰＣ予測器係数アレイａの帯域幅拡張も実行するが、この場合係数は０.８である。これらの２つの帯域幅拡張された全極フィルタ（２次及び１６次）をカスケード化することによって、望ましい１８次整形フィルタ４０が得られる。整形フィルタ係数アレイａｗｃは、直接形の１８次フィルタを得るべく、上述の２つの帯域幅拡張された係数アレイ（２次及び１６次）を畳み込むことによって計算される。整形フィルタ４０が、図１に示されている通りにＬＰＣ予測誤差フィルタでカスケード化された時点で、２つのフィルタは実際に、望まれる符号化雑音スペクトルのほぼ逆数である周波数応答をもつ知覚重みづけフィルタを形成する。かくして、整形フィルタ４０の出力は、知覚重みづけ音声信号ｓｗと呼ばれる。ゼロ入力応答プロセッサ５０は、その中に整形フィルタを有する。各々の４ｍｓのサブフレームの始めに、このプロセッサは、４ｍｓのゼロ相当入力信号をフィルタに供給することにより整形ろ過を実行する。一般に、対応する出力信号ベクトルｚｉｒは、フィルタが一般に非ゼロメモリを有することから（符号器初期化の後の一番最初のサブフレーム中、又は符号器が始動したために符号器の入力信号が正確にゼロであるときを除く）、非ゼロである。プロセッサ６０は、重みづけされた音声ベクトルｓｗからｚｉｒを減算する。結果として得られたベクトルｔｐは、閉ループピッチ予測のための標的ベクトルである。３．閉ループピッチ予測ピッチ予測において、量子化され復号器に伝送される必要のあるパラメータの種類には２つある。すなわち、有声音声のほぼ周期的な波形の周期に対応するピッチ周期、及び３つの予測係数（タップ）である。ａ．ピッチ周期ＬＰＣ予測残余のピッチ周期は、ここにあたかも完全に記述されているごとくに参考として内含されている「有声メッセージ符号器／復号器の使用方法」という題の米国特許第５,３２７,６２０号の中で論述された効率の良い２段階探索技術の修正バージョンを用いて、開ループピッチ抽出器及び補間回路７０によって、決定される。プロセッサ７０は、帯域幅を約７００Ｈｚに制限するべく３次楕円低域フィルタの中にＬＰＣ残余を通過させ、次に低域フィルタ出力の８：１のデシメーションを実行する。現フレームの最後の３つのサブフレームに対応するピッチ解析窓を用いて、デシメートされた信号の相関係数は、デシメートされていない信号領域内の２４〜２７２サンプルのタイムラグに対応する３〜３４のタイムラグについて計算される。かくして、ピッチ周期として許容可能な範囲は１．５ｍｓ〜１７ｍｓあるか又はピッチ周波数でいうと５９Ｈｚ〜６６７Ｈｚである。これは、低ピッチの男性及び高ピッチの子供を含め大部分の話者の通常のピッチ範囲を網羅するのに充分である。デシメートされた信号の相関係数が計算された後、最低のタイムラグを有する相関係数の最初の大きいピークが識別される。これが第１段の探索である。結果として得られたタイムラグをｔとする。この値ｔに８を乗じてデシメートされない信号領域内のタイムラグを得る。結果として得られたタイムラグ８ｔは、真のピッチ周期がある可能性の最も高い近傍を指している。デシメートされていない信号領域内にもとの時間分解能を保持するため、ｔ−４からｔ＋４の範囲内で第２段のピッチ探索が行なわれる。もとのデシメートされていないＬＰＣ残余の相関係数ｄは、ｔ−４〜ｔ＋４のタイムラグについて計算される（下界は２４サンプル、上界は２７２サンプル）。この範囲内の最大相関係数に対応するタイムラグは、このとき最終ピッチ周期として識別される。このピッチ周期は、８ビットに符号化され、８ビットの指標ＩＰが、サイド情報として復号器に伝送されるためにマルチプレクサ１８０に供給される。ピッチ周期として選択され得ると考えられる整数は２７２−２４＋１＝２４９しかないことから、ピッチ周期を表わすのに８つのビットで充分である。各々の２０ｍｓのフレームについてこのような８ビットのピッチ指標は１つずつしか伝送されない。プロセッサ７０は、以下の要領で各々のサブフレームについてピッチ周期ｋｐｉを決定する。現フレームの抽出されたピッチ周期と最後のフレームのものの間の差が２０％以上である場合、上述の抽出されたピッチ周期は、現フレーム内の全てのサブフレームについて用いられる。一方、この相対的ピッチ変化が２０％未満である場合、抽出されたピッチ周期は、現フレームの最後の３つのサブフレームについて使用され、一方最初の２つのサブフレームのピッチ周期は、最後のフレームの抽出されたピッチ周期と現フレームのものの間の線形補間によって得られる。ｂ．ピッチ予測器タップ閉ループピッチタップ量子化器とピッチ予測器８０は、以下のオペレーションをサブフレーム毎に実行する：すなわち（１）３ピッチタップの閉ループ量子化、（２）現フレーム内のＬＰＣ予測残余ｄのピッチ予測されたバージョンであるｄｈの生成及び、（３）標的信号ｔｐに最も近い突合せであるｈｄの生成である。プロセッサ８０は、ＬＰＣ予測残余ｄの量子化バージョンとみなすことのできる信号ｄｔの先行サンプルを記憶する内部バッファを有する。各々のサブフレームについて、プロセッサ８０は、ｄｔバッファから６４次元ベクトルを３つ抽出するため、ピッチ周期ｋｐ１を用いる。ｘ₁，ｘ₂及びｘ₃と呼ばれるこれらの３つのベクトルはそれぞれ、ｄｔの現フレームよりもｋｐｉ−１，ｋｐｉ及びｋｐｉ＋１サンプルだけ早い。このとき、これらの３つのベクトルは、ゼロ初期フィルタメモリをもつ整形フィルタ（係数アレイａｗｃを伴う）によって別々にろ過される。結果として得られる３つの６４次元出力ベクトルをｙ₁，ｙ₂及びｙ₃と呼ぼう。次に、プロセッサ８０は、３ピッチ予測器タップの６４の候補セットｂ_1j ，ｂ_2j，ｂ_3j（ｊ＝１，２，…６４）のコードブックを通して探索し、ひずみ測定を最小限にする最適なセットｂ_1k，ｂ_2k，ｂ_3kを見い出す必要がある。このタイプの問題は、以前に研究されており、米国特許第５,３２７,５２０号の中に効率の良い探索方法に見い出すことができる。この技術の詳細はここでは紹介しないが、基本的な考え方は以下の通りである。このひずみ測定を最小にすることが２つの９次元ベクトルの内部積を最大にすることと同等であるということを示すことができる。これらの９次元ベクトルの１つは、ｙ₁，ｙ₂及びｙ₃の相関係数のみを含んでいる。もう１つの９次元ベクトルは、評価中の３つのピッチ予測器タップのセットから導出された積の項のみを含んでいる。このようなベクトルは信号に依存せず、ピッチタップコードベクトルのみに依存していることから、このようなベクトルの可能性は６４しかなく（各ピッチタップコードベクトルについて１つずつ）、これらを、予め計算し、１つのテーブルつまりＶＱコードブックの中に記憶させることができる。実際のコードブック探索においては、ｙ₁，ｙ₂，及びｙ₃の９次元相関ベクトルが最初に計算される。次に、６４の予め計算され記憶された９次元ベクトルの各々との結果として得られたベクトルの内部積が計算される。最大内部積を与える記憶されたテーブル内のベクトルが勝者であり、３つの量子化されたピッチ予測器タップがそこから導出される。記憶されたテーブル内には６４のベクトルが存在することから、３つの量子化されたピッチ予測器タップを表わすには、ｍ番目のサブフレームのための６ビットの指標ＩＴ（ｍ）で充分である。各フレーム内には５つのサブフレームが存在することから、全てのサブフレームに用いられる３つのピッチタップを表わすためには、１フレームにつき合計３０ビットが用いられる。これらの３０ビットは、サイド情報としてデコーダに対して伝送するためマルチプレクサ１８０に供給される。各々のサブフレームについて、上述のコードブック探索方法により３つのピッチタップｂ_1k，ｂ_2k，ｂ_3kの最適なセットが選択された後、ｄのピッチ予測バージョンが、次のように計算される。出力信号ベクトルｈｄは、以下のように計算される。このベクトルｈｄは、減算ユニット９０によりベクトルｔｐから減算される。結果は、変換符号化のための標的ベクトルｔｔである。４．標的ベクトルの変換符号化ａ．正規化のための整形フィルタ絶対値応答標的ベクトルｔｔは、変換符号化アプローチを用いて、ブロック１００−１５０によりサブフレーム毎に符号化される。整形フィルタ絶対値応答プロセッサ１００は、以下の要領で信号ｍａｇを計算する。まず最初に、このプロセッサは、現フレームの最後のサブフレームの整形フィルタ係数アレイａｗｃをとり、それを６４サンプルにゼロパッドし、次に、結果として得られた６４次元ベクトルについて６４ポイントＦＦＴを実行する。次にこれは、０〜８ｋＨｚの周波数範囲に対応する３３のＦＦＴ係数の絶対値を計算する。結果としてのベクトルｍａｇは、最後のサブフレームのための整形フィルタの絶対値応答である。計算を節約するため、最初の４つのサブフレームは、最後のフレームの最後のサブフレームのｍａｇベクトルと現フレームの最後のサブフレームのｍａｇベクトルの間の線形補間によって得られる。ｂ．変換及び利得正規化変換プロセッサ１１０は、以下に記述するようないくかつのオペレーションを実行する。まず第１に、これは、６４ポイントＦＦＴを用いることにより現サブフレーム内の６４次元ベクトルを変換する。６４サンプル（又は４ｍｓ）というこの変換サイズは、オーディオ符号化技術においては周知のものであるいわゆる「プレ・エコー」ひずみを回避する。ここにあたかも完全に記述されているごとくに参考として内含されている Jayant．N．et al.，「人間の知覚モデルに基づく信号圧縮」Proc. ＩＥＥＥｐｐ１３８５〜１４２２，１９９３年１０月を参照のこと。最初の３３の複合ＦＦＴ係数の各々は次に、ｍａｇベクトル内の対応する要素により除算される。結果として得られた正規化されたＦＦＴ係数ベクトルは、３つの周波数帯域、すなわち（１）最初の６つの正規化されたＦＦＴ係数から成る低周波数帯域（すなわち０〜１２５０Ｈｚ）、（２）次の１０個の正規化されたＦＦＴ係数から成る中周波数帯域（１５００〜３７５０Ｈｚ）及び（３）残りの１７の正規化されたＦＦＴ係数から成る周波数帯域（４０００〜８０００Ｈｚ）に区分される。３つの帯域の各々の中の全エネルギーは計算されて次に、各帯域の対数利得と呼ばれるｄＢ値に変換される。低周波数帯域の対数利得は、当該技術分野において周知のロイドアルゴリズムを用いて設計された５ビットのスカラ量子化器を用いて量子化される。量子化された低周波数対数利得は、中及び高周波数帯域の対数利得から減算される。結果として得られるレベル調整された中及び高周波数の対数利得は２次元ベクトルを形成するべく連結され、これは次に、同じく当該技術分野において周知の一般化されたロイドアルゴリズムにより設計されたコードブックを伴う７ビットベクトル量子化器によって量子化される。量子化された低周波数対数利得は次に、レベル調整された中及び高周波数対数利得の量子化されたバージョンに加算し戻され、中及び高周波数帯域の量子化対数利得が得られる。次に、３つの量子化された対数利得は全て対数（ｄＢ）領域から線形領域へ変換される。このとき、３３の正規化されたＦＦＴ係数（以上で記述したとおりｍａｇによって正規化されたもの）の各々は、さらに、ＦＦＴ係数がある周波数帯域の対応する量子化された線形利得によって除算される。この正規化第２段の後、結果は、０〜８０００Ｈｚの周波数を表わす３３の複素数を含む最終的な正規化された変換ベクトルｔｃである。ｍ番目のサブフレーム中の対数利得の量子化中、変換プロセッサ１１０は、低周波数対数利得のための５ビットの利得コードブック指標ＩＧ（ｍ，１）及び、中及び高周波数対数利得のための７ビットコードブック指標ＩＧ（ｍ，２）を生成する。従って、３つの対数利得は、サブフレームあたり１２ビットつまりフレームあたり６０ビットのビットレートで符号化される。これらの６０ビットは、サイド情報として復号器に伝送するためマルチプレクサ１８０に供給される。これら６０の利得ビットは、ＬＳＰのための４９ビット、ピッチ周期のための８ビット及びピッチタップのための３０ビットと共に、サイド情報を形成し、その合計は１フレームあたり４９＋８＋３０＋６０＝１４７ビットとなる。ｃ．ビットストリーム上述の通り、ＬＰＣパラメータを符号化するために４９ビット／フレームが割振られ、３−タップピッチ予測器のために８＋（６×５）＝３８ビット／フレームが割振られ、利得のために（５＋７）×５＝６０ビット／フレームが割振られた。従って、サイド情報ビットの合計数は２０ｍｓのフレームあたり４９＋３８＋６０＝１４７ビットつまり４ｍｓのサブフレームあたりおよそ３０ビットである。符号器は、１６，２４及び３２ｋｂ／秒という３つの異なるレートの１つで使用され得ると考えよう。１６ｋＨｚのサンプリングルートで、これら３つの標的レートは、１，１.５及び２ビット／サンプル又は６４，９６及び１２８ビット／サブフレームにそれぞれ変形する。サイド情報のために用いられる３０ビット／サブフレームで、主情報を符号化（ＦＦＴ係数の符号化）する上で使用するべく残っているビットの数は、それぞれ１６，２４及び３２ｋｂ／秒の３つのレートについて３４，６６及び９８ビット／サブフレームである。ｄ．適応ビット割振り本発明の原則に従うと、ＴＰＣ復号器における出力音声の知覚的品質を高めるため、異なる量子化精度で周波数スペクトルのさまざまな部分にこれらの残りのビットを割当てるべく、適応ビット割振りが実行される。これは、オーディオ信号内の雑音に対する人間の感度のモデルを用いることによって行なわれる。このようなモデルは、知覚的オーディオ符号化技術においては既知のものである。例えば、Tobias J.V．ed.，「モデル聴覚理論の基礎」、Academic Press New York 及び London,１９７０を参照のこと。同様に、ここにあたかも完全に記述されているかのように参考として内含されているSchroeder，M.R.et al．「人間の耳のマスキング特性を開発利用することによるデジタル音声符号器の最適化」。J.Ac oust ．Soc．Amer ，６６：１６４７−１６５２，１９７９年１２月(Schroeder et al)も参照のこと。聴力モデル及び量子化制御プロセッサ１３０が適応ビット割振りを実行し、ｔｃ内に含まれた３３の正規化された変換係数の各々を量子化するためにいくつかのビットを使用すべきかについて変換係数量子化器１２０に告げる出力ベクトルｂａを生成する。適応ビット割振りはサブフレームに１回ずつ実行することができるが、本発明の実施形態では、計算上の複雑さを低減するため１フレームにつき１回ずつビット割振りが行なわれる。従来の音楽符号器で行なわれているように、雑音マスキングしきい値及びビットの割振りを導出するのに量子化されていない入力信号を使用するのではなく、実施形態の雑音マスキングしきい値及びビットの割振りは、量子化されたＬＰＣ合成フィルタ（これは往々にして「ＬＰＣスペクトル」と呼ばれる）の周波数応答から決定される。ＬＰＣスペクトルは、２４ｍｓのＬＰＣ解析窓内での入力信号のスペクトル包絡線の近似とみなすことができる。ＬＰＣスペクトルは、量子化されたＬＰＣ係数に基づいて決定される。量子化されたＬＰＣ係数は、ＬＰＣスペクトルを以下のように決定する聴力モデル及び量子化器制御プロセッサ１３０に対して、ＬＰＣパラメータプロセッサ１０によって提供される。量子化されたＬＰＣフィルタ係数ａは、まず第１に６４ポイントＦＦＴによって変換される。最初の３３のＦＥＴ係数の各々の累乗が決定され、次に逆数が計算される。結果は、６４ポイントＦＦＴの周波数分解能をもつＬＰＣパワースペクトルである。ＬＰＣパワースペクトルが決定された後、ここにあたかも完全に記述されているごとくに参考として内含されている米国特許第５，３１４，４５７号の中に記述された方法の修正バージョンを用いて、推定雑音マスキングしきい値ＴＭが計算される。プロセッサ１３０は、主観的リスニング実験から経験的に決定された周波数依存性減衰関数により、ＬＰＣパワースペクトルの３３のサンプルを基準化する。減衰関数は、ＬＰＣパワースペクトルのＤＣ項について１２ｄＢから始まり、７００〜８００Ｈｚの間で約１５ｄＢまで増大し、次に高周波数に向かって単調に減少し、最終的に８０００Ｈｚでの６ｄＢまで減少する。３３の減衰されたＬＰＣパワースペクトルサンプルの各々は、次に、マスキングしきい値を計算するべくその特定の周波数について導出された「基底板広がり関数」を基準化するために用いられる。任意の周波数のための広がり関数は、その周波数における単一音調マスカー信号に応答したマスキングしきい値の形状に対応する。Schroeder et al.の等式（５）は、「バルク」周波数尺度又は臨界帯域周波数の形でこのような広がり関数を記述しており、これはここに、あたかも完全に記述されているかのごとく参考として内含される。尺度化プロセスは、「バルク」周波数尺度に６４ポントＦＦＴの最初の３３個の周波数（すなわち０Ｈｚ，２５０Ｈｚ，５００Ｈｚ，…８０００Ｈｚ）が変換されることから始まる。次に、結果として得られた３３のバーク値の各々について、Schroeder et al の等式（５）を用いてこれらの３３のバーク値で、対応する広がり関数がサンプリングされる。結果として得られた３３の広がり関数は、１つのテーブルの中に記憶され、これはオフラインプロセスの一部分として行なわれ得る。推定マスキングしきい値を計算するため、３３の広がり関数の各々には、減衰されたＬＰＣパワースペクトルの対応するサンプル値が乗じられ、結果として得られた３３の基準化された広がり関数は合計される。結果は、推定マスキングしきい値関数である。マスキングしきい値を推定するためのこの技術が、利用可能な唯一の技術ではないという点に留意すべきである。複雑性を低く保つため、プロセッサ１３０は適応ビット割振りを実行するのに「貧欲」アルゴリズムを使用する。この技術は、将来のビット割振りに対する潜在的な影響とは無関係に、最も「困窮している」周波数成分に対して一度に１つのビットを割振るという意味において「貧欲な」ものである。まだいかなるビットも割当てられていない最初において、対応する出力音声はゼロとなり、符号化誤り信号は入力音声そのものである。従って当初ＬＰＣパワースペクトルは符号化雑音のパワースペクトルであると仮定される。その後、上で計算されたマスキングしきい値及び Schroeder et al の中で雑音の大きさの計算方法の簡略化されたバージョンを用いて、６４ポイントのＦＦＴの３３の周波数の各々における雑音の大きさが推定される。３３の周波数の各々における簡略化された雑音の大きさは、以下のように計算される。まず最初に、Tobias の Scharf の本の章の表１に列挙された臨界帯域幅の線形補間を用いて、ｉ番目の周波数における臨界帯域幅Ｂ₁を計算する。結果は Schroeder et al の等式（３）の中の項ｄｆ／ｄｘの近似値である。３３の臨界帯域幅の値を予め計算し、テーブルの中に記憶する。次に、ｉ番目の周波数について、雑音パワーＮ_iをマスキングしきい値Ｍ_iと比較する。Ｎ_i≦Ｍ_iである場合、雑音の大きさＬ_iはゼロにセットされる。Ｎ_i≧Ｍ_iである場合、雑音の大きさは、以下のように計算される。Ｌ_i＝Ｂ_i（(Ｎ_i−Ｍ_i)/(１＋（Ｓ_i/Ｎ_i）²)）^0.25 なお式中、Ｓ_iはｉ番目の周波数におけるＬＰＣパワースペクトルのサンプル値である。ひとたび雑音の大きさが３３の周波数全てに計算されたならば、最大の雑音の大きさをもつ周波数が識別され、この周波数に対して１つのビットが割当てられる。この周波数における雑音パワーは、正規化されたＦＦＴ係数を量子化するためＶＱコードブックの設計中に得られた信号雑音比（ＳＮＲ）から経験的に決定される１つの係数により縮分させられる（縮分係数の値の例は４〜５ｄＢである）。この周波数における雑音の大きさは、次に縮分された雑音パワーを用いて更新される。次に、更新された雑音の大きさアレイから最大値が識別され、１つのビットが対応する周波数に割当てられる。このプロセスは、全ての利用可能ビットが使い尽されるまで続く。３２及び２４ｋｂ／秒のＴＰＣ符号器については、３３の周波数の各々が、適応ビット割振りの間にビットを受理することができる。一方、１６ｋｂ／秒のＴＰＣ符号器については、符号器が０〜４ｋＨｚの周波数範囲のみに対してビットを割当て（すなわち最初の１６のＦＥＴ係数）、高周波数合成プロセッサ１４０を用いて４〜８ｋＨｚのより高い周波数帯域内で残余ＦＦＴ係数を合成する場合に、より優れた音声品質を達成することができる。量子化されたＬＰＣ係数ａは、ＴＰＣ復号器でも利用可能であることから、ビット割振り情報を伝送する必要は全くない、ということに留意されたい。このビット割振り情報は、復号器内の聴力モデル量子化制御プロセッサ５０のレプリカにより決定される。かくして、ＴＰＣ復号器は、このようなビット割振り情報を得るため符号器の適応ビット割振りオペレーションを局所的に複製することができる。ｅ．変換係数の量子化変換係数量子化器１２０は、ビット割振り信号ｂａを用いてｔｃ内に含まれた変換係数を量子化する。ＦＦＴのＤＣ項は実数であり、ビット割振りの間に何らかのビットを受理した場合、スカラー量子化される。それが受理できるビットの最大数は４である。２番目から１６番目までのＦＦＴ係数については、実数部分及び虚数部分を同時に量子化するために、従来の２次元ベクトル量子化器が使用される。この２次元ＶＱのためのビットの最大数は６ビットである。残りのＦＦＴ係数については、２つの隣接するＦＦＴ係数の実及び虚部分を同時に量子化するために、従来の４次元ベクトル量子化器が使用される。変換係数の量子化の後、結果として得られるＶＱコードブック指標アレイＩＣは、ＴＰＣ符号器の主情報を含む。この指標アレイＩＣは、マルチプレクサ１８０に供給され、ここでサイト情報ビットと組合わされる。結果は、通信チャンネルを通ってＴＰＣ復号器に伝送される最終ビットフレームである。変換係数量子化器１２０は同様に、正規化された変換係数の量子化された値を復号する。これは次に、対応するｍａｇ要素及び対応する周波数帯域の量子化された線形利得をこれら係数の各々に乗じることにより、これらの変換係数のもとの利得レベルを復元する。結果は出力ベクトルｄｔｃである。ｆ．高周波数合成及び雑音充てん１６ｋｂ／秒の符号器につていは、適応ビット割振りは０〜４ｋＨｚの帯域に制限され、プロセッサ１４０は４〜８ｋＨｚの帯域を合成する。それを行なう前に、聴力モデル量子化器制御プロセッサ１３０はまず第１に、４〜７ｋＨｚの帯域内の周波数について、ＬＰＣパワースペクトルとマスキングしきい値の間の比率又は信号−マスキングしきい値比（ＳＭＲ）を計算する。１７番目〜２９番目のＦＦＴ係数（４〜７ｋＨｚ）は、ランダムである位相及びＳＭＲによって制御されている絶対値を用いて合成される。ＳＭＲ＞５ｄＢである周波数については、ＦＦＴ係数の絶対値は、高周波数帯域の量子化された線形利得にセットされる。ＳＭＲ≦５ｄＢである周波数については、絶対値は、高周波数帯域の量子化された線形利得より２ｄＢ下である。３０番目から３３番目のＦＥＴ係数については、絶対値は、高周波数帯域の量子化された線形利得よりも２ｄＢ〜３０ｄＢまで傾斜し、位相は再びランダムである。３２ｋｂ／秒及び２４ｋｂ／秒の符号器については、記述されている通り、周波数帯域全体について、ビット割振りが実行される。しかしながら、４〜８ｋＨｚの帯域内のいくつかの周波数はなおも全くビットを受理しない可能性がある。この場合、上述の高周波数合成及び雑音充てん手順は、ビットを全く受理しない周波数のみに適用される。ベクトルｄｔｃに対してこのような高周波数合成及び雑音充てんを適用した後、結果として得られる出力ベクトルｑｔｃは、正規化の前の変換係数の量子化バージョンを含む。ｇ．逆変換及びフィルタメモリ更新逆変換プロセッサ１５０は、半サイズの３３要素ベクトルｑｔｃによって表わされる６４要素複合ベクトルについて逆ＦＦＴを実行する。この結果として、変換コーディングのための時間−領域標的ベクトルｔｔの量子化されたバージョンである出力ベクトルｑｔｔが得られる。ゼロ初期フィルタ状態（フィルタメモリ）では、その係数アレイとしてａｗｃをもつ全ゼロフィルタである逆整形フィルタ１６０は、出力ベクトルｅｔを生成するべくベクトルｑｔｔをろ過する。このとき、加算器１７０はｄｈをｅｔに加算して、量子化されたＬＰＣ予測残余ｄｔを得る。このｄｔベクトルは次に、閉ループピッチタップ量子化器及びピッチ予測器８０の中の内部記憶バッファを更新するために用いられる。これは又、次のサブフレームのためのゼロ入力応答生成に備えて補正フィルタメモリを樹立するためゼロ入力応答プロセッサ５０の内部の内部整形フィルタを励起するためにも使用される。Ｃ．復号の実施形態本発明の復号実施形態が、図２に示されている。各々のフレームについて、デマルチプレクサ２００は、受理されたビットストリームから全ての主及びサイド情報成分を分離する。主情報、つまり変換係数指標アレイＩＣは、変換係数復号器２３５に供給される。この主情報を復号するためには、いくつの主情報ビットが各々の量子化された変換係数と結びつけられるかを決定するため、適応ビット割振りを実行しなければならない。適応ビット割振りにおける第１のステップは、量子化されたＬＰＣ係数（割振りを左右するもの）の生成である。デマルチプレクサ２００は、ＬＰＣパラメータ復号器２１５に対して７つのＬＳＰコートブック指標ＩＬ（１）〜ＩＬ（７）を供給し、この復号器は、１６の量子化されたＬＳＰ係数を得るため７つのＬＳＰＶＱコードブックからテーブルのルックアップを実行する。このとき、ＬＰＣパラメータ復号器２１５は、図３のブロック３４５，３５０及び３５５と同じソーティング、補間及びＬＳＰ−ＬＰＣ係数変換オペレーションを実行する。計算されたＬＰＣ係数アレイａで、聴力モデル量子化器制御プロセッサ２２０は、ＴＰＣ符号器内のプロセッサ１３０と同じ要領で、各々のＦＦＴ係数について（量子化されたＬＰＣパラメータに基づいて）ビット割振りを決定する（図１）。同様にして、整形フィルタ係数プロセッサ２２５及び整形フィルタ絶対値応答プロセッサ２３０も、ＴＰＣ符号器の中でそれぞれ対応するプロセッサ３０及び１００のレプリカである。プロセッサ２３０は、変換係数復号器２３５が使用するように、整形フィルタの絶対値応答ｍａｇを生成する。ひとたびビット割振り情報が導出された時点で、変換係数復号器２３５は次に主情報を正しく復号し、正規化された変換係数の量子化された情報を得ることができる。復号器２３５は同様に、利得指標アレイＩＧを用いて利得を復号する。各々のサブフレームについて、２つの利得指標（５ビット及び７ビット）が存在し、これらは、低周波数帯域の量子化された対数利得及び、中及び高周波数対数利得のレベル調整された対数利得の量子化バージョンへと復号される。このとき、量子化された低周波数対数利得は、中及び高周波数帯域の量子化された対数利得を得るべく、レベル調整された中及び高周波数対数利得の量子化バージョンに加算し戻される。次に３つの量子化された対数利得は全て、対数（ｄＢ）領域から線形領域まで変換される。３つの量子化された線形利得の各々は、対応する周波数帯域内の正規化された変換係数の量子化されたバージョンを乗算するのに用いられる。結果として得られた３３の利得基準化された量子化変換係数の各々は、次に、整形フィルタ絶対値応答アレイｍａｇの中の対応する要素でさらに乗算される。これらの２つの基準化段の後、結果は、復号された変換係数アレイｄｔｃである。高周波数合成プロセッサ２４０、逆変換プロセッサ２４５及び逆整形フィルタ２５０は、ここでもＴＰＣ符号器内の対応するブロック（１４０，１５０及び１６０）の正確なレプリカである。これらは、合わさって、高周波数合成、雑音充てん、逆変換、及び逆整形ろ過を実行して、量子化された励起ベクトルｅｔを生成する。ピッチ復号器及び補間回路２０５は、最後の３つのサブフレームのためのピッチ周期を得るため８ビットピッチ指標ＩＰを復号し、次に、ＴＰＣ符号器の対応するブロック７０内で行なわれたのと同じ要領で最初の２つのサブフレームのためのピッチ周期を補間する。ピッチタップ復号器及びピッチ予測器２１０は、３つの量子化されたピッチ予測器タップｂ_1K，ｂ_2K，ｂ_3Kを得るため、各々のサブフレームについてピッチタップ指標ＩＴを復号する。次に、これは、補間されたピッチ周期ｋｐｉを用いて、符号器の節で記述したものと同じ３つのベクトルｘ₁ ，ｘ₂，及びｘ₃を抽出する。（これら３つのベクトルはそれぞれ、ｄｔの現フレームよりもｋｐｉ−１，ｋｐｉ及びｋｐｉ＋１サンプルだけ早い）。次にこれは以下のとおり、ＬＰＣ残余のピッチ予測されたバージョンを計算する：加算器２５５は、ｄｈをｅｔに加算して、ＬＰＣ予測残余ｄの量子化バージョンｄｔを得る。このｄｔベクトルは、ｄｔについてその内部記憶バッファを更新する（ピッチ予測器のフィルタメモリ）ために、ブロック２１０内のピッチ予測器までフィードバックされる。長期後置フィルタ２６０は、ＩＴＵ−ＴＧ.７２８の標準的１６ｋｂ／秒の低遅延ＣＥＬＰ符号器の中で用いられる長期後置フィルタと基本的に類似している。主たる相異点は、有声化インジケータとして３つの量子化されたピッチタップの合計であるを用いること、そして長期後置フィルタ係数のための基準化因子がＧ７２８にあるように０.１５ではなく０.４であること、にある。この有声化インジケータが０.５未満である場合、事後ろ過オペレーションは省かれ、出力ベクトルｆｄｔは入力ベクトルｄｔと同一である。インジケータが０.５以上である場合、事後ろ過オペレーションが実施される。ＬＰＣ合成フィルタ２６５は、標準的なＬＰＣフィルタ、すなわち量子化されたＬＰＣ係数アレイａを伴う全極直接形フィルタである。これは、信号ｆｄｔをろ過し、長期後置ろ過された量子化音声ベクトルｓｔを生成する。このｓｔベクトルは、短期後置フィルタ２７０の中を通過させられて、最終的ＴＰＣ復号器出力音声信号ｆｓｔを生成する。ここでも又、この短期後置フィルタ２７０は、Ｇ .７２８の中で使用される短期後置フィルタに非常に類似している。唯一の差異は、以下の点にある。まず第１に、極制御係数、ゼロ制御係数及びスペクトル傾動制御係数がそれぞれ、Ｇ.７２８における０.７５，０.６５及び０.１５という対応値ではなく、０.７，０.５５及び０.４である。第２に一次スペクトル傾動補償フィルタの係数は、フレーム間でサンプル毎に線形補間される。こうして、フレーム境界における不連続性に起因する、場合によっては可聴となるクリック音が避けられる。長期及び短期後置フィルタは、出力信号ｆｓｔにおいて符号化雑音の知覚されたレベルを低下させ、かくして音声品質を高める効果をもつ。DETAILED DESCRIPTION OF THE INVENTION Multi-stage speech coder by transform coding of prediction residual signal with quantization by auditory modelField of the invention The present invention uses a predictive coding system for audio signals, for example audio signals. Compression (encoding).Background of the Invention As taught in the signal compression literature, audio and music waveforms are very Are encoded by different encoding techniques. Telephone encoding at 16 kb / s or less Speech coding, such as bandwidth (3.4 kHz) speech, is a time domain prediction code This has been done mainly by vessels. These encoders predict the speech waveform to be encoded. Use a speech production model to At this time, to reduce redundancy in the original signal The predicted waveform is subtracted from the actual (original) (to be coded) waveform. signal The reduction in redundancy results in coding gain. As an example of such a predictive speech coder Is adaptive predictive coding, all well known in the art of audio signal compression, There are multi-pulse linear prediction coding and code excitation linear prediction (CELP) coding. On the other hand, the wideband (0 to 20 kHz) music coding of 64 kb / sec or more uses the frequency domain. It has mainly been performed by band transforms or subband encoders. These music encoders Is fundamentally very different from the speech coder described above. The difference is that the music source , Unlike audio sources, are too volatile to be immediately predictable Due to the fact that there is. As a result, models of music sources are generally music-coded Not used for Instead, the music encoder is the perceptually relevant part of the signal Use a sophisticated human auditory model to encode only. That is, general Unlike a speech coder that uses a speech generation model for music, a music coder uses Use the hearing (sound receiving) model to get it. For speech encoders, the noise masking ability of the music to be encoded must be determined. For this purpose, a hearing model is used. The term "noise masking ability" How much quantization noise can be introduced into a music signal without noticing the noise To tell. This noise masking capability is then determined by the quantizer resolution (eg, quantizer resolution). Step size). In general, music ", The poorer the masking quantization noise, and therefore The required step size is smaller and vice versa. Smaller stepsa Is corresponding to a smaller coding gain, and vice versa. Such a music encoder Examples of AT & T's Perceptual Audio Code (PAC) and ISO MPEG Audio coding standards are included. Between the telephone bandwidth speech coding and the wideband music coding, the speech signal is at 16 kHz. There is a wideband speech coding sampled with a bandwidth of 7 kHz. 7 kh The advantage of z wideband speech is that the resulting speech quality is And the bit rate required for encoding is 20 kHz. much lower than the audio signal of z. These previously proposed wideband speech Some encoders use time-domain predictive coding, while others use frequency-domain transforms. Or use sub-band coding, and furthermore in the time domain and frequency domain Some use a mixture of both techniques. Include perceptual criteria in predictive speech coding, whether wideband or otherwise. Is to select the best synthesized speech signal from a plurality of synthesized speech signal candidates. The use of perceptual weighting filters has been limited. For example, against Atal et al See U.S. Patent No. Re. Such filters are Achieves some type of noise shaping that is useful for reducing noise in the process. One known encoder utilizes a perceptual model in forming a perceptual weighting filter. Attempts have been made to improve this technology.Summary of the Invention Despite the above efforts, none of the known speech or audio encoders Hearing model to set quantizer resolution according to analysis of noise masking ability It does not make use of both audio and sound generation models for signal prediction purposes. On the other hand, the present invention provides a noise mask determined by a human auditory sensitivity model for noise. A prediction process and a quantization process that quantizes the signal based on the I'm matching. The output of the predictive coding system thus becomes an audio perception model Therefore, a resolution (eg, uniform scan) that is a function of the determined noise masking signal. The step size in the La quantizer or the vector in the Vector quantizer. (The number of bits used for discrimination). According to the invention, one signal representing the estimation (or prediction) of a signal representing speech information Issue is generated. The term "original signal representing speech information" refers not only to the speech itself but also to the speech itself. Audio signal derivatives commonly found in audio coding systems (eg, linear prediction and This is a broad enough meaning to also mean the “switch prediction residual signal”. At this time, The signal is compared to the original signal to form a signal that represents the difference between these compared signals. This signal, which represents the difference between the compared signals, It is quantized according to the perceptual noise masking signal generated by the perceptual model. An embodiment of the present invention called "Transform Predictive Coding" or TPC is 16-32k Encode wideband speech as high as 7 kHz at a target bit rate of b / s. That As the name implies, TPC is a technique for transform coding and predictive coding within a single encoder. Are combined. More specifically, the encoder extracts redundancy from the input speech waveform. To use linear prediction to remove and then encode the resulting prediction residual Use transform coding techniques. The transformed prediction residuals are audible and encoded In human auditory perception expressed in the form of an auditory perception model to discard what is heard Is quantized based on knowledge. One important feature of embodiments is the perceptual noise masking capability of the signal (eg, Perceptual threshold of "recognizable distortion") and subsequent bit allocation The manner in which the resizing is performed. As is done in traditional music encoders This embodiment, instead of using the unquantized input signal to determine the perceptual threshold, The noise masking threshold and bit allocation of the Is determined based on the frequency response of the quantized LPC synthesis filter). Is done. This feature allows the decoder to decode the received and encoded wideband speech information Encoder to duplicate the perceptual threshold and bit allocation process required for System does not need to transmit the bit allocation signal from the To provide. Instead, the synthesis filter coefficients being transmitted for other purposes Are developed and used to save bitrate. Another important feature of the embodiment is that the TPC encoder is between encoder frequencies. Or the decoder is quantized based on the allocated bits It is about how to generate the output signal. Under certain circumstances, T The PC encoder allocates bits to only a part of the audio band (for example, 0 to 4). Bits can only be allocated for coefficients between kHz.) 4kHz and 7k No bits are allocated to represent the coefficients between the Hz, and thus the decoder Do not obtain any coefficients in the frequency range of An example of such a situation is the TPC encoder. Must operate at very low bit rates, for example 16 kb / s Occurs when the Signals encoded in the 4 kHz and 7 kHz frequency ranges Despite having no bits to represent, the decoder still has a wideband response If it must be provided, signals within this range must be combined. Implementation form In accordance with this feature of the state, the decoder may use other available information, namely (LPC Estimation of the signal spectrum (obtained from the parameters) and noise maps at frequencies within that range. A coefficient signal within this frequency range is generated (e.g., Composite). With this technique, the decoder calculates the audio signal coefficients for all bands. A broadband response can be provided without having to communicate. ISDN videoconferencing or audio Audio conferencing, multimedia audio, "hi-fi" phone technology and 28.8k b / s or more, simultaneous voice & data (S VD).BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows a diagram of an encoder embodiment of the present invention. FIG. 2 shows a diagram of a decoder embodiment of the present invention. FIG. 3 is a detailed block diagram of the LPC parameter processor of FIG. Is shown.Detailed description A. Overview of the embodiment For the sake of clarity, the embodiments of the invention shown in the figures are represented by individual functional blocks ( (Including functional blocks labeled "processor"). Are shown. The functions represented by these blocks execute software Shared (including, but not limited to) hardware that can Alternatively, it can be provided through the use of dedicated hardware. For example, FIG. 4 is provided by a single shared processor. You may. (The use of the word "processor" is Software should not be considered exclusive). The illustrated embodiment includes an AT & T DSP16 or DSP32C, for example. Digital signal processor (DSP) hardware, the operations discussed below Read-only memory (ROM) for storing software for executing the A random access memory (RAM) for storing DSP results may be included. No. Similarly, very large scale integrated circuit (VLSI) hardware embodiments and general A custom VLSI circuit combined with a custom DSP circuit. According to the present invention, a sequence of digital input audio samples is called a frame. Are divided into successive 20 ms blocks, and each frame is further divided into 4 blocks each. ms into 5 equal subframes. Common for wideband audio signals Assuming a sampling rate of 16 kHz so that Compatible with frame size of sample and subframe size of 64 samples I do. The TPC speech coder buffers and processes the input speech signal frame by frame. Within each frame, several coding operations are performed per subframe. Is performed. FIG. 1 shows an example of an embodiment of the TPC speech encoder of the present invention. Shown in FIG. See the embodiment. Once every 20 ms frame, the LPC parameter The processor 10 derives a line spectrum pair (LSP) parameter from the input audio signal S. Then, such LSP parameters are quantized, and for each 4 ms subframe, And interpolate them, then add LPC predictor coefficient array a for each subframe. Convert. The short-term redundancy determines whether the input speech signal s is Removed. The resulting LPC prediction residual signal d is still within the voiced speech Have some long-term redundancy due to the pitch periodicity of Shaping filter coefficient processor The processor 30 converts the shaped filter coefficient awc from the quantized LPC filter coefficient a. Derive. The shaping filter 40 filters the LPC prediction residual signal d and weights it perceptually. To generate the audio signal sw. The zero input response processor 50 includes a Calculate the zero input response zir of the data. Subtraction unit 60 subtracts zir from sw Thus, a target signal tp for pitch prediction is obtained. The open loop pitch extractor and interpolator 70 is used for each 20 ms frame. Use the LPC prediction residual d to extract the pitch period, and then use each 4 ms sub The interpolated pitch period kpi is calculated for the frame. Closed loop pitcher The quantizer and pitch estimator 80 uses this interpolated pitch period kpi. To calculate three pitch predictor taps from the codebook of the pitch tap candidate set. Select one set. This selection corresponds to the previously quantized LPC residual signal dt. Filtered by a 3-tap pitch synthesis filter, and then has zero initial memory. When the output signal hd is filtered by the shaping filter, the mean square error (M SE) in such a way that it is closest to the target signal tp in the sense. A subtraction unit 90 subtracts hd from tp to obtain the target signal t t for transform coding. Get t. The shaping filter absolute response processor 100 calculates the absolute value of the frequency response of the shaping filter. Calculate the value of the signal mag. The conversion processor 110 calculates, for the signal tt, Perform a linear transformation such as a fast Fourier transform (FFT). Next, this The processor is a quantized version of the gain values calculated for three different frequency bands. The transformation coefficient is normalized using the function and mag. The result is the normalized transform coefficient This is the signal tc. At this time, the transform coefficient quantizer 120 performs the transform at a different frequency. Auditory model quantizer control processor according to time-varying perceptual magnitude of coefficients Using the adaptive bit allocation signal ba determined by 130, the signal tc is quantized. Become At low bit rates, such as 16 kb / s, the processor 130 Bits are allocated only to the lower half (0 to 4 kHz). In this case, the high frequency synthesis The processor 140 synthesizes the transform coefficients in the high frequency band (4-8 kHz), Is combined with the quantized low-frequency transform coefficient signal dtc to obtain the final quantized The generated all-band transform coefficient signal qtc is generated. Higher such as 24 or 32 kb / s At higher bit rates, each transform coefficient within the entire frequency band is Process is allowed to accept bits, but in some cases, It is possible that no bits will be accepted due to lack of available bits. this In that case, the high frequency synthesis processor 140 simply accepts no bits at all, Frequency within the Hz band, and the "vortex" which is typically found in adaptive transform encoders. In order to avoid distortion of the “swing” type, such “spectral homing” with low level noise ”. The inverse transform processor 150 takes the quantized transform coefficient signal qtc and transforms it. Linear transformation, which is the inverse operation of the linear transformation used in (Inverse FFT in our particular embodiment here). As a result, the conversion Time domain signal qt, which is a quantized version of tt, the target signal for encoding t is provided. At this time, the inverse shaping filter 160 filters the qtt, Obtain the excitated excitation signal et. Adder 170 calculates the pitch prediction in block 80. Dh (this is the pitch prediction version of the LPC prediction residual d) ) Is added. The resulting signal dt is the LPC prediction This is a quantized version of the residual d. This is the zero input response processor 50 Note on filter memory of internal shaping filter and pitch predictor in block 80 Update the file. Thus, the signal loop is completed. LPC predictor parameters (IL), pitch predictor parameters (IP and IT) , A codebook representing the transform gain level (IG) and the quantized transform coefficient (IC) The targets are multiplexed into one bit stream by multiplexer 180, It is transmitted on the channel to the decoder. Channels can be wireless channels, All suitable communication channels, including computer and data networks, telephone networks And solid state memory (eg, semiconductor memory), optical memory System (eg, CD-ROM), magnetic memory (eg, disk memory), etc. Including You can go out. FIG. 2 shows a TPC speech decoder embodiment of the present invention. Demultifluxa 200 Separates the codebook indices IL, IP, IT, IG and IC. Pitch decoding The interpolator and interpolation circuit 205 decodes the IP and calculates an interpolated pitch period kpi You. The pitch tap decoder and pitch estimator 210 comprises a pitch estimator tap array. b to decode the IT and obtain the pitch prediction of the signal dh or LPC prediction residual d. Calculated version. The LPC parameter decoder and interpolation circuit 215 is And then calculate the interpolated LPC filter coefficient array a. Block 2 From 20 to 255, in order to generate a quantized LPC residual signal dt, FIG. Perform exactly the same operations as its corresponding configuration in. Long-term post filter 2 60 enhances the pitch periodicity within the dt and the filtered version as its output Generate fdt. This signal is passed through the LPC synthesis filter 265 and The resulting signal st is further filtered by a short-term post-filter 270. , This filter 270 produces the final filtered output audio signal fst. To keep complexity low, TPC utilizes open-loop quantization whenever possible. Open loop quantization means that the quantizer has no effect on the output speech quality Minimizes the difference between the unquantized parameter and its quantized version Means to try. This is where the pitch estimator, gain and excitation In contrast to a normally closed loop quantized CELP coder. Of encoder parameters In closed-loop quantization, the quantizer codebook search is the last reconstructed Try to minimize distortion in the output audio. Not surprisingly, this is generally This leads to better output speech quality, but at the cost of higher codebook search complexity. Miscellaneous. In the present invention, the TPC encoder has a closed loop amount only for the three pitch predictor taps. Use childization. The quantization operation that leads to the quantized excitation signal et Basically similar to open-loop quantization, but the effect on output speech is It is close to the child. The spirit of this approach is described in Lefebvre et al., High Quality Coding of Broadband Audio Signals Using Ki (TCX) ",Proc.IEEE Inte rnational Conf . Acoustics, Speech. Signal Processing , 1994, p. I Similar to the approach used in the TCX encoder by -193 to I-196. You. For example, a feature of the present invention not found in the TCX encoder is that the shaping filter absolute Normalization of transform coefficients by value response, adaptive bit allocation controlled by hearing model And there are high frequency synthesis and noise filling procedures. B. Encoder embodiment 1. LPC analysis and prediction A detailed block diagram of the LPC parameter processor 10 is shown in FIG. Have been. Processor 10 is a windowing and autocorrelation processor 310; Smoothing and white noise correction processor 315; Levinson-Durbin cyclic type processor Processor 320; bandwidth extension processor 325; LPC-LSP conversion processor 3 30 and LPC power spectrum processor 335; LSP quantizer 340; L SP sorting processor 345; LSP interpolation processor 350 and LSP- An LPC conversion processor 355 is included. Windowing and autocorrelation processor 310 initiates the LPC coefficient generation process. . Processor 310 determines the basis on which the LPC coefficients are subsequently calculated, as discussed below. The autocorrelation coefficient r is generated once every 20 ms in the conventional manner. Rabiner L.R . et al.,Digital processing of audio signals, Prentice-Hall, Inc. Englewood Cliffs, See New Jersey, 1978 (Rabiner et al.). LPC frame size Is 20 ms (or 320 audio samples at a 16 kHz sampling rate) It is. Each 20 ms frame is also 4 ms (or 64 samples) each. It is divided into five subframes of length. LPC analysis processor is the traditional way 24 ms centered on the last 4 ms subframe of the current frame Use a humming window. To mitigate potential fault conditioning, several conventional signal conditioning techniques have Used. Spectral smoothing technology (SST) and white noise correction technology use LPC solution Applied by spectral smoothing and white noise correction processor 315 prior to analysis . SST (Tohkura, Y. et al., "PARC", which is well known in the art. OR Speech Analysis-Spectrum Smoothing Technology in Synthesis "IEEE Trans, Acoust, S peech , Signal Processing , ASSP-26: 589-596, December 1978 The moon (Tohkura et al.) Has a Gaussian probability density with a standard deviation of 40 Hz. Array of autocorrelation coefficients corresponding to the function (pdf), but in the Gaussian window (From processor 310). Same as before (Chen, JH, "Robust low-delay CELP at 16 kbit / sec. Speech encoder ", Proc IEEE, Global Comm. Conf.pp. 1237-11241 , Dallas, TX, November 1989) White noise correction is based on the zero-lag autocorrelation coefficient ( That is, the energy term) is increased by 0.001%. At this time, the coefficients generated by the processor 315 are based on Levinson-Darby Provided to a recursive processor 320, which may be provided in a conventional manner. 16 LPC coefficients α for i = 1, 2,..., 16₁(LPC prediction error fill (The order of the data 20 is 16). The bandwidth extension processor 325 determines each α for further signal conditioning._iTo g_i (Where g = 0.994). This is 30Hz Support bandwidth expansion (Tohkura et al.) After such a bandwidth extension, the LPC predictor coefficients are calculated as LPC-LS in a conventional manner. It is converted to a line spectrum pair (LSP) coefficient by the P conversion processor 330. So ong, F.K. et al., "Line spectrum pair (LSP) and audio data compression" Proc.I EEE Int. Conf. Acoust. Speech, Signal Processing , Pp1.10.1-1-1. 10.4, March 1984 (Soong et al). Please refer to. This document is here It is not included as a reference as if it were completely described. Next, an LSP quantizer 34 for quantizing the resulting LSP coefficients. 0 provides vector quantization (VQ). Used by processor 240 The specific VQ technology to be referred to is as if fully described here. Paliwal, K. K. et al., “24-bit / frame LPC Efficient vector quantization of parameters ”,Proc . IEEE Int. Conf. Acoust . Speech, Signal Processing pp 661-664, Toronto, Canada, 199 Similar to the split VQ proposed in May 1st (Paliwal et al). 16 The dimensional LSP vector is 2,2,2,2,2,3,3 counted from the low frequency end. Into seven smaller vectors of dimension Of the seven subvectors Each is quantized to 7 bits (ie, VQ code of 128 code vectors). Using a book). Thus, seven codebook indices I each of length 7 bits L (1) to IL (7) exist, and one frame in LPC parameter quantization in total. 49 bits are used per frame. These 49 bits provide side information and And provided to multiplexer 180 for transmission to the decoder. Processor 340 uses conventional weighting as described in Paliwal et al. Through the VQ codebook using the calculated mean square error (WMSE) strain measurement To perform the search. The LPC power spectrum processor 335 uses this WM Used to calculate weights in SE strain measurements. In processor 340 The codebook used is a conventional codebook generator known in the art. It is designed using technology. Traditional MSE distortion measurements also have Substitute for WMSE measurement to reduce encoder complexity without significant degradation in Can also be used. Usually, the LSP coefficient monotonically increases. However, the quantization is Breakage can result. As a result of this break, the LPC signal in the decoder The synthesis filter becomes unstable. To avoid this problem, LSP Sorting Processor 345 sorts the quantized LSP coefficients and increases monotonically Restore order and ensure stability. The quantized LSP coefficients are used in the last subframe of the current frame . These LSP coefficients and the LSP coefficients from the last subframe of the previous frame are The linear interpolation between them is performed by the LSP interpolation processor 350 in the conventional manner. Performed to provide LSP coefficients for subframes. At this time, The quantized LSP coefficients are converted to an LSP-LPC conversion processor 35 in a conventional manner. 5 converted back to LPC predictor coefficients for use in each subframe. You. This is done at both the encoder and the decoder. LSP interpolation is output This is important for maintaining smooth playback of audio. LSP interpolation is a subframe (4ms) LPC predictor coefficients can be updated smoothly once every time I do. The resulting LPC predictor coefficient array a predicts the encoder input signal. Measure It is used in the LPC prediction error filter 20 for this purpose. The input signal and its predicted The difference between the versions is the LPC prediction residual d. 2. Shaping filter The shaping filter coefficient processor 30 calculates the first three values of the LPC predictor coefficient array a. , And then for the corresponding optimal second-order all-pole predictor, c_j, J = 0, 1, 2 and the coefficient C_jUse the Levinson-Durbin circuit to solve To use. These predictor coefficients are then bandwidth expanded by a factor of 0.7 ( That is, the j-th coefficient C_jIs C_j(0.7)^jIs replaced by Next, the processor 30 also performs bandwidth extension of the 16th order all-pole LPC predictor coefficient array a, , In this case the coefficient is 0.8. These two bandwidth extended all-pole filters By cascading (2nd and 16th order), the desired 18th order shaping field A filter 40 is obtained. The shaping filter coefficient array awc is a direct 18th order filter. In order to obtain the data, the above two bandwidth-extended coefficient arrays (2nd and 16th) are Calculated by convolution. The shaping filter 40 is controlled by an LPC prediction error filter as shown in FIG. Once scaled, the two filters are actually the desired coding noise spectrum. Form a perceptual weighting filter with a frequency response that is approximately the inverse of the torque. Scratch The output of the shaping filter 40 is called a perceptually weighted audio signal sw. Zero input response processor 50 has a shaping filter therein. 4m each At the beginning of the s subframe, the processor sends a 4 ms zero equivalent input signal. Performs shaped filtration by feeding the filter. Generally, the corresponding output signal level The vector zir is based on the fact that filters generally have non-zero memory (encoder initial Encoder input during the very first subframe after encoding or because the encoder has started Non-zero, except when the signal is exactly zero). The processor 60 calculates the weight The zir is subtracted from the assigned speech vector sw. The resulting vector Tp is a target vector for closed loop pitch prediction. 3. Closed loop pitch prediction In pitch prediction, the parameters that need to be quantized and transmitted to the decoder There are two types. That is, the pitch corresponding to the period of the substantially periodic waveform of the voiced voice Switch cycle, and three prediction coefficients (tap). a. Pitch period The pitch period of the LPC prediction residual is as if fully described here. "How to Use Voiced Message Encoder / Decoder", which is included as a reference in Efficient two-stage search technique discussed in US Pat. No. 5,327,620 entitled Using a modified version of the technique, the open loop pitch extractor and interpolator 70 ,It is determined. Processor 70 controls the cubic ellipse to limit the bandwidth to about 700 Hz. Pass the LPC residue through a circular low pass filter, then 8: 1 of the low pass filter output. Perform decimation. Corresponds to the last three subframes of the current frame Using the pitch analysis window, the correlation coefficient of the decimated signal is 3 to 34 tags corresponding to a time lag of 24 to 272 samples Calculated for imlag. Thus, the acceptable range for the pitch period is 1 . 5 ms to 17 ms or 59 to 667 Hz in terms of pitch frequency. You. This is the normal pitch for most speakers, including low pitch men and high pitch children. Is sufficient to cover the switch range. Has the lowest time lag after the correlation coefficient of the decimated signal is calculated The first large peak of the correlation coefficient is identified. This is the first stage search. result Let t be the time lag obtained as Do not decimate by multiplying this value t by 8. To obtain a time lag in a large signal area. The resulting time lag 8t is true It refers to the neighborhood where the pitch period is most likely to be. Not decimated In order to keep the original time resolution in the signal area, the time range from t-4 to t + 4 A two-stage pitch search is performed. Original undecimated LPC residual phase The relation number d is calculated for the time lag from t−4 to t + 4 (the lower bound is 24 suns). Pull, upper bound 272 samples). The timeline corresponding to the maximum correlation coefficient within this range Is then identified as the last pitch period. This pitch period is 8 bits And the 8-bit index IP is transmitted to the decoder as side information. To the multiplexer 180. Thought it could be chosen as pitch period Since there are only 272-24 + 1 = 249, the pitch period represents the pitch period. Eight bits are sufficient for One such 8-bit pitch indicator for each 20 ms frame Only one is transmitted. The processor 70 performs processing for each subframe in the following manner. To determine the pitch period kpi. The extracted pitch period of the current frame and the last If the difference between the frame's is greater than or equal to 20%, the extracted pitch period as described above Is used for all subframes in the current frame. On the other hand, this relative If the pitch change is less than 20%, the extracted pitch period will be the maximum of the current frame. Used for the last three subframes, while the first two subframes The pitch period between the extracted pitch period of the last frame and that of the current frame. Obtained by linear interpolation. b. Pitch predictor tap The closed loop pitch tap quantizer and pitch estimator 80 performs the following operations: Is performed for each subframe: (1) closed-loop quantization of three pitch taps , (2) the pitch predicted version of the LPC prediction residual d in the current frame. dh and (3) the generation of hd which is the closest match to the target signal tp. . Processor 80 may consider the quantized version of LPC prediction residual d Has an internal buffer for storing the previous sample of the signal dt. Each sub frame Processor 80 extracts three 64-dimensional vectors from the dt buffer Therefore, the pitch period kp1 is used. x₁, X_TwoAnd x_ThreeThese three called Are kpi-1, kpi, and kp respectively than the current frame of dt. Early by i + 1 samples. Then, these three vectors are zero initial Separately filtered by shaping filter with filter memory (with coefficient array awc) Is done. The resulting three 64-dimensional output vectors are represented by y₁, Y_TwoAnd y_ThreeWhen Let's call. Next, the processor 80 determines the 64 candidate sets b of the three pitch predictor taps._1j , B_2j, B_3j(J = 1, 2,..., 64) Optimal set b to minimize measurement_1k, B_2k, B_3kNeed to find out. This type of problem has been previously studied and is disclosed in US Pat. No. 5,327,520. You can find an efficient search method in. The details of this technology are here Although not introduced, the basic idea is as follows. Minimizing this strain measurement maximizes the inner product of the two 9-dimensional vectors. Can be shown to be equivalent to Of these 9-dimensional vectors One is y₁, Y_TwoAnd y_ThreeOnly the correlation coefficient is included. Another 9-dimensional vector Torr is only the product term derived from the set of three pitch predictor taps being evaluated Contains. Such vectors do not depend on the signal, and Because it depends only on the (One for each pitch tap code vector), these are pre-calculated, It can be stored in one table, the VQ codebook. Real In the codebook search, y₁, Y_Two, And y_ThreeIs the first 9-dimensional correlation vector Is calculated. Next, with each of the 64 pre-computed and stored 9-dimensional vectors, The inner product of the resulting vector is calculated. Remember to give maximum inner product The vector in the specified table is the winner and the three quantized pitch estimator Is derived therefrom. There are 64 vectors in the stored table Thus, to represent the three quantized pitch predictor taps, the mth sub A 6-bit index IT (m) for the frame is sufficient. 5 in each frame Since there are three subframes, three subframes used for all subframes A total of 30 bits per frame are used to represent pitch taps . These 30 bits are used for transmission to the decoder as side information. It is supplied to the chipplexer 180. For each sub-frame, the three codebook search methods described above Little tap b_1k, B_2k, B_3kAfter the optimal set of is selected, the pitch prediction bar of d John is calculated as follows: The output signal vector hd is calculated as follows. This vector hd is subtracted from the vector tp by the subtraction unit 90. The result is the target vector tt for transform coding. 4. Transform coding of target vector a. Shaping filter absolute value response for normalization The target vector tt is calculated using blocks 100-15 using a transform coding approach. 0 is encoded for each subframe. Shaping filter absolute value response processor 1 00 calculates the signal mag in the following manner. First of all, this processor Take the shaping filter coefficient array awc of the last subframe of the current frame, and Is zero padded to 64 samples, and then the resulting 64 dimensional vector Then a 64-point FFT is performed. Then this is the frequency range of 0-8kHz Is calculated, the absolute value of 33 FFT coefficients corresponding to. The resulting vector mag Is the absolute value response of the shaping filter for the last subframe. Save on calculations The first four subframes are the last subframes of the last frame Between the mag vector of the current frame and the mag vector of the last subframe of the current frame Obtained by shape interpolation. b. Conversion and gain normalization Transform processor 110 performs a number of operations as described below. Execute. First, it uses the 64-point FFT to convert the current sub Transform a 64-dimensional vector in a frame. 64 samples (or 4 ms) This transform size is known in the audio coding art, so-called Avoid "pre-echo" distortion. As if it were completely described here Jayant, specifically included for reference. N. et al., “Based on human perception model. Signal compression "Proc. IEEE pp. 1385-1422, October 1993 That. Each of the first 33 composite FFT coefficients then corresponds to the corresponding one in the mag vector. Divided by The resulting normalized FFT coefficient vector Are the three frequency bands: (1) the first six normalized FFT coefficients (2) the next 10 normal bands Frequency bands (1500-3750 Hz) composed of normalized FFT coefficients and (3) ) The frequency band (4000-800) consisting of the remaining 17 normalized FFT coefficients 0 Hz). The total energy in each of the three bands is calculated and then the log gain and Is converted to a called dB value. Logarithmic gain in the low frequency band is Using a 5-bit scalar quantizer designed using the well-known Lloyd algorithm And quantized. The quantized low-frequency logarithmic gain is the mid- and high-frequency band pair. It is subtracted from the number gain. The resulting level adjusted medium and high frequency The logarithmic gains are concatenated to form a two-dimensional vector, which in turn, Code designed by generalized Lloyd algorithm well known in the art It is quantized by a 7-bit vector quantizer with a book. Quantized low The frequency log gain is then quantized for the level adjusted medium and high frequency log gain. To the quantized logarithmic gain in the middle and high frequency bands . Next, all three quantized logarithmic gains change from the logarithmic (dB) domain to the linear domain. Is replaced. At this time, 33 normalized FFT coefficients (m ag) normalized to the frequency band where the FFT coefficients are Divided by the corresponding quantized linear gain of the region. After this second stage of normalization , The result is a final normal containing 33 complex numbers representing frequencies from 0 to 8000 Hz. This is a transformed vector tc. During quantization of the log gain during the mth subframe, transform processor 110 may A 5-bit gain codebook index IG (m, 1) for frequency logarithmic gain, and Generate 7-bit codebook index IG (m, 2) for medium and high frequency logarithmic gain To achieve. Thus, the three logarithmic gains are 12 bits per subframe or frame. Encoded at a bit rate of 60 bits per frame. These 60 bits are It is provided to multiplexer 180 for transmission to the decoder as side information. This These 60 gain bits are 49 bits for the LSP and 8 bits for the pitch period. Together with the 30 bits for the bit and pitch taps, the side information is formed and the The total is 49 + 8 + 30 + 60 = 147 bits per frame. c. Bit stream As described above, 49 bits / frame are allocated to encode LPC parameters. 8+ (6 × 5) = 38 bits / frame for a 3-tap pitch predictor (5 + 7) × 5 = 60 bits / frame for gain Was. Therefore, the total number of side information bits is 49 + 38 per 20 ms frame. + 60 = 147 bits, or about 30 bits per 4 ms subframe You. The encoder operates at one of three different rates: 16, 24 and 32 kb / s. Think it could be used. With a sampling route of 16 kHz, these three targets Typical rates are 1, 1.5 and 2 bits / sample or 64, 96 and 128 bits. And subframes, respectively. 30 bits used for side information Used in encoding main information (encoding of FFT coefficients) in the frame / subframe The number of bits remaining to be stored are three levels of 16, 24 and 32 kb / s, respectively. 34, 66 and 98 bits / subframe for each frame. d. Adaptive bit allocation According to the principles of the present invention, enhance the perceptual quality of output speech in a TPC decoder Because of these residuals in different parts of the frequency spectrum with different quantization accuracy Adaptive bit allocation is performed to allocate bits. This is an audio signal This is done by using a model of the human sensitivity to noise in the signal. this Such models are well known in perceptual audio coding technology. An example For example, Tobias J.V. ed., "Basics of Model Auditory TheoryAcademic Press New York And London, 1970. Similarly, here is the complete description Schroeder, M.R. et al. "The human ear Optimization of Digital Speech Coder by Developing and Using Masking Characteristics. "J.Ac oust . Soc. Amer , 66: 1647-1652, December 1979 (Schroeder et al. See also al). The hearing model and quantization control processor 130 performs an adaptive bit allocation and t to quantize each of the 33 normalized transform coefficients contained in c Output vector that tells transform coefficient quantizer 120 whether to use the bits of generate ba. Adaptive bit allocation can be performed once per subframe. However, in the embodiment of the present invention, one frame is used to reduce computational complexity. Bit allocation is performed once each time. As is done in conventional music encoders, noise masking thresholds and bit Rather than using an unquantized input signal to derive the allocation, The noise masking threshold and bit allocation of the embodiment may be quantized LPC The frequency response of the synthesis filter (often called the "LPC spectrum") Determined from the answer. The LPC spectrum is the input signal within the 24 ms LPC analysis window. It can be regarded as an approximation of the spectral envelope of the signal. The LPC spectrum is quantum Is determined based on the converted LPC coefficients. The quantized LPC coefficient is LPC Hearing model and quantizer control processor 13 that determines the spectrum as follows: 0 is provided by the LPC parameter processor 10. Quantized The LPC filter coefficient a is first transformed by a 64-point FFT . The power of each of the first 33 FET coefficients is determined, and then the reciprocal is calculated. Conclusion The result is an LPC power spectrum with a 64-point FFT frequency resolution. . After the LPC power spectrum has been determined, it can be completely described here. No. 5,314,457, which is hereby incorporated by reference in its entirety. Using a modified version of the described method, the estimated noise masking threshold TM is calculated. Is calculated. Processor 130 was empirically determined from a subjective listening experiment Based on 33 samples of LPC power spectrum by frequency dependent attenuation function Become The decay function starts at 12 dB for the DC term of the LPC power spectrum. That is, it increases to about 15 dB between 700 and 800 Hz, and then to higher frequencies. Monotonically and eventually to 6 dB at 8000 Hz. Each of the 33 attenuated LPC power spectrum samples was then "Basal plate spread" derived for that particular frequency to calculate the threshold Used to scale functions. The spread function for any frequency is The shape of the masking threshold in response to a single tone masker signal at different frequencies Corresponding. Equation (5) of Schroeder et al. Describes the "bulk" frequency scale or critical band. It describes such a spread function in the form of a band frequency, which, as if It is included as a reference as if fully described. The scaling process is For the "bulk" frequency scale, the first 33 frequencies (ie, 0 Hz, 250 Hz, 500 Hz, ... 8000 Hz) . Next, for each of the 33 resulting Bark values, Schroeder et al. Using these 33 bark values using equation (5), the corresponding spread function is summed Ringed. The resulting 33 spread functions are in one table Stored, which may be performed as part of an offline process. Estimated maski To calculate the thresholding threshold, each of the 33 spread functions has an attenuated LPC The corresponding sample values of the power spectrum were multiplied and the resulting 33 The scaled spread functions are summed. The result is an estimated masking threshold function is there. This technique for estimating the masking threshold is the only available technique It should be noted that this is not the case. To keep complexity low, the processor 130 needs to perform adaptive bit allocation. Use the "greedy" algorithm. This technology provides a potential for future bit allocation. One at a time for the most "poor" frequency components, independent of local effects Is "greedy" in the sense of allocating bits. Any bit At the beginning, when no audio is assigned, the corresponding output audio is zero, The error signal is the input speech itself. Therefore, the initial LPC power spectrum is It is assumed to be the power spectrum of the quantization noise. Then the maski calculated above Simplification of the calculation method for noise threshold in Schroeder et al. Using each version, at each of the 33 frequencies of the 64-point FFT The magnitude of the noise is estimated. The simplified noise magnitude at each of the 33 frequencies is calculated as follows: Is done. First, the critical bands listed in Table 1 of the book chapter of Scharf in Tobias Using linear interpolation of the width, the critical bandwidth B at the ith frequency₁Is calculated. Conclusion The result is an approximation of the term df / dx in equation (3) of Schroeder et al. 33 Is calculated in advance and stored in a table. Next, the i-th frequency For a number, the noise power N_iTo the masking threshold M_iCompare with N_i≤M_iIn The noise level L_iIs set to zero. N_i≧ M_iIf, the noise The size is calculated as follows. L_i= B_i((N_i-M_i) / (1+ (S_i/ N_i)^Two))^0.25 In the equation, S_iIs a sample of the LPC power spectrum at the ith frequency Value. Once the noise magnitude has been calculated for all 33 frequencies, the maximum noise A frequency having a magnitude is identified and one bit is assigned to this frequency. You. The noise power at this frequency is used to quantize the normalized FFT coefficients. Determined empirically from the signal-to-noise ratio (SNR) obtained during VQ codebook design (An example of the value of the reduction coefficient is 4 to 5 dB) ). The magnitude of the noise at this frequency is then updated using the reduced noise power. Be renewed. Next, a maximum value is identified from the updated noise magnitude array and one Bits are assigned to corresponding frequencies. This process repeats all available bits. Until they run out. For 32 and 24 kb / sec TPC encoders, each of the 33 frequencies is suitable. Bits can be accepted during response bit allocation. On the other hand, T of 16 kb / s For PC encoders, the encoder is only able to use bits for the frequency range 0-4 kHz. (Ie, the first 16 FET coefficients) and the high frequency synthesis processor 140 Combining residual FFT coefficients in a higher frequency band of 4 to 8 kHz using In addition, better voice quality can be achieved. Since the quantized LPC coefficient a can be used in a TPC decoder, Note that there is no need to transmit the packet allocation information. This The allocation information is a replica of the hearing model quantization control processor 50 in the decoder. Is determined by Thus, the TPC decoder converts such bit allocation information The adaptive bit allocation operation of the encoder can be replicated locally to obtain Wear. e. Transform coefficient quantization The transform coefficient quantizer 120 uses the bit allocation signal ba to be included in tc. Quantize the transform coefficients. The DC term of the FFT is a real number and does not If such a bit is received, it is scalar-quantized. Of the bits it can accept The maximum number is four. For the 2nd to 16th FFT coefficients, the real part Uses conventional two-dimensional vector quantizer to simultaneously quantize and imaginary parts Is done. The maximum number of bits for this two-dimensional VQ is 6 bits. Remaining FF For the T coefficient, quantize the real and imaginary parts of two adjacent FFT coefficients simultaneously For this purpose, a conventional four-dimensional vector quantizer is used. After quantization of transform coefficients , The resulting VQ codebook index array IC is a key feature of the TPC encoder. Information. This index array IC is supplied to the multiplexer 180, where the index Combined with the site information bit. The result is transmitted through a communication channel to a TPC decoder. Is the last bit frame transmitted. Transform coefficient quantizer 120 similarly computes the quantized value of the normalized transform coefficient. Decrypt. This is then the quantization of the corresponding mag element and the corresponding frequency band. By multiplying each of these coefficients by the calculated linear gain, To restore the gain level. The result is the output vector dtc. f. High frequency synthesis and noise filling For a 16 kb / s encoder, the adaptive bit allocation should be in the 0-4 kHz band. Limited, processor 140 synthesizes a band of 4-8 kHz. Before doing it First, the hearing model quantizer control processor 130 firstly sets the frequency band of 4 to 7 kHz. For frequencies in the band, the ratio between the LPC power spectrum and the masking threshold Calculate the rate or signal-masking threshold ratio (SMR). 17th to 29th FFT coefficients (4-7kHz) are controlled by random phase and SMR It is synthesized using the absolute value that has been set. For frequencies where SMR> 5 dB , FFT coefficients are set to the quantized linear gain of the high frequency band . For frequencies where SMR ≦ 5 dB, the absolute value is the quantized value of the high frequency band. 2 dB below the obtained linear gain. About the 30th to 33rd FET coefficients Is between 2 dB and 30 dB higher than the quantized linear gain in the high frequency band. And the phase is again random. For the 32 kb / s and 24 kb / s encoders, as described, Bit allocation is performed for the entire wavenumber band. However, 4-8 kHz Some frequencies in the z band may still not accept any bits. In this case, the high frequency synthesis and noise filling procedure described above does not accept any bits Applies to frequency only. After applying such high frequency synthesis and noise filling to the vector dtc , The resulting output vector qtc is the quantization factor of the transform coefficients before normalization. Version. g. Inverse transformation and filter memory update The inverse transform processor 150 is represented by a half-sized 33-element vector qtc. Performs an inverse FFT on the resulting 64-element composite vector. As a result, Quantized version of time-domain target vector tt for transcoding Is obtained as the output vector qtt. In the zero initial filter state (filter memory), awc Inverse shaping filter 160, which is an all-zero filter with Filter the vector qtt to do so. At this time, the adder 170 adds dh to et. To obtain the quantized LPC prediction residual dt. This dt vector is then closed Update internal storage buffer in loop pitch tap quantizer and pitch estimator 80 Used to renew. This is also the zero input response generator for the next subframe. Of the zero input response processor 50 to establish a correction filter memory in preparation for Also used to excite the internal shaping filter of the unit. C. Embodiment of decryption A decoding embodiment of the present invention is shown in FIG. For each frame, Multiplexer 200 converts all the main and side signals from the received bit stream. Separate information components. The main information, that is, the transform coefficient index array IC is used for transform coefficient decoding. Is supplied to the container 235. To decode this main information, a number of main information bits Adaptive bits to determine if is associated with each quantized transform coefficient Allocation must be performed. The first step in adaptive bit allocation is to quantize LPC coefficients (allocation That affect the operation). The demultiplexer 200 is an LPC parameter. LSP codebook indices IL (1) to IL (7) for the data decoder 215 And the decoder provides 7 LS to obtain 16 quantized LSP coefficients. Perform a table lookup from the PVQ codebook. At this time, LPC Parameter decoder 215 has the same software as blocks 345, 350 and 355 in FIG. Performing the switching, interpolation and LSP-LPC coefficient conversion operations. With the calculated LPC coefficient array a, the hearing model quantizer control processor 220 Is for each FFT coefficient in the same manner as the processor 130 in the TPC encoder. To determine the bit allocation (based on the quantized LPC parameters) (FIG. 1). ). Similarly, the shaping filter coefficient processor 225 and the shaping filter absolute value The answer processor 230 also has a corresponding processor 30 and 30 in the TPC encoder. And 100 replicas. Processor 230 is used by transform coefficient decoder 235 To generate the absolute value response mag of the shaping filter. Once the bit allocation information is derived, the transform coefficient decoder 235 next It is possible to correctly decode the main information and obtain the quantized information of the normalized transform coefficients. it can. The decoder 235 similarly decodes the gain using the gain index array IG. For each subframe, there are two gain indicators (5 and 7 bits) These are the quantized logarithmic gain in the low frequency band and the mid and high frequency logarithms. The gain level is decoded into a quantized version of the logarithmic gain. At this time The quantized low frequency logarithmic gain is the quantized logarithmic gain of the middle and high frequency bands. For gain, leveled medium and high frequency log gain quantized versions It is added back. Next, all three quantized logarithmic gains are in the logarithmic (dB) domain. To the linear domain. Each of the three quantized linear gains has a corresponding Used to multiply the quantized version of the normalized transform coefficients in the waveband Can be. Each of the resulting 33 gain scaled quantized transform coefficients is , Then further multiply by the corresponding element in the shaping filter absolute value response array mag Is done. After these two scaling stages, the result is the decoded transform coefficient array dt c. High frequency synthesis processor 240, inverse transform processor 245, and inverse shaping filter 250 is again the corresponding block (140, 150 and 1) in the TPC encoder 60) is an exact replica. Together they combine high frequency synthesis and noise Balance, inverse transform and inverse shaping filtering to produce the quantized excitation vector et. To achieve. The pitch decoder and interpolator 205 provides the pitch for the last three subframes. Decode the 8-bit pitch index IP to obtain the first period, and then Of the first two subframes in the same manner as performed in block 70 To interpolate the pitch period. The pitch tap decoder and pitch predictor 210 has 3 One quantized pitch predictor tap b_1K, B_2K, B_3KTo get each sub Decode the pitch tap index IT for the frame. Then this was interpolated Using the pitch period kpi, the same three vectors x as described in the encoder section₁ , X_Two, And x_ThreeIs extracted. (These three vectors are the current Earlier than the frame by kpi-1, kpi and kpi + 1 samples). Then this Computes the pitch predicted version of the LPC residual as follows: The adder 255 adds dh to et to obtain a quantized version of the LPC prediction residual d. Dt. This dt vector updates its internal storage buffer for dt To perform the pitch prediction in block 210 Feedback to the vessel. The long-term post-filter 260 is a standard 16 kb / sec ITU-T G.728 standard. Basically similar to the long-term postfilter used in low-delay CELP encoders You. The main difference is that the three quantized pitch taps as voicing indicators Is the sum of And the scaling factor for the long-term post-filter coefficients is in G728. Therefore, it is 0.4 instead of 0.15. This voicing indicator If less than 0.5, the post-filtration operation is omitted and the output vector fdt Is the same as the input vector dt. If the indicator is greater than 0.5, A filtration operation is performed. The LPC synthesis filter 265 is a standard LPC filter, ie, a quantized All-pole direct filter with an LPC coefficient array a. This translates the signal fdt into Filter and generate a long-term post-filtered quantized speech vector st. This st baek The tor is passed through a short-term post-filter 270 to output the final TPC decoder. A force audio signal fst is generated. Again, this short-term post-filter 270 Very similar to the short-term post-filter used in .728. The only difference Is in the following points. First, the polar control coefficient, the zero control coefficient, and the spectral tilt The dynamic control coefficients are 0.75, 0.65 and 0.15 in G.728, respectively. Instead of the corresponding values, they are 0.7, 0.55 and 0.4. Second, primary spectrum tilt The coefficients of the compensation filter are linearly interpolated between frames between samples. Thus, Possibly audible clicks due to discontinuities at frame boundaries Sounds can be avoided. The long-term and short-term post-filters are the perceived coding noise in the output signal fst This has the effect of lowering the sound level and thus increasing the voice quality.

Claims

[Claims] 1. In a method for encoding a frame of an audio signal, Use the linear prediction filter to generate a short-term phase from the speech signal to generate a prediction residual signal. Removing the relationship; Determining an open loop pitch period estimate of the audio signal based on the predicted residual signal When, Two or more sub-frames of a frame based on a quantized version of the prediction residual signal Determining the pitch filter tap weights for the frame; Open loop pitch period estimation, pitch filter for two or more subframes Based on the tap weights and the prediction residual signal, a pitch prediction residual signal is formed. Tep, Quantizing the pitch prediction residual signal; A method comprising: