JP3566931B2

JP3566931B2 - Method and apparatus for assembling packet of audio signal code string and packet disassembly method and apparatus, program for executing these methods, and recording medium for recording program

Info

Publication number: JP3566931B2
Application number: JP2001018541A
Authority: JP
Inventors: 徹森永; 茂明佐々木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-01-26
Filing date: 2001-01-26
Publication date: 2004-09-15
Anticipated expiration: 2021-01-26
Also published as: JP2002221994A

Description

【０００１】
【発明の属する技術分野】
本発明は音声信号をパケット伝送するときに生じうるパケット消失により欠落した音声信号の品質劣化を抑えて補償する技術に用いられる音声信号の符号列のパケット組立方法、装置及びパケット分解方法、装置並びにこれらの方法を実行するプログラム、プログラムを記憶する記憶媒体に関する。
【０００２】
【従来の技術】
移動通信やＶｏＩＰ（ＶｏｉｃｅｏｖｅｒＩＰ）に代表されるように、パケット通信によって音声とデータを統合的に扱うことが可能となる。通常パケットのヘッダのオーバーヘッドを少なくするために１パケットには複数音声フレームを詰めて送信する場合が多い。
音声通信においては、一般的に音声信号を構成する信号系列が発生した順序が伝送後再生する場合においても維持されることが必須である反面、パケット通信では一定時間毎に発信されたパケットの伝送遅延が各々変動することから到着時刻に揺らぎが生じうる。その揺らぎを吸収して発信された順序にパケットに格納された符号により音声信号を再生するために揺らぎ吸収バッファを用いる。
【０００３】
パケット音声通信における問題点の一つとしてパケット消失があげられる。通信路が広帯域化、高速化されることにより、符号化による劣化、遅延は解消される。反面、パケット消失は通信容量が増えても避けられない問題である。
パケット消失が起こる原因として次のものがあげられる。まず、パケット数が多い場合、パケットどうしの衝突（コリジョン）によってパケットが完全に消失してしまう場合がある。また符号ビット誤りが伝送の過程で約５０％程度に達した場合、そのパケット情報は全て失われたものとし、パケット消失と判定されることがある。さらに、パケットの到着遅延が揺らぎ吸収バッファで補償されるよりも大きい場合にパケットが失われたものとしてパケット消失と判定されることがある。これらの原因によって音声の品質劣化が生じる。
【０００４】
品質の劣化によって聴覚に不快感を与えないために、失われたパケットの部分は別の何らかの信号で補償する必要がある。符号化方式によってはバッファの前後の情報を用いて符号化しているため、一度パケットが消失すると、復帰後しばらく品質が劣化することがある。その品質の劣化を聴感上抑制することもパケット消失補償に含まれる。
パケット消失により欠落した音声信号の品質劣化を抑えて補償する従来の技術を図９を参照して説明する。
【０００５】
欠落音声補間装置は、多重分離回路１０２、残差信号電力復号器１１９、逆量子化器１２０、長時間予測係数復号／選択器１１８、短時間予測係数復号器１０７、長時間合成／補間フィルタ１１７、短時間合成フィルタ１１０と入出力端子１０１，１１５，１１６を備え、音声符号化情報の欠落を検出した場合、すなわち、欠落検出信号が入力された場合には、接続遮断スイッチ１２１が開放されると共に長時間予測係数復号／選択器１１８から補間用の長時間予測係数（１．予め数値が設定されており、常に一定の値、もしくは２．長時間予測係数復号器から得た前フレームの長時間予測係数に応じた補間用の長時間予測係数）が長時間予測器１１２に出力される。また短時間予測器１１４には前フレームでの短時間予測係数をそのまま設定しておく。長時間合成／補間フィルタ１１７には何の入力もされずに自己駆動することにより出力信号を短時間合成フィルタ１１０に入力する。この短時間合成フィルタ１１０は通常の復号処理を行うことにより再生デジタル音声信号が補間される。（特開平５−８８６９７号公報参照）
この従来の技術は、復号器において、過去の信号からピッチ周期を解析し適当な波形を取り出し、それを繰り返すことによって、擬似的な信号を作る方法である。このピッチ周期繰り返し補償で最も劣化の原因となりやすいのは波形の不連続によるものである。その波形の不連続が発生しやすいのは、パケット消失間の補償信号とパケット消失から復帰後の信号の繋ぎ合わせの部分である。この不連続性を目立たなくするために、ピッチ周期を消失から復帰後と連続になるように調整する、あるいは、ＯＬＡ（Ｏｖｅｒｌａｐａｄｄ）によって、合成信号と復帰後の信号を徐々に変化させていくという方法や合成信号のパワーを徐々に減衰させることが提案されている。
【０００６】
低ビットレートの音声符号化に使用されるＣＥＬＰ（ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ：符号励振線形予測）方式のパケット消失補償では、パケット内の音声信号をあらかじめ周期性と非周期性に分類しておき、消失パケットのピッチ周波数が周期性であれば、適応符号帳の励振信号を用い、非周期性であれば白色雑音をランダムに使用するという方法が良く用いられる。
さらにその他の手法において、特徴的な処理として、合成フィルタ係数を反復させる、適応・固定コードブックゲインを減衰させる、ゲイン予測を減衰させるという手法があげられる。
【０００７】
これらの手法は聴覚に不快な信号を抑制する効果に関しては有効な手法であった。しかし、あくまで擬似的な合成信号の再生であり常に原音に近い音を再生することが困難である場合が多い。パケット間において、ピッチやパワーが急速に変動する場合、あるいはピッチ間隔の不一致による波形の不連続性や無理な調整によって音質が著しく劣化する場合があった。さらに、圧縮コーデックの場合は消失から復帰後に立ち上がりの部分が劣化するという問題点があった。
【０００８】
【発明が解決しようとする課題】
本発明では通信路の容量が十分に大きく、多少の補助情報を付加できることを前提として、従来のパケット消失補償技術の欠点を解消し、パケット消失による音声の品質劣化を改善することを課題としている。
従来技術ではパケットが消失している区間で、ピッチ周期、パワー等が変化する場合に劣化が顕著になる。本発明は、パケットに含ませる音声データ長が大きくても、パケットが消失している間の音声信号、消失から復帰後の音声信号の劣化を抑えることのできる符号化、および復号化方法、およびこれらの方法を実現する手段を提供することを課題とする。
【０００９】
【課題を解決するための手段】
上記課題を解決するために、本発明は、音声パケットの組立において、現フレームの音声信号を高品質な第１の符号化方法で符号化して第１の符号列を生成し、前記第１の符号列を復号して復号信号を生成し、現フレームより１個先からＮ（Ｎ：１以上の任意の整数）個先までのＮ個のフレームの音声信号を第２の符号化方法で符号化するために必要な内部状態を前記復号信号から算出して、前記算出した内部状態を用いて、前記現フレームより１個先からＮ個先までのＮ個のフレームの音声信号を高圧縮な第２の符号化方法で符号化して第２の符号列を生成し、前記第１の符号列と前記第２の符号列を結合して現フレーム時刻のパケットに格納し、また、音声パケットの分解において、フレーム毎にパケット消失の有無を判定し、現フレーム時刻のパケットが消失していないと判定された場合には現フレーム時刻のパケットに格納された符号列のうち現フレームの第１の符号列から第１の符号化方法と対応する第１の復号化方法で音声信号を復号し、現フレーム時刻のパケットが消失していると判定された場合には過去のフレーム時刻のパケットに格納された符号列のうち過去のフレームの第１の符号列から復号された音声信号より１個先からＮ個先までのＮ個のフレームの音声信号を高圧縮な第２の符号化方法に対応する第２の復号化方法で復号化するための内部状態を算出し、前記算出した内部状態を用いて現フレームの第２の符号化方法と対応する第２の復号化方法で音声信号を復号する。

【００１０】
【発明の実施の形態】
図１に示すようにＶｏＰ（ＶｏｉｃｅｏｖｅｒＰａｃｋｅｔ）では音声パケットをネットワークモジュール１で受信し、パケット毎に分解して揺らぎ吸収バッファ３に出力し、またパケット消失を判定してパケット消失フラグをパケット消失補償部２に出力する。
分解されたパケットは揺らぎ吸収バッファ３に蓄積し、しばらくパケットを待ってから再生を行う。
パケット消失の判定がされない場合には、揺らぎ吸収バッファ３の蓄積したパケットの現フレームのメインビットストリームを音声デコーダ４に出力する。パケット消失の判定がされた場合には、揺らぎ吸収バッファ３に届いている前後のパケットのメイン、サブビットストリームを使って効率の良いパケット消失補償を行うことができる。例えば、再生すべきパケット▲３▼が到着しない場合には、音声デコーダ４は分析係数を作成してパケット消失補償部２に出力し、パケット消失補償部２は分析係数と揺らぎ吸収バッファ３に届いている過去のパケット▲２▼のサブビットストリーム▲３▼を使って、消失補償データを音声デコーダ４に出力し、効率の良いパケット消失補償を行うことができる。音声デコーダ４はパラメータの補間や音量の制御を施すことにより、できるだけ劣化を抑えるように処理し、音声を出力することができる。
【００１１】
本発明においては通常使用している１つのメインコーデックに、パケットが消失した場合の補償手段として、複数のサブコーデックを組み合わせることによってパケット消失に対して耐性を持たせている。
メインコーデックは圧縮率の比較的低い、高品質の符号化方式を用い、また、サブコーデックはメインコーデックより高圧縮、かつ品質が十分良い符号化方式を選ぶ必要がある。このようにすることによりデータ量の増加を抑えることができる。
（エンコーダ）
図２，３を参照して本発明のエンコーダを説明する。
【００１２】
入力された音声データはフレーム（データ単位）に分割され、デコーダ側で通常再生すべき信号を、メインコーデック（エンコーダ）１１で第１の符号化方法により符号化してメインビットストリームをつくる。例えば、ＩＴＵ−Ｔが勧告した音声符号化方式で８ｋＨｚサンプリングの音声帯域信号を６４ｋｂ／ｓで伝送するパルス符号変調（ＰＣＭ）方式であるＧ．７１１符号化標準により符号化する。
また、１パケット分先の音声データをサブコーデック（エンコーダ）１４で第２の符号化方法により符号化してサブビットストリームを作る。例えば、ＩＴＵ−Ｔが勧告した１６ｋｂ／ｓ電話帯域音声符号化に関するＬＤ−ＣＥＬＰ方式であるＧ．７２８符号化標準により符号化する。２つ以上のサブビットストリームを含ませる場合は、それよりもさらに先読みした信号を符号化したビットストリームを含ませるとよい。
【００１３】
そして、メインビットストリームと１パケット分先のサブビットストリームは結合部１５で結合され、公知の通番付与回路等により時間順シーケンス番号、ビット誤り検出符号等を付加する処理をパケット化部１６で行いパケット化して出力される。
以上のような手法で、それぞれパケットにはメインバッファに蓄えられた信号をメインコーデックで符号化したメインビットストリーム、サブバッファに蓄えられたメインバッファより先の信号をサブコーデックで符号化したサブビットストリームを持つ。
【００１４】
図４に示すように、そのようにビットストリームを作成することによって、デコード側ではパケットが消失したと判断された時（パケット▲３▼）、消失する直前のパケット▲２▼に含まれているサブビットストリーム▲３▼の情報によって消失した区間の音声（復号音声信号▲３▼）を作ることができる。それは、サブビットストリームには先読み信号をサブコーデックで符号化した信号が含まれるからである。また、パケットのヘッダのオーバーヘッドを少なくするため、音声フレームは１パケットに複数個詰めて送信される場合が多い。従来のパケット消失補償では合成音声によって過去の信号の繰り返しで擬似的な音声信号を作成しているため、パケットに含まれる音声データが長ければ長いほど劣化が顕著になる。
【００１５】
本発明では、サブコーデックにはメインコーデックの１パケット分先の音声データを含ませるため、各パケットに含ませる音声データの長さに係わらず劣化の少ないパケット消失補償を行うことができる。
サブコーデックに使用する圧縮符号化によっては、注意をしなければならない点がある。
それは符号化、復号化に必要な分析係数（これはコーデックによって異なるが、例えば合成信号、フィルタ係数、予測係数等があげられる）を、前パケットから引き継いで復号化するコーデックでは、パケット消失が発生した時、通常エンコーダとデコーダで予測器や量子化器等の分析係数が異なってしまう。そのような場合でも分析係数を一致させるためには、エンコーダ側で分析係数の初期情報を符号化情報として送信する必要がある。
【００１６】
本発明においては、パケット内部に高品質符号化であるメインコーデックで符号化されたメインビットストリームと高圧縮符号化であるサブコーデックで符号化されたサブビットストリームがセットになって存在する。そこでサブコーデックの分析係数を、メインコーデックを復号した信号からメインコーデック（デコーダ）１２とサブコーデック（分析係数算出）１３により作成すれば、その情報を送らなくても良い。
例えば、Ｇ．７２８のようにエンコーダの分析係数が過去の合成信号から作られているような場合、合成信号の部分をメインコーデックで復号した高品質信号で置き換えることが可能となる。同様にデコーダでも合成信号の部分を高品質符号化で置き換える必要がある。そしてエンコーダとデコーダの内部状態を合わせることによって正しく復号することが可能となる。また、分析係数を合成信号でなく高品質信号で置き換えることによって復号化信号の品質も向上させることが可能となる。
【００１７】
分析係数は、例えば、Ｇ．７２８：ＬＤ−ＣＥＬＰ符号器（図示せず）のバックワード合成フィルタ適応器で求められる合成フィルタ係数と聴覚重み付けフィルタで求められる聴覚重み付けフィルタ係数を指す。
合成逆フィルタ
ｅ_ｎ＝ｘ_ｎ＋ａ_１ｘ_ｎー_１＋・・・＋ａ_ｎｘ_０
ａ：フィルタ係数ｘ：合成信号ｅ：残差信号
同様にして聴覚重み付けフィルタを高品質の信号より置き換えすることも可能となる。
【００１８】
聴覚重み付けフィルタ
ｗ_ｎ＝ａ_０ｘ_ｎ＋ａ_１ｘ_ｎー_１＋・・・＋ａ_ｎｘ_ｏ−（ｂ_０ｗ_ｎ＋ｂ_１ｗ_ｎー_１＋・・・＋ｂ_ｎｗ_ｏ）
ａ、ｂ：フィルタ係数ｘ：合成信号ｗ：聴覚重み付け信号
このようにして、高品質な信号と置き換えることによって品質の良い復号化をすることができる。
算出された分析係数をサブコーデック（デコーダ）１４に転送し、すなわちＧ．７２８：ＬＤ−ＣＥＬＰ符号器（図示せず）の最適コードブックデータ選択器からの出力としてコードブック（符号帳）中に格納される形状コードベクトル（波形）と利得レベルの中から最適なものに対応する符号が選択され、サブビットストリームが出力される。
（デコーダ）
図５，６を参照してデコーダを説明する。
【００１９】
デコーダ側では、まず受信した信号をデパケット化部２１でデパケット化し、メインコーデック／サブコーデック分配部で現フレーム時刻の音声パケットのうち、メインビットストリーム（現フレームのＧ．７１１音声符号化標準による符号列）とサブビットストリーム（先フレームのＧ．７２８音声符号化標準による符号列）に分配する。
メインビットストリームは、メインコーデック（デコーダ）２３で第１の復号化方法により音声信号に復号する。
【００２０】
そして、その復号した信号をサブコーデック（分析係数算出）２４で分析係数を算出しサブコーデック（デコーダ）２５の内部状態を作りあげる。あるいは前述したように、メインコーデックから直接サブコーデックの内部状態を作成する。
最後に、その内部状態を引き継いだ状態で、サブコーデック（デコーダ）２５によりサブビットストリームを第２の復号化方法により復号化して音声信号を出力する。
【００２１】
具体的には、ＬＤ−ＣＥＬＰ復号器（図示せず）にＧ．７２８音声符号化標準による符号化列として形状コードベクトル（波形）に対する符号と利得レベルに対する符号をそれぞれ入力し符号帳（励振ＶＷコードブック）から形状コードベクトルと利得ベクトルを選択し、また合成フィルタにおけるフィルタ係数として算出された分析係数を転送して用い復号音声を再生する。
音声パケットの消失有無の検出は、図５に示すデパケット化部２１の前段で行い、汎用されているパケット消失検出回路によりシーケンス番号の乱れ、もしくはビット誤りを検出して行う。
【００２２】
パケット消失信号有の判定がされない場合には、切換スイッチ２７をメインコーデック（デコーダ）２３側に切り換えて音声信号を出力する。また、パケット消失信号有と判定された場合には切換スイッチ２７をサブコーデック（デコーダ）２５側に切り換える。
メインコーデックがＡＤＰＣＭ（ＡｄａｐｔｉｖｅＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）のようにメインコーデックが過去の情報を必要とする、つまり内部状態を引き継ぐようなコーデックを使用する場合において、過去ののパケットが消失した場合は、消失補償に使用したサブコーデックを復号化したものと、メインコーデックを復号化した音声信号のつながりの部分に劣化が生じる。そのような場合、サブコーデックが再生した信号から必要な情報を作成することによって補償後のメインコーデックの再生の劣化を抑えることができる。
【００２３】
サブコーデックが１つの時で、かつ連続してパケットが消失してしまった場合において、消失したパケット数だけのサブコーデックが無い場合、サブコーデックによる消失補償ができず、音声が劣化すると考えられる。そのような場合、図５に示された従来手法による波形繰り返し補償部２６を用いて図６に示された補償を行う。サブコーデックが使えない時のみ過去のピッチ周波数繰り返し消失補償等の合成信号を用いて波形を作るものとする。
バースト消失（パケット消失が２個以上）の場合の対処として図５のように従来手法により補償を行う例が示されているが、音声パケット構成において１つのパケットに先の２個フレーム以上の音声信号による符号列を格納し、パケット分解において先の２個以上のフレームにまたがる符号列から先のフレームにおける符号列をそれぞれ用いて音声信号を復号すればよい。
【００２４】
メインコーデック、サブコーデックのお互いの量子化雑音の歪具合の大差が無い場合、お互いを同期させ足し合わせることによって信号対量子化雑音比（ＳＮＲ：Ｓｉｇｎａｌ−ｔｏ−ｑｕａｎｔｉｚａｔｉｏｎＮｏｉｓｅＲａｔｉｏ）をあげることができる。それは異なったコーデックの場合、量子化雑音は無相関である場合が多く、足し合わせることにより相関のある音響部分と無相関の雑音のパワーの比率が音響部分の方が大きくなると考えられる所以である。
サブコーデックを増やせば増やすほどメインコーデックに対して先読み情報を多くもつことになる。そのことによってパケットが連続的に消失する場合においても耐性をもたせることが可能となる。
【００２５】
しかし、図７に示すようにサブコーデックを増やせば増やすほど、その分先読み情報が必要となり、結果として遅延が増加することになる。また。サブコーデックの数だけ情報量が多くなる。
ＶｏＩＰにおいて遅延の原因となるのは上述したもの以外にパケットの到着遅延などの揺らぎを吸収する、揺らぎ吸収バッファによる遅延が大きい。また、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）上ではネットワークカード、サウンドカード等のバッファの影響により、大きな遅延が生じるが、専用ハードウエアの導入や、ＰＣの性能向上により解決される。リアルタイムの会話では、片方向での遅延時間の合計が２００ミリ秒以内であることが望ましい。よって、サブコーデックの数、揺らぎ吸収、その他の遅延の合計をその基準に合うように調整する必要がある。
【００２６】
移動通信、ＶｏＩＰは、通信速度が常に一定であるとは限らず、アプリケーションに使う情報量によっても音声通信に使うことができる情報量が変化すると考えられる。本発明では、通信速度、コンピュータの演算速度によってサブコーデックの品質、サブコーデックの数をフレキシブルに変更することによってネットワークに適した組み合わせを可能とすることを特徴としている。
図８に本手法を用いた時の波形の概略図を示す。
この図を参照すると、従来手法と比較すると本手法では原音により近いことが分かる。
【００２７】
また、本発明のパケット組立装置とパケット分解装置をＣＰＵやメモリ等を有するコンピュータと、アクセス主体となるユーザが利用する利用者端末と、ＣＤ−ＲＯＭ、磁気ディスク装置、半導体メモリ等の機械読み取り可能な記録媒体から構成することができる。
コンピュータに前述した動作を実行させる制御用プログラムを記録媒体に記憶させ、この制御用プログラムをコンピュータに読み取り、コンピュータの動作を制御してコンピュータ上に前述した実施の形態における各要素を実現することができる。
【００２８】
【発明の効果】
本発明によれば、従来の方式に比較して、パケット消失による品質の劣化を極力抑え、波形の不連続部分がなくなり、原音に忠実な消失部分の補償をすることができる。また、現フレーム時刻のパケットとして現フレームの符号列と先のフレームの符号列を結合しているので先のパケットが消失しても容易に補償することができる。
【図面の簡単な説明】
【図１】ＶｏＰの基本構成を示す図。
【図２】エンコーダの処理の説明図。
【図３】エンコーダの概略構成図。
【図４】パケットが消失した場合の処理の説明図。
【図５】デコーダの概略構成図。
【図６】バースト消失した場合の処理の説明図。
【図７】複数サブコーデックを持たせる場合の１パケットの構造を示す図。
【図８】原音に対する従来技術と本発明の手法の波形の比較図。
【図９】従来の欠落音声補間装置の構成図。
【符号の説明】
１ネットワークモジュール
２パケット消失補償部
３揺らぎ吸収バッファ
４音声デコーダ
１１メインコーデック（エンコーダ）
１２、２３メインコーデック（デコーダ）
１３、２４サブコーデック（分析係数算出）
１４サブコーデック（エンコーダ）
１５結合部
１６パケット化部
２１デパケット化部
２２メインコーデック・サブコーデック分配部
２５サブコーデック（デコーダ）
２６波形繰り返し補償部
２７切換スイッチ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method and an apparatus for assembling a packet of a code string of an audio signal, and a method and an apparatus for disassembling the packet, which are used in a technique for suppressing and compensating for the quality deterioration of an audio signal lost due to packet loss that may occur when transmitting an audio signal in packets. The present invention relates to a program that executes these methods and a storage medium that stores the program.
[0002]
[Prior art]
As represented by mobile communication and VoIP (Voice over IP), voice and data can be integratedly handled by packet communication. In many cases, a plurality of audio frames are packed and transmitted in one packet in order to reduce the overhead of the header of the normal packet.
In voice communication, in general, it is essential that the order in which a signal sequence forming an audio signal is generated is maintained even when the signal is reproduced after transmission. On the other hand, in packet communication, transmission of a packet transmitted at regular time intervals is required. Since the delay varies, the arrival time may fluctuate. A fluctuation absorbing buffer is used to reproduce the audio signal by the code stored in the packet in the order in which the fluctuations are emitted and transmitted.
[0003]
One of the problems in packet voice communication is packet loss. By increasing the bandwidth and speed of the communication channel, the degradation and delay due to coding are eliminated. On the other hand, packet loss is a problem that cannot be avoided even if the communication capacity increases.
The causes of packet loss include the following. First, when the number of packets is large, packets may be completely lost due to collision between the packets. If the code bit error reaches about 50% in the course of transmission, it is assumed that all the packet information has been lost, and it may be determined that the packet has been lost. Furthermore, if the arrival delay of a packet is larger than compensated by the fluctuation absorbing buffer, it may be determined that the packet has been lost and that the packet has been lost. These causes the voice quality to deteriorate.
[0004]
In order not to make the hearing uncomfortable due to the deterioration of quality, the lost packet part needs to be compensated by some other signal. Depending on the encoding method, encoding is performed using information before and after the buffer. Therefore, once a packet is lost, quality may deteriorate for a while after restoration. Suppressing the deterioration of the quality in terms of audibility is also included in the packet loss compensation.
A conventional technique for suppressing and compensating for the quality deterioration of a voice signal lost due to packet loss will be described with reference to FIG.
[0005]
The missing speech interpolation device includes a demultiplexing circuit 102, a residual signal power decoder 119, an inverse quantizer 120, a long-term prediction coefficient decoding / selection unit 118, a short-term prediction coefficient decoding unit 107, and a long-time synthesis / interpolation filter 117. The short cut filter 121 and the input / output terminals 101, 115, and 116 are provided, and when the loss of the voice coded information is detected, that is, when the loss detection signal is input, the connection cutoff switch 121 is opened. Along with the long-term prediction coefficient decoder / selector 118, a long-term prediction coefficient for interpolation (1. A numerical value is set in advance and is always a constant value, or 2. the length of the previous frame obtained from the long-term prediction coefficient decoder A long-term prediction coefficient for interpolation according to the temporal prediction coefficient) is output to the long-term predictor 112. In the short-term predictor 114, the short-term prediction coefficient in the previous frame is set as it is. The output signal is input to the short-time synthesis filter 110 by self-driving without any input to the long-time synthesis / interpolation filter 117. The short-time synthesis filter 110 performs normal decoding processing to interpolate the reproduced digital audio signal. (See JP-A-5-88697)
This conventional technique is a method in which a decoder analyzes a pitch period from a past signal, extracts an appropriate waveform, and repeats the waveform to generate a pseudo signal. The most likely cause of the deterioration in the pitch cycle repetition compensation is the discontinuity of the waveform. The discontinuity of the waveform is likely to occur in a portion where the compensation signal during packet loss and the signal after the recovery from the packet loss are joined. In order to make this discontinuity inconspicuous, the pitch cycle is adjusted so as to be continuous from disappearance to that after restoration, or the synthesized signal and the signal after restoration are gradually changed by OLA (Overlap add). And a method of gradually attenuating the power of the combined signal has been proposed.
[0006]
In packet elimination compensation of CELP (Code Excited Linear Prediction) used for low bit rate audio coding, an audio signal in a packet is classified into periodicity and aperiodicity in advance, and erasure is performed. If the pitch frequency of the packet is periodic, a method of using an excitation signal of an adaptive codebook is used, and if the packet is non-periodic, a method of randomly using white noise is often used.
Still other methods include, as characteristic processes, a method of repeating a synthesis filter coefficient, attenuating an adaptive / fixed codebook gain, and attenuating a gain prediction.
[0007]
These methods were effective methods for suppressing an unpleasant signal for hearing. However, in many cases, it is difficult to always reproduce a sound close to the original sound because the reproduction is a pseudo synthetic signal. Between packets, the pitch or power may fluctuate rapidly, or the sound quality may be significantly degraded due to waveform discontinuity or unreasonable adjustment due to pitch interval mismatch. Further, in the case of the compression codec, there is a problem that a rising portion is deteriorated after recovery from disappearance.
[0008]
[Problems to be solved by the invention]
An object of the present invention is to solve the drawbacks of the conventional packet loss compensation technology and improve voice quality deterioration due to packet loss, assuming that the capacity of the communication path is sufficiently large and some auxiliary information can be added. .
In the related art, when the pitch period, the power, and the like change in the section where the packet has been lost, the deterioration becomes remarkable. The present invention provides an audio signal while a packet is lost, an encoding method and a decoding method capable of suppressing deterioration of an audio signal after recovery from loss, even if the audio data length included in the packet is large, and It is an object to provide means for realizing these methods.
[0009]
[Means for Solving the Problems]
In order to solve the above problem, the present invention provides a method of assembling a voice packet, comprising the steps of: coding a voice signal of a current frame by a first coding method of high quality to generate a first code string; A decoded signal is generated by decoding the code sequence, and the audio signals of N frames from one to N (N: an arbitrary integer equal to or more than 1) from the current frame are encoded by the second encoding method. An internal state necessary for encoding is calculated from the decoded signal, and using the calculated internal state, audio signals of N frames from one to N ahead of the current frame are highly compressed. Encoding is performed by a second encoding method to generate a second code string, the first code string and the second code string are combined and stored in a packet at the current frame time, and in the decomposition, to determine the presence or absence of packet loss in each frame, current frame When it is determined that the packet at the time has not been lost, the first decoding corresponding to the first encoding method from the first code sequence of the current frame among the code sequences stored in the packet at the current frame time When the audio signal is decoded by the encoding method, and it is determined that the packet at the current frame time has been lost, the first code sequence of the past frame among the code sequences stored in the packet at the past frame time is used. The internal state for decoding the audio signals of N frames from one to N ahead of the decoded audio signal by the second decoding method corresponding to the highly-compressed second encoding method is as follows. Then, the speech signal is decoded by the second decoding method corresponding to the second coding method of the current frame using the calculated internal state.

[0010]
BEST MODE FOR CARRYING OUT THE INVENTION
As shown in FIG. 1, in a voice over packet (VoP), a voice packet is received by the network module 1, decomposed for each packet and output to the fluctuation absorbing buffer 3, and a packet loss is determined and a packet loss flag is determined. Output to the compensator 2.
The decomposed packets are accumulated in the fluctuation absorbing buffer 3, and after waiting for a while, the packets are reproduced.
If the packet loss is not determined, the main bit stream of the current frame of the packet stored in the fluctuation absorbing buffer 3 is output to the audio decoder 4. When packet loss is determined, efficient packet loss compensation can be performed using the main and sub bit streams of the packets before and after reaching the fluctuation absorbing buffer 3. For example, when the packet (3) to be reproduced does not arrive, the audio decoder 4 creates an analysis coefficient and outputs the analysis coefficient to the packet erasure compensator 2. The packet erasure compensator 2 reaches the analysis coefficient and the fluctuation absorbing buffer 3. The erasure compensation data is output to the audio decoder 4 using the sub-bit stream (3) of the past packet (2), and efficient packet erasure compensation can be performed. By performing parameter interpolation and volume control, the audio decoder 4 can perform processing to minimize deterioration and output audio.
[0011]
In the present invention, one main codec that is normally used is provided with resistance to packet loss by combining a plurality of sub-codecs as a compensating means when a packet is lost.
The main codec uses a high-quality coding method with a relatively low compression rate, and the sub-codec needs to select a coding method with higher compression and sufficiently higher quality than the main codec. By doing so, an increase in the data amount can be suppressed.
(Encoder)
The encoder of the present invention will be described with reference to FIGS.
[0012]
The input audio data is divided into frames (data units), and a signal to be normally reproduced on the decoder side is encoded by a main codec (encoder) 11 by a first encoding method to create a main bit stream. For example, the G.10 is a pulse code modulation (PCM) system that transmits an audio band signal of 8 kHz sampling at 64 kb / s in the audio encoding system recommended by the ITU-T. Encode according to the 711 encoding standard.
The sub-codec (encoder) 14 encodes the audio data one packet ahead by the second encoding method to create a sub-bit stream. For example, G.264, which is an LD-CELP method for 16 kb / s telephone band voice coding recommended by the ITU-T. Encode according to the 728 coding standard. When two or more sub-bitstreams are included, it is preferable to include a bitstream obtained by encoding a signal that is prefetched further than that.
[0013]
Then, the main bit stream and the sub-bit stream one packet ahead are combined by the combining unit 15, and the packetizing unit 16 performs a process of adding a chronological sequence number, a bit error detection code, and the like by a known serial number assigning circuit or the like. Output as packets.
In the above-mentioned manner, each packet has a main bit stream in which the signal stored in the main buffer is encoded by the main codec, and a sub-bit in which the signal ahead of the main buffer stored in the sub buffer is encoded by the sub codec. Have a stream.
[0014]
As shown in FIG. 4, when the decoding side determines that a packet has been lost (packet {circle around (3)}) by creating a bit stream in this way, it is included in the packet {right arrow over (2)} immediately before the loss. It is possible to generate a sound (decoded sound signal [3]) in a section lost according to the information of the sub-bit stream [3]. This is because the sub-bit stream includes a signal obtained by encoding a pre-read signal by a sub-codec. Also, in order to reduce the overhead of the header of a packet, a plurality of voice frames are often packed in one packet and transmitted. In the conventional packet erasure compensation, a pseudo speech signal is generated by repeating a past signal using a synthesized speech, so that the longer the speech data included in the packet, the more the degradation becomes remarkable.
[0015]
According to the present invention, since the sub codec includes audio data one packet ahead of the main codec, it is possible to perform packet loss compensation with little deterioration regardless of the length of audio data included in each packet.
Care must be taken depending on the compression coding used for the sub-codec.
This is because a codec that decodes the analysis coefficients required for encoding and decoding (depending on the codec, for example, synthetic signals, filter coefficients, prediction coefficients, etc.) from the previous packet may cause packet loss. Then, the analysis coefficients of the predictor, the quantizer, and the like in the normal encoder and the decoder are different. Even in such a case, in order to match the analysis coefficients, it is necessary for the encoder to transmit initial information of the analysis coefficients as encoded information.
[0016]
In the present invention, a set of a main bit stream encoded by a main codec which is high quality encoding and a sub bit stream encoded by a sub codec which is high compression encoding exists in a packet. Therefore, if the analysis coefficient of the sub codec is created by the main codec (decoder) 12 and the sub codec (analysis coefficient calculation) 13 from the signal obtained by decoding the main codec, the information need not be transmitted.
For example, G. When the analysis coefficients of the encoder are generated from past synthesized signals as in 728, it is possible to replace the synthesized signal portion with a high-quality signal decoded by the main codec. Similarly, in the decoder, it is necessary to replace the synthesized signal portion with high quality coding. The decoding can be performed correctly by matching the internal states of the encoder and the decoder. Further, the quality of the decoded signal can be improved by replacing the analysis coefficient with a high-quality signal instead of the synthesized signal.
[0017]
The analysis coefficient is, for example, G. 728: Refers to a synthesis filter coefficient obtained by a backward synthesis filter adaptor of an LD-CELP encoder (not shown) and an auditory weighting filter coefficient obtained by an auditory weighting filter.
The synthetic inverted filter _{_{_{e n = x n + a 1}}} x n over ₁ ₊ ··· + a n _{x 0}
a: Filter coefficient x: Synthetic signal e: Residual signal Similarly, it is possible to replace the auditory weighting filter with a high quality signal.
[0018]
Perceptual weighting filter _{_{_{_{w n = a 0 x n +}}}} a 1 x n over _{_{_{_{1 + ··· + a n x o}}}} - (b 0 w n + b 1 w n over _{_{_{1 + ··· + b n w o}}} )
a, b: filter coefficient x: synthetic signal w: auditory weighting signal In this way, high-quality decoding can be performed by replacing the signal with a high-quality signal.
The calculated analysis coefficients are transferred to the sub codec (decoder) 14, that is, 728: Shape code vector (waveform) stored in a codebook (codebook) as an output from an optimum codebook data selector of an LD-CELP encoder (not shown) and an optimum gain level The corresponding code is selected and a sub-bitstream is output.
(decoder)
The decoder will be described with reference to FIGS.
[0019]
On the decoder side, first, the received signal is depacketized by the depacketizing section 21, and the main codec / sub-codec distribution section outputs a main bit stream (code according to the G.711 voice coding standard of the current frame) among the voice packets at the current frame time. Stream) and a sub-bitstream (code stream according to the G.728 speech coding standard of the previous frame).
The main bit stream is decoded by the main codec (decoder) 23 into an audio signal by the first decoding method.
[0020]
Then, the decoded signal is calculated by the sub codec (analysis coefficient calculation) 24 to generate an internal state of the sub codec (decoder) 25. Alternatively, as described above, the internal state of the sub codec is created directly from the main codec.
Finally, with the internal state being inherited, the sub bit stream is decoded by the sub codec (decoder) 25 according to the second decoding method, and an audio signal is output.
[0021]
Specifically, the LD.-CELP decoder (not shown) supplies the G. A code for a shape code vector (waveform) and a code for a gain level are input as a coded sequence according to the 728 speech coding standard, and a shape code vector and a gain vector are selected from a codebook (excitation VW codebook). The analysis coefficient calculated as the filter coefficient is transferred to use the decoded sound.
Detection of the presence / absence of a voice packet is performed at a stage prior to the depacketizing unit 21 shown in FIG. 5, and is performed by detecting a disorder of a sequence number or a bit error by a widely used packet loss detection circuit.
[0022]
If it is not determined that there is a packet loss signal, the changeover switch 27 is switched to the main codec (decoder) 23 to output an audio signal. When it is determined that a packet loss signal is present, the switch 27 is switched to the sub codec (decoder) 25 side.
If the main codec uses a codec that requires past information, such as ADPCM (Adaptive Pulse Code Modulation), that is, a codec that takes over the internal state, if a past packet is lost, loss compensation is performed. Degradation occurs in the connection between the decoded sub-codec used in the above and the audio signal decoded from the main codec. In such a case, by creating necessary information from the signal reproduced by the sub codec, it is possible to suppress the deterioration of the reproduction of the main codec after the compensation.
[0023]
When there is only one sub-codec and packets are continuously lost, if there are not as many sub-codecs as the number of lost packets, loss compensation by the sub-codec cannot be performed, and it is considered that voice deteriorates. In such a case, the compensation shown in FIG. 6 is performed by using the waveform repetition compensator 26 according to the conventional method shown in FIG. Only when the sub-codec cannot be used, a waveform is created using a synthesized signal such as the past pitch frequency repetition erasure compensation.
As a countermeasure for the case of burst erasure (two or more packet erasures), an example is shown in FIG. 5 in which compensation is performed by a conventional method. It is sufficient to store a code string based on a signal, and decode a speech signal using a code string in a preceding frame from a code string extending over two or more frames in packet decomposition.
[0024]
When there is no great difference in the degree of distortion of the quantization noise between the main codec and the sub codec, a signal-to-quantization noise ratio (SNR) can be increased by synchronizing and adding each other. . That is why quantization noise is often uncorrelated for different codecs, and the sum of the powers of the correlated and uncorrelated noise is considered to be greater in the acoustic part when added together. .
As the number of sub-codecs increases, the more pre-read information is provided for the main codec. As a result, it is possible to provide resistance even when packets are continuously lost.
[0025]
However, as shown in FIG. 7, as the number of sub-codecs is increased, pre-read information is required, and as a result, the delay is increased. Also. The amount of information increases by the number of sub-codecs.
The cause of the delay in VoIP is, in addition to the above, a large delay caused by a fluctuation absorbing buffer that absorbs fluctuations such as packet arrival delay. In addition, a large delay occurs on a PC (Personal Computer) due to a buffer of a network card, a sound card, or the like, but the problem is solved by introducing dedicated hardware and improving the performance of the PC. In a real-time conversation, it is desirable that the total delay time in one direction is within 200 milliseconds. Therefore, it is necessary to adjust the total number of sub-codecs, fluctuation absorption, and other delays to meet the standard.
[0026]
In mobile communication and VoIP, the communication speed is not always constant, and it is considered that the amount of information that can be used for voice communication changes depending on the amount of information used for an application. The present invention is characterized in that a combination suitable for a network is made possible by flexibly changing the quality of a sub-codec and the number of sub-codecs according to a communication speed and a calculation speed of a computer.
FIG. 8 shows a schematic diagram of a waveform when this method is used.
Referring to this figure, it can be seen that the present method is closer to the original sound than the conventional method.
[0027]
Also, the packet assembling apparatus and the packet disassembling apparatus of the present invention have a computer having a CPU, a memory, and the like, a user terminal used by a user who is an access subject, and a machine readable medium such as a CD-ROM, a magnetic disk device, and a semiconductor memory. It can be composed of various recording media.
A control program for causing the computer to execute the above-described operation is stored in a recording medium, and the control program is read by the computer, and the operation of the computer is controlled to realize each element in the above-described embodiment on the computer. it can.
[0028]
【The invention's effect】
According to the present invention, as compared with the conventional method, deterioration in quality due to packet loss is suppressed as much as possible, discontinuous portions of the waveform are eliminated, and a lost portion faithful to the original sound can be compensated. Further, since the code sequence of the current frame and the code sequence of the previous frame are combined as the packet at the current frame time, even if the previous packet is lost, it can be easily compensated.
[Brief description of the drawings]
FIG. 1 is a diagram showing a basic configuration of a VoP.
FIG. 2 is an explanatory diagram of processing of an encoder.
FIG. 3 is a schematic configuration diagram of an encoder.
FIG. 4 is an explanatory diagram of processing when a packet is lost.
FIG. 5 is a schematic configuration diagram of a decoder.
FIG. 6 is an explanatory diagram of processing when a burst is lost.
FIG. 7 is a diagram showing the structure of one packet when a plurality of sub-codecs are provided.
FIG. 8 is a comparison diagram of the waveforms of the conventional technique and the technique of the present invention for the original sound.
FIG. 9 is a configuration diagram of a conventional missing voice interpolation device.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Network module 2 Packet loss compensation part 3 Fluctuation absorption buffer 4 Audio decoder 11 Main codec (encoder)
12, 23 Main codec (decoder)
13, 24 Sub codec (calculation of analysis coefficient)
14 Sub Codec (Encoder)
15 Coupling unit 16 Packetization unit 21 Depacketization unit 22 Main codec / sub-codec distribution unit 25 Sub-codec (decoder)
26 Waveform repetition compensator 27 Selector switch

Claims

A packet assembling method for encoding an audio signal for each frame to generate a code sequence, storing the generated code sequence, and generating a packet,
Encoding the audio signal of the current frame with a first encoding method of high quality to generate a first code sequence;
The first code string is decoded to generate a decoded signal, and the audio signals of N frames from one to N (N: any integer equal to or greater than 1) from the current frame are converted to a second code. Calculating the internal state required for encoding by the decoding method from the decoded signal,
Using the calculated internal state, audio signals of N frames from one to N ahead of the current frame are encoded by a highly-compressed second encoding method to generate a second code sequence. The process of
Combining the first code string and the second code string and storing them in a packet at the current frame time;
A packet assembling method comprising:

A packet assembling apparatus that encodes an audio signal for each frame to generate a code sequence, stores the generated code sequence, and generates a packet,
A first encoder that encodes the audio signal of the current frame with a high-quality first encoding method to generate a first code string;
The first code string is decoded to generate a decoded signal, and the audio signals of N frames from one to N (N: any integer equal to or greater than 1) from the current frame are converted to a second code. Means for calculating, from the decoded signal, an internal state necessary for encoding by the encoding method, and using the calculated internal state, audio of N frames from one to N ahead of the current frame. Means for encoding the signal with a highly compressed second encoding method to generate a second code sequence ; and
A combining unit that combines the first code string and the second code string and stores the first code string and the packet at the current frame time;
A packet assembling apparatus comprising:

A first code string generated by the packet assembling method according to claim 1 and coded by the first high-quality coding method of the current frame and N (N: 1 or more arbitrary random numbers from the next one) A step of inputting a packet at the current frame time obtained by combining the second code string encoded by the second encoding method with high compression of N frames up to (integer) ahead;
Determining the presence or absence of packet loss for each frame;
If it is determined that the packet at the current frame time has not been lost, the first code sequence of the current frame among the code sequences stored in the packet at the current frame time is converted to the first code sequence corresponding to the first encoding method. A first step of decoding the audio signal by the decoding method of
If it is determined that the packet at the current frame time has been lost, one of the code strings stored in the packet at the previous frame time is ahead of the audio signal decoded from the first code string of the past frame . To calculate the internal state for decoding the audio signals of N frames from N to N ahead by the second decoding method corresponding to the high-compression second encoding method, and calculating the calculated internal state. Using the second decoding method corresponding to the second coding method of the current frame to decode the audio signal using the second coding method ;
A packet disassembly method comprising:

The packet disassembly method according to claim 3,
A packet disassembly method comprising, if it is determined that the packet at the current frame time has not been lost, adding a current frame signal and a current frame signal included in a past packet to generate an audio signal; Method.

A first code string encoded by the high-quality first encoding method of the current frame generated by the packet assembling apparatus according to claim 2 and N (N: an arbitrary integer equal to or more than 1) from the first code string. Means for inputting a packet at the current frame time obtained by combining a second code string coded by the second compression method with high compression of N frames up to the next frame;
Packet loss determining means for determining the presence or absence of packet loss for each frame;
If it is determined that the packet at the current frame time has not been lost, the first code sequence of the current frame among the code sequences stored in the packet at the current frame time is converted to the first code sequence corresponding to the first encoding method. When the audio signal is decoded by the decoding method of (1) and it is determined that the packet at the current frame time has been lost, the first code of the past frame in the code string stored in the packet at the past frame time An internal part for decoding the audio signals of N frames from one to N ahead of the audio signal decoded from the column by the second decoding method corresponding to the high-compression second encoding method Decoding means for calculating a state and decoding the audio signal by a second decoding method corresponding to the second encoding method of the current frame using the calculated internal state ,
A packet decomposer comprising:

The packet decomposer according to claim 5,
The decryption means
A packet decomposer for generating an audio signal by adding a current frame signal and a current frame signal included in a past packet when it is determined that a packet at a current frame time has not been lost.

Encoding the audio signal of the current frame with a first encoding method of high quality to generate a first code string;
The first code string is decoded to generate a decoded signal, and the audio signals of N frames from one to N (N: any integer equal to or greater than 1) from the current frame are converted to a second code. Calculating from the decoded signal the internal state required for encoding by the encoding method,
Using the calculated internal state , audio signals of N frames from one to N ahead of the current frame are encoded by a highly-compressed second encoding method to generate a second code sequence. Steps to
Combining the first code string and the second code string and storing them in a packet at the current frame time;
Computer-readable recording medium on which a program for causing a computer to execute the program is recorded.

Encoding the audio signal of the current frame by the first encoding method of high quality to generate a first code string; decoding the first code string to generate a decoded signal; Calculating, from the decoded signal, an internal state necessary for encoding the audio signals of N frames from the destination to N (N: an arbitrary integer equal to or greater than N) frames by the second encoding method; Using the calculated internal state, the audio signals of N frames from one to N ahead of the current frame are encoded by a highly-compressed second encoding method to form a second code sequence. Generating the current frame time packet from the generating step, combining the first code string and the second code string into a packet at the current frame time, and inputting the packet generated at the current frame time.
A procedure for determining the presence or absence of packet loss for each frame;
If it is determined that the packet at the current frame time has not been lost, the first code sequence of the current frame among the code sequences stored in the packet at the current frame time is converted to the first code sequence corresponding to the first encoding method. A first procedure of decoding an audio signal by the decoding method of
If it is determined that the packet at the current frame time has been lost, one of the code strings stored in the packet at the previous frame time is ahead of the audio signal decoded from the first code string of the past frame . To calculate the internal state for decoding the audio signals of N frames from N to N ahead by the second decoding method corresponding to the high-compression second encoding method, and calculating the calculated internal state. A second procedure for decoding a speech signal using a second decoding method corresponding to a second encoding method for the current frame using the second coding method ;
Computer-readable recording medium on which a program for causing a computer to execute the program is recorded.

A computer-readable recording medium recording the program according to claim 8,
When it is determined that the packet at the current frame time has not been lost, a program for causing a computer to execute a procedure of generating an audio signal by adding the current frame signal included in the past frame and the current frame is recorded. Computer readable recording medium.