JP3977784B2

JP3977784B2 - Real-time packet processing apparatus and method

Info

Publication number: JP3977784B2
Application number: JP2003200000A
Authority: JP
Inventors: 登原田; 徹阪谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-07-22
Filing date: 2003-07-22
Publication date: 2007-09-19
Anticipated expiration: 2023-07-22
Also published as: JP2005043423A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a real-time packet processor and its method capable of appropriately interpolating a frame while minimizing mutual disagreement of pieces of correlation data of an encoding processing part and a decoding processing part and reproducing sound while reducing quality deterioration when a sound packet is lost and delayed according to an arrival state of the packet. <P>SOLUTION: A receiver performs only decoding at the decoding processing to N (N is a natural number) pieces of frames from the frames which become discontinuous when the frame to be processed is not successive, performs the sound reproduction processing from a frame next to the N frames without performing sound reproduction processing to the N pieces of decoded frames, makes the correlation data to be used at the decoding processing part in the case of performing decoding coincident with the correlation data in the encoding processing part on the transmitter side and performs appropriate sound reproduction processing from a frame next to the N pieces of frames. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、受信したパケットに含まれるフレームのデータをリアルタイムで処理するリアルタイムパケット処理装置及びその方法に関するものである。
【０００２】
【従来の技術】
従来、電子機器のディジタル化に伴い、情報通信においては転送対象となる情報をパケット化して転送することが一般的に行われている。例えば、音声信号を転送する場合には、送信側では、所定のサンプリング周波数にてサンプリングした音声データを所定量ずつ別個のパケットに分散して収納し、パケット単位で転送を行っている。受信側においては、受信したパケットから音声データを取りだし、取り出した音声データを繋ぎ合わせて再生処理したりミキシング処理したりする。
【０００３】
即ち、上記のようなパケット通信を行う電子機器では、送信側においては１パケット分のデータが得られた段階でパケットを形成して送信する処理を行い、受信側では受け取ったパケットに収納されているデータの再生に要する時間毎にパケット内のデータを読み出す処理を行っている。これにより、受信側では、例えば音声データのリアルタイム転送の場合、分割して受け取った複数のパケットから連続した音声を再生処理したりミキシング処理したりすることができる。
【０００４】
この様なパケット通信は、ほとんどの場合コンピュータ装置を使用して行っており、例えば、無線通信を利用した携帯型電話機やインターネット等の通信網を利用した周知のＩＰ電話、配信サーバから音楽などのコンテンツをユーザ端末装置に配信するシステム、及び遠隔会議システムなどに用いられている。
【０００５】
例えば、フレーム間予測を用いた音声符号化方式（予測符号化方式）には、ＩＴＵ標準のＧ．７２９、Ｇ．７２３．１、Ｇ．７２２．１等がある。
【０００６】
これらの符号化方式は、送信装置側の符号化処理部の内部バッファに格納されている相関データと、受信装置側の復号化処理部の内部バッファに格納されている相関データが一致していなければ正しい音響信号を復元できないという制約がある。尚、上記相関データとは、上記Ｇ．７２９、Ｇ．７２３．１、Ｇ．７２２．１に記載されている予測符号化方式に用いるデータである。
【０００７】
例えば、送信装置側の復号化処理部で音声フレーム０，１，２，３を符号化して、パケット０，１，２，３に各フレームを含めて送信した場合には、パケットの受信順序に関わらず受信装置側の復号化処理部でも、符号化処理部が符号化したのと同じようにフレーム０，１，２，３の順序で復号化処理を行わなければ、各フレームを符号化した時点の符号化処理部の相関データと、当該フレームを復号化する際の復号化処理部の相関データが一致しなくなり、正しい復号波形を得ることができない。
【０００８】
また、転送中にパケットが消失した場合には受信装置においてパケットの消失補償（ＰＬＣ：ＰａｕｑｔｔｅＬｏｓｓＣｏｎｃｅａｌｍｅｎｔ）処理が行われる場合がある。
【０００９】
尚、パケット消失補償処理としては、Ｇ．７１１Ａｐｐｅｎｄｉｘ１やＧ．７２９の標準でもたれている方式が知られている。
【００１０】
【特許文献１】
特開２０００−８３０５０号公報
【非特許文献１】
ＩＴＵ−ＴＲｅｃｏｍｍｅｎｄａｔｉｏｎＧ．７２９
【非特許文献２】
ＩＴＵ−ＴＲｅｃｏｍｍｅｎｄａｔｉｏｎＧ．７２３．１
【非特許文献３】
ＩＴＵ−ＴＲｅｃｏｍｍｅｎｄａｔｉｏｎＧ．７２２．１
【非特許文献４】
ＩＴＵ−ＴＲｅｃｏｍｍｅｎｄａｔｉｏｎＧ．７１１Ａｐｐｅｎｄｉｘ１
【００１１】
【発明が解決しようとする課題】
前述した音声符号化方式（予測符号化方式）における制約により、パケットの消失が起こった場合等には、消失したフレームを復号化処理部の入力として復号化処理部の相関データを更新することができないため、消失後正しく受信したパケット内のフレームを復号化する際に、送信装置側の符号化処理部の相関データと、受信装置側の復号化処理部の相関データが不一致となり、音声フレームを正しく復元できない場合がある。
【００１２】
上記の従来方式では、上記のようなパケット消失や遅延に起因する相関データの不一致に関して特に対処を行わないため、再生音声に知覚可能な品質の劣化を生じていた。
【００１３】
また、前述のような予測符号化方式を用いた場合に制約条件があるにもかかわらず、ＶｏＩＰ等のアプリケーションにおいては、送信装置側が送信した全ての音声パケットが正しく受信装置側に到着する保証はない。例えば、通信網等においてパケットが消失した場合には容易に符号化処理部と復号化処理部の相関データの不一致が生じるし、通信セッションの確立に際して、送信装置側が送信した先頭のパケットから正しく受信装置側に到着するとは限らないため、通信の初期の段階から符号化処理部と復号化処理部の相関データが不一致のまま通信が継続してしまうという問題がある。
【００１４】
本発明の目的は上記の問題点に鑑み、音声パケットの到着状況に応じてパケットが消失したときや遅延したときに符号化処理部と復号化処理部の相関データの不一致を最小限に抑えてフレームを適切に補間し、品質劣化を低減して音声再生することができるリアルタイムパケット処理装置及びその方法を提供することである。
【００１５】
【課題を解決するための手段】
一般的に符号化処理部、復号化処理部に用いられる予測は数フレームの間の相関を用いており、符号化処理部と復号化処理部の内部状態が不一致となった時点から数フレームの間は正しい音響信号を復元することができず、再生音声に知覚可能な品質の劣化が生じる。
【００１６】
しかし、符号化処理部と復号化処理部の相関データに不一致が生じた場合にも、受信フレームを正しい順序で数フレーム復号した後には相関データは次第に一致してくるため、結果として復号した波形の品質の劣化は、連続して正しいフレームを復号するに従って次第に収まってくる。
【００１７】
本発明では、符号化処理部、復号化処理部の相関データに不一致が生じた場合に、最初の数フレームの音声をあえて再生しないことによって品質が劣化した音声を再生せず、受聴品質を向上するリアルタイムパケット処理装置及びその方法を提案する。
【００１８】
本発明では、連続した入力音声信号を所定周期毎に切り取り、該切り取った信号を前記周期よりも短い所定のサンプリング時間毎にサンプリングして得られた複数のサンプリングデータを符号化処理部によって符号化してなるフレームを生成すると共に該フレーム毎に該フレームを含むパケットを生成して順次送信する送信装置から通信網を介して受信装置によって前記パケットを受信し、前記受信装置により、前記受信したパケットに含まれる前記フレームを復号化処理部により復号化し、該復号化したフレームに含まれるサンプリングデータに対して音声再生処理を施す予測符号化方式を用いたリアルタイムパケット処理において、前記受信装置は、前記送信装置から受信したパケットをバッファに格納して、前記バッファから入力したパケットに含まれるフレームを復号化する際に符号化されたフレームを分析し該分析結果を相関データとして保持し、前記復号化処理部による復号処理を行う際に処理対象となるフレームが連続していないときに、不連続となったフレームからＮ個のフレームに対して前記復号化処理部における復号化のみを行い、復号化したＮ個のフレームに対して前記音声再生処理を施さずに、前記Ｎ個のフレームの次のフレームから前記保持している相関データを用いて復号化を行って前記音声再生処理を施す。さらに、前記受信装置は、前記バッファに格納されているパケットの数が所定数を越えたときに、前記バッファに格納されている連続した所定数の破棄対象となるパケットのうちの最後のＮ個のフレーム以外のパケットを破棄すると共に、前記最後のＮ個のフレームは復号のみを行い、これに続くＭ個のパケットのフレームに対して無音状態から徐々に音量を増加させるフェードイン処理を施し、前記破棄したパケットの前のＭ個のパケットのフレーム対して徐々に無音まで音量を低下させるフェードアウト処理を施す。
【００１９】
本発明によれば、前記復号化処理部による復号処理を行う際に処理対象となるフレームが連続していないときは、不連続となったフレームからＮ個のフレームに対して前記復号化処理部における復号化のみが行われ、復号化したＮ個のフレームに対して前記音声再生処理を施さずに、前記Ｎ個のフレームの次のフレームから前記保持している相関データを用いて復号化を行って前記音声再生処理が施される。前記Ｎ個のフレームに対して復号化処理のみが行われることにより、復号化を行う際に復号化処理部において用いる相関データを、送信装置側の符号化処理部における相関データと完全に一致若しくはほぼ一致させることができ、前記Ｎ個のフレームの次のフレームから適切な音声再生処理を行うことができる。
【００２０】
また、本発明では、前記受信装置は、前記復号化処理部による復号処理を行う際に処理対象となるフレームが連続していないときに、不連続となったフレームからＮ個のフレームに対して前記復号化処理部における復号化を行い、復号化した前記Ｎ個のフレームに対して音量を低下させた前記音声再生処理を施す。
【００２１】
本発明によれば、前記復号化処理部による復号処理を行う際に処理対象となるフレームが連続していないときは、不連続となったフレームからＮ個のフレームに対して前記復号化処理部における復号化が行われた後、復号化した前記Ｎ個のフレームに対して音量を低下させた前記音声再生処理が施される。
【００２２】
このとき、上記と同様に、前記Ｎ個のフレームに対して復号化処理が行われることにより、復号化を行う際に復号化処理部において用いる相関データを、送信装置側の符号化処理部における相関データと完全に一致若しくはほぼ一致させることができる。さらに、前記Ｎ個のフレームに対して音量を低下させた音声再生処理が施されるので、この遷移部分で異音が生じることが無く、音声品質の劣化が低減される。
【００２３】
また、本発明では、前記受信装置は、前記送信装置から受信したパケットをバッファに格納して、前記バッファから入力したパケットに含まれるフレームを復号化する際に、前記バッファに格納されているパケットの数が所定数を越えたときに、前記バッファに格納されている連続した所定数の破棄対象となるパケットのうちの最後のＮ個のフレーム以外のパケットを破棄すると共に、前記最後のＮ個のフレームは復号化のみを行い、これに続くＭ個のパケットのフレームに対して無音状態から徐々に音量を増加させるフェードイン処理を施し、前記破棄したパケットの前のＭ個のパケットのフレーム対して徐々に無音まで音量を低下させるフェードアウト処理を施す。
【００２４】
本発明によれば、受信装置のバッファに蓄積されたパケットの数が所定数を越えたときは、音声再生処理に遅延が生じるため、バッファに格納されている連続した所定数の破棄対象となるパケットのうちの最後のＮ個のフレーム以外のパケットが破棄される。
【００２５】
これにより、前記Ｍ個のフレームに対して無音状態から徐々に音量を増加させるフェードイン処理が施されるので、受信装置において音声再生する際に、無音状態から有音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることが無く、音声品質の劣化が低減される。
【００２６】
さらに、不連続となる前のＭ個のフレーム対して徐々に無音まで音量を低下させるフェードアウト処理が施されるので、有音状態から無音状態になる部分の音声レベルが徐々に減少されるため、受信装置において音声再生する際に、有音状態から無音状態に遷移する部分で音声波形が不連続となることがないので、この遷移部分で異音が生じることがなく、音声品質の劣化が低減される。
【００２７】
さらにまた、破棄対象となるパケットのフレームのうちの最後のＮ個のフレームに対して復号化のみが施されるため、送信側の符号化処理部と受信側の復号化処理部において予測符号化方式で用いられる相関データを完全に一致或いはほぼ一致させることができる。
【００２８】
また、本発明では、前記受信装置は、前記パケットに含まれるシーケンス番号に基づいて、該シーケンス番号が不連続になったときに、前記フレームが不連続となったと判定する。
【００２９】
本発明によれば、受信装置により、前記パケットに含まれるシーケンス番号に基づいて、該シーケンス番号が不連続になったときに、前記フレームが不連続となったと判定される。
【００３０】
また、本発明では、前記受信装置は、前記フェードアウト処理を施したフレームと前記フェードイン処理を施したフレームとを重ねて音声再生する。
【００３１】
本発明によれば、前記フェードアウト処理を施したフレームと前記フェードイン処理を施したフレームとが重ねられて音声再生されるので、不連続になる部分において無音状態が生じることがなく、音声品質の劣化がさらに低減される。
【００３２】
【発明の実施の形態】
以下、図面に基づいて本発明の一実施形態を説明する。
【００３３】
（第１実施形態）
図１は本発明の第１実施形態におけるリアルタイムパケット処理装置の機能構成を示すブロック図、図２は本発明の第１実施形態における音声パケット送信装置による音声信号のパケット化を説明する図、図３は本発明の第１実施形態において用いているリアルタイム転送プロトコル（以下、ＲＴＰと称する）ヘッダを説明する図である。図において１は音声パケット送信装置（以下、単に送信装置と称する）、２は音声パケット受信装置（以下、単に受信装置と称する）、３はインターネット等の通信網である。本実施形態では、一例として、通信網３を介して送信装置１からＵＤＰ／ＩＰを用いて音声パケットをリアルタイムで受信装置２に転送する装置に関して説明する。
【００３４】
送信装置１は、周知のコンピュータ装置から構成され、予め設定されているプログラムによって動作し、音声入力部１１と、アナログ／ディジタル（Ａ／Ｄ）変換部１２、符号化処理部１３、パケット生成部１４、送信部１５とから構成されている。これらの送信装置１を構成する各部分は、ハードウェア及びソフトウェアの両方によって構成されている。
【００３５】
受信装置２は、周知のコンピュータ装置から構成され、予め設定されているプログラムによって動作し、受信部２１と、パケット解析部２２、復号化処理部２３、ディジタル／アナログ（Ｄ／Ａ）変換部２４、音声出力部２５とから構成されている。これらの受信装置２を構成する各部分は、ハードウェア及びソフトウェアの両方によって構成されている。
【００３６】
音声入力部１１は音声信号を図２に示すようなアナログ電気信号４に変換してＡ／Ｄ変換部１２に出力し、Ａ／Ｄ変換部１２によって所定のサンプリングタイムでディジタル信号に変換された音声データ（サンプル）が符号化処理部１３に備わるデータバッファ（図示せず）に順次格納される。
【００３７】
また、図２に示すように、符号化処理部１３のデータバッファに格納された音声データは、符号化処理部１３によって所定周期Ｔ毎に切り取られ音声データフレーム３１とされ、先頭から順に１フレームずつ順送りにパケット３０が生成されて送信される。
【００３８】
符号化処理部１３は、Ａ／Ｄ変換部１２から入力した符号化対象となる音声データフレームの符号化処理を行うが、符号化処理を行うに際して前のフレームを符号化した結果の内部状態を内部バッファ１３ａに保持し、過去からの予測を行うことで符号化利得を向上させている。
【００３９】
本実施例においては、パケット消失により送信装置１の符号化処理部１３と受信装置２の復号化処理部２３における相関データの不一致による品質劣化を低減するために、無音状態から有音状態に変化した場合に、符号化処理部１３の内部バッファ１３ａをリセットして初期値を用いることにより伝送誤りによる品質低下の発生を低減している。
【００４０】
さらに、符号化処理部１３は、分析結果に基づいて符号化対象となる音声データフレームを符号化してパケット生成部１４に送出する。
【００４１】
パケット生成部１４は、符号化処理部１３から入力した符号化された音声データを含むＲＴＰパケットを生成して送信部１５へ送出する。このときのＲＴＰパケットには図３に示すようなＲＴＰヘッダが付加される。
【００４２】
ＲＴＰヘッダには、周知のように、２ビットのＶｅｒｓｉｏｎ情報Ｖと、１ビットのＰａｄｄｉｎｇ情報Ｐ、１ビットのＥｘｔｅｎｓｉｏｎ情報Ｘ、３ビットのＣＳＲＣ−Ｃｏｕｎｔ情報ＣＣ、１ビットのＭａｒｋｅｒ情報（以下、マーカービットと称する）Ｍ、７ビットのＰａｙｌｏａｄ−Ｔｙｐｅ情報ＰＴ、１６ビットのシーケンス番号（順序番号：ＳｅｑｕｅｎｃｅＮｕｍｂｅｒ）、３２ビットのタイムスタンプ（Ｔｉｍｅｓｔａｍｐ）、３２ビットの同期信号元（ＳＳＲＣ）識別子、３２ビットの寄与送信元（ＣＳＲＣ）識別子等が含まれている。
【００４３】
また、本実施形態では、無音状態であってパケット送信を停止していた後に有音状態になって最初に送信するパケットのマーカービットＭを「１」に設定し、その他のパケットのマーカービットＭを「０」に設定する。
【００４４】
送信部１５は、パケット生成部１４から入力したＲＴＰパケットを通信網３を介して受信装置２に送信する。
【００４５】
一方、受信装置２の受信部２１は、通信網３を介して送信装置１から送信されたＲＴＰパケットを受信しパケット解析部２２に送出する。
【００４６】
パケット解析部２２は、受信部２１から入力したＲＴＰパケットを解析してヘッダ部と符号化された音声データフレームに分離すると共に、ヘッダ部の内容を解析し、ＲＴＰタイムスタンプに基づいて、送信された順番に符号化された音声データフレームを復号化処理部２３に出力する。さらに、パケット解析部２２は、ＲＴＰヘッダのマーカービットＭの値を復号化処理部２３に通知する。
【００４７】
復号化処理部２３は、パケット解析部２２から入力した符号化された音声データフレームを復号してディジタル音声データに変換し、このディジタル音声データをＤ／Ａ変換部２３に出力する。
【００４８】
また、復号化処理部２３は、復号化を行う際に、符号化された音声データフレームを分析しその分析結果を内部バッファ２３ａに一時記憶すると共に、データ分析を行う際に、内部バッファ２３ａに一時記憶されている分析結果或いは予め設定されている分析初期値を参照してデータ分析を行う。ここで、内部バッファ２３に一時記憶されている１フレーム前の分析結果を用いることにより前後のフレーム間の相関を考慮した最適な分析及び復号を行えるようにしている。
【００４９】
Ｄ／Ａ変換部２３は、復号化処理部２３によって復号して得られたディジタル音声データを入力してアナログ音声信号に変換して音声出力部２４に出力する。
【００５０】
音声出力部２４は、Ｄ／Ａ変換部２３から入力したアナログ音声データを音声に変換して出力する。
【００５１】
次に、上記構成よりなる本実施形態におけるリアルタイムパケット処理装置の動作を説明する。
【００５２】
ＶｏＩＰ通信において、受信装置２側の受け入れ準備が完了する前に送信装置１側が音声パケットの送出を始める場合がある。この様な場合には、受信装置２側では通信開始直後のパケットを正しく受信することができず、先頭の数パケットを取りこぼすことになる。
【００５３】
この場合には、送信装置１側の符号化処理部１３における内部バッファ１３ａに格納されている相関データと、受信装置２側の復号化処理部２３における内部バッファ２３ａに格納されている相関データが不一致となり、正しい音声信号を生成することができない。
【００５４】
本実施形態では、例えば符号化方式として前述したＧ．７２９を用いた場合を一例として説明する。この場合、１フレームが１０ｍｓであるので、１０ｍｓ分の音声１フレームを１パケットとした場合について説明する。また、以降の各実施形態でも同様の条件を例にとって記述する。
【００５５】
ＲＴＰ／ＲＴＣＰを用いてＶｏＩＰ音声パケット通信を行う場合に、送信装置１が最初に送ったパケットのシーケンス番号を知ることができないため、受信装置２側では最初に受け取ったパケットが、送信装置１が最初に送出したパケットであるかどうかわからない。
【００５６】
このため、送信装置１の符号化処理部１３における内部バッファの相関データをリセットした状態で生成した最初の符号化フレームが含まれているかどうか知ることができない。
【００５７】
受信装置２において、送信装置１が送出したパケットのうち、先頭の数パケットを受信できなかったにもかかわらず何も付加的な処理を行わず、受け取ったパケットに含まれる音声フレームをそのまま復号して再生すると、先頭部分で符号化処理部と復号化処理部の相関データの不一致に起因する再生音声の品質劣化が生じる場合がある。
【００５８】
この問題を回避するため、第１実施形態では、通信開始から数パケットについては、復号化処理は行うが、復号化処理したフレームの音声再生は行わず、フレームの信号波形が安定するまで数フレーム復号化処理を行った後でフェードイン処理を用いて再生している。
【００５９】
例えば、図４に示す一例では、送信装置１はシーケンス番号が０番のフレームを含むパケットから順に送信しているが、受信装置２側では受信を開始した後にシーケンス番号が３番のパケットから受信している。この場合、受信装置２は、受信できた最初のＮ個のパケットに含まれるＮ個のフレームについては復号化処理を行うのみとする。これにより、正常に復号化するための相関データを復号化処理部２３の内部バッファ２３ａに蓄積している。ここでは、Ｎ＝２としてシーケンス番号が３番と４番の２つのパケットについて復号化処理を行うのみで、音声再生を行わずに相関データを内部バッファ２３ａに蓄積している。
【００６０】
さらに、受信装置２は、上記シーケンス番号が３番と４番のフレームに続くＭ個のパケットのフレームについては復号化処理を施した後、音声再生する際にフェードイン処理を施す。ここでは、Ｍ＝２としてシーケンス番号が５番と６番のパケットのフレームに関してフェードイン処理を施している。シーケンス番号が７番以降のパケットのフレーム関しては通常通りの復号化処理と音声再生処理を施す。尚、以下の説明においてＮ個及びＭ個はそれぞれ０以上の数であり且つＮ＋Ｍが１以上となる数であればよい。
【００６１】
前述したように本実施形態によれば、先頭の音声パケットを受信できなかったときに、品質劣化を低減して音声再生することができる。
【００６２】
（第２実施形態）
次に、本発明の第２実施形態を説明する。
【００６３】
第２実施形態では、送信装置１の符号化処理部１３における内部バッファ１３ａと受信装置２の復号化処理部２３における内部バッファ２３ａに格納されている相関データの状態不一致が、ネットワーク通信網３におけるパケット消失に起因して生じる場合について説明する。尚、第２実施形態における装置構成は第１実施形態と同様である。
【００６４】
パケットが消失した場合には、消失したパケットに含まれるフレームを復号化処理することができないため、送信装置１側の符号化処理部１３の符号化器の内部バッファ１３ａと受信装置２側の復号化処理部２３の復号化器の内部バッファ２３ａに格納されている相関データに関して状態の不一致が生じる。
【００６５】
このような内部バッファ１３ａ，２３ａに格納されている相関データに関する状態の不一致を生じる場合、本実施形態では、パケット消失の直後のフレームを復号化してすぐに再生するのではなく、パケット消失後に受信した数パケットのフレームについては復号化処理は行うが音声再生は行わず、フレームの信号波形が安定するまで数フレーム符号化処理を行った後でフェードイン処理を行い再生している。
【００６６】
以下に上記の内容を実現するための動作に関して図５を参照して説明する。
【００６７】
受信装置２は、受信部２１及びパケット解析部２２において受信パケットに含まれるシーケンス番号を用いて、シーケンス番号が１番と２番のパケットの消失を知ることができる。
【００６８】
受信装置２は、シーケンス番号が１番と２番のパケットの消失を知った場合に、消失したこれらのパケットのフレーム分の無音を生成して再生する。
【００６９】
さらに、本実施形態では、受信装置２は消失したシーケンス番号が１番と２番のパケットのフレームの次に受信したパケットのフレームを復号化処理する前に、復号化処理部２３の内部バッファ２３ａを初期化（リセット）する。
【００７０】
次に、受信装置２は、受信したシーケンス番号が３番以降のパケットの復号化処理を開始するが、Ｎフレームの間は復号化するのみで再生は行わない。即ち、シーケンス番号が３番と４番の２つのパケットのフレームは復号化するのみで音声再生は行わない。
【００７１】
これに続くＭ個のフレームは復号化処理して得られた音声信号を音量０から次第に音量を増加させるフェードイン処理を施して再生する。即ち、シーケンス番号が３番と４番の２つのパケットのフレームについては、復号化処理して得られた音声信号を音量０から次第に音量を増加させるフェードイン処理を施して再生する。
【００７２】
Ｎ＋Ｍ個のフレームの後、すなわちシーケンス番号が７番のパケット以降のパケットのフレームは、復号化処理して得られた音声信号をそのまま通常通り再生する。
【００７３】
上記第２実施形態によれば、音声パケットの到着状況に応じてパケットが消失したときに、品質劣化を低減して音声再生することができる。
【００７４】
（第３実施形態）
次に、本発明の第３実施形態を説明する。尚、第３実施形態における装置構成は、前述した第１実施形態と同様である。
【００７５】
第３実施形態では、パケット通信において、ＩＰパケットが一度に大量に到着した場合には、受信装置の受信部２１にあるＦＩＦＯバッファ内にパケットが溜まりすぎて、受信パケットの一部を破棄する必要が生じる場合について説明する。
【００７６】
このようなときに受信パケットの一部を破棄し、受信した全てのパケットに含まれる音声フレームを復号処理しない場合には、送信装置１の符号化処理部１３における内部バッファ１３ａに格納されている相関データと受信装置２の復号化処理部２３における内部バッファ２３ａに格納されている相関データとの間に状態の不一致が生じる。
【００７７】
また、破棄する予定のフレームについても全て復号化処理を行うことは演算処理の負荷を考慮した場合には許容できない場合が多い。
【００７８】
第３実施形態では、上記のような場合に全てのフレームを復号化処理しないことに起因する音声品質劣化を低減する例を示す。
【００７９】
また、第３実施形態では、Ｍ個のフレームを境界フレームとして設け、クロスフェードすることで音声波形の不連続性をなくし、品質の劣化を低減している。
【００８０】
ここで、破棄するフレーム数が非常に少ない場合には、破棄するフレームも含めて全てのフレームを復号化処理しても良い。例えば、破棄すべきフレーム数Ｘ＜Ｎとなる場合に全てのフレームを復号化処理するようにすることができる。
【００８１】
次に、上記の処理の具体例について図６を参照して説明する。
【００８２】
図６に示す具体例では、受信部２２のＦＩＦＯバッファに多数のパケットが溜まりすぎたためシーケンス番号が１１番から１４番のパケットを破棄したいときの処理を示す。
【００８３】
このとき、シーケンス番号が１１番と１２番のパケットは破棄する。また、シーケンス番号が９番と１０番のパケットのフレームを境界フレームとして、これらのフレームの復号化処理を行った後、これらの境界フレーム対して徐々に無音まで音量を低下させるフェードアウト処理を施す。
【００８４】
さらに、シーケンス番号が１３番のパケットのフレームを復号化処理する前に、復号化処理部２３の内部バッファ２３ａを初期化（リセット）する。
【００８５】
また、シーケンス番号が１３番と１４番のパケットについては復号化処理を施して、復号化処理部２３の内部バッファ２３ａに格納されている相関データを更新する。しかし、復号化処理されたシーケンス番号が１３番と１４番のパケットのフレームについては音声再生しない。
【００８６】
また、これに続くＭ個のフレームを境界フレームとし、これらのフレームを復号化処理して得られた音声信号を音量０から次第に音量を増加させるフェードイン処理を施して再生する。即ち、シーケンス番号が１５番と１６番のパケットのフレームについては、復号化処理して得られた音声信号を音量０から次第に音量を増加させるフェードイン処理を施す。
【００８７】
さらに、シーケンス番号が９，１０番のフレームと１５，１６番のフレームとを重ねてクロスフェードした状態で音声再生する。
【００８８】
シーケンス番号が１７番のパケット以降のパケットのフレームは、復号化処理して得られた音声信号をそのまま通常通り再生する。
【００８９】
上記第３実施形態によれば、ＩＰパケットが一度に大量に到着し、受信装置２の受信部２１にあるＦＩＦＯバッファ内にパケットが溜まりすぎて、受信パケットの一部を破棄したときにも、品質劣化を低減して音声再生することができる。
【００９０】
（第４実施形態）
次に、本発明の第４実施形態を説明する。
【００９１】
図７は本発明の第４実施形態における音声パケット通信システムの機能構成を示すブロック図である。図において、前述した第１実施形態と同一構成部分は同一符号をもって表しその説明を省略する。また、第４実施形態と第１実施形態との相違点は、受信装置２に消失補償処理部２６と混合器２７を設けたことである。
【００９２】
第４実施形態では、第２実施形態に示したパケットが消失した場合の処理を拡張して、パケット消失補償処理を行う場合の例を説明する。
【００９３】
パケットが消失した場合には、消失したパケットに含まれるフレームを復号化処理して復号化器の内部状態を更新することができないため、送信装置１側の符号化処理部１３で当該フレームを符号化した時点の相関データと、当該フレームを復号化する復号化処理部２３の内部バッファ２３ａに格納されている相関データとの間に状態の不一致が生じる。
【００９４】
受信装置２は、受信部２１とパケット解析部２２において受信パケットに含まれるシーケンス番号を用いて、パケットの消失を知ることができる。本実施形態では、シーケンス番号が１番と２番のパケットが消失した場合を一具体例として図８を参照して説明する。
【００９５】
これらのパケットの消失を知った場合に、受信装置２は、消失補償処理部２６において第１再生処理を行うと共に復号化処理部２４において第２再生処理を行い、消失補償処理部２６からの出力信号と復号化処理部２４からの出力信号を混合機２７によって混合してＤ／Ａ変換部２４に入力する。
【００９６】
即ち、受信装置２は、パケットの消失を知った場合に、正常に受信した最後のパケットに含まれる最後のフレームを復号化した後の復号化処理部２３の内部バッファ２３ａに格納されている相関データを、消失補償処理部２６の内部バッファ２６ａにコピーする。
【００９７】
消失補償処理部２６では、第１再生処理を行う。この第１再生処理では、内部バッファ２６ａにコピーされた相関データを用いて、消失したフレームの代わりに再生されるべき音声波形を擬似生成して補間すると共に、補間したフレームに続くＮ個のフレームを擬似生成して復号化処理し、これに続くＭ個のフレームを擬似生成して復号化すると共にフェードアウト処理を施して、混合部２７を介してＤ／Ａ変換部２４に出力する。
【００９８】
復号化処理部２３では、第２再生処理を行う。この第２再生処理では、消失していない次のパケットのフレームを復号化処理する前に、内部バッファ２３ａを初期化（リセット）する。
【００９９】
次に、復号化処理部２３は、受信したパケットの復号化処理を開始するが、最初のＮ個のフレームの間は復号化するのみで再生は行わない。このＮ個のフレームに続くＭ個のフレームには復号化処理部２４において復号化処理して得られた音声信号を０から次第に音量を増加させるフェードイン処理を施す。
【０１００】
復号化処理部２４においてフェードイン処理を施されたものと、消失補償処理部で生成された音声波形をフェードアウト処理したものを合成し（クロスフェード）、再生する。
【０１０１】
また、復号化処理部２３においては、内部バッファ２３ａを初期化（リセット）した後のＮ＋Ｍ個のフレームに続くフレームに対しては、通常通りの復号化処理を施し、この復号化された音声信号はそのまま通常通り再生される。
【０１０２】
次に、上記の処理の具体例について図８を参照して説明する。
【０１０３】
図８に示す具体例では、受信装置２は、シーケンス番号が１番と２番のパケットの消失を知った場合に、正常に受信したシーケンス番号が０番のパケットに含まれるフレームを復号化した後の復号化処理部２３の内部バッファ２３ａに格納されている相関データを、消失補償処理部２６の内部バッファ２６ａにコピーする。
【０１０４】
消失補償処理部２６では、第１再生処理として、内部バッファ２６ａにコピーされた相関データを用いて、消失したシーケンス番号が１番と２番のパケットのフレームの代わりに再生されるべき１’番と２’番のフレーム及びこれに続く３’〜６’番のフレームの音声波形を擬似生成して補間すると共に、補間した１’〜４’番のフレームに対しては復号化処理のみを施し、これに続くシーケンス番号が５’番と６’番のパケットのフレームに対しては復号化すると共にフェードアウト処理を施して音声波形を生成し、混合部２７を介してＤ／Ａ変換部２４に出力する。
【０１０５】
復号化処理部２３では、第２再生処理として、受信したシーケンス番号が３番のパケットのフレームを復号化処理する前に、内部バッファ２３ａを初期化（リセット）する。
【０１０６】
次に、復号化処理部２３は、受信したシーケンス番号が３番と４番のパケットのフレームは復号化するのみで再生は行わない。これにより、復号化処理部２３の内部バッファ２３ａに格納されている相関データが正常なものとなる。
【０１０７】
また、シーケンス番号が５番と６番のパケットのフレームには復号化処理部２４において復号化処理して得られた音声信号を０から次第に音量を増加させるフェードイン処理を施して、混合部２７を介してＤ／Ａ変換部２４に出力する。
【０１０８】
これにより、混合部２７によって復号化処理部２３においてフェードイン処理を施されたものと、消失補償処理部２６においてフェードアウト処理を施されたものが合成（クロスフェード）され、再生される。
【０１０９】
また、復号化処理部２３においては、シーケンス番号が７番以降のパケットのフレームに対しては、通常通りの復号化処理を施す。この復号化された音声信号はそのまま通常通り再生される。
【０１１０】
上記第４実施形態によれば、通信網３においてパケットが消失したときにも、品質劣化を低減して音声再生することができる。
【０１１１】
（第５実施形態）
次に、本発明の第５実施形態を説明する。尚、第５実施形態における装置構成は前述した第４実施形態と同様である。
【０１１２】
第５実施形態では、図９に示すように、シーケンス番号が０番のパケットを受信した後、このパケットに続くシーケンス番号が１番のパケットが遅延したために、消失補償処理部２６において、１’番及び２’番のフレームを擬似生成して再生する場合の処理を説明する。
【０１１３】
受信装置２は、受信部２１、パケット解析部２２において、受信すべきパケットが遅延していることを知ることができる。本実施形態では、シーケンス番号が１番のパケット以降が遅延した場合を一具体例として図９を参照して説明する。
【０１１４】
パケットの遅延を契機に受信装置２は、消失補償処理部２６において第１再生処理を行うと共に復号化処理部２４において第２再生処理を行い、消失補償処理部２６からの出力信号と復号化処理部２４からの出力信号を混合機２７によって混合してＤ／Ａ変換部２４に入力する。
【０１１５】
即ち、受信装置２は、パケットの遅延を知った時に、正常に受信した最後のパケットに含まれる最後のフレームを復号化した後の復号化処理部２３の内部バッファ２３ａに格納されている相関データを、消失補償処理部２６の内部バッファ２６ａにコピーする。
【０１１６】
消失補償処理部２６では、第１再生処理を行う。この第１再生処理では、内部バッファ２６ａにコピーされた相関データを用いて、遅延して受信できていないフレームの代わりに遅延時間内に存在すべきフレームの音声波形を生成して補間すると共に、補間したフレームに続くＮ個のフレームを擬似生成すると共にこれらＮ個のフレームに対してフェードアウト処理を施して、混合部２７を介してＤ／Ａ変換部２４に出力する。
【０１１７】
復号化処理部２３では、第２再生処理を行う。この第２再生処理では、遅延して受信したシーケンス番号が１番以降のパケットの復号化処理を開始する。このとき、先頭のＮ個のパケットのフレームは復号化した後に、復号化処理して得られた音声信号を０から次第に音量を増加させるフェードイン処理を施す。
【０１１８】
さらに、復号化処理部２３においてフェードイン処理を施されたものと、消失補償処理部２６で生成されたフェードアウト処理した音声波形を混合部２７によって合成し（クロスフェード）、再生する。
【０１１９】
また、復号化処理部２３においては、シーケンス番号が３番以降のパケットのフレームに対しては、通常通りの復号化処理を施し、この復号化された音声信号はそのまま通常通り再生される。
【０１２０】
次に、上記の処理の具体例について図９を参照して説明する。
【０１２１】
図９に示す具体例では、受信装置２は、シーケンス番号が１番以降のパケットの遅延を知った場合に、復号化処理部２３によって、正常に受信したシーケンス番号が０番のパケットに含まれるフレームを復号化した後の復号化処理部２３の内部バッファ２３ａに格納されている相関データを、消失補償処理部２６の内部バッファ２６ａにコピーする。
【０１２２】
消失補償処理部２６は、内部バッファ２６ａに格納されている相関データを用いて遅延時間内に存在しなければならないパケットに含まれるフレームの代わりに再生すべきフレーム１’，２’の音声信号を擬似生成する。
【０１２３】
次に、パケット１，２が遅れて到着する。
【０１２４】
復号化処理部２３では、シーケンス番号が０番のパケットに含まれるフレームを復号化処理した直後の内部バッファ２３ａに相関データが保持されているため、この相関データを用いてシーケンス番号が１番と２番のパケットに含まれるフレームを復号化処理すれば、符号化処理部１３の内部バッファ１３ａに格納されている相関データと復号化処理部２３の内部バッファ２３ａに格納されている相関データの不一致は生じない。
【０１２５】
さらに、符号化処理部２３は、シーケンス番号が１番と２番のパケットのフレームを復号化した後に、復号化処理して得られた音声信号を０から次第に音量を増加させるフェードイン処理を施して混合部２７を介してＤ／Ａ変換部２４に出力する。
【０１２６】
一方、消失補償処理部２６では、再生される音声信号波形が不連続になることに起因する音声品質の劣化を避けるために、Ｎ個（ここではＮ＝２）の擬似フレーム３’，４’の音声信号を生成して混合部２７を介してＤ／Ａ変換部２４に出力する。
【０１２７】
これにより、符号化処理部２３から出力されたシーケンス番号が１番と２番のパケットに含まれるフレームを復号した音声信号波形と、消失補償処理部２６から出力された擬似フレーム３’，４’の音声信号とが混合部２７によって混合されてクロスフェード処理が施され、これがＤ／Ａ変換部２４に出力される。
【０１２８】
また、復号化処理部２３においては、シーケンス番号が３番以降のパケットのフレームに対しては、通常通りの復号化処理を施す。この復号化された音声信号はそのまま通常通り再生される。
【０１２９】
上記第５実施形態によれば、通信網３においてパケットが遅延したときにも、品質劣化を低減して音声再生することができる。
【０１３０】
（第６実施形態）
次に、本発明の第６実施形態を図１０を参照して説明する。尚、第６実施形態における装置構成は前述した第４，５実施形態と同様である。
【０１３１】
第６実施形態では、図１０に示すように、前述した第５実施形態の処理に代えて、復号化処理部２３において、シーケンス番号が１番と２番のパケットを破棄し、シーケンス番号が３番と４番のパケットに含まれるフレームを復号化処理してさらにフェードイン処理を施し、このフェードイン処理した音声信号と、消失補償処理部２６によって生成した擬似フレーム３’，４’の音声信号とを混合部２７によって合成することによりクロスフェードして出力するようにした。
【０１３２】
上記第６実施形態によっても第５実施形態と同様に、通信網３においてパケットが遅延したときにも、品質劣化を低減して音声再生することができる。
【０１３３】
（第７実施形態）
次に、本発明の第７実施形態を説明する。
【０１３４】
図１１は本発明の第７実施形態における音声パケット通信システムの受信装置を示すブロック図である。図において、前述した第４実施形態と同一構成部分は同一符号をもって表しその説明を省略する。また、第７実施形態と第４実施形態との相違点は、第４実施形態における消失補償処理部２６に代えて内部バッファ状態保持部２８を設けると共に混合部２８を除去したことである。
【０１３５】
上記構成によっても第５実施形態と同様の処理を行うことができる。即ち、Ｇ．７２９等の場合のように復号化処理部２３と消失補償処理部２６が実質的に同一であるような場合には、復号化処理部２３の内部バッファ２３ａに格納されている相関データを、内部バッファ状態保持部２８に一時的にコピーして保存しておき、遅延してきた次のフレームを復号する場合には、保持しておいた相関データを用いて復号を始めることで、同様の処理を行うことができる。
【０１３６】
この場合にはシーケンス番号が０番のパケットに含まれるフレームを復号化処理した直後に、復号化処理部２３の内部バッファ２３ａに格納されている相関データを内部バッファ状態保持部２８にコピーして保持し、上記消失補償処理部２６の処理と同様に１’番から４’番までの擬似フレームを生成する。
【０１３７】
次いで、遅延して受信したシーケンス番号が１番のパケットに含まれるフレームを復号化処理する場合には、内部バッファ状態保持部２８に保持されている相関データを復号化処理部２６の内部バッファ２３ａにコピーして復帰してから復号化処理を行う。
【０１３８】
上記第７実施形態によっても第５実施形態と同様に、通信網３においてパケットが遅延したときにも、品質劣化を低減して音声再生することができる。
【０１３９】
尚、前述した各実施形態は本発明の一具体例であって、本発明が上記実施形態にのみ限定されることはない。
【０１４０】
また、前後のパケットが入れ替わった状態でパケットを受信し、これらのパケットのフレームを復号化せざるおえない場合にも、本発明の手法を適用可能であることは言うまでもないことである。
【０１４１】
【発明の効果】
以上説明したように本発明のリアルタイムパケット処理装置及びその方法によれば、復号化処理部による復号処理を行う際に処理対象となるフレームが連続していないときは、不連続となったフレームからＮ個のフレームに対して復号化処理部における復号化のみが行われて復号化したＮ個のフレームに対して音声再生処理を施さずに前記Ｎ個のフレームの次のフレームから音声再生処理が施される、或いは、不連続となったフレームからＮ個のフレームに対して復号化が行われ該復号化した前記Ｎ個のフレームに対して音量を低下させた音声再生処理が施されるため、復号化を行う際に復号化処理部において用いる相関データを、送信装置側の符号化処理部における相関データと一致させることができ、前記Ｎ個のフレームの次のフレームから適切な音声再生処理を行うことができるという非常に優れた効果を奏するものである。
【図面の簡単な説明】
【図１】本発明の第１実施形態におけるリアルタイムパケット処理装置の機能構成を示すブロック図
【図２】本発明の第１実施形態における音声パケット送信装置による音声信号のパケット化を説明する図
【図３】本発明の第１実施形態において用いているリアルタイム転送プロトコルヘッダを説明する図
【図４】本発明の第１実施形態におけるリアルタイムパケット処理装置の処理を説明するタイミングチャート
【図５】本発明の第２実施形態におけるリアルタイムパケット処理装置の処理を説明するタイミングチャート
【図６】本発明の第３実施形態におけるリアルタイムパケット処理装置の処理を説明するタイミングチャート
【図７】本発明の第４実施形態におけるリアルタイムパケット処理装置の機能構成を示すブロック図
【図８】本発明の第４実施形態におけるリアルタイムパケット処理装置の処理を説明するタイミングチャート
【図９】本発明の第５実施形態におけるリアルタイムパケット処理装置の処理を説明するタイミングチャート
【図１０】本発明の第６実施形態におけるリアルタイムパケット処理装置の処理を説明するタイミングチャート
【図１１】本発明の第７実施形態におけるリアルタイムパケット処理装置の受信装置の機能構成を示すブロック図
【符号の説明】
１…送信装置、２…受信装置、３…通信網、１１…音声入力部、１２…アナログ／ディジタル（Ａ／Ｄ）変換部、１３…符号化処理部、１３ａ…内部バッファ、１４…パケット生成部、１５…送信部、２１…受信部、２２…パケット解析部、２３…復号化処理部、２３ａ…内部バッファ、２４…ディジタル／アナログ（Ｄ／Ａ）変換部、２５…音声出力部、２６…消失補償処理部、２６ａ…内部バッファ、２７…混合部、２８…内部バッファ状態保持部。[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to a real-time packet processing apparatus and method for processing frame data contained in a received packet in real time.
[0002]
[Prior art]
  2. Description of the Related Art Conventionally, along with the digitization of electronic devices, information to be transferred is generally packetized and transferred in information communication. For example, in the case of transferring an audio signal, on the transmitting side, audio data sampled at a predetermined sampling frequency is distributed and stored in separate packets by a predetermined amount and transferred in units of packets. On the receiving side, the audio data is extracted from the received packet, and the extracted audio data is connected and played back or mixed.
[0003]
  That is, in an electronic device that performs packet communication as described above, the transmission side performs processing to form and transmit a packet when one packet of data is obtained, and the reception side stores the packet in the received packet. A process of reading data in the packet is performed every time required for reproducing the data. Thereby, on the receiving side, for example, in the case of real-time transfer of audio data, it is possible to perform reproduction processing or mixing processing on continuous audio from a plurality of dividedly received packets.
[0004]
  Such packet communication is almost always performed using a computer device. For example, a mobile phone using wireless communication, a well-known IP phone using a communication network such as the Internet, music from a distribution server, etc. It is used in a system for distributing content to a user terminal device, a remote conference system, and the like.
[0005]
  For example, a speech coding method (predictive coding method) using inter-frame prediction is ITU standard G.264. 729, G.G. 723.1, G.M. 722.1.
[0006]
  In these encoding schemes, the correlation data stored in the internal buffer of the encoding processing unit on the transmitting device side and the correlation data stored in the internal buffer of the decoding processing unit on the receiving device side must match. There is a restriction that a correct acoustic signal cannot be restored. The correlation data is the G. 729, G.G. 723.1, G.M. This is data used in the predictive coding method described in 722.1.
[0007]
  For example, when audio frames 0, 1, 2, and 3 are encoded by the decoding processing unit on the transmission device side and each frame is included in packets 0, 1, 2, and 3 and transmitted, the packet reception order is changed. Regardless, the decoding processing unit on the receiving apparatus side encodes each frame unless the decoding processing is performed in the order of frames 0, 1, 2, and 3 in the same manner as the encoding processing unit encoded. The correlation data of the encoding processing unit at the time point and the correlation data of the decoding processing unit at the time of decoding the frame do not match, and a correct decoded waveform cannot be obtained.
[0008]
  In addition, when a packet is lost during transfer, a packet loss compensation (PLC) process may be performed in the receiving apparatus.
[0009]
  As packet loss compensation processing, G.I. 711 Appendix 1 and G. A method that is based on the 729 standard is known.
[0010]
[Patent Document 1]
  JP 2000-83050 A
[Non-Patent Document 1]
  ITU-T Recommendation G. 729
[Non-Patent Document 2]
  ITU-T Recommendation G. 723.1
[Non-Patent Document 3]
  ITU-T Recommendation G. 722.1
[Non-Patent Document 4]
  ITU-T Recommendation G. 711 Appendix 1
[0011]
[Problems to be solved by the invention]
  When the packet loss occurs due to the restriction in the speech coding method (predictive coding method) described above, the correlation data of the decoding processing unit may be updated using the lost frame as an input to the decoding processing unit. Therefore, when decoding a frame in a packet correctly received after erasure, the correlation data of the encoding processing unit on the transmitting device side and the correlation data of the decoding processing unit on the receiving device side are inconsistent, and the audio frame is It may not be restored correctly.
[0012]
  In the above-described conventional method, since no special measures are taken for correlation data mismatch caused by packet loss or delay as described above, degradation in perceivable quality occurs in reproduced speech.
[0013]
  In addition, in the case of an application such as VoIP, there is no guarantee that all voice packets transmitted from the transmitting device side will correctly arrive at the receiving device side, even though there are restrictions when using the predictive coding method as described above. Absent. For example, when a packet is lost in a communication network or the like, the correlation data of the encoding processing unit and the decoding processing unit easily mismatch, and when the communication session is established, it is correctly received from the first packet transmitted by the transmitting device. Since it does not always arrive at the apparatus side, there is a problem that communication continues from the initial stage of communication with the correlation data of the encoding processing unit and the decoding processing unit being inconsistent.
[0014]
  In view of the above problems, the object of the present invention is to minimize the mismatch of correlation data between the encoding processing unit and the decoding processing unit when the packet is lost or delayed according to the arrival state of the voice packet. It is an object to provide a real-time packet processing apparatus and method capable of appropriately interpolating frames and reproducing sound with reduced quality deterioration.
[0015]
[Means for Solving the Problems]
  In general, the prediction used in the encoding processing unit and the decoding processing unit uses a correlation between several frames, and several frames from the point in time when the internal states of the encoding processing unit and the decoding processing unit do not match. During this time, a correct acoustic signal cannot be restored, and perceived quality degradation occurs in the reproduced sound.
[0016]
  However, even if there is a mismatch in the correlation data between the encoding processing unit and the decoding processing unit, the correlation data gradually matches after decoding the received frames in several frames in the correct order, and as a result the decoded waveform The quality degradation gradually decreases as the correct frames are successively decoded.
[0017]
  In the present invention, when there is a mismatch in the correlation data of the encoding processing unit and the decoding processing unit, the audio whose quality is deteriorated by not reproducing the audio of the first few frames is not reproduced and the listening quality is improved. A real-time packet processing apparatus and a method thereof are proposed.
[0018]
  In the present invention, a continuous input audio signal is cut out at predetermined intervals, and a plurality of sampling data obtained by sampling the cut signals at predetermined sampling times shorter than the cycle is encoded by an encoding processing unit. A packet including the frame is generated for each frame, and the packet is received by a receiver via a communication network from a transmitter that sequentially transmits the frame, and the received packet is received by the receiver In the real-time packet processing using a predictive encoding method in which the frame included is decoded by a decoding processing unit, and audio reproduction processing is performed on sampling data included in the decoded frame, the receiving device includes:A packet received from the transmission device is stored in a buffer and included in a packet input from the buffer.Analyzing the encoded frame when decoding the frame, holding the analysis result as correlation data, and when performing the decoding process by the decoding processing unit, when the frame to be processed is not continuous, Only the decoding processing unit performs decoding on N frames from the discontinuous frames, and the N frames without performing the audio reproduction processing on the decoded N frames. The audio reproduction processing is performed by decoding the stored correlation data from the next frame.Further, the receiving device is configured such that when the number of packets stored in the buffer exceeds a predetermined number, the last N packets among a predetermined number of consecutive packets to be discarded stored in the buffer. A packet other than that frame is discarded, the last N frames are only decoded, and the subsequent M packet frames are subjected to a fade-in process for gradually increasing the volume from the silent state, A fade-out process for gradually reducing the sound volume to silence is performed on the frame of M packets before the discarded packet.
[0019]
  According to the present invention, when the frame to be processed is not continuous when performing the decoding process by the decoding processing unit, the decoding processing unit is applied to N frames from discontinuous frames. Only the decoding is performed, and without performing the sound reproduction process on the decoded N frames, the decoding is performed using the held correlation data from the next frame of the N frames. And the audio reproduction process is performed. Since only the decoding process is performed on the N frames, the correlation data used in the decoding processing unit when decoding is completely matched with the correlation data in the encoding processing unit on the transmission device side or It is possible to substantially match, and appropriate audio reproduction processing can be performed from the next frame of the N frames..
[0020]
MaFurther, in the present invention, the receiving apparatus performs the processing on the N frames from the discontinuous frames when the frames to be processed are not continuous when performing the decoding process by the decoding processing unit. Decoding in the decoding processing unit is performed, and the audio reproduction processing with the volume reduced is performed on the decoded N frames.
[0021]
  According to the present invention, when the frame to be processed is not continuous when performing the decoding process by the decoding processing unit, the decoding processing unit is applied to N frames from discontinuous frames. After the decoding in step S3 is performed, the sound reproduction process with the volume reduced is performed on the decoded N frames.
[0022]
  At this time, similarly to the above, the decoding processing is performed on the N frames, so that the correlation data used in the decoding processing unit when decoding is performed in the encoding processing unit on the transmission device side. The correlation data can be completely matched or almost matched. Further, since the sound reproduction processing with a reduced volume is performed on the N frames, no abnormal sound is generated at this transition portion, and deterioration of sound quality is reduced..
[0023]
MaIn the present invention, the reception device stores the packet received from the transmission device in the buffer, and decodes the frame included in the packet input from the buffer, and stores the packet stored in the buffer. When the number of frames exceeds a predetermined number, packets other than the last N frames of a predetermined number of consecutive packets to be discarded stored in the buffer are discarded, and the last N Only the frame is decoded, and a fade-in process for gradually increasing the volume from the silent state is performed on the subsequent frame of M packets, and the frame of the M packets before the discarded packet is subjected to fade-in processing. Apply fade-out processing to gradually reduce the volume to silence.
[0024]
  According to the present invention, when the number of packets accumulated in the buffer of the receiving apparatus exceeds a predetermined number, a delay occurs in the audio reproduction processing, so that a predetermined number of consecutive discards stored in the buffer are targeted for discarding. Packets other than the last N frames of the packets are discarded.
[0025]
  Thus, since the fade-in process for gradually increasing the volume from the silent state is performed on the M frames, the voice is reproduced at the portion where the transition is made from the silent state to the voiced state when the voice is reproduced in the receiving device. Since the waveform does not become discontinuous, abnormal noise does not occur in this transition portion, and deterioration of voice quality is reduced.
[0026]
  Furthermore, since the fade-out process for gradually reducing the sound volume to silence is performed on the M frames before becoming discontinuous, the sound level of the part from the sounded state to the silent state is gradually reduced. When audio is played back by the receiver, the audio waveform does not become discontinuous at the transition from the voiced state to the silent state, so no abnormal noise is generated at this transition and the degradation of voice quality is reduced. Is done.
[0027]
  Furthermore, since only the last N frames of the frame of the packet to be discarded are decoded, predictive coding is performed in the encoding processing unit on the transmission side and the decoding processing unit on the reception side. Correlation data used in the method can be completely matched or almost matched.
[0028]
  In the present invention, the receiving device determines that the frame is discontinuous when the sequence number becomes discontinuous based on the sequence number included in the packet.
[0029]
  According to the present invention, based on the sequence number included in the packet, the receiving device determines that the frame is discontinuous when the sequence number becomes discontinuous.
[0030]
  In the present invention, the receiving apparatus reproduces audio by superimposing the frame subjected to the fade-out process and the frame subjected to the fade-in process.
[0031]
  According to the present invention, since the frame subjected to the fade-out process and the frame subjected to the fade-in process are overlapped and reproduced by sound, a silent state does not occur in a discontinuous portion, and the sound quality is improved. Degradation is further reduced.
[0032]
DETAILED DESCRIPTION OF THE INVENTION
  Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
[0033]
  (First embodiment)
  FIG. 1 is a block diagram showing a functional configuration of a real-time packet processing apparatus according to the first embodiment of the present invention. FIG. 2 is a diagram for explaining packetization of a voice signal by a voice packet transmitting apparatus according to the first embodiment of the present invention. 3 is a diagram for explaining a real-time transfer protocol (hereinafter referred to as RTP) header used in the first embodiment of the present invention. In the figure, 1 is a voice packet transmitting device (hereinafter simply referred to as a transmitting device), 2 is a voice packet receiving device (hereinafter simply referred to as a receiving device), and 3 is a communication network such as the Internet. In the present embodiment, as an example, a description will be given of an apparatus that transfers voice packets from the transmission apparatus 1 to the reception apparatus 2 in real time using UDP / IP via the communication network 3.
[0034]
  The transmission device 1 is composed of a well-known computer device and operates according to a preset program. The transmission device 1 includes an audio input unit 11, an analog / digital (A / D) conversion unit 12, an encoding processing unit 13, and a packet generation unit. 14 and a transmission unit 15. Each part which comprises these transmitters 1 is comprised by both hardware and software.
[0035]
  The receiving device 2 is composed of a known computer device, and operates according to a preset program. The receiving device 21, a packet analyzing unit 22, a decoding processing unit 23, and a digital / analog (D / A) converting unit 24. And an audio output unit 25. Each part which comprises these receivers 2 is comprised by both hardware and software.
[0036]
  The audio input unit 11 converts the audio signal into an analog electric signal 4 as shown in FIG. 2 and outputs it to the A / D conversion unit 12, and the A / D conversion unit 12 converts it into a digital signal at a predetermined sampling time. Audio data (samples) are sequentially stored in a data buffer (not shown) provided in the encoding processing unit 13.
[0037]
  Also, as shown in FIG. 2, the audio data stored in the data buffer of the encoding processing unit 13 is cut by the encoding processing unit 13 every predetermined period T to be an audio data frame 31. Packets 30 are generated and transmitted sequentially.
[0038]
  The encoding processing unit 13 performs encoding processing of the audio data frame to be encoded input from the A / D conversion unit 12, but the internal state of the result of encoding the previous frame when performing the encoding processing is Coding gain is improved by holding in the internal buffer 13a and performing prediction from the past.
[0039]
  In this embodiment, in order to reduce quality degradation due to mismatch of correlation data in the encoding processing unit 13 of the transmitting device 1 and the decoding processing unit 23 of the receiving device 2 due to packet loss, the silent state is changed to the voiced state. In this case, the internal buffer 13a of the encoding processing unit 13 is reset and the initial value is used to reduce the occurrence of quality degradation due to transmission errors.
[0040]
  Further, the encoding processing unit 13 encodes the audio data frame to be encoded based on the analysis result and sends it to the packet generation unit 14.
[0041]
  The packet generator 14 generates an RTP packet including the encoded voice data input from the encoding processor 13 and sends it to the transmitter 15. An RTP header as shown in FIG. 3 is added to the RTP packet at this time.
[0042]
  As is well known, the RTP header includes a 2-bit version information V, a 1-bit padding information P, a 1-bit extension information X, a 3-bit CSRC-Count information CC, a 1-bit Marker information (hereinafter referred to as a marker). M, 7-bit Payload-Type information PT, 16-bit sequence number (sequence number: Sequence Number), 32-bit time stamp (Timestamp), 32-bit synchronization signal source (SSRC) identifier, 32 bits , A contributing transmission source (CSRC) identifier or the like.
[0043]
  In the present embodiment, the marker bit M of the first packet to be transmitted after entering the voiced state after stopping the packet transmission in the silent state is set to “1”, and the marker bit M of the other packet is set. Is set to “0”.
[0044]
  The transmission unit 15 transmits the RTP packet input from the packet generation unit 14 to the reception device 2 via the communication network 3.
[0045]
  On the other hand, the reception unit 21 of the reception device 2 receives the RTP packet transmitted from the transmission device 1 via the communication network 3 and sends it to the packet analysis unit 22.
[0046]
  The packet analysis unit 22 analyzes the RTP packet input from the reception unit 21 and separates it into a header portion and an encoded audio data frame, analyzes the contents of the header portion, and transmits the result based on the RTP time stamp. The audio data frames encoded in the specified order are output to the decoding processing unit 23. Further, the packet analysis unit 22 notifies the decoding processing unit 23 of the value of the marker bit M in the RTP header.
[0047]
  The decoding processing unit 23 decodes the encoded audio data frame input from the packet analysis unit 22 and converts it into digital audio data, and outputs the digital audio data to the D / A conversion unit 23.
[0048]
  Further, the decoding processing unit 23 analyzes the encoded audio data frame when performing decoding, temporarily stores the analysis result in the internal buffer 23a, and stores it in the internal buffer 23a when performing data analysis. Data analysis is performed with reference to the temporarily stored analysis result or the preset initial analysis value. Here, by using the analysis result of the previous frame temporarily stored in the internal buffer 23, the optimum analysis and decoding can be performed in consideration of the correlation between the previous and subsequent frames.
[0049]
  The D / A converter 23 receives the digital audio data obtained by decoding by the decoding processor 23, converts it into an analog audio signal, and outputs it to the audio output unit 24.
[0050]
  The audio output unit 24 converts the analog audio data input from the D / A conversion unit 23 into audio and outputs it.
[0051]
  Next, the operation of the real-time packet processing apparatus according to the present embodiment having the above configuration will be described.
[0052]
  In VoIP communication, the sending device 1 side may start sending voice packets before the receiving device 2 side is ready to accept. In such a case, the receiving apparatus 2 cannot correctly receive the packet immediately after the start of communication, and misses the first few packets.
[0053]
  In this case, the correlation data stored in the internal buffer 13a in the encoding processing unit 13 on the transmitting device 1 side and the correlation data stored in the internal buffer 23a in the decoding processing unit 23 on the receiving device 2 side are It becomes inconsistent and a correct audio signal cannot be generated.
[0054]
  In the present embodiment, for example, G. A case where 729 is used will be described as an example. In this case, since one frame is 10 ms, a case where one frame of 10 ms worth of audio is assumed to be one packet will be described. In the following embodiments, the same condition is described as an example.
[0055]
  When performing VoIP voice packet communication using RTP / RTCP, it is impossible to know the sequence number of the packet sent first by the transmission device 1, so that the transmission device 1 receives the first packet received on the reception device 2 side. I don't know if this is the first packet sent out.
[0056]
  For this reason, it cannot be known whether or not the first encoded frame generated in a state where the correlation data of the internal buffer in the encoding processing unit 13 of the transmission device 1 is reset is included.
[0057]
  The receiving device 2 decodes the voice frame included in the received packet as it is without performing any additional processing even though the first several packets sent from the transmitting device 1 could not be received. When reproduced in this manner, there may be a case where the quality of the reproduced speech is deteriorated due to the mismatch of the correlation data between the encoding processing unit and the decoding processing unit at the head portion.
[0058]
  In order to avoid this problem, in the first embodiment, the decoding process is performed for several packets from the start of communication, but the decoded frame is not played back, and several frames until the signal waveform of the frame is stabilized. After the decryption process is performed, reproduction is performed using a fade-in process.
[0059]
  For example, in the example shown in FIG. 4, the transmitting apparatus 1 transmits in order from the packet including the frame having the sequence number 0, but the receiving apparatus 2 receives the packet from the packet having the sequence number 3 after starting reception. is doing. In this case, the receiving apparatus 2 only performs a decoding process on the N frames included in the first N packets that can be received. Thus, correlation data for normal decoding is stored in the internal buffer 23a of the decoding processing unit 23. Here, N = 2 and only the decoding process is performed for the two packets with the sequence numbers 3 and 4, and the correlation data is stored in the internal buffer 23a without performing the audio reproduction.
[0060]
  Further, the receiving apparatus 2 performs a fade-in process at the time of audio reproduction after performing a decoding process on a frame of M packets following the frames of the sequence numbers 3 and 4. Here, fade-in processing is applied to the frames of packets with sequence numbers 5 and 6 with M = 2. As for the frames of packets whose sequence numbers are No. 7 and later, normal decoding processing and audio reproduction processing are performed. In the following description, N and M may be numbers that are 0 or more and N + M is 1 or more.
[0061]
  As described above, according to the present embodiment, when the leading voice packet cannot be received, the voice can be reproduced with reduced quality deterioration.
[0062]
  (Second Embodiment)
  Next, a second embodiment of the present invention will be described.
[0063]
  In the second embodiment, the state mismatch of the correlation data stored in the internal buffer 13a in the encoding processing unit 13 of the transmitting device 1 and the internal buffer 23a in the decoding processing unit 23 of the receiving device 2 is caused in the network communication network 3. A case that occurs due to packet loss will be described. In addition, the apparatus structure in 2nd Embodiment is the same as that of 1st Embodiment.
[0064]
  When a packet is lost, the frame included in the lost packet cannot be decoded, so the internal buffer 13a of the encoder of the encoding processing unit 13 on the transmission device 1 side and the decoding on the reception device 2 side A state mismatch occurs with respect to the correlation data stored in the internal buffer 23a of the decoder of the conversion processing unit 23.
[0065]
  In the case where such a state mismatch regarding the correlation data stored in the internal buffers 13a and 23a occurs, in this embodiment, the frame immediately after the packet loss is not decoded and reproduced immediately, but received after the packet loss. The frames of several packets are decoded but not reproduced, and are reproduced by performing fade-in processing after performing several frames encoding processing until the signal waveform of the frame is stabilized.
[0066]
  The operation for realizing the above contents will be described below with reference to FIG.
[0067]
  Using the sequence numbers included in the received packets in the receiving unit 21 and the packet analyzing unit 22, the receiving device 2 can know the disappearance of the packets with the sequence numbers 1 and 2.
[0068]
  When receiving the loss of the packets with the sequence numbers 1 and 2, the receiving device 2 generates and reproduces silence for the frames of these lost packets.
[0069]
  Further, in the present embodiment, the receiving device 2 uses the internal buffer 23a of the decoding processing unit 23 before decoding the frame of the packet received next to the frames of the lost sequence numbers of the first and second packets. Initialize (reset).
[0070]
  Next, the receiving device 2 starts decoding the received sequence number 3 and subsequent packets, but only decodes and does not reproduce during N frames. That is, the frames of the two packets with the sequence numbers 3 and 4 are only decoded, and the audio is not reproduced.
[0071]
  Subsequent M frames are reproduced by performing a fade-in process for gradually increasing the sound volume from the sound volume 0 to the audio signal obtained by the decoding process. That is, for the frames of the two packets with the sequence numbers 3 and 4, the audio signal obtained by the decoding process is reproduced by performing a fade-in process for gradually increasing the volume from the volume 0.
[0072]
  After N + M frames, that is, for the frames of packets with sequence number 7 and subsequent packets, the audio signal obtained by the decoding process is reproduced as usual.
[0073]
  According to the second embodiment, when a packet is lost depending on the arrival state of a voice packet, it is possible to reproduce voice while reducing quality degradation.
[0074]
  (Third embodiment)
  Next, a third embodiment of the present invention will be described. The apparatus configuration in the third embodiment is the same as that in the first embodiment described above.
[0075]
  In the third embodiment, in packet communication, when a large number of IP packets arrive at one time, the packets are accumulated in the FIFO buffer in the receiving unit 21 of the receiving apparatus, and a part of the received packets needs to be discarded. A case where the above occurs will be described.
[0076]
  In such a case, when a part of the received packet is discarded and the audio frame included in all received packets is not decoded, it is stored in the internal buffer 13a in the encoding processing unit 13 of the transmission device 1. A state mismatch occurs between the correlation data and the correlation data stored in the internal buffer 23 a in the decoding processing unit 23 of the receiving device 2.
[0077]
  Also, it is often unacceptable to perform the decoding process for all frames to be discarded in consideration of the processing load.
[0078]
  In the third embodiment, an example is shown in which voice quality deterioration caused by not decoding all frames in the above case is reduced.
[0079]
  In the third embodiment, M frames are provided as boundary frames, and crossfading is performed to eliminate the discontinuity of the speech waveform and reduce quality degradation.
[0080]
  Here, when the number of discarded frames is very small, all the frames including the discarded frames may be decoded. For example, when the number of frames to be discarded X <N, all frames can be decoded.
[0081]
  Next, a specific example of the above processing will be described with reference to FIG.
[0082]
  The specific example shown in FIG. 6 shows processing when it is desired to discard packets having sequence numbers 11 to 14 because a large number of packets have accumulated in the FIFO buffer of the receiver 22.
[0083]
  At this time, packets with sequence numbers 11 and 12 are discarded. Further, the frames of the packets with the sequence numbers of 9 and 10 are used as boundary frames, and after these frames are decoded, fade-out processing for gradually reducing the sound volume to silence is performed on these boundary frames.
[0084]
  Further, the internal buffer 23a of the decoding processing unit 23 is initialized (reset) before the frame of the packet with the sequence number 13 is decoded.
[0085]
  Also, the packets with sequence numbers 13 and 14 are subjected to decoding processing, and the correlation data stored in the internal buffer 23a of the decoding processing unit 23 is updated. However, voice decoding is not performed for frames of packets with sequence numbers 13 and 14 that have been decoded.
[0086]
  Further, M frames subsequent to this are used as boundary frames, and an audio signal obtained by decoding these frames is subjected to fade-in processing for gradually increasing the volume from volume 0 and reproduced. That is, for the frames of the packets with sequence numbers 15 and 16, a fade-in process for gradually increasing the volume of the audio signal obtained by the decoding process from the volume 0 is performed.
[0087]
  Further, sound is reproduced in a state where the frames of the sequence numbers 9 and 10 and the frames of the 15 and 16 are overlapped and crossfade.
[0088]
  The frame of the packet after the packet with the sequence number of 17 reproduces the audio signal obtained by decoding as usual.
[0089]
  According to the third embodiment, even when a large number of IP packets arrive at one time and packets accumulate too much in the FIFO buffer in the receiving unit 21 of the receiving device 2, some of the received packets are discarded. Audio reproduction can be performed with reduced quality degradation.
[0090]
  (Fourth embodiment)
  Next, a fourth embodiment of the present invention will be described.
[0091]
  FIG. 7 is a block diagram showing a functional configuration of a voice packet communication system according to the fourth embodiment of the present invention. In the figure, the same components as those in the first embodiment are denoted by the same reference numerals, and the description thereof is omitted. Further, the difference between the fourth embodiment and the first embodiment is that the reception device 2 is provided with an erasure compensation processing unit 26 and a mixer 27.
[0092]
  In the fourth embodiment, an example in which the packet loss compensation process is performed by extending the process when the packet shown in the second embodiment is lost will be described.
[0093]
  When a packet is lost, the frame included in the lost packet cannot be decoded and the internal state of the decoder cannot be updated. Therefore, the encoding processing unit 13 on the transmission apparatus 1 side encodes the frame. There is a discrepancy between the correlation data at the time of conversion to the correlation data stored in the internal buffer 23a of the decoding processing unit 23 that decodes the frame.
[0094]
  The receiving device 2 can know the packet loss by using the sequence number included in the received packet in the receiving unit 21 and the packet analyzing unit 22. In the present embodiment, a case where packets with sequence numbers 1 and 2 are lost will be described with reference to FIG. 8 as a specific example.
[0095]
  When receiving the loss of these packets, the receiving apparatus 2 performs the first reproduction process in the erasure compensation processing unit 26 and the second reproduction process in the decoding processing unit 24, and outputs from the erasure compensation processing unit 26. The signal and the output signal from the decoding processing unit 24 are mixed by the mixer 27 and input to the D / A conversion unit 24.
[0096]
  That is, when the receiving device 2 knows the loss of a packet, the correlation stored in the internal buffer 23a of the decoding processing unit 23 after decoding the last frame included in the last packet received normally. Data is copied to the internal buffer 26 a of the erasure compensation processing unit 26.
[0097]
  The erasure compensation processing unit 26 performs the first reproduction process. In the first reproduction process, the correlation data copied to the internal buffer 26a is used to artificially generate and interpolate an audio waveform to be reproduced instead of the lost frame, and N frames following the interpolated frame are generated. Is generated and decoded, and the subsequent M frames are pseudo-generated and decoded and subjected to fade-out processing, and output to the D / A converter 24 via the mixer 27.
[0098]
  The decryption processing unit 23 performs the second reproduction process. In this second reproduction process, the internal buffer 23a is initialized (reset) before decoding the frame of the next packet that has not been lost.
[0099]
  Next, the decoding processing unit 23 starts decoding the received packet, but only decodes the first N frames and does not reproduce it. The M frames subsequent to the N frames are subjected to a fade-in process for gradually increasing the volume of the audio signal obtained by the decoding process in the decoding processing unit 24 from zero.
[0100]
  A signal that has been subjected to the fade-in process in the decoding processing unit 24 and a signal that has been subjected to the fade-out process on the speech waveform generated by the erasure compensation processing unit are combined (cross-fade) and reproduced.
[0101]
  Further, in the decoding processing unit 23, a normal decoding process is performed on the frames following the N + M frames after the initialization (reset) of the internal buffer 23a, and the decoded audio signal is processed. Is played as usual.
[0102]
  Next, a specific example of the above processing will be described with reference to FIG.
[0103]
  In the specific example shown in FIG. 8, when the receiving device 2 knows that the packets with the sequence numbers 1 and 2 are lost, the receiving device 2 decodes the frame included in the packet with the sequence number 0 that has been normally received. The correlation data stored in the internal buffer 23 a of the subsequent decoding processing unit 23 is copied to the internal buffer 26 a of the erasure compensation processing unit 26.
[0104]
  In the erasure compensation processing unit 26, as the first reproduction process, the correlation data copied to the internal buffer 26a is used, and the lost sequence number 1 ′ to be reproduced instead of the frames of the first and second packets. And 2 ′ frame and the subsequent 3′-6 ′ frame speech waveform are simulated and interpolated, and the interpolated 1′-4 ′ frame is only decoded. Subsequently, the frames of the packets with sequence numbers 5 ′ and 6 ′ are decoded and subjected to fade-out processing to generate a speech waveform, which is then sent to the D / A converter 24 via the mixer 27. Output.
[0105]
  In the decoding processing unit 23, the internal buffer 23a is initialized (reset) before decoding the frame of the received packet with the sequence number 3 as the second reproduction processing.
[0106]
  Next, the decoding processing unit 23 only decodes the frames of the received sequence numbers 3 and 4 and does not reproduce them. Thereby, the correlation data stored in the internal buffer 23a of the decoding processing unit 23 becomes normal.
[0107]
  In addition, the frames of the packets with the sequence numbers 5 and 6 are subjected to fade-in processing for gradually increasing the volume of the audio signal obtained by the decoding processing in the decoding processing unit 24 from 0, and the mixing unit 27 Is output to the D / A converter 24.
[0108]
  As a result, the signal that has been subjected to the fade-in process in the decoding processing unit 23 by the mixing unit 27 and the signal that has been subjected to the fade-out process in the erasure compensation processing unit 26 are combined (crossfaded) and reproduced.
[0109]
  Further, the decoding processing unit 23 performs a normal decoding process on the frames of packets with sequence numbers of 7 and later. The decoded audio signal is reproduced as usual.
[0110]
  According to the fourth embodiment, even when a packet is lost in the communication network 3, it is possible to reproduce sound while reducing quality deterioration.
[0111]
  (Fifth embodiment)
  Next, a fifth embodiment of the present invention will be described. In addition, the apparatus structure in 5th Embodiment is the same as that of 4th Embodiment mentioned above.
[0112]
  In the fifth embodiment, as shown in FIG. 9, after the packet with the sequence number 0 is received, the packet with the sequence number 1 following this packet is delayed. A process in the case where the No. 2 and No. 2 ′ frames are simulated and reproduced will be described.
[0113]
  The receiving device 2 can know in the receiving unit 21 and the packet analyzing unit 22 that the packet to be received is delayed. In the present embodiment, a case where a packet having a sequence number of 1 and thereafter is delayed will be described with reference to FIG. 9 as a specific example.
[0114]
  In response to the packet delay, the receiving apparatus 2 performs the first reproduction process in the erasure compensation processing unit 26 and the second reproduction process in the decoding processing unit 24, and the output signal from the erasure compensation processing unit 26 and the decoding process are performed. The output signal from the unit 24 is mixed by the mixer 27 and input to the D / A conversion unit 24.
[0115]
  That is, when the receiving device 2 knows the delay of the packet, the correlation data stored in the internal buffer 23a of the decoding processing unit 23 after decoding the last frame included in the last packet received normally. Is copied to the internal buffer 26 a of the erasure compensation processing unit 26.
[0116]
  The erasure compensation processing unit 26 performs the first reproduction process. In the first reproduction process, the correlation data copied to the internal buffer 26a is used to generate and interpolate a voice waveform of a frame that should exist within the delay time instead of a frame that cannot be received with a delay, N frames following the interpolated frame are pseudo-generated and fade-out processing is performed on these N frames, which are output to the D / A conversion unit 24 via the mixing unit 27.
[0117]
  The decryption processing unit 23 performs the second reproduction process. In the second reproduction process, the decoding process of the packets having the sequence numbers received after the delay is started. At this time, after the frames of the first N packets are decoded, a fade-in process for gradually increasing the volume of the audio signal obtained by the decoding process from 0 is performed.
[0118]
  Further, the audio waveform subjected to the fade-in process in the decoding processing unit 23 and the audio waveform subjected to the fade-out process generated by the erasure compensation processing unit 26 are synthesized (cross-fade) by the mixing unit 27 and reproduced.
[0119]
  Further, in the decoding processing unit 23, a normal decoding process is performed on the frames of the packets whose sequence numbers are No. 3 and later, and the decoded audio signal is reproduced as usual.
[0120]
  Next, a specific example of the above processing will be described with reference to FIG.
[0121]
  In the specific example shown in FIG. 9, when the receiving apparatus 2 knows the delay of the packet whose sequence number is 1 or later, the decoding processor 23 includes the sequence number normally received by the decoding processor 23 in the 0th packet. The correlation data stored in the internal buffer 23 a of the decoding processing unit 23 after decoding the frame is copied to the internal buffer 26 a of the erasure compensation processing unit 26.
[0122]
  The erasure compensation processing unit 26 uses the correlation data stored in the internal buffer 26a to output the audio signals of the frames 1 ′ and 2 ′ to be reproduced instead of the frames included in the packet that must exist within the delay time. Simulate.
[0123]
  Next, packets 1 and 2 arrive late.
[0124]
  In the decoding processing unit 23, since the correlation data is held in the internal buffer 23a immediately after decoding the frame included in the packet with the sequence number 0, the sequence number is set to 1 using the correlation data. If the frame included in the second packet is decoded, the correlation data stored in the internal buffer 13a of the encoding processing unit 13 and the correlation data stored in the internal buffer 23a of the decoding processing unit 23 do not match. Does not occur.
[0125]
  Furthermore, the encoding processing unit 23 performs a fade-in process for gradually increasing the volume of the audio signal obtained by decoding after decoding the frames of the packets having the sequence numbers 1 and 2 after the decoding. To the D / A converter 24 via the mixer 27.
[0126]
  On the other hand, in the erasure compensation processing unit 26, N (here, N = 2) pseudo frames 3 ′ and 4 ′ are avoided in order to avoid deterioration in audio quality due to discontinuous audio signal waveforms to be reproduced. Are output to the D / A converter 24 via the mixer 27.
[0127]
  As a result, the audio signal waveform obtained by decoding the frames included in the packets having the sequence numbers 1 and 2 output from the encoding processing unit 23 and the pseudo frames 3 ′ and 4 ′ output from the erasure compensation processing unit 26 are obtained. Are mixed by the mixing unit 27 and subjected to cross-fade processing, which is output to the D / A conversion unit 24.
[0128]
  Further, the decoding processing unit 23 performs a normal decoding process on the frames of packets with sequence numbers of 3 and after. The decoded audio signal is reproduced as usual.
[0129]
  According to the fifth embodiment, even when a packet is delayed in the communication network 3, it is possible to reproduce sound while reducing quality deterioration.
[0130]
  (Sixth embodiment)
  Next, a sixth embodiment of the present invention will be described with reference to FIG. In addition, the apparatus structure in 6th Embodiment is the same as that of 4th and 5th Embodiment mentioned above.
[0131]
  In the sixth embodiment, as shown in FIG. 10, instead of the process of the fifth embodiment described above, the decryption processing unit 23 discards the packets with the sequence numbers 1 and 2 and the sequence number 3 The frames included in the No. 4 and No. 4 packets are subjected to a decoding process and further subjected to a fade-in process. Were mixed by the mixing unit 27 and cross-faded and output.
[0132]
  Also in the sixth embodiment, as in the fifth embodiment, even when a packet is delayed in the communication network 3, it is possible to reproduce sound while reducing quality degradation.
[0133]
  (Seventh embodiment)
  Next, a seventh embodiment of the present invention will be described.
[0134]
  FIG. 11 is a block diagram showing a receiving apparatus of a voice packet communication system according to the seventh embodiment of the present invention. In the figure, the same components as those of the fourth embodiment described above are denoted by the same reference numerals, and description thereof is omitted. The difference between the seventh embodiment and the fourth embodiment is that an internal buffer state holding unit 28 is provided in place of the erasure compensation processing unit 26 in the fourth embodiment and the mixing unit 28 is removed.
[0135]
  Also with the above configuration, the same processing as in the fifth embodiment can be performed. That is, G. When the decoding processing unit 23 and the erasure compensation processing unit 26 are substantially the same as in the case of 729, the correlation data stored in the internal buffer 23a of the decoding processing unit 23 is When the next frame that has been delayed is decoded and temporarily stored in the buffer status holding unit 28, the same processing is performed by starting decoding using the stored correlation data. It can be carried out.
[0136]
  In this case, immediately after decoding the frame included in the packet with the sequence number 0, the correlation data stored in the internal buffer 23a of the decoding processing unit 23 is copied to the internal buffer state holding unit 28. The pseudo frames from No. 1 'to No. 4' are generated in the same manner as the processing of the erasure compensation processing unit 26.
[0137]
  Next, when decoding the frame included in the packet with the sequence number 1 received with delay, the correlation data held in the internal buffer state holding unit 28 is converted into the internal buffer 23a of the decoding processing unit 26. After decoding and returning to the decryption process.
[0138]
  Also in the seventh embodiment, as in the fifth embodiment, even when a packet is delayed in the communication network 3, it is possible to reproduce sound with reduced quality degradation.
[0139]
  Each embodiment mentioned above is an example of the present invention, and the present invention is not limited only to the above-mentioned embodiment.
[0140]
  It goes without saying that the method of the present invention can also be applied to cases where packets are received in a state where the preceding and succeeding packets are switched and the frames of these packets have to be decoded.
[0141]
【The invention's effect】
  As described above, according to the real-time packet processing apparatus and the method of the present invention, when the frames to be processed are not continuous when performing the decoding process by the decoding processing unit, Only the decoding processing unit performs decoding on the N frames, and the audio reproduction processing is performed from the next frame of the N frames without performing the audio reproduction processing on the decoded N frames. Since the decoding is performed on the N frames from the discontinuous frames, and the sound reproduction processing with the volume reduced is performed on the decoded N frames. The correlation data used in the decoding processing unit when decoding can be matched with the correlation data in the encoding processing unit on the transmission device side, and the next frame of the N frames In which exhibits the excellent effect that it is possible to perform et appropriate audio reproduction process.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a functional configuration of a real-time packet processing apparatus according to a first embodiment of the present invention.
FIG. 2 is a diagram for explaining packetization of a voice signal by the voice packet transmitting apparatus according to the first embodiment of the present invention.
FIG. 3 is a diagram for explaining a real-time transfer protocol header used in the first embodiment of the present invention.
FIG. 4 is a timing chart illustrating processing of the real-time packet processing device according to the first embodiment of the present invention.
FIG. 5 is a timing chart for explaining processing of the real-time packet processing device according to the second embodiment of the present invention.
FIG. 6 is a timing chart for explaining processing of the real-time packet processing device according to the third embodiment of the present invention.
FIG. 7 is a block diagram showing a functional configuration of a real-time packet processing device according to a fourth embodiment of the present invention.
FIG. 8 is a timing chart illustrating processing of a real-time packet processing device according to the fourth embodiment of the present invention.
FIG. 9 is a timing chart illustrating processing of the real-time packet processing device according to the fifth embodiment of the present invention.
FIG. 10 is a timing chart illustrating processing of a real-time packet processing device according to a sixth embodiment of the present invention.
FIG. 11 is a block diagram showing a functional configuration of a receiving device of a real-time packet processing device according to a seventh embodiment of the present invention.
[Explanation of symbols]
  DESCRIPTION OF SYMBOLS 1 ... Transmission apparatus, 2 ... Reception apparatus, 3 ... Communication network, 11 ... Voice input part, 12 ... Analog / digital (A / D) conversion part, 13 ... Coding process part, 13a ... Internal buffer, 14 ... Packet generation 15, transmitting unit, 21, receiving unit, 22, packet analyzing unit, 23, decoding processing unit, 23 a, internal buffer, 24, digital / analog (D / A) conversion unit, 25, audio output unit, 26 ... disappearance compensation processing unit, 26a ... internal buffer, 27 ... mixing unit, 28 ... internal buffer state holding unit.

Claims

A frame formed by encoding a plurality of sampling data obtained by cutting a continuous input audio signal every predetermined cycle and sampling the cut signal every predetermined sampling time shorter than the cycle by an encoding processing unit. And generating the packet including the frame for each frame and receiving the packet by a receiving device via a communication network from a transmitting device that sequentially transmits the frame, and the frame included in the received packet by the receiving device In a real-time packet processing apparatus using a predictive coding method for decoding audio data by a decoding processing unit and performing audio reproduction processing on sampling data included in the decoded frame,
A buffer for storing packets received from the transmitting device;
Means for analyzing a frame encoded when decoding a frame included in a packet input from the buffer and holding the analysis result as correlation data;
Means for determining whether or not the frames to be processed are continuous when performing the decoding process by the decoding processing unit;
As a result of the determination, when the frames to be decoded are not continuous, the decoding processing unit only performs decoding on N frames from the discontinuous frames, and the decoded N Means for performing the audio reproduction processing by performing decoding using the held correlation data from the next frame of the N frames without performing the audio reproduction processing on the frames ;
When the number of packets stored in the buffer exceeds a predetermined number, discard a packet other than the last N frames among a predetermined number of consecutive packets to be discarded stored in the buffer. Means to
Means for performing a fade-in process in which the last N frames are only decoded, and the volume of the subsequent M packets is gradually increased from a silent state;
A real-time packet processing apparatus comprising: means for performing a fade-out process for gradually reducing the volume of the M packets before the discarded packet to a silent state .

A frame formed by encoding a plurality of sampling data obtained by cutting a continuous input audio signal every predetermined cycle and sampling the cut signal every predetermined sampling time shorter than the cycle by an encoding processing unit. And generating the packet including the frame for each frame and receiving the packet by a receiving device via a communication network from a transmitting device that sequentially transmits the frame, and the frame included in the received packet by the receiving device In a real-time packet processing apparatus using a predictive coding method for decoding audio data by a decoding processing unit and performing audio reproduction processing on sampling data included in the decoded frame,
A buffer for storing packets received from the transmitting device;
Means for analyzing a frame encoded when decoding a frame included in a packet input from the buffer and holding the analysis result as correlation data;
Means for determining whether or not the frames to be processed are continuous when performing the decoding process by the decoding processing unit;
As a result of the determination, when the frames to be decoded are not continuous, the decoding processing unit performs decoding on N frames from discontinuous frames, and then performs the decoding The audio reproduction processing is performed by performing the audio reproduction processing with the volume reduced on the N frames, and performing decoding using the held correlation data from the next frame of the N frames. Means for applying ,
When the number of packets stored in the buffer exceeds a predetermined number, discard a packet other than the last N frames among a predetermined number of consecutive packets to be discarded stored in the buffer. Means to
Means for performing a fade-in process in which the last N frames are only decoded, and the volume of the subsequent M packets is gradually increased from a silent state;
A real-time packet processing apparatus comprising: means for performing a fade-out process for gradually reducing the volume of the M packets before the discarded packet to a silent state .

The said determination means has a means to determine that the said frame became discontinuous based on the sequence number contained in the said packet, when this sequence number becomes discontinuous. Item 3. The real-time packet processing device according to Item 2.

The real-time packet processing apparatus according to claim 1, further comprising a unit that reproduces audio by superimposing the frame subjected to the fade-out process and the frame subjected to the fade-in process.

A frame formed by encoding a plurality of sampling data obtained by cutting a continuous input audio signal every predetermined cycle and sampling the cut signal every predetermined sampling time shorter than the cycle by an encoding processing unit. And generating the packet including the frame for each frame and receiving the packet by a receiving device via a communication network from a transmitting device that sequentially transmits the frame, and the frame included in the received packet by the receiving device In a real-time packet processing method using a predictive coding method for decoding audio data by a decoding processing unit and performing audio reproduction processing on sampling data included in the decoded frame,
The receiving device stores the packet received from the transmitting device in a buffer, decodes the frame included in the packet input from the buffer , analyzes the encoded frame, and correlates the analysis result with the correlation data. And when the decoding process by the decoding processing unit is not continuous, the decoding processing unit applies to N frames from discontinuous frames when the frames to be processed are not continuous. Only decoding is performed, and the audio reproduction processing is not performed on the decoded N frames, and decoding is performed using the correlation data held from the next frame of the N frames. and facilities the audio reproduction process, when the number of packets stored in the buffer exceeds a predetermined number, discarding a predetermined number of successive stored in the buffer The packets other than the last N frames of the packets to be processed are discarded, and the last N frames are only decoded, and the subsequent M frames are gradually changed from the silent state. A real-time packet processing method , wherein a fade-in process for increasing the volume is performed, and a fade-out process for gradually decreasing the volume to a silence is applied to a frame of M packets before the discarded packet.

A frame formed by encoding a plurality of sampling data obtained by cutting a continuous input audio signal every predetermined cycle and sampling the cut signal every predetermined sampling time shorter than the cycle by an encoding processing unit. And generating the packet including the frame for each frame and receiving the packet by a receiving device via a communication network from a transmitting device that sequentially transmits the frame, and the frame included in the received packet by the receiving device In a real-time packet processing method using a predictive coding method for decoding audio data by a decoding processing unit and performing audio reproduction processing on sampling data included in the decoded frame,
The receiving device stores the packet received from the transmitting device in a buffer, decodes the frame included in the packet input from the buffer , analyzes the encoded frame, and correlates the analysis result with the correlation data. And when the decoding process by the decoding processing unit is not continuous, the decoding processing unit applies to N frames from discontinuous frames when the frames to be processed are not continuous. After decoding, the audio reproduction processing with a reduced volume is performed on the decoded N frames, and the held correlation data is used from the next frame of the N frames. performs decoding by facilities the audio reproduction processing Te, when the number of packets stored in the buffer exceeds a predetermined number, is stored in the buffer A packet other than the last N frames of the predetermined number of packets to be discarded is discarded, and the last N frames are only decoded, and the frames of M packets that follow this are decoded. A real-time process in which a fade-in process for gradually increasing the volume from a silent state is performed, and a fade-out process for gradually decreasing the volume to a silence is performed on a frame of M packets before the discarded packet. Packet processing method.

The receiving device, based on the sequence number included in the packet, when the sequence number becomes discontinuous, claim 5 or claim, characterized in that determines that the frame is discontinuous 6 The real-time packet processing method described in 1.

7. The real-time packet processing method according to claim 5 , wherein the receiving device reproduces audio by superimposing the frame subjected to the fade-out process and the frame subjected to the fade-in process. 8.