JPWO2002095731A1

JPWO2002095731A1 - Audio signal processing device

Info

Publication number: JPWO2002095731A1
Application number: JP2002592111A
Authority: JP
Inventors: 金山　靖隆; 靖隆金山; 佐藤　輝幸; 輝幸佐藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2001-05-22
Filing date: 2001-05-22
Publication date: 2004-09-09
Anticipated expiration: 2021-05-22
Also published as: JP4426186B2; WO2002095731A1

Abstract

リニアＰＣＭデータから得られる音声波形について一周期程度にわたって、自己相関値を求め、該自己相関値に基づいて音声波形のピッチ周期を抽出する。次に、目的の音声波形のサンプル値の近傍について、目的の音声波形のサンプル値の一周期前からの音声波形を予測波形として、実際の音声波形と予測波形との相関値を求め、予測波形と実際の音声波形の相関の大きさから実際の音声波形の不連続点の存在を検出する。不連続点が検出された場合には、予測波形と実際の音声波形との補間演算により、不連続点近傍では予測波形に似ており、次第に実際の波形に近づく補正音声波形を形成する。An autocorrelation value is obtained for about one cycle of a speech waveform obtained from the linear PCM data, and a pitch cycle of the speech waveform is extracted based on the autocorrelation value. Next, for the vicinity of the sample value of the target voice waveform, the correlation value between the actual voice waveform and the predicted waveform is obtained using the voice waveform from one period before the sample value of the target voice waveform as the predicted waveform, and calculating the predicted waveform. The presence of a discontinuity in the actual audio waveform is detected from the magnitude of the correlation between the waveform and the actual audio waveform. When a discontinuous point is detected, an interpolation calculation between the predicted waveform and the actual voice waveform forms a corrected voice waveform that resembles the predicted waveform near the discontinuous point and gradually approaches the actual waveform.

Description

技術分野
本発明は、通信ネットワークや端末機におけるリニアＰＣＭ音声データなどのデジタル音声データを対象とした音声信号処理装置に関する。
背景技術
今日の情報通信社会において、様々な情報がネットワークを通してやりとりされているが、一昔前に比べるとその扱っているデータは非常に大きくなっており、そしてまた多様化している。今後もこの傾向は続くと思われる。
ネットワークはそのような増え続ける情報量に対応しなければならないが、最近では、そのためのキーワードとして「ブロードバンド化」、「ＩＰ化」などの言葉を良く目にする。
「ブロードバンド化」とは、通信経路の伝送能力を高くし、巨大なデータを速やかに伝送できるようにすることであり、「ＩＰ化」とはデータをＩＰパケット単位で送るというものである。パケット交換は回線を占有するわけではないため、データの量に応じた従量課金となり、巨大なデータを扱う今日において非常に重要な方式と言える。
ところで、音声はというと、現在のところ音声情報は回線交換方式で伝送されており、回線を占有している時間に応じた時間課金となっている。回線を占有するわけなので、その間の品質についてはかなり高いものが要求され、また、実際品質はある程度高いと言える。
しかし、時代の流れの中で、音声もＩＰパケットにより伝送することが検討されており、近い将来「ＶｏＩＰ」とよばれるサービスが始まると考えられる。つまり、音声データも他のデータと同様にパケット交換によって情報のやりとりを行うことになる。
その際音声データは、音声以外のデータのサイズと比較して非常に小さいため、伝送フォーマットは特に圧縮は行わず、現在のＡＴＭ網で使われているＧ．７１１ＰＣＭフォーマットとなることが予想される。
しかし、ＩＰパケット交換はエラーが発生してもパケットの再送が可能なデータなどに適した伝送方法であり、音声データのように再送のきかないリアルタイム系の情報においてはある程度の品質劣化が起こると考えられる。
このような品質劣化に起因する音声波形の不連続点があると聴覚上大きな品質劣化が起こることは良く知られているが、音声波形の不連続というのはいろいろな原因で起こるものである。
例えば、最近の移動体通信で使われている音声コーデックはＣＥＬＰ方式などが主であるが、この方式の場合リニアＰＣＭデータをフレーム単位で処理する。フレームからはスペクトル包絡情報や音源情報などのパラメータを抽出し、高い圧縮率での符号化を可能にしている。しかし、フレーム単位で符号化されたデータを復号する場合、フレームとフレームの境目には不連続点が生じやすい。このような不連続点が生じないようにするために、代表的なパラメータ（ピッチ周期など）を用い、重み付けを用いてフレームの境目付近で音声波形の補間を行っている。
他にも、聴覚上の音質を改善するためにフィルタ処理を行う方法などが知られている。また、無線区間における符号化データフレーム（パケット）の消失やデータエラーなどによっても不連続点は生じる。そのときはエラーが起こったことを外部チェックにより伝えてもらい、音声データのレベルを落とす処理などを行うことで聴覚上の品質劣化を抑えている。このような手法の例としては、特開平７−１０５６３７や特開平６−３２６６２２がある。
上記のようにフレームの境目で補間を行う場合やデータのエラーが生じた場合などは不連続が起こった場所、あるいは不連続が起こる可能性がある場所が予め分かった上での処理であり、主に音声符号化や復号とあわせて行われるものである。しかし、ＰＣＭデータをパケット単位で伝送するＡＴＭ網やＩＰ網において、パケットが消失したり原因不明のビットエラーが起こった場合、そこで生じた不連続点はどこらかもチェックを受けることなく品質劣化の原因を抱えたまま伝送されることになる。
特に、ＩＰ網ではパケットの伝送ルートが可変であるため、ルーティングの状態によっては時間的に後に発行されたパケットが先に発行されたパケットを追い越してしまう状況も考えられ、その場合にも不連続点は生じる。
図１は、ＩＰ網におけるパケットルーティングの様子を示した図である。
同図では、３つのパケットがそれぞれ順に送信された場合を示している。第１のパケットの後に、第２、第３のパケットが順次送信されても、第２のパケットは、ＶｏＩＰルータ２を通ってからＶｏＩＰルータ１に送信されている。一方、第３のパケットは、直接ＶｏＩＰルータ１に送信されているため、後から送信された第３のパケットが、第２のパケットを追い越して、送信先に到着することになる。
また、ＩＭＴ−２０００向けの移動体通信網においては、端末機同士での接続の場合にＴＦＯ（ＴａｎｄｅｍＦｒｅｅＯｐｅｒａｔｉｏｎ）と呼ばれる方式を使うことが検討されている。この方式はタンデム接続による品質の劣化を回避する目的があるが、タンデム接続からＴＦＯへの移行、あるいはその逆が行われる時、方式的に不連続点が生じる可能性がある。しかし、それをチェックし、補正する技術はない。
発明の開示
本発明の課題では、不特定に生じる音声波形の不連続点をデジタル音声データを調べることにより検出し、不連続点に起因する品質劣化を補償する音声信号処理装置、特には、リニアＰＣＭデータをチェックし、不連続点を検出し、不連続点を判定された部分には即座に補正をかけ、聴覚上の品質劣化を回避することの出来る音声信号処理装置を提供することである。
本発明の音声信号処理装置は、通信ネットワークにおけるデジタル音声データの処理を行う音声信号処理装置において、入力波形の周期を検出し、該周期から受信する波形を予測する波形予測手段と、該予測された波形と実際に受信された波形との相関値から波形の不連続点を検出する不連続点検出手段と、該不連続点が検出された場合に、該予測された波形と該実際に受信された波形とを用いて不連続点のない補正波形を生成する補正波形生成手段とを備えることを特徴とする。
本発明によれば、受信した波形を直接調べて、不連続点の有無を検出するので、予測できないような原因で不連続点が生じても、不連続点を見つけて、これを補正した波形を生成することが出来る。従って、フレームのつなぎ目など、システムの構成から予測される位置に不連続点が生じた場合のみではなく、波形の任意の位置に発生した不連続点による音声品質の劣化を補償することが出来る。
これにより、本発明では、パケット交換方式による通信ネットワークを介して音声を送受信しても品質の良い音声通信を提供することが出来る。
発明を実施するための最良の形態
本発明では、過去の入力データから周期を求める手段と、求められた周期から未来の音声波形を予測する手段と、予測波形と実際の波形を比較し、補正が必要であるかどうかを判断する手段と、補正が必要な不連続点に対して、重み付けなどの手段を用いて波形を補正する手段を具備する。
図２〜５は、本発明の実施形態の原理を説明する図である。
音声波形を観察すると、有音部分においてはある一定の周期をもって類似した波形が連続して現れることが知られている。これはピッチと呼ばれるものであり、音声を高圧縮する際のパラメータの１つとして、最近の音声符号化方式でも使われている重要なパラメータである。本発明の実施形態では、目的とする音声波形の補正に、このピッチ周期を利用する。図２は、音声波形の例を示しており、ｋがピッチ周期に相当する。
ピッチ周期は自己相関係数の計算などの方法を使うことで抽出が可能である。自己相関がある程度高い数値を出している場合、ピッチ周期を用いることにより未来の波形（期待する波形）をある程度の誤差の範囲で予測することが可能である。図２で言えば、ピッチ周期がｋと求められている場合、ｋサンプル前のリニアＰＣＭデータの値を現在の値として用いることで予測波形を求めることが出来る。
通常の音声波形では図２のようにきちんとピッチ周期が現れている場合は、実際の波形が予測波形から大きく外れることはあまりない。しかし、図３に示すように、予測波形に対し実際の波形が著しく異なる場合、それは音声波形上の不連続点となり、聴覚上大きく品質を落とす可能性を含むことになる。そのため本実施形態では、毎サンプルで実際の波形と予測波形を比較して不連続点を検出し、不連続点周辺を予測波形を用いて補間する。
実際の波形と予測波形との比較方法として、局所的な相関係数の計算などが挙げられる。図３の点ａ_０の近傍を拡大したのが、図４である。図４においてａ_０の近傍ａ_２、．．．．、ａ_２とｂ_０の近傍ｂ_２、．．．．．、ｂ_２について局所的な相関を求めることで著しく波形が乱れたかどうかをチェックする事が出来る。
不連続点と判断されたサンプルについては補正がかけられるが、補正方法としては重み付けを用いる方法などがあるが、特に、本実施形態においては、重み付けの方法を使用することを説明する。図５は重み付けにより予測波形から実際の波形へと徐々に補正されていく様子を示している。すなわち、実線でしめされる実際の波形に不連続点ｄがある場合、破線で示される予測波形との補間を行う。補間の仕方は、太線で示される補正波形が、不連続点ｄに近いところでは、予測波形に近い形状となり、徐々に、実際の波形に近づいていくようにする。
図６は、本発明の実施形態の音声信号処理装置の処理ブロック図である。また、図７は、本発明の実施形態の音声信号処理装置の全体の処理フローを示す図である。
本発明の実施形態を図６と図７を用いて説明する。
リニアＰＣＭデータのサンプル列をａ（−ｉ）、・・・、ａ（０）、・・・、ａ（４）とする。ａ（０）が補正すべきかどうかが判断されるサンプルであり、ａ（−１）がそのひとつ過去のサンプルである。また、ａ（１）、・・・、ａ（４）はａ（０）の後のサンプル値である。本実施形態では、補正すべきサンプル値ａ（０）より時間的に４サンプル後のサンプル値まで必要となるので、実際の処理においては、補正すべきサンプル値の４サンプル後の値まで読み込んでから処理を行う。
まず、ａ（０）のサンプルが含まれている部分の波形が周期性を持っているかどうかを調べるために図６の周期検出部１０ではａ（０）の前の数十サンプル（ここでは４０サンプル）でセグメントを形成し、以下の計算を行う。
なお、ａ（０）の前の数十サンプルは、記憶部１４に、入力からの過去のリニアＰＣＭデータが記憶されており、ここから、データを周期検出部１０に読み込むようにする。また、周期性の検出のために必要なサンプル数は、ここでは、４０サンプルとしているが、実際には、音声データのピッチの一周期を周期検出に使用できるようにサンプル数を決定すべきである。通常、音声データのピッチ周期の検出には、４０サンプル程度有れば十分である。サンプリング周波数が異なる場合などにおいては、その周波数に応じて適当なサンプル数を使用するようにする。

この計算でＳが最大となる時のｋの値とＳの値を求める。ただし、逆位相の波形やパワーが小さい波形が補正に影響を及ぼさないようにするため、分子は括弧の中が正であり、かつ、分母で掛け合わされている２つの項がそれぞれある閾値を超えている場合のみを対象とする。すなわち、分子は、２乗されているために、常に正の数であるが、分子の括弧の中の式は、波形が一致している場合に正の大きい値を示し、波形が逆位相となっている場合には、負であって絶対値の大きな値を示す。従って、波形が逆位相となっている場合には、波形の一致が見られないにも関わらず、上記Ｓが大きな値となってしまうので、これを取り除くため、分子の括弧の中が正の場合に限定する。また、分母の各項の大きさが所定の閾値以上とするのは、音声のパワーが小さい場合を取り除く意味である。分母の各項は、音声のパワーを計算する式となっており、これらの値を所定値以上とすることによって、パワーの小さい音声波形を除去することが出来る。パワーの小さい音声波形を取り除くのは、パワーが小さい音声波形の場合、雑音の影響を受ける可能性が高く、実際の音声波形は、過去の波形と異なるのに、雑音の影響で、上記式で計算した結果、偶然に波形が一致すると判断されてしまう場合を避けるためである。なお、上記閾値は、実験的に本実施形態を利用しようとする各当業者によって適宜決定されるべきものである。
次に、Ｓがある閾値を超えているかどうかを判断する。超えていた場合は周期的な波形となっていると判断され、周期であるｋの値を決定し、図６の予測部１１へ送る。超えていない場合は周期的でないと判断され、予測部１１や判定部１２、補正部１３の処理は行わない。なお、Ｓの判断のための閾値も、実験などを行って、当業者によって適宜設定されるべきものである。
予測部１１ではａ（０）の近傍がａ（−ｋ）の近傍のようになっていると予測する。ここでは、ａ（０）の近傍をａ（−２）、・・・、ａ（４）、ａ（−ｋ）の近傍をａ（−ｋ−２）、・・・、ａ（−ｋ＋４）とする。予測部１１は予測波形を比較判定部１２に送る。ここで、予測波形は、ａ（０）の近傍と同様になっていると判断されたａ（−ｋ）の近傍のａ（−ｋ−２）〜ａ（−ｋ＋４）のサンプルからなる波形である。そして、予測波形（ａ（−ｋ）の近傍）と実際の波形（ａ（０）の近傍）について短区間で以下の計算を行う。なお、ここでの計算は、ａ（０）とａ（−ｋ）のそれぞれの近傍の７サンプルについて行っている。これは、音声波形の一周期よりは十分小さいが、１サンプル毎の雑音的な変化を平均化できる程度に大きい近傍を選択して計算するものである。すなわち、あまり計算するサンプル数が大きすぎると、波形の局所的な不連続を検出することができなくなり、あまりサンプル数が小さすぎると、雑音的なサンプル値の変化でも波形の不連続点と判断してしまうなど、サンプル値の変化に対して敏感になりすぎてしまうので、７サンプル程度がちょうど良いと考えたものである。しかし、本実施形態では、このサンプル数は、必ずしも７サンプルに限定するものではなく、実験などによって当業者が適宜定めるべきものである。

次に、このＴが、ある閾値を超えているかどうかを判断し、超えていない場合はその点で著しく波形が乱れたと判断し、補正部１３に対して比較判定部１２から補正指示が出される。ただし、この場合も分母でかけ合わされている２つの項がある閾値よりも小さい場合は除くようにする。分子の括弧の中が負の場合は−Ｔとする。ここでも、分母の各項が所定閾値よりも大きい場合のみを使用することにより、音声パワーが小さい場合を取り除き、また、分子の括弧の中が負の場合は、−Ｔとして、Ｔの値が負になるようにして、閾値よりも大きくならないようにしている。すなわち、分子の括弧の中が負の場合、つまり、波形が逆位相になっている場合を排除する意味である。また、上述の各閾値は、やはり、実験などにより当業者が適宜決定すべきものである。
補正指示を受けた補正部１３では以下に示すような重み付けにより補間を行い、ｓ（補正後の音声データサンプル値）を出力する。一度補正指示がでたらｎサンプル（補正後の波形が十分滑らかに実際の波形にほぼ一致するようになるように：このｎの値も当業者によって適宜設定されるべきものである）について補正を行い、その間は周期検出、予測、比較判定の機能は停止する。

ここで、ｏｆｆｓｅｔとは補正指示がでたときのａ（−１）−ａ（−ｋ−１）の値であり、補正を行う時に１周期（ｋサンプル）前の値（予測波形）と補正後の波形を滑らかにつなぐために必要な量である。
補正指示がでていない場合は、

となる。
補正部の処理が終わった後、記憶部１４はａ（４）→ａ（３）、ａ（３）→ａ（２）、ａ（ｉ）→ａ（ｉ−１）という具合に値を更新する。なお、ｓ→ａ（−１）とし、補正結果を記憶部１４に記憶される過去の波形データに反映させる。
なお、図６の構成においては、入力からはリニアＰＣＭデータの１サンプルデータが順次入力され、最新のサンプル値は、直接比較判定部１２及び補正部１３に入力される。記憶部１４からは、最新のサンプル値以前の過去のサンプル値が所定数（例えば、４０サンプル程度）出力される。例えば、上述の例で言えば、ａ（４）は、入力から直接比較判定部１２、補正部１３に入力されるが、ａ（３）〜ａ（−４０）は、記憶部１４からそれぞれの部に入力される。
図７は、本発明の実施形態の全体の処理を示すフローチャートである。
まず、ステップＳ１において、自己相関係数を計算する。ここでの計算は、上述の説明におけるＳの算出にあたる。そして、ステップＳ２において、周期性があるか否かを判断する。この周期性の判断は、前述の通り、Ｓの値が所定閾値よりも大きいか否かを判断することにより行い、周期ｋを決定する。ｋとは、音声波形の１周期の長さをサンプル数で示したものである。周期性が無いと判断された場合には、ステップＳ７に進む。この場合、ステップＳ７では、ｓ＝ａ（０）となり、何ら補正をせずに音声波形のサンプル値を出力する。そして、ステップＳ８において、新しいサンプル値１つを記憶部１４に格納すると共に、一番古いサンプル値を１つ破棄する。
ステップＳ２において、周期性があると判断された場合には、ステップＳ３において、波形予測、すなわち、一周期前の過去の波形を予測波形として取得し、ステップＳ４において、現在の波形と予測波形とを比較する。このステップＳ４における演算は、前述のＴを算出することであり、目的のサンプル値の近傍の少ないサンプル数について、現在の波形と予測波形の相関値を求め、その相関値が所定閾値より大きいか否かを判断することであるが、ステップＳ４の処理を「比較」と称している。従って、ステップＳ４の「比較」を行うことによって、現在の波形に不連続点があるか否かが判断される。
そして、ステップＳ４の比較の結果、現在の音声波形に不連続点があるか否かに従って、ステップＳ５において、波形の補正が必要か否かを判断する。音声波形に不連続点が無い場合には、補正が必要ないとして、ステップＳ７、Ｓ８に進み、ステップＳ２において、周期性がない場合と同様の処理を行う。
ステップＳ５において、補正が必要と判断された場合には、ステップＳ６において、前述の重み付け演算により、音声波形のサンプル値の補正を行い、これをステップＳ７において出力し、ステップＳ８において、補正後のサンプル値を記憶部１４に格納すると共に、最も古いサンプル値を破棄する。
図８は、本発明の実施形態に従った音声信号処理装置の適用部分とネットワークを説明する図である。
公衆回線網２２は、ネットワーク２０を介して移動体網２３に接続される。なお、移動体網２３は、別の公衆回線網であってもよいし、公衆回線網２２が別の移動体網であってもよい。ネットワーク２０は、インターネットなどＩＰパケット交換方式によるネットワークなどである。この場合、ネットワーク２０を介して音声を送受するために、ＶｏＩＰという方式が採用される。ネットワーク２０と公衆回線網２２との境界装置としてゲートウェイ２１が設けられる。また、同様に、移動体網２３とネットワーク２０の境界装置としてゲートウェイ２１が設けられる。
本発明の実施形態に従った音声信号処理装置は、これら境界装置としてのゲートウェイ２１に搭載される。すなわち、例えば、公衆回線網２２からゲートウェイ２１に入力された音声信号は、リニアＰＣＭデータに変換された後、本発明の実施形態の音声信号処理を施され、ネットワーク２０にＶｏＩＰのフォーマットで送信される。ネットワーク２０に送出された音声データを受信したゲートウェイ２１は、受信した音声信号をリニアＰＣＭデータに変換し、やはり、本発明の実施形態の音声信号処理を施し、移動体網２３に送出する。
移動体網２３から公衆回線網２２に音声信号を送信する場合も同様である。
また、上記説明では、本発明の実施形態の音声信号処理装置の適用箇所としてゲートウェイを挙げたが、実際には、これには限定されない。すなわち、移動体網２３の携帯端末などの移動機内において、受信した音声を再生する場合にも適用可能であるし、移動体網２３の基地局、あるいは、公衆回線網２２の電話機内に設けて、リニアＰＣＭデータの状態にした音声信号に本発明の実施形態の音声信号処理を行うことも可能である。
産業上の利用可能性
以上本発明によれば、音声波形における不連続点の生じる原因によらず、聴覚上の品質劣化を抑えることが出来る。また、大きな遅延を伴わずに処理を行うことが出来る。
【図面の簡単な説明】
図１は、ＩＰ網におけるパケットルーティングの様子を示した図である。
図２は、本発明の実施形態の原理を説明する図（その１）である。
図３は、本発明の実施形態の原理を説明する図（その２）である。
図４は、本発明の実施形態の原理を説明する図（その３）である。
図５は、本発明の実施形態の原理を説明する図（その４）である。
図６は、本発明の実施形態の音声信号処理装置の処理ブロック図である。
図７は、本発明の実施形態の音声信号処理装置の全体の処理フローを示す図である。
図８は、本発明の実施形態に従った音声信号処理装置の適用部分とネットワークを説明する図である。TECHNICAL FIELD The present invention relates to an audio signal processing apparatus for digital audio data such as linear PCM audio data in a communication network or a terminal.
2. Description of the Related Art In today's information and communication society, various types of information are exchanged through a network, but the data handled is very large and diversified as compared with a time ago. This trend is expected to continue in the future.
The network must cope with such an ever-increasing amount of information. Recently, however, keywords such as “broadband” and “IP” have been frequently used.
"Broadbanding" is to increase the transmission capacity of a communication path so that huge data can be transmitted quickly, and "IPing" is to send data in IP packet units. Since packet switching does not occupy the line, it is a pay-as-you-go method according to the amount of data, and can be said to be a very important method for handling huge data today.
By the way, speaking of voice, voice information is currently transmitted by a circuit switching system, and the time is charged according to the time occupying the line. Since the line is occupied, a very high quality is required in the meantime, and the quality is actually high to some extent.
However, in the course of the times, transmission of voice using IP packets is also being studied, and a service called “VoIP” is expected to start in the near future. That is, the voice data also exchanges information by packet exchange similarly to other data.
At this time, since the audio data is very small in comparison with the size of the data other than the audio, the transmission format is not particularly compressed, and the G.264 used in the current ATM network is used. It is expected to be 711 PCM format.
However, the IP packet exchange is a transmission method suitable for data in which a packet can be retransmitted even when an error occurs. For example, if a certain degree of quality deterioration occurs in real-time information that cannot be retransmitted such as voice data, Conceivable.
It is well known that a large quality deterioration occurs auditoryly when there is a discontinuity in the audio waveform due to such quality deterioration. However, the discontinuity of the audio waveform is caused by various causes.
For example, a speech codec used in recent mobile communication mainly uses the CELP system or the like. In this system, linear PCM data is processed in frame units. Parameters such as spectrum envelope information and sound source information are extracted from the frame, thereby enabling encoding at a high compression rate. However, when decoding data encoded in frame units, discontinuous points are likely to occur at the boundaries between frames. In order to prevent such a discontinuous point from occurring, interpolation of a voice waveform is performed near a boundary between frames using typical parameters (eg, pitch period) and weighting.
In addition, a method of performing a filtering process to improve auditory sound quality is known. In addition, discontinuous points also occur due to loss of a coded data frame (packet) or a data error in a wireless section. At that time, the fact that an error has occurred is notified by an external check, and a process of lowering the level of the audio data is performed, thereby suppressing auditory quality deterioration. Examples of such a method include JP-A-7-105637 and JP-A-6-326622.
In the case where interpolation is performed at a frame boundary or a data error occurs as described above, a place where a discontinuity has occurred, or a place where a discontinuity may occur, is a process on which it is known in advance, It is mainly performed together with speech encoding and decoding. However, in an ATM network or an IP network that transmits PCM data in packet units, when a packet is lost or a bit error of unknown origin occurs, the discontinuity point generated there is a cause of quality deterioration without being checked at all. Will be transmitted with the
In particular, since the transmission route of a packet is variable in an IP network, a packet issued later in time may overtake a previously issued packet depending on the routing state. A point arises.
FIG. 1 is a diagram showing a state of packet routing in an IP network.
The figure shows a case where three packets are transmitted in order. Even if the second packet and the third packet are sequentially transmitted after the first packet, the second packet is transmitted to the VoIP router 1 after passing through the VoIP router 2. On the other hand, since the third packet is transmitted directly to the VoIP router 1, the third packet transmitted later overtakes the second packet and arrives at the destination.
Also, in a mobile communication network for IMT-2000, the use of a method called TFO (Tandem Free Operation) has been studied for connection between terminals. Although this method has the purpose of avoiding the deterioration of the quality due to the tandem connection, when the transition from the tandem connection to the TFO or vice versa, a discontinuity may occur in the system. However, there is no technology to check and correct it.
DISCLOSURE OF THE INVENTION An object of the present invention is to detect a discontinuity of an audio waveform that occurs unspecified by examining digital audio data, and to compensate for quality degradation caused by the discontinuity, particularly, a linear signal processing apparatus. An object of the present invention is to provide an audio signal processing device that checks PCM data, detects a discontinuous point, immediately corrects a portion where the discontinuous point is determined, and can avoid deterioration in auditory quality. .
An audio signal processing apparatus according to the present invention is an audio signal processing apparatus for processing digital audio data in a communication network, comprising: a waveform prediction unit that detects a period of an input waveform and predicts a waveform to be received from the period; A discontinuous point detecting means for detecting a discontinuous point of the waveform from a correlation value between the detected waveform and the actually received waveform; and, when the discontinuous point is detected, the predicted waveform and the actually received waveform. And a corrected waveform generating means for generating a corrected waveform having no discontinuity using the obtained waveform.
According to the present invention, the received waveform is directly examined and the presence or absence of a discontinuity point is detected, so even if a discontinuity point occurs due to an unpredictable cause, the discontinuity point is found and the corrected waveform is obtained. Can be generated. Therefore, it is possible to compensate not only for a case where a discontinuity occurs at a position predicted from the system configuration, such as a joint between frames, but also for a deterioration in voice quality due to a discontinuity occurring at an arbitrary position in the waveform.
Thus, according to the present invention, it is possible to provide high-quality voice communication even when voice is transmitted and received via a packet-switched communication network.
BEST MODE FOR CARRYING OUT THE INVENTION In the present invention, a means for calculating a cycle from past input data, a means for predicting a future speech waveform from a determined cycle, a method for comparing a predicted waveform with an actual waveform, and correcting And a means for correcting the waveform by using a means such as weighting for a discontinuous point that needs to be corrected.
2 to 5 are diagrams for explaining the principle of the embodiment of the present invention.
It is known that, when observing a sound waveform, a similar waveform appears continuously at a certain period in a sound portion. This is called pitch, and is an important parameter that is used in recent speech coding systems as one of the parameters when speech is highly compressed. In the embodiment of the present invention, the pitch period is used for correcting a target audio waveform. FIG. 2 shows an example of a speech waveform, where k corresponds to a pitch cycle.
The pitch period can be extracted by using a method such as calculation of an autocorrelation coefficient. When a value with a high autocorrelation is output to some extent, it is possible to predict a future waveform (expected waveform) within a certain error range by using the pitch period. In FIG. 2, when the pitch period is determined to be k, the predicted waveform can be determined by using the value of the linear PCM data before k samples as the current value.
In the case of a normal voice waveform, when the pitch period appears properly as shown in FIG. 2, the actual waveform does not largely deviate from the predicted waveform. However, as shown in FIG. 3, if the actual waveform is significantly different from the predicted waveform, it will be a discontinuity on the audio waveform, which will have the potential to impair audio quality. Therefore, in this embodiment, the discontinuous point is detected by comparing the actual waveform and the predicted waveform at each sample, and the vicinity of the discontinuous point is interpolated using the predicted waveform.
As a method of comparing the actual waveform with the predicted waveform, there is a method of calculating a local correlation coefficient or the like. Point an enlarged vicinity of a ₀ of Figure 3, a 4. Near _a 2, of _{a 0} in FIG. 4. . . . , Near _b 2 of _{a 2} and _{b 0,.} . . . . It is possible to check whether the disturbed significantly waveform by obtaining a local correlation for b _2.
A sample determined to be a discontinuous point is corrected, but there is a method using weighting as a correction method. Particularly, in the present embodiment, the use of the weighting method will be described. FIG. 5 shows a state where the weighting is gradually corrected from the predicted waveform to the actual waveform. That is, when there is a discontinuity point d in the actual waveform indicated by the solid line, interpolation is performed with the predicted waveform indicated by the broken line. The method of interpolation is such that the correction waveform indicated by the bold line has a shape close to the predicted waveform near the discontinuous point d, and gradually approaches the actual waveform.
FIG. 6 is a processing block diagram of the audio signal processing device according to the embodiment of the present invention. FIG. 7 is a diagram showing an overall processing flow of the audio signal processing device according to the embodiment of the present invention.
An embodiment of the present invention will be described with reference to FIGS.
.., A (0),..., A (4) are sample strings of the linear PCM data. a (0) is a sample for which it is determined whether or not to correct, and a (-1) is one previous sample. Also, a (1),..., A (4) are sample values after a (0). In the present embodiment, up to a sample value four samples later than the sample value a (0) to be corrected is required, so in the actual processing, the sample value to be corrected is read up to four samples later. Perform processing from
First, in order to check whether or not the waveform of the portion including the sample of a (0) has periodicity, the period detection unit 10 of FIG. 6 uses several tens of samples (40 in this case) before a (0). Sample) to form a segment and perform the following calculations.
It should be noted that several tens of samples before a (0) store past linear PCM data from the input in the storage unit 14, from which data is read into the cycle detection unit 10. Although the number of samples required for detecting the periodicity is set to 40 samples here, in actuality, the number of samples should be determined so that one period of the pitch of the audio data can be used for the period detection. is there. Normally, it is sufficient to detect about 40 samples for detecting the pitch period of audio data. When the sampling frequency is different, an appropriate number of samples is used according to the frequency.

In this calculation, the value of k and the value of S when S is maximized are obtained. However, in order to prevent waveforms with opposite phases or waveforms with small power from affecting the correction, the numerator is positive in parentheses and the two terms multiplied by the denominator exceed a certain threshold. Only if they are In other words, the numerator is always a positive number because it is squared, but the expression in parentheses of the numerator indicates a large positive value when the waveforms match, and the waveform has the opposite phase. If it is, it is negative and has a large absolute value. Therefore, when the waveforms are in opposite phases, the above-mentioned S becomes a large value despite the fact that the waveforms do not coincide with each other. Limited to cases. The reason why the size of each term of the denominator is equal to or larger than a predetermined threshold value is to remove a case where the power of the sound is low. Each term of the denominator is a formula for calculating the power of the voice. By setting these values to a predetermined value or more, it is possible to remove a voice waveform having a low power. Removing a low-power voice waveform is likely to be affected by noise in the case of a low-power voice waveform, and the actual voice waveform is different from the past waveform. This is to avoid a case where the result of the calculation accidentally determines that the waveforms match. It should be noted that the above threshold value should be appropriately determined by each person skilled in the art who intends to experimentally use the present embodiment.
Next, it is determined whether or not S exceeds a certain threshold. If it exceeds, it is determined that the waveform has a periodic shape, and the value of the cycle k is determined and sent to the prediction unit 11 in FIG. If not exceeded, it is determined that it is not periodic, and the processing of the prediction unit 11, the determination unit 12, and the correction unit 13 is not performed. The threshold value for the determination of S should also be appropriately set by those skilled in the art through experiments and the like.
The prediction unit 11 predicts that the neighborhood of a (0) is like the neighborhood of a (-k). Here, the neighborhood of a (0) is a (−2),..., A (4), and the neighborhood of a (−k) is a (−k−2),. And The prediction unit 11 sends the predicted waveform to the comparison and determination unit 12. Here, the predicted waveform is a waveform composed of a (−k−2) to a (−k + 4) samples in the vicinity of a (−k) determined to be similar to the vicinity of a (0). is there. Then, the following calculation is performed on the predicted waveform (near a (-k)) and the actual waveform (near a (0)) in a short section. Note that the calculation here is performed for seven samples near each of a (0) and a (-k). This is to calculate by selecting a neighborhood that is sufficiently smaller than one cycle of the speech waveform but large enough to average the noise-like change for each sample. That is, if the number of samples to be calculated is too large, local discontinuity of the waveform cannot be detected, and if the number of samples is too small, it is determined that the waveform is a discontinuous point even if the sample value changes like noise. For example, about seven samples are considered to be just right because they become too sensitive to changes in sample values. However, in the present embodiment, the number of samples is not necessarily limited to seven samples, and should be appropriately determined by those skilled in the art through experiments and the like.

Next, it is determined whether or not this T exceeds a certain threshold. If not, it is determined that the waveform is significantly disturbed at that point, and a correction instruction is issued from the comparison determination unit 12 to the correction unit 13. . However, also in this case, the case where the two terms multiplied by the denominator are smaller than a certain threshold value is excluded. When the value in the parenthesis of the numerator is negative, it is -T. Again, by using only the case where each term of the denominator is larger than the predetermined threshold, the case where the audio power is small is removed. When the value in the parentheses of the numerator is negative, the value of T is set as -T. It is made negative so that it does not become larger than the threshold value. That is, it means that the case where the parenthesis in the numerator is negative, that is, the case where the waveform is in the opposite phase, is excluded. Each of the above-described thresholds should be appropriately determined by a person skilled in the art through experiments and the like.
Upon receiving the correction instruction, the correction unit 13 performs interpolation by weighting as described below, and outputs s (corrected audio data sample value). Once the correction instruction is issued, the correction is performed for n samples (so that the corrected waveform sufficiently smoothly matches the actual waveform: the value of n should also be appropriately set by those skilled in the art). During that time, the functions of period detection, prediction, and comparison / judgment stop.

Here, the offset is the value of a (-1) -a (-k-1) when a correction instruction is issued, and when correction is performed, the value (predicted waveform) one cycle (k samples) earlier is used. This is the amount needed to connect the subsequent waveforms smoothly.
If no correction instruction is given,

It becomes.
After the processing of the correction unit is completed, the storage unit 14 updates the values in the order of a (4) → a (3), a (3) → a (2), a (i) → a (i−1). I do. Note that s → a (−1), and the correction result is reflected in the past waveform data stored in the storage unit 14.
In the configuration of FIG. 6, one sample data of the linear PCM data is sequentially input from the input, and the latest sample value is input to the direct comparison determination unit 12 and the correction unit 13. The storage unit 14 outputs a predetermined number (for example, about 40 samples) of past sample values before the latest sample value. For example, in the above-described example, a (4) is directly input from the input to the comparison determination unit 12 and the correction unit 13, but a (3) to a (-40) Section.
FIG. 7 is a flowchart showing the overall processing of the embodiment of the present invention.
First, in step S1, an autocorrelation coefficient is calculated. The calculation here corresponds to the calculation of S in the above description. Then, in step S2, it is determined whether or not there is periodicity. As described above, the periodicity is determined by determining whether or not the value of S is greater than a predetermined threshold, thereby determining the period k. k indicates the length of one cycle of the audio waveform by the number of samples. If it is determined that there is no periodicity, the process proceeds to step S7. In this case, in step S7, s = a (0), and the sample value of the audio waveform is output without any correction. Then, in step S8, one new sample value is stored in the storage unit 14, and one oldest sample value is discarded.
If it is determined in step S2 that there is periodicity, in step S3, waveform prediction, that is, a past waveform one cycle before is acquired as a predicted waveform, and in step S4, the current waveform and the predicted waveform are compared. Compare. The calculation in step S4 is to calculate the above-mentioned T. For a small number of samples near the target sample value, the correlation value between the current waveform and the predicted waveform is obtained. In order to determine whether or not it is, the process of step S4 is referred to as "comparison". Therefore, by performing “comparison” in step S4, it is determined whether or not there is a discontinuity in the current waveform.
Then, according to the result of the comparison in step S4, whether or not the current audio waveform has a discontinuity point, it is determined in step S5 whether the waveform needs to be corrected. If there is no discontinuous point in the audio waveform, it is determined that no correction is necessary, and the process proceeds to steps S7 and S8. In step S2, the same processing as in the case where there is no periodicity is performed.
If it is determined in step S5 that the correction is necessary, in step S6, the sample value of the audio waveform is corrected by the above-described weighting calculation, and the corrected value is output in step S7. The sample values are stored in the storage unit 14 and the oldest sample values are discarded.
FIG. 8 is a diagram illustrating an application portion and a network of the audio signal processing device according to the embodiment of the present invention.
The public network 22 is connected to a mobile network 23 via the network 20. The mobile network 23 may be another public network, or the public network 22 may be another mobile network. The network 20 is a network based on the IP packet switching system such as the Internet. In this case, a method called VoIP is adopted for transmitting and receiving voice via the network 20. A gateway 21 is provided as a boundary device between the network 20 and the public network 22. Similarly, a gateway 21 is provided as a boundary device between the mobile network 23 and the network 20.
The audio signal processing device according to the embodiment of the present invention is mounted on the gateway 21 as these boundary devices. That is, for example, an audio signal input to the gateway 21 from the public line network 22 is converted into linear PCM data, subjected to the audio signal processing of the embodiment of the present invention, and transmitted to the network 20 in the VoIP format. You. The gateway 21 receiving the voice data transmitted to the network 20 converts the received voice signal into linear PCM data, performs the voice signal processing according to the embodiment of the present invention, and transmits the data to the mobile network 23.
The same applies to the case where an audio signal is transmitted from the mobile network 23 to the public line network 22.
Further, in the above description, the gateway is mentioned as an application portion of the audio signal processing device according to the embodiment of the present invention, but is not limited to this. That is, the present invention can be applied to the case where the received voice is reproduced in a mobile device such as a portable terminal of the mobile network 23, or provided in a base station of the mobile network 23 or a telephone set of the public network 22. It is also possible to perform the audio signal processing according to the embodiment of the present invention on the audio signal in the state of the linear PCM data.
INDUSTRIAL APPLICABILITY According to the present invention, deterioration in auditory quality can be suppressed regardless of the cause of the occurrence of a discontinuous point in an audio waveform. Further, processing can be performed without a large delay.
[Brief description of the drawings]
FIG. 1 is a diagram showing a state of packet routing in an IP network.
FIG. 2 is a diagram (part 1) for explaining the principle of the embodiment of the present invention.
FIG. 3 is a diagram (part 2) for explaining the principle of the embodiment of the present invention.
FIG. 4 is a diagram (part 3) for explaining the principle of the embodiment of the present invention.
FIG. 5 is a diagram (part 4) for explaining the principle of the embodiment of the present invention.
FIG. 6 is a processing block diagram of the audio signal processing device according to the embodiment of the present invention.
FIG. 7 is a diagram illustrating an overall processing flow of the audio signal processing device according to the embodiment of the present invention.
FIG. 8 is a diagram illustrating an application portion and a network of the audio signal processing device according to the embodiment of the present invention.

Claims

In an audio signal processing device for processing digital audio data in a communication network,
A waveform prediction means for detecting a cycle of the input waveform and predicting a waveform received from the cycle;
Discontinuous point detecting means for detecting a discontinuous point of the waveform from a correlation value between the predicted waveform and the actually received waveform,
Correction waveform generating means for generating a correction waveform having no discontinuity using the predicted waveform and the actually received waveform when the discontinuity is detected;
An audio signal processing device comprising:

The audio signal processing device according to claim 1, wherein the period of the input waveform is detected by detecting that an autocorrelation value of the input waveform is equal to or greater than a predetermined value.

The audio signal processing device according to claim 2, wherein the autocorrelation value is calculated for substantially one cycle of the input waveform.

The audio signal processing apparatus according to claim 1, wherein the prediction of the waveform to be received from now on is performed using a waveform one cycle before the waveform to be predicted as a predicted waveform.

The detection of the discontinuous point calculates a correlation value between the predicted waveform and the actually received waveform for several sample points before and after a sample point to determine whether or not a discontinuous point exists. The audio signal processing device according to claim 1, wherein the audio signal processing device is obtained by:

The audio signal processing apparatus according to claim 1, wherein the correction waveform is generated by performing a weighted interpolation operation on the sample value of the predicted waveform and the sample value of the actually received waveform. .

The weighted interpolation operation is performed by adding an offset amount to a sample value of the predicted waveform, and the correction waveform and a waveform actually received in the past are continuously connected. 7. The audio signal processing device according to 6.

The audio signal processing device according to claim 7, wherein the offset amount is calculated based on two sample values calculated from a cycle of the input waveform.

The audio signal processing device according to claim 1, wherein the communication network transmits the audio signal by a packet switching method.

The audio signal processing device according to claim 9, wherein the communication network is an ATM network or an IP network.

The audio signal processing device according to claim 1, wherein the digital audio data is linear PCM data.

An audio signal processing method for processing digital audio data in a communication network,
A waveform prediction step of detecting a cycle of the input waveform and predicting a waveform received from the cycle;
A discontinuous point detecting step of detecting a discontinuous point of the waveform from a correlation value between the predicted waveform and the actually received waveform;
When the discontinuous point is detected, a corrected waveform generating step of generating a corrected waveform without a discontinuous point using the predicted waveform and the actually received waveform;
An audio signal processing method comprising:

13. The audio signal processing method according to claim 12, wherein the period of the input waveform is detected by detecting that an autocorrelation value of the input waveform is equal to or greater than a predetermined value.

14. The audio signal processing method according to claim 13, wherein the autocorrelation value is calculated for substantially one cycle of the input waveform.

13. The audio signal processing method according to claim 12, wherein the prediction of the waveform to be received from now on is performed using a waveform one cycle before the waveform to be predicted as a predicted waveform.

The detection of the discontinuous point calculates a correlation value between the predicted waveform and the actually received waveform for several sample points before and after a sample point to determine whether or not a discontinuous point exists. 13. The audio signal processing method according to claim 12, wherein the method is obtained by:

The audio signal processing method according to claim 12, wherein the correction waveform is generated by performing a weighted interpolation operation on the sample value of the predicted waveform and the sample value of the actually received waveform. .

The weighted interpolation operation is performed by adding an offset amount to a sample value of the predicted waveform, and the correction waveform and a waveform actually received in the past are continuously connected. 18. The audio signal processing method according to item 17.

19. The audio signal processing method according to claim 18, wherein the offset amount is calculated based on two sample values calculated from a cycle of the input waveform.

13. The audio signal processing method according to claim 12, wherein the communication network transmits the audio signal by a packet switching method.

The voice signal processing method according to claim 20, wherein the communication network is an ATM network or an IP network.

13. The audio signal processing method according to claim 12, wherein the digital audio data is linear PCM data.