JP3617503B2

JP3617503B2 - Speech decoding method

Info

Publication number: JP3617503B2
Application number: JP2002123468A
Authority: JP
Inventors: 正山浦; 裕久田崎
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1996-10-18
Filing date: 2002-04-25
Publication date: 2005-02-09
Anticipated expiration: 2017-03-14
Also published as: JP2003029799A

Description

【０００１】
【発明の属する技術分野】
この発明はディジタル信号に圧縮符号化された音声信号を復号化する復号化方法に関し、特に通信に適用する際に伝送路誤りによる品質劣化の少ない音声を再生するための音声復号化方法に関する。
【０００２】
【従来の技術】
従来、通信用の高能率音声符号化方法としては、符号励振線形予測符号化（Ｃｏｄｅ−ＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎｃｏｄｉｎｇ：ＣＥＬＰ）、多帯域励振符号化（Ｍｕｌｔｉ−ＢａｎｄＥｘｃｉｔａｔｉｏｎｃｏｄｉｎｇ：ＭＢＥ）といった手法が代表的である。それぞれの技術については、「Ｃｏｄｅ−ｅｘｃｉｔｅｄｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎ（ＣＥＬＰ）：Ｈｉｇｈ−ｑｕａｌｉｔｙｓｐｅｅｃｈａｔ８ｋｂｐｓ」（Ｍ．Ｒ．ＳｈｒｏｅｄｅｒａｎｄＢ．Ｓ．Ａｔａｌ著、ＩＣＡＳＳＰ’８５，ｐｐ．９３７−９４０，１９８５）、及び「Ａｒｅａｌ−ｔｉｍｅｉｍｐｌｅｍｅｎｔａｔｉｏｎｏｆｔｈｅｉｍｐｒｏｖｅｄＭＢＥｓｐｅｅｃｈｃｏｄｅｒ」（Ｍ．Ｓ．Ｂｒａｎｄｓｔｅｉｎ，Ｐ．Ａ．Ｍｏｎｔａ，Ｊ．Ｃ．ＨａｒｄｗｉｃｋａｎｄＪ．Ｓ．Ｌｉｍ著、ＩＣＡＳＳＰ’９０，ｐｐ．５−８，１９９０）に述べられている。
【０００３】
ここでは、ＣＥＬＰ系音声復号化について説明する。ＣＥＬＰ系音声復号化では、５〜５０ｍｓ程度を１フレームとして、そのフレームの音声をスペクトル情報と音源情報に分けて符号化された符号を入力し音声信号を復号化する。以下、ＣＥＬＰ系音声復号化について図５を用い説明する。
Ｓ２は符号化された音声信号の符号入力端子で、適応符号、雑音符号、線形予測パラメータの符号、ゲインの符号を入力する。２は復号化部、４は分離手段、Ｓ３は再生音声信号の出力端子である。復号化部２は線形予測パラメータ復号化手段１５、合成フィルタ１６、適応符号帳１７、雑音符号帳１８、ゲイン復号化手段１９より構成されている。
適応符号帳１７には過去の駆動音源ベクトルが記憶されており、適応符号に対応して過去の駆動音源ベクトルを周期的に繰り返した時系列ベクトルを出力する。
また、雑音符号帳１８は、例えばランダム雑音から生成した複数の時系列ベクトルが記憶されており、雑音符号に対応した時系列ベクトルを出力する。
【０００４】
復号化部２において、入力端子２に入力された音声符号は分離手段４により適応符号、雑音符号、線形予測パラメータの符号、ゲインの符号に分離される。線形予測パラメータ復号化手段１５は線形予測パラメータの符号から線形予測パラメータを復号化し、合成フィルタ１６の係数として設定する。次に、適応符号帳１７は、適応符号に対応して、過去の駆動音源ベクトルを周期的に繰り返した時系列ベクトルを出力し、また雑音符号帳１８は雑音符号に対応した時系列ベクトルを出力する。これらの時系列ベクトルは、ゲイン復号化手段１９でゲインの符号から復号化したそれぞれのゲインに応じて重み付けして加算され、その加算結果が駆動音源ベクトルとして合成フィルタ１６へ供給され出力音声Ｓ３が得られる。
【０００５】
ここで実際上移動体通信のような、符号誤りの発生する応用分野に適用される音声復号化方法では、誤り訂正符号化技術を用いて符号誤りによる符号化音声の品質劣化を押さえている。また誤りを訂正しきれない場合には、波形修復処理を行い、符号誤りの影響を押さえる工夫がなされている。
【０００６】
これまでの修復方法としては、次の２種類がある。すなわち一番目の修復方法は、「ＣｈａｎｎｅｌｃｏｄｉｎｇｆｏｒｄｉｇｉｔａｌｓｐｅｅｃｈｔｒａｎｓｍｉｓｓｉｏｎｉｎｔｈｅＪａｐａｎｅｓｅｄｉｇｉｔａｌｃｅｌｌｕｌａｒｓｙｓｔｅｍ」（Ｍ．Ｊ．ＭｃＬａｕｇｈｌｉｎ著、電子情報通信学会無線通信システム研究会ＲＣＳ９０−２７，ｐｐ．４１−４５，１９９０）に示すように、現在のフレームが符号誤りのあるフレームの場合に、過去のフレームのパラメータを繰り返し用いて再生音声を生成する、また再生音声のパワーを徐々に抑圧していく方法である。
【０００７】
また２番目の修復方法は、特開平６−１２０９５号公報に開示されているように、過去のフレーム、現在のフレーム及び将来のフレームのそれぞれの符号誤り検出情報を用い、各フレーム誤り検出状態に応じて現在のフレームの音声を再生修復するものである。この場合には、将来のフレームの情報も用いて補間を行うので、過去のフレームの情報のみを用いる場合に比較して、歪みの小さい補間を行うことができる。
【０００８】
また波形修復が十分に行えない場合にも、聴感上の音声品質を維持する伝送誤り補償方法として、特開平７−３６４９６号公報に開示されたものがある。これは、伝送誤りの状態により再生音声を出力せず、代わりに雑音信号を出力するものである。これにより、適当な波形修復が行えない場合にも、異音となることを避けることができる。
【０００９】
【発明が解決しようとする課題】
従来の符号誤りが発生した場合の波形修復方法では、過去のフレームのパラメータを繰り返して用いているが、まだ再生音声の品質が低いという問題があった。例えば有声の立ち上がりのフレームではピッチ周期性が非定常なため、必ずしも後続の有声定常部に適したピッチ情報のパラメータが伝送されてはいないことが多く、後続フレームで符号誤りが発生した場合、そのピッチ情報のパラメータを繰り返して用いても良好な再生音声は得られなかった。
【００１０】
また無声部でも局所的には周期性があるフレームがあり、これに後続する無声フレームで符号誤りが発生した場合、パラメータの繰り返しにより一定周期の信号が連続するため再生音声がブザー音となり、再生音声の品質劣化が大きかった。また符号誤りが発生したフレームで、前フレームのパラメータを繰り返して生成したスペクトルが不適当であった場合には、再生音声品質が大きく劣化していた。これは特に、高域に鋭いホルマントピークがある等、高域のパワーが大きい場合に、その高域での異音感が顕著となり、聴感上の劣化が大きかった。
【００１１】
また将来のフレームの情報も用いて補間を行う従来の波形修復方法では、修復のために現フレームの再生に必要な遅延が大きくなるとういう問題があった。この音声再生に必要な遅延が大きくなると、通信に適用した場合、自然な対話ができず、通話に支障を来たすので、遅延はできるだけ小さいことが望ましい。
また従来の符号誤りが発生して波形修復が十分に行えない場合にも、聴感上の音声品質を維持する伝送誤り補償方法では、再生音声と雑音信号を切り替えて出力している。しかし、音声から雑音あるいは雑音から音声への移行に不連続感が伴うため、これがかえって、聴感上の再生音声品質の劣化につながるという問題があった。さらに伝送誤りが発生した場合、常に雑音信号を出力しているため、本来は無音である部分でも雑音が出力され、これも聴感上の劣化につながっていた。
【００１２】
【課題を解決するための手段】
この発明の音声復号化方法は、符号化された音声のスペクトル情報と音源情報を、フレーム単位に復号し、音声を再生する音声復号化方法において、符号が正しく復号できなかった事を検出した場合に、前のフレームの再生音声の状態を判定し、その判定に応じて現在のフレームの音声を再生修復するようにした。
【００１３】
また、次の発明の音声復号化方法は、音源情報はピッチ情報を含み、前のフレームの再生音声が有声の立ち上がりの場合、過去のフレームの情報より現在のフレームについてのピッチ情報を求め、当該ピッチ情報に基づいて現在のフレームの音声を再生修復するようにした。
【００１４】
さらに次の発明の音声復号化方法は、前のフレームの再生音声が無声である場合、ピッチ周期性を抑制するように現在のフレームの音声を再生修復するようにした。
【００１５】
さらに次の発明の音声復号化方法は、符号化された音声のスペクトル情報と音源情報を、フレーム単位に復号し、音声を再生する音声復号化方法において、符号が正しく復号できなかった事を検出した場合に、前のフレームのスペクトルの低域のホルマント構造を保ち、高域のホルマントピークを抑制したスペクトルとなるように現在のフレームの音声を再生修復するようにした。
【００１６】
【００１７】
また次の発明の音声復号化方法は、符号化された音声の情報を復号し、音声を再生する音声復号化方法において、符号が正しく復号できなかった事を検出しその検出状態に応じて再生音声に雑音を重畳するようにした。
【００１８】
さらに次の発明の音声復号化方法は、符号が正しく復号できなかった事を検出し、その検出状態に応じて再生音声に重畳する雑音のパワーを制御するようにした。
【００１９】
さらに次の発明の音声復号化方法はさらに、再生する音声の状態を推定し、その推定結果に応じて再生音声に雑音を重畳するようにした。
【００２０】
さらに次の発明の音声復号化方法は、再生する音声の状態を推定し、その推定結果に応じて再生音声に重畳する雑音のパワーを制御するようにした。
【００２１】
【発明の実施の形態】
以下図面を参照しながら、この発明の実施の形態について説明する。
【００２２】
実施の形態１．
図５との対応部分に同一符号を付けた図１は、この発明による音声復号化方法の実施の形態１の構成を示し、図中３０は符号の伝送誤りが検出された場合に線形予測パラメータを修復する線形予測パラメータ修復手段、３１は再生音声を分析して前フレームが有声の立ち上がりか否かを判定する立ち上がり判定手段、３２は再生音声を分析しピッチを求めるピッチ抽出手段、３３は符号の伝送誤りが検出された場合に適応符号を修復する適応符号修復手段である。また３４は再生音声を分析して前フレームが無声か否かを判定する無声判定手段、３５は符号の伝送誤りが検出された場合にゲインを修復するゲイン修復手段、３６は符号の伝送誤り検出状況に応じて再生音声のパワーを制御する再生音声パワー制御手段である。さらに３７は符号の伝送誤りが検出された場合に現フレームの音声の状態を推定する音声状態推定手段、３８は雑音信号を生成する雑音生成手段、３９は符号の伝送誤り検出状況に応じて雑音信号のパワーを制御する雑音パワー制御手段である。
【００２３】
復号化部２において、線形予測パラメータ復号化手段１５は線形予測パラメータの符号から線形予測パラメータを復号化する。線形予測パラメータ修復手段３０は、符号の伝送誤りが検出されていないフレームでは復号化された線形予測パラメータを、伝送誤りが検出されたフレームでは前フレームで復号化された線形予測パラメータから、低域のホルマント構造を保ち、高域のホルマントピークを抑制したスペクトルとなるような線形予測パラメータを、例えば線形予測パラメータを線スペクトル対パラメータとして表したときに、その低域のパラメータのみを用いるとして求め、合成フィルタ１６の係数として設定する。図２に前フレームの線形予測パラメータが表すスペクトル包絡と、線形予測パラメータ修復処理によりその低域の構造を保ち高域を抑制したスペクトル包絡の例を示す。
【００２４】
立ち上がり判定手段３１は符号の伝送誤りが検出されたフレームで、前フレームの再生音声を分析して、前フレームが有声の立ち上がりであるか否かを判定し、判定結果をピッチ分析手段３２、適応符号修復手段３３に出力する。ここで、有声立ち上がりか否かの判定は、例えば前フレームにおいて、フレーム後半部の音声のパワーがフレーム前半部のそれと比較してある閾値以上大きい場合には有声の立ち上がりとする。ピッチ抽出手段３２は、符号の伝送誤りが検出されたフレームで、前フレームが有声立ち上がりと判定された場合に、過去の再生音声を分析して少なくとも１つ以上のピッチ周期の候補を抽出し、適応符号修復手段３３に出力する。
【００２５】
適応符号修復手段３３は、符号の伝送誤りが検出されていないフレームでは、分離手段４から入力される適応符号を、符号誤りが検出されても前フレームが有声立ち上がりではない場合には、前フレームの適応符号を、符号誤りが検出され、かつ前フレームが有声立ち上がりの場合には、前記ピッチ抽出手段３２から入力されたピッチ周期の候補の中から、例えば過去に正しく伝送された適応符号の出現頻度分布と比較し、出現頻度の高い符号に対応するピッチ周期に最も近いものを選択し、この選択されたピッチ周期に対応する適応符号を適応符号帳１７に出力する。適応符号帳１７は適応符号に対応して、過去の駆動音源ベクトルを周期的に繰り返した時系列ベクトルを出力する。図３（Ａ）にこの実施の形態１１における適応符号修復処理の例を、また図３（Ｂ）に従来の前フレームの適応符号を繰り返して用いる修復処理の例を示す。図に示すように、前フレームの適応符号に対応するピッチ周期が現フレームに不適な場合でも、過去のフレームの情報より現フレームに適したピッチ周期を求めることにより、良好な波形修復処理が可能であることが分かる。
【００２６】
雑音符号帳１８は雑音符号に対応した時系列ベクトルを出力する。ゲイン復号化手段１９はゲインの符号から、前記適応符号帳１７及び雑音符号帳１８より出力された時系列ベクトルに対するゲインを復号化する。無声判定手段３４は、符号の伝送誤りが検出されたフレームで、前フレームの再生音声を分析して、前フレームが無声であるか否かを判定し、判定結果をゲイン修復手段３５に出力する。ゲイン修復手段３５は、符号の伝送誤りが検出されていないフレームでは、ゲイン復号化手段１９から入力されたゲインを、符号誤りが検出され、かつ、前フレームが無声でない場合は前フレームのゲインを、符号誤りが検出され、かつ、前フレームが無声の場合は、前フレームのゲインに対し、適応符号帳からの時系列ベクトルに対するゲインをα倍、雑音符号帳からの時系列ベクトルに対するゲインをβ倍して出力する。ここでα、βは、例えば、０≦α＜β≦１とする。
【００２７】
適応符号帳１７、雑音符号帳１８からの各時系列ベクトルは、ゲイン修復手段３５から出力されたそれぞれのゲインに応じて重み付けして加算され、その加算結果を駆動音源ベクトルとして合成フィルタ１６へ供給され再生音声が得られる。再生音声パワー制御手段３６は、符号の伝送誤り検出状況に応じて、例えば誤りが連続するに従い徐々に抑圧量を強めるとして、再生音声のパワーを抑圧する。音声状態推定手段３７は、符号の伝送誤りが検出された場合に、過去の再生音声及び現フレームの線形予測パラメータの符号から現フレームの有音／無音を判定し、その判定結果を雑音生成手段３８に出力する。
【００２８】
雑音生成手段３８は、符号の伝送誤り検出状況及び音声状態判定手段３７から入力された有音／無音判定結果に応じて、例えば誤りが検出され、かつ、有音と推定されたフレームで雑音を生成する。雑音パワー制御手段３９は、例えば雑音の重畳始めは徐々にパワーを増大し、重畳終わりには徐々にパワーを減少させるとして、雑音のパワーを制御する。図４に再生音声及び雑音信号のパワー制御処理の例を示す。再生音声パワー制御手段３６から出力された再生音声と雑音パワー制御手段３９から出力された雑音は加算され、出力音声Ｓ３が得られる。
【００２９】
この実施の形態１によれば、符号が正しく復号できなかったことを検出した場合に、前のフレームの再生音声の状態を判定し、この判定に応じて現在のフレームの音声を再生修復することにより、遅延を増大することなく、伝送路誤りによる品質劣化の少ない音声を再生することができる。また、符号誤りが発生した場合に、再生する音声の状態に応じて、再生音声に雑音を重畳することにより、聴感上の音声品質を維持することができる。
【００３０】
実施の形態２．
上述の実施の形態１では、線形予測パラメータ修復手段３０において、伝送誤りが発生した際は常に前フレームの線形予測パラメータから高域のホルマントピークを抑制したスペクトルとなるような線形予測パラメータを求め、合成フィルタ１６の係数としているが、これに代え、前フレームの線形予測パラメータが表すスペクトルの高域に鋭いホルマントピークがある場合にのみ抑制処理を行い、その他の場合は、前フレームの線形予測パラメータをそのまま用いるとするなど、状態に応じて選択的に処理を行っても良い。
【００３１】
この実施の形態２によれば、線形予測パラメータの修復処理を行う際に、高域に鋭いホルマントピークがあり、聴感上の品質劣化につながる可能性が高い場合にのみその抑制処理を行い、その他の場合前フレームの線形予測パラメータを繰り返して用いるので、異音を発生すること無く合成音声の連続性が向上し、聴感上の音声品質を向上することができる。
【００３２】
実施の形態３．
上述の実施の形態１では、伝送誤り検出時前フレームが無声の場合、ゲイン修復手段３５においてゲインを調整することにより駆動音源ベクトルのピッチ周期性を抑制を実現しているが、これに代え、例えば適応符号帳からの時系列ベクトルを用いないとするなど、別の手段を用いてピッチ周期性の抑制を実現しても良い。この実施の形態３によれば、ピッチ周期性の抑制を適応符号帳からの時系列ベクトルを使用する／使用しないを切り替えるだけで良く、ゲインを調整してピッチ周期性を抑制する場合に比較し簡易に実現できる。
【００３３】
実施の形態４．
上述の実施の形態１では、音声状態推定手段３７で、過去の再生音声及び現フレームの線形予測パラメータの符号から現フレームの有音／無音を判定しているが、これに代え、適応符号やゲインの符号も判定のためのパラメータとして用いても良い。この実施の形態４によれば、過去の再生音声及び線形予測パラメータの符号からだけでなく、その他の情報も用いて有音／無音判定するので、より精度の高い判定が可能となる。
【００３４】
実施の形態５．
上述の実施の形態１では、ＣＥＬＰ系音声符号化に伝送誤り時の修復処理、雑音重畳処理を適用しているが、これに代え、ＭＢＥ系音声符号化をはじめ他の音声符号化方式に適用しても、同様に伝送誤り時の再生音声の聴感上の音声品質を向上することができる。
【００３５】
【発明の効果】
以上詳述したように、
【００３６】
この発明によれば、復号化方法で、符号が正しく復号できなかったことを検出した場合に、前のフレームの再生音声の状態を判定し、この判定に応じて現在のフレームの音声を再生修復するようにしたので、遅延を増大することなく効果的な再生修復処理ができ、伝送路誤りによる品質劣化の少ない音声を再生することができる。
【００３７】
またこの発明によれば、前フレームの再生音声が有声立ち上がりの場合、過去のフレームの情報より現在のフレームについてのピッチ情報を求め、このピッチ情報に基づいて現在のフレームの音声を再生修復するようにしたので、前フレームで伝送されたピッチ情報が現フレームには不適なものであっても、現フレームに適したピッチ周期を求めて再生音声を生成することができ、伝送誤りによる品質劣化の少ない音声を再生することができる。
【００３８】
またこの発明によれば、前フレームの再生音声が無声である場合、ピッチ周期性を抑制するように現在のフレームの音声を再生修復するようにしたので、無声である部分で不自然な周期性が発生することを回避することができ、伝送誤りによる品質劣化の少ない音声を再生することができる。
【００３９】
またこの発明によれば、符号が正しく復号できなかったことを検出した場合に前のフレームのスペクトルの低域のホルマント構造を保ち、高域のホルマントピークを抑制したスペクトルとなるように現在のフレームの音声を再生修復することにより、再生音声の連続性を保ちつつ高域の異音感を軽減することができ、伝送路誤りによる品質劣化の少ない音声を再生することができる。
【００４０】
またこの発明によれば、音声復号化方法で、符号が正しく復号できなかったことを検出し、この検出状態に応じて再生音声に雑音を重畳することにより、再生音声と雑音間の移行を滑らかにすることができ、また、本来有音である部分にのみ雑音を出力し、無音部分では雑音を出力しないので、聴感上の音声品質を維持することができる。
【図面の簡単な説明】
【図１】この発明による音声復号化方法の実施の形態１の構成を示すブロック図である。
【図２】実施の形態１におけるスペクトル修復処理の動作の説明に供する略線図である。
【図３】実施の形態１における適応符号修復処理の動作の説明に供する信号波形図である。
【図４】実施の形態１における再生音声及び雑音信号のパワー制御処理の一例の説明に供する略線図である。
【図５】従来のＣＥＬＰ系音声符号化復号化方法の全体構成を示すブロック図である。
【符号の説明】
２復号化部
４分離手段
１６合成フィルタ
１７適応符号帳
１８雑音符号帳
１５線形予測パラメータ復号化手段
１９ゲイン復号化手段
３０線形予測パラメータ修復手段
３１立ち上がり判定手段
３２ピッチ抽出手段
３３適応符号修復手段
３４無声判定手段
３５ゲイン修復手段
３６再生音声パワー制御手段
３７音声状態推定手段
３８雑音生成手段
３９雑音パワー制御手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a decoding method for decoding a voice signal compressed and encoded into a digital signal, and more particularly to a voice decoding method for reproducing a voice with little quality deterioration due to a transmission path error when applied to communication.
[0002]
[Prior art]
Conventionally, as a high-efficiency speech coding method for communication, techniques such as code-excited linear prediction coding (CELP) and multi-band excitation coding (MBE) are representative. It is. About each technique, "Code-excited linear prediction (CELP): High-quality speech at 8 kbps" (M.R. And “A real-time implementation of the improved MBE speech coder” (MS Brandstein, PA Monta, J. C. Hardwick and J. S. Lim, ICAS SP'90, pp. 8). 1990).
[0003]
Here, CELP speech decoding will be described. In CELP speech decoding, about 5 to 50 ms is defined as one frame, and the speech signal is decoded by inputting the encoded speech by dividing the speech of the frame into spectrum information and sound source information. Hereinafter, CELP speech decoding will be described with reference to FIG.
S2 is a code input terminal for the encoded speech signal, and inputs an adaptive code, a noise code, a linear prediction parameter code, and a gain code. Reference numeral 2 denotes a decoding unit, 4 denotes separation means, and S3 denotes an output terminal for a reproduced audio signal. The decoding unit 2 includes a linear prediction parameter decoding unit 15, a synthesis filter 16, an adaptive codebook 17, a noise codebook 18, and a gain decoding unit 19.
The adaptive codebook 17 stores past drive excitation vectors, and outputs a time series vector in which past drive excitation vectors are periodically repeated corresponding to the adaptive code.
The noise codebook 18 stores a plurality of time series vectors generated from random noise, for example, and outputs a time series vector corresponding to the noise code.
[0004]
In the decoding unit 2, the speech code input to the input terminal 2 is separated by the separating unit 4 into an adaptive code, a noise code, a linear prediction parameter code, and a gain code. The linear prediction parameter decoding means 15 decodes the linear prediction parameter from the code of the linear prediction parameter and sets it as a coefficient of the synthesis filter 16. Next, adaptive codebook 17 outputs a time series vector in which past drive excitation vectors are periodically repeated corresponding to the adaptive code, and noise codebook 18 outputs a time series vector corresponding to the noise code. To do. These time series vectors are weighted and added according to each gain decoded from the gain code by the gain decoding means 19, and the addition result is supplied to the synthesis filter 16 as a drive excitation vector, and the output sound S3 is output. can get.
[0005]
Here, in a speech decoding method applied to an application field where a code error occurs in practice, such as mobile communication, the quality of encoded speech due to the code error is suppressed by using an error correction coding technique. Further, when the error cannot be corrected, the waveform is repaired to reduce the influence of the code error.
[0006]
There are the following two types of repair methods. That is, the first restoration method is “Channel coding for digital transmission transmission in the Japan” (MJ McLaughlin, IEICE Radiocommunication System Research Group RCS90-4 ), When the current frame is a frame with a code error, the reproduced sound is generated by repeatedly using the parameters of the past frame, and the power of the reproduced sound is gradually suppressed.
[0007]
Further, as disclosed in Japanese Patent Laid-Open No. 6-12095, the second restoration method uses the code error detection information of each of the past frame, the current frame, and the future frame, and sets each frame error detection state. In response, the audio of the current frame is reproduced and repaired. In this case, since interpolation is performed also using information on future frames, it is possible to perform interpolation with less distortion than when only information on past frames is used.
[0008]
Japanese Patent Laid-Open No. 7-36496 discloses a transmission error compensation method for maintaining audio quality in the sense of hearing even when waveform restoration cannot be performed sufficiently. This does not output reproduced sound due to a transmission error state, but outputs a noise signal instead. Thereby, even when appropriate waveform restoration cannot be performed, it is possible to avoid noise.
[0009]
[Problems to be solved by the invention]
In the conventional waveform restoration method when a code error occurs, the parameters of the past frame are repeatedly used, but there is a problem that the quality of the reproduced speech is still low. For example, since the pitch periodicity is non-stationary in a voiced rising frame, the pitch information parameter suitable for the subsequent voiced stationary part is not always transmitted, and when a code error occurs in the subsequent frame, Even if the parameters of the pitch information were used repeatedly, good reproduction sound could not be obtained.
[0010]
In addition, there is a locally periodic frame even in the unvoiced part. If a code error occurs in the following unvoiced frame, the signal is played back as a buzzer sound because the signal is repeated for a certain period due to repeated parameters. The voice quality was greatly degraded. In addition, in a frame in which a code error has occurred, if the spectrum generated by repeating the parameters of the previous frame is inappropriate, the reproduced voice quality is greatly degraded. In particular, when the power in the high frequency range is large, such as a sharp formant peak in the high frequency range, the sense of noise in the high frequency range becomes prominent and the auditory degradation is significant.
[0011]
Further, the conventional waveform restoration method in which interpolation is performed also using information of a future frame has a problem that a delay necessary for reproduction of the current frame is increased for restoration. If the delay required for audio reproduction becomes large, when applied to communication, natural conversation cannot be performed and the call is disturbed. Therefore, it is desirable that the delay be as small as possible.
In addition, even when a conventional code error occurs and waveform restoration cannot be performed sufficiently, the transmission error compensation method that maintains audible voice quality switches between a reproduced voice and a noise signal for output. However, since there is a discontinuity in the transition from voice to noise or from noise to voice, there is a problem that this leads to deterioration of the reproduced voice quality in terms of audibility. Furthermore, since a noise signal is always output when a transmission error occurs, noise is output even in a portion that is originally silent, which also leads to deterioration in hearing.
[0012]
[Means for Solving the Problems]
When the speech decoding method of the present invention detects that the code has not been correctly decoded in the speech decoding method for decoding the spectrum information and sound source information of the encoded speech in units of frames and reproducing the speech. In addition, the state of the reproduced sound of the previous frame is determined, and the sound of the current frame is reproduced and repaired according to the determination.
[0013]
In the speech decoding method of the next invention, the sound source information includes pitch information, and when the reproduced speech of the previous frame is voiced, the pitch information for the current frame is obtained from the information of the past frame, The audio of the current frame is played back and repaired based on the pitch information.
[0014]
Furthermore, in the audio decoding method of the next invention, when the reproduced audio of the previous frame is unvoiced, the audio of the current frame is reproduced and repaired so as to suppress the pitch periodicity.
[0015]
Furthermore, the speech decoding method of the next invention detects that the code was not correctly decoded in the speech decoding method in which the spectrum information and sound source information of the encoded speech are decoded in units of frames and the speech is reproduced. In this case, the low-band formant structure of the spectrum of the previous frame is maintained, and the sound of the current frame is reproduced and repaired so that the high-band formant peak is suppressed.
[0016]
[0017]
The speech decoding method of the next invention is a speech decoding method that decodes encoded speech information and reproduces speech, and detects that the code has not been decoded correctly and reproduces it according to the detected state. Added noise to voice.
[0018]
Furthermore, the speech decoding method of the next invention detects that the code has not been correctly decoded, and controls the power of noise superimposed on the reproduced speech in accordance with the detected state.
[0019]
Furthermore, the speech decoding method according to the next invention further estimates the state of the speech to be reproduced, and superimposes noise on the reproduced speech according to the estimation result.
[0020]
In the speech decoding method of the next invention, the state of the speech to be reproduced is estimated, and the power of noise superimposed on the reproduced speech is controlled according to the estimation result.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0022]
Embodiment 1 FIG.
FIG. 1 in which the same reference numerals are assigned to the corresponding parts to FIG. 5 shows the configuration of the first embodiment of the speech decoding method according to the present invention, and 30 in the figure is a linear prediction parameter when a code transmission error is detected. Linear prediction parameter repair means 31 for repairing, rising judgment means 31 for judging whether the previous frame is a voiced rise by analyzing the reproduced voice, 32 for pitch extraction means for analyzing the reproduced voice and obtaining the pitch, 33 for code This is adaptive code repair means for repairing the adaptive code when a transmission error is detected. Further, 34 is a silent determination means for analyzing whether or not the previous frame is silent by analyzing the reproduced speech, 35 is a gain repair means for repairing a gain when a code transmission error is detected, and 36 is a code transmission error detection. Reproduction audio power control means for controlling the reproduction audio power according to the situation. Further, 37 is a voice state estimating means for estimating the voice state of the current frame when a code transmission error is detected, 38 is a noise generating means for generating a noise signal, and 39 is a noise according to the code transmission error detection situation. This is noise power control means for controlling the power of the signal.
[0023]
In the decoding unit 2, the linear prediction parameter decoding unit 15 decodes the linear prediction parameter from the code of the linear prediction parameter. The linear prediction parameter restoration means 30 calculates a low frequency band from a linear prediction parameter decoded in a frame in which no transmission error of the code is detected, and from a linear prediction parameter decoded in a previous frame in a frame in which a transmission error is detected. A linear prediction parameter that maintains a formant structure of the above and suppresses a high-frequency formant peak, for example, when the linear prediction parameter is expressed as a line spectrum pair parameter, is determined as using only the low-frequency parameter, Set as a coefficient of the synthesis filter 16. FIG. 2 shows an example of a spectrum envelope represented by the linear prediction parameter of the previous frame and a spectrum envelope in which the low frequency structure is maintained and the high frequency band is suppressed by the linear prediction parameter repair processing.
[0024]
The rise determination unit 31 analyzes the reproduced voice of the previous frame in the frame in which the transmission error of the code is detected, determines whether or not the previous frame is a voiced rise, and the determination result is used as the pitch analysis unit 32. The data is output to the code repair means 33. Here, the determination of whether or not a voiced rise is made is, for example, a voiced rise when the power of the voice in the latter half of the frame is greater than a threshold by comparison with that in the first half of the frame. The pitch extraction means 32 analyzes the past reproduced voice and extracts at least one pitch period candidate when the previous frame is determined to be voiced rise in the frame in which the transmission error of the code is detected, The data is output to the adaptive code repair means 33.
[0025]
The adaptive code restoration means 33 uses the adaptive code input from the separation means 4 in a frame in which no code transmission error is detected, and if the previous frame is not a voiced rise even if a code error is detected, If a code error is detected and the previous frame has a voiced rise, for example, the appearance of an adaptive code correctly transmitted in the past from among the pitch period candidates input from the pitch extraction means 32 Compared with the frequency distribution, the one closest to the pitch period corresponding to the code having a high appearance frequency is selected, and the adaptive code corresponding to the selected pitch period is output to the adaptive codebook 17. The adaptive codebook 17 outputs time series vectors in which past drive excitation vectors are periodically repeated corresponding to the adaptive codes. FIG. 3A shows an example of the adaptive code repair process in the eleventh embodiment, and FIG. 3B shows an example of the repair process that repeatedly uses the conventional adaptive code of the previous frame. As shown in the figure, even if the pitch period corresponding to the adaptive code of the previous frame is unsuitable for the current frame, good waveform restoration processing is possible by obtaining the pitch period suitable for the current frame from the past frame information It turns out that it is.
[0026]
The noise codebook 18 outputs a time series vector corresponding to the noise code. The gain decoding means 19 decodes the gain for the time series vector output from the adaptive codebook 17 and the noise codebook 18 from the gain code. The silent determination unit 34 analyzes the reproduced sound of the previous frame in the frame in which the code transmission error is detected, determines whether the previous frame is unvoiced, and outputs the determination result to the gain restoration unit 35. . The gain restoration means 35 obtains the gain input from the gain decoding means 19 for a frame in which no code transmission error is detected, and the gain of the previous frame if a code error is detected and the previous frame is not silent. When a code error is detected and the previous frame is unvoiced, the gain for the time series vector from the adaptive codebook is α times the gain of the previous frame, and the gain for the time series vector from the noise codebook is β Double the output. Here, α and β are, for example, 0 ≦ α <β ≦ 1.
[0027]
The time series vectors from the adaptive codebook 17 and the noise codebook 18 are weighted and added according to the respective gains output from the gain restoration means 35, and the addition result is supplied to the synthesis filter 16 as a drive excitation vector. Playback audio can be obtained. The playback sound power control means 36 suppresses the power of the playback sound, for example, gradually increasing the suppression amount as errors continue in accordance with the code transmission error detection status. When a code transmission error is detected, the speech state estimation unit 37 determines the presence / absence of the current frame from the past reproduced speech and the code of the linear prediction parameter of the current frame, and the determination result is used as the noise generation unit. 38.
[0028]
The noise generation unit 38 detects noise in a frame in which, for example, an error is detected and it is estimated to be sound according to the transmission error detection status of the code and the sound / silence determination result input from the sound state determination unit 37. Generate. The noise power control means 39 controls the noise power, for example, gradually increasing the power at the beginning of noise superposition and gradually decreasing the power at the end of superposition. FIG. 4 shows an example of the power control processing of reproduced voice and noise signal. The reproduced sound output from the reproduced sound power control means 36 and the noise output from the noise power control means 39 are added to obtain an output sound S3.
[0029]
According to the first embodiment, when it is detected that the code has not been correctly decoded, the state of the reproduced sound of the previous frame is determined, and the sound of the current frame is reproduced and restored according to this determination. As a result, it is possible to reproduce audio with little quality degradation due to transmission path errors without increasing the delay. In addition, when a code error occurs, the audio quality on hearing can be maintained by superimposing noise on the reproduced sound according to the state of the reproduced sound.
[0030]
Embodiment 2. FIG.
In the first embodiment described above, in the linear prediction parameter restoration means 30, when a transmission error occurs, a linear prediction parameter is obtained so that a spectrum in which a high-frequency formant peak is suppressed is always obtained from the linear prediction parameter of the previous frame. Although the coefficients of the synthesis filter 16 are used, instead of this, the suppression process is performed only when there is a sharp formant peak in the high band of the spectrum represented by the linear prediction parameter of the previous frame. In other cases, the linear prediction parameter of the previous frame is used. Alternatively, the processing may be selectively performed depending on the state.
[0031]
According to the second embodiment, when the linear prediction parameter repair process is performed, the suppression process is performed only when there is a sharp formant peak in the high band and there is a high possibility that it leads to quality degradation in the sense of hearing. In this case, since the linear prediction parameter of the previous frame is used repeatedly, the continuity of the synthesized speech is improved without generating abnormal noise, and the audio quality in terms of hearing can be improved.
[0032]
Embodiment 3 FIG.
In Embodiment 1 described above, when the previous frame at the time of transmission error detection is silent, the gain restoration means 35 adjusts the gain to suppress the pitch periodicity of the driving sound source vector. For example, the pitch periodicity may be suppressed using another means such as not using a time-series vector from the adaptive codebook. According to the third embodiment, it is only necessary to switch whether or not to use the time-series vector from the adaptive codebook for the suppression of pitch periodicity, as compared with the case where the pitch periodicity is suppressed by adjusting the gain. It can be realized easily.
[0033]
Embodiment 4 FIG.
In the first embodiment described above, the speech state estimation unit 37 determines the presence / absence of the current frame from the past reproduced speech and the code of the linear prediction parameter of the current frame. The sign of gain may also be used as a parameter for determination. According to the fourth embodiment, since the sound / silence determination is performed using not only the past reproduced speech and the code of the linear prediction parameter but also other information, determination with higher accuracy is possible.
[0034]
Embodiment 5 FIG.
In the above-described first embodiment, repair processing at the time of transmission error and noise superimposition processing are applied to CELP speech coding, but instead, this is applied to other speech coding schemes including MBE speech coding. Even in the same way, it is possible to improve the sound quality of the reproduced sound at the time of transmission error.
[0035]
【The invention's effect】
As detailed above,
[0036]
According to the present invention, when it is detected by the decoding method that the code has not been correctly decoded, the state of the reproduced sound of the previous frame is determined, and the sound of the current frame is reproduced and restored according to this determination. As a result, it is possible to perform an effective reproduction and repair process without increasing the delay, and it is possible to reproduce audio with less quality degradation due to transmission path errors.
[0037]
According to the present invention, when the reproduced sound of the previous frame is voiced, the pitch information for the current frame is obtained from the information of the past frame, and the sound of the current frame is reproduced and repaired based on the pitch information. Therefore, even if the pitch information transmitted in the previous frame is unsuitable for the current frame, it is possible to generate a playback sound by obtaining a pitch period suitable for the current frame, and quality degradation due to transmission errors. Less audio can be played.
[0038]
Further, according to the present invention, when the reproduced sound of the previous frame is unvoiced, the sound of the current frame is reproduced and repaired so as to suppress the pitch periodicity. Can be avoided, and voice with little quality degradation due to transmission errors can be reproduced.
[0039]
Further, according to the present invention, when it is detected that the code has not been correctly decoded, the current frame is maintained so that the low-band formant structure of the previous frame spectrum is maintained and the high-band formant peak is suppressed. By reproducing and restoring the sound, it is possible to reduce high-frequency noise while maintaining the continuity of the reproduced sound, and it is possible to reproduce the sound with less quality deterioration due to transmission path errors.
[0040]
Further, according to the present invention, the speech decoding method detects that the code has not been correctly decoded, and superimposes noise on the reproduced speech in accordance with the detection state, thereby smoothing the transition between the reproduced speech and noise. In addition, noise is output only to a portion that is originally sounded, and noise is not output in a silent portion, so that it is possible to maintain audible voice quality.
[Brief description of the drawings]
1 is a block diagram showing a configuration of a first embodiment of a speech decoding method according to the present invention;
FIG. 2 is a schematic diagram for explaining the operation of spectrum restoration processing in the first embodiment;
FIG. 3 is a signal waveform diagram for explaining an operation of adaptive code restoration processing in the first embodiment;
FIG. 4 is a schematic diagram for explaining an example of power control processing of reproduced speech and noise signals in the first embodiment.
FIG. 5 is a block diagram showing an overall configuration of a conventional CELP speech coding / decoding method.
[Explanation of symbols]
2 Decoding unit 4 Separation unit 16 Synthesis filter 17 Adaptive codebook 18 Noise codebook 15 Linear prediction parameter decoding unit 19 Gain decoding unit 30 Linear prediction parameter restoration unit 31 Rising judgment unit 32 Pitch extraction unit 33 Adaptive code restoration unit 34 Silence determination means 35 Gain restoration means 36 Reproduction voice power control means 37 Voice state estimation means 38 Noise generation means 39 Noise power control means

Claims

In the audio decoding method for decoding the encoded audio spectrum information and sound source information in units of frames and reproducing the audio, when it is detected that the code has not been correctly decoded, the reproduced audio of the previous frame is reproduced. When the state is determined and the playback sound of the previous frame is voiced , the current frame is selected from the information of the previous frame by selecting the pitch information corresponding to the pitch cycle with a high frequency of correct transmission in the past. A speech decoding method characterized by obtaining the pitch information of the current frame and reproducing and restoring the speech of the current frame based on the pitch information.

In the speech decoding method for decoding the encoded speech spectrum information and sound source information in units of frames and reproducing the speech, if it is detected that the code cannot be decoded correctly, the spectrum of the previous frame is reduced. A speech decoding method comprising: reproducing and restoring speech of a current frame so as to obtain a spectrum in which a formant structure in a region is maintained and a formant peak in a high region is suppressed.

In the audio decoding method for decoding encoded audio information and reproducing the audio, when it is detected that the code has not been correctly decoded, the state of the audio to be reproduced is estimated, and according to the estimation result A speech decoding method characterized by superimposing noise on a reproduced speech and controlling the noise power so that the power is gradually reduced at a silent portion at the end of the superposition.

When it is detected that the code could not be decoded correctly, the state of the audio to be reproduced is estimated, noise is superimposed on the reproduced audio according to the estimation result, and the power is gradually reduced at the silent part at the end of the superimposition. 4. The speech decoding method according to claim 3, wherein no noise is output in a silent part by controlling the power of the noise so as to cause the noise to be generated.