JP3612260B2

JP3612260B2 - Speech encoding method and apparatus, and speech decoding method and apparatus

Info

Publication number: JP3612260B2
Application number: JP2000054994A
Authority: JP
Inventors: 勝美土谷; 公生三関; 皇天田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2000-02-29
Filing date: 2000-02-29
Publication date: 2005-01-19
Anticipated expiration: 2020-02-29
Also published as: JP2001242899A

Description

【０００１】
【発明の属する技術分野】
本発明は、電話帯域の音声、広帯域音声及びオーディオ信号等の音声信号の圧縮符号化方法及び装置並びに復号方法及び装置に関する。
【０００２】
【従来の技術】
低ビットレートでも比較的高音質の音声を再生できる音声符号化方式として、ＣＥＬＰ（ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ）方式が知られている。ＣＥＬＰ方式の詳細は例えばＭ．Ｒ．ＳｃｈｒｏｅｄｅｒａｎｄＡｔａ１． ”Ｃｏｄｅ−ＥｘｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ（ＣＥＬＰ）：ｈｉｇｈｑｕｑ１ｉｔｙｓｐｅｅｃｈａｖｅｒｙ１ｏｗｂｉｔｒａｔｅｓ”、ｉｎＰｒｏｃ．ＩＣＡＳＳＰ‘８５．ｐｐ．９３７−９３９，１９８５（文献１）に示されている。ＣＥＬＰ方式の構成を図１５に示す。図１５に示されるように、ＣＥＬＰ方式では聴覚重みフィルタを用いて符号化による音声に混入する雑音（符号化雑音）の評価を行い、符号化雑音が現フレームの音声のスペクトルから決まる形状のマスキング特性にマスクされる原理（同時マスキング）を用いて雑音が聞こえにくくなるような音源の符号を選択することを特徴としている。一般に、ＣＥＬＰに用いる聴覚重みフィルタはホルマント重みフィルタとピッチ重みフィルタの縦続接続で構成される。ホルマント重みフィルタは入力音声のホルマントによるマスキング特性を利用し、ピッチ重みフィルタは入力音声の調和構造（ハーモニクス）によるマスキング特性を利用している。聴覚重みフィルタの伝達関数ｗ（ｚ）は、ホルマント重みフィルタの伝達関数Ｗｓ（ｚ）及びピッチ重みフィルタの伝達関数Ｗｐ（ｚ）を用いて
【０００３】
【数１】

【０００４】
と表される。ピッチ重みフィルタはピッチ調和周波数成分に小さな重み、調和周波数間の成分に大きな重みをそれぞれかけることにより、符号化雑音のスペクトルを入力音声と同じピッチの調和構造に整形する働きをする。ここで、ピッチ重みフィルタの伝達関数Ｗｐ（ｚ）はピッチ周期Ｔ０及びピッチ予測により求められたピッチ予測係数βｉを用いて
【０００５】
【数２】

【０００６】
と表される。ただし、Ｍはピッチ予測次数を制御する定数、γは雑音整形の度合を制御する定数である。
【０００７】
このようにして求めたピッチ重みフィルタの周波数特性を図１６に示す。図１６において、ピッチ重みフィルタの周波数特性はＷ（ｆ）、音声の周波数特性はＳ（ｆ）で表される。この図からも分かるように、ピッチ重みフィルタはピッチ調和周波数では谷の特性を持ち、調和周波数間では山の特性を持つ。従って、符号化雑音をピッチ重みフィルタで重み付けを行うことにより、音声のピッチ調和周波数では小さな重みを付け、逆に調和周波数間では大きな重みを付けて評価することができる。
【０００８】
このようにフレーム内で周波数毎の相対的な重み付けを用いて、音源の符号選択を行うことにより、符号化により生じる符号化雑音のスペクトルを図１６のＥ（ｆ）に示すように音声と同じピッチ周期の調和構造にすることができる。こうすると、符号化雑音は音声のスペクトルの凹凸にマスクされて聞こえにくいものとなる。このようにピッチ重みフィルタは比較的簡単な分析により得られ、かつ、主観的な符号化雑音を抑えた音声符号化を行うことができるため、ＣＥＬＰで用いられてきた。
【０００９】
また、ＣＥＬＰ方式では復号音声の主観品質を向上させるために、音声を復号した後にポストフィルタが用いられることが多い。一般に、ＣＥＬＰに用いるポストフィルタはホルマント強調フィルタとピッチ強調フィルタの縦続接続で構成される。ポストフィルタ伝達関数Ｈｐｆ（ｚ）は、ホルマント強調フィルタの伝達関数Ｈｓ（ｚ）及びピッチ強調フィルタの伝達関数Ｈｐ（ｚ）を用いて
【００１０】
【数３】

【００１１】
と表される。ここで、ピッチ強調フィルタの伝達関数Ｈｐ（ｚ）はピッチ周期Ｔ０及びピッチ予測係数λを用いて、
【００１２】
【数４】

【００１３】
と表される。ただし、λはピッチ強調の度合を制御する定数である。
【００１４】
【発明が解決しようとする課題】
しかし、実際の音声は帯域によって調和構造の強さが異なっており、図１７のＳ（ｆ）のように調和構造が弱い帯域が存在することもある。従来のピッチ重みフィルタを用いたピッチ重み付けでは、図１７のＷ（ｆ）のように全帯域で整形の強さが同じであるピッチ重みフィルタを使用するためにＥ（ｆ）に示される符号化雑音の調和構造と入力音声の調和構造とが異なり、復号音声の音質が劣化するという問題があった。
【００１５】
また、ポストフィルタ処理におけるピッチ強調においても同様で、式５に示す伝達関数のフィルタを用いた従来のピッチ強調では、全帯域でピッチ強調の強さが同じであるためピッチ強調の不要な帯域に対してもピッチ強調が行われ、復号音声の音質が劣化するという問題があった。
【００１６】
本発明は、このような問題点を解消し、図１８に示すように、符号化雑音の調和構造を入力音声の調和構造に近づけることで復号音声の音質を向上させる音声符号化及び復号方法並びに音声符号化及び復号化装置を提供することを目的とする。
【００１７】
【課題を解決するための手段】
第１の本発明は、入力音声情報信号とこの入力音声情報信号に対応する合成音声情報信号との差を表す誤差信号を生成し、周波数に従って前記誤差信号に対するピッチ重み付けの度合いを変えて重み付け信号を生成し、この重み付け信号に基づきインデックス情報を生成することを特徴とする音声符号化方法を提供する。
【００１８】
このようにピッチ重み付けの度合を周波数によって変化させることにより、各周波数に適したピッチ重み付けを行い、符号化雑音の調和構造を各周波数で制御することが可能となり、復号音声の音質を向上させることができる。
【００１９】
また、第２の発明は、第１の発明に係る音声符号化方法おいて、入力音声の特性に従って各周波数のピッチ重み付けの度合を変化させることを特徴とする音声符号化方法を提供する。
【００２０】
このように、各周波数のピッチ重み付けの度合を入力信号の特性に従って変化させることにより、符号化雑音の調和構造を入力音声の調和構造に対応して変化させることが可能となり、復号音声の音質を向上させることができる。
【００２１】
また、第３の発明は、第２の発明に係る音声符号化方法おいて、入力音声を分析して各周波数の有声度を求め、有声度に従って各周波数のピッチ重み付けの度合を変化させることを特徴とする音声符号化方法を提供する。
【００２２】
このように、各周波数のピッチ重み付けの度合を入力信号の各周波数の有声度に従って変化させることにより、符号化雑音の調和構造を入力音声の調和構造に対応して変化させることが可能となり、復号音声の音質を向上させることができる。
【００２３】
また、第４の発明は、第３の発明に係るに係る音声符号化方法において、有声度が高い周波数ではピッチ重み付けの度合を強くし、有声度が低い周波数ではピッチ重み付けの度合を弱くすることを特徴とする音声符号化方法を提供する。
【００２４】
このような重み付けを行うことで、符号化雑音の調和構造を入力音声の調和構造に近づけることができ、復号音声の音質を向上させることができる。
【００２５】
また、第５の発明は、入力音声情報信号とこの入力音声情報信号に対応する合成音声情報信号との差を表す誤差信号を生成し、前記入力音声情報信号を少なくとも２つの周波数帯域に分割し、該周波数帯域毎に前記誤差信号に対するピッチ重み付けの度合いを変えて重み付け信号を生成し、この重み付け信号に基づきインデックス情報を生成することを特徴とする音声符号化方法を提供する。
【００２６】
このように、ピッチ重み付けの度合を帯域毎に変化させることにより、各帯域に適したピッチ重み付けを行うことができ、符号化雑音の調和構造を帯域毎に制御し、復号音声の音質を向上させることができる。
【００２７】
また、第６の発明は、第５の発明に係る方法おいて、入力音声を分析して各帯域の有声度を求め、有声度に従って各帯域のピッチ重み付けの度合を変化させることを特徴とする音声符号化方法を提供する。
【００２８】
このように、各帯域のピッチ重み付けの度合を入力信号の各帯域の有声度に従って変化させることにより、符号化雑音の調和構造を入力音声の調和構造に対応して変化させることができ、復号音声の音質を向上させることができる。
【００２９】
また、第７の発明は、第６の発明に係る音声符号化方法において、有声度が高い帯域ではピッチ重み付けの度合を強くし、有声度が低い帯域ではピッチ重み付けの度合を弱くすることを特徴とする音声符号化方法を提供する。
【００３０】
このような重み付けを行うことで、符号化雑音の調和構造を入力音声の調和構造に近づけることができ、復号音声の音質を向上させることができる。
【００３１】
また、第８の発明は、第５の発明に係る音声符号化方法において、入力音声を分析して各帯域の有声／無声判定を行い、有声と判定された帯域に対してはピッチ重み付けを行い、無声と判定された帯域に対してはピッチ重み付けを行わないことを特徴とする音声符号化方法を提供する。
【００３２】
このように、帯域によってピッチ重み付けの度合を変化させることによって符号化雑音の調和構造を入力音声の調和構造に近づけることができるようになり、復号音声の品質を向上させることができる。
【００３３】
ここで、ピッチ重み付けの度合とは、雑音のピッチ整形の強さを指し、雑音のピッチ整形の強さは、例えば、ピッチ重みフィルタのフィルタ係数によって制御することができる。
【００３４】
また、第９の発明は、符号化音声情報からインデックス情報を抽出し、このインデックス情報に基づき復号音声信号を生成し、周波数に応じてピッチ強調の度合を変化させて前記復号音声信号にピッチ強調処理を行うことを特徴とする音声復号方法を提供する。
【００３５】
このように、ポストフィルタのピッチ強調の度合を周波数によって変化させることにより、各周波数に適したピッチ強調を行うことができ、復号音声の品質を向上させることができる。
【００３６】
また、第１０の発明は、第９の発明に係る音声復号方法において、復号音声の特性に従って各周波数のピッチ強調の度合を変化させることを特徴とする音声復号方法を提供する。
【００３７】
このように、復号音声の特性に従って各周波数のピッチ強調の度合を変化させることで、復号音声にあったピッチ強調を行うことができる。
【００３８】
また、第１１の発明は、第１０の発明に係る音声復号方法において、復号音声の各周波数の有声度に従って各周波数のピッチ強調の度合を変化させることを特徴とする音声復号方法を提供する。
【００３９】
また、第１２の発明は、第１１の発明に係る音声復号方法において、有声度が高い周波数ではピッチ強調の度合を強くし、有声度が低い周波数ではピッチ強調の度合を弱くすることを特徴とする音声復号方法を提供する。
【００４０】
また、第１３の発明は、符号化音声情報からインデックス情報を抽出し、このインデックス情報に基づき復号音声信号を生成し、前記復号音声信号を少なくとも２つの周波数帯域に分割し、周波数帯域毎にピッチ強調の度合を変化させて前記復号音声信号にピッチ強調処理を行うことを特徴とする音声復号方法を提供する。
【００４１】
また、第１４の発明は、第１３の発明に係る音声復号方法において、復号音声の各帯域の有声度に従って各帯域のピッチ強調の度合を変化させることを特徴とする音声復号方法を提供する。
【００４２】
また、第１５の発明は、第１４の発明に係る音声復号方法において、有声度が高い帯域ではピッチ強調の度合を強くし、有声度が弱い帯域ではピッチ強調の度合を弱くすることを特徴とする音声復号方法を提供する。
【００４３】
また、第１６の発明は、第１３の発明に係る音声復号方法において、復号音声の各帯域の有声／無声判定を行い、有声と判定された帯域に対してはピッチ強調を行い、無声と判定された帯域に対してはピッチ強調を行わないことを特徴とする音声復号方法を提供する。
【００４４】
この第１６の発明によれば、必要な帯域に対してのみピッチ強調を行うことができるので、復号音声の品質を向上させることができる。
【００４５】
ここで、ピッチ強調の度合とは、復号音声のピッチ整形の強さを指し、ピッチ整形の強さは、例えば、ピッチ強調フィルタのフィルタ係数によって制御することができる。
【００４６】
また、第１７の発明は、入力音声情報信号とこの入力音声情報信号に対応する合成音声情報信号との差を表す誤差信号を生成する合成フィルタ手段と、周波数に従って前記誤差信号に対するピッチ重み付けの度合いを変えて重み付け信号を生成する重み付けフィルタ手段と、この重み付け信号に基づきインデックス情報を生成するインデックス情報発生手段とにより構成されることを特徴とする音声符号化装置を提供する。
【００４７】
また、第１８の発明は、入力音声情報信号とこの入力音声情報信号に対応する合成音声情報信号との差を表す誤差信号を生成する合成フィルタ手段と、前記入力音声情報信号を少なくとも２つの周波数帯域に分割する帯域分割手段と、該周波数帯域毎に前記誤差信号に対するピッチ重み付けの度合いを変えて重み付け信号を生成する重み付けフィルタ手段と、この重み付け信号に基づきインデックス情報を生成するインデックス情報発生手段とにより構成されることを特徴とする音声符号化装置を提供する。
【００４８】
また、第１９の発明は、符号化音声情報からインデックス情報を抽出する分離手段と、このインデックス情報に基づき復号音声信号を生成する合成フィルタ手段と、周波数に応じてピッチ強調の度合を変化させて前記復号音声信号にピッチ強調処理を行うポストフィルタ手段とで構成されることを特徴とする音声復号装置を提供する。
【００４９】
また、第２０の発明は、符号化音声情報からインデックス情報を抽出し、このインデックス情報に基づき復号音声信号を生成する合成フィルタ手段と、前記復号音声信号を少なくとも２つの周波数帯域に分割し、周波数帯域毎にピッチ強調の度合を変化させて前記復号音声信号にピッチ強調処理を行うポストフィルタ手段とにより構成されることを特徴とする音声復号装置を提供する。
【００５０】
【発明の実施の形態】
（第１の実施形態）
本発明の音声符号化法をＣＥＬＰ方式に適用した第１の実施形態について説明する。ＣＥＬＰ方式の符号化は、音声のスペクトル包絡情報の符号化と音源信号の符号化に大きく分けることができる。聴覚重みフィルタは音源信号の符号化に用いる。ＣＥＬＰ方式ではフレーム単位に音声の分析・符号化を行う。方式によっては、フレームをさらに小さなサブフレームに分割し、サブフレーム毎に音源信号の符号化を行う方法もあるが、ここでは説明の簡単のために音源信号の符号化もフレーム単位で行うことにする。
【００５１】
図１に、本実施形態に係る音声符号化方法を適用した音声符号化システムの構成を示す。この音声符号化システムによると、入力音声１００の線形予測係数１０１を計算する線形予測分析部１０及び帯域分割部の広域通過フィルタ２０及び低域通過フィルタ２１に入力される。広域通過フィルタ２０及び低域通過フィルタ２１の出力は各帯域のピッチ重みフィルタ係数１１２、１１３を求めるピッチ重みフィルタ係数算出部２２，２３にそれぞれ接続される。ピッチ重みフィルタ係数算出部２２，２３の出力は聴覚重み付けフィルタ３３のピッチ重みフィルタ２９，３０にそれぞれ接続される。
【００５２】
線形予測分析部１０の出力は線形予測係数１０１を符号化する線形予測係数符号化部１７及び入力音声１００と復号音声１０７の差信号１０８にホルマント重み付けを行うホルマント重みフィルタ２５に接続される。線形予測係数符号化部１７の出力は駆動音源１０５から復号音声１０７を生成する合成フィルタ１８及びマルチプレクサ３４に接続される。ホルマント重みフィルタ２５の出力は広域通過フィルタ２６及び低域通過フィルタ２７を介してピッチ重みフィルタ２９，３０にそれぞれ接続される。帯域分割されたホルマント重み付きの差信号１１５、１１６にピッチ重み付けを行うピッチ重みフィルタ２９，３０の出力は加算器３１に入力され、この加算器３１の出力は歪み計算部３２に接続される。この歪み計算部３２の出力は音声のピッチ周期成分を符号化するための適応符号帳１１，音声のピッチ周期以外の成分を符号化するための雑音符号帳１２及び適応符号帳１１から出力された適応符号ベクトル１０２及び雑音符号帳１２から出力された雑音符号ベクトル１０３のクインを符号化するためのゲイン符号帳１３に接続されると共にマルチプレクサ３４に接続される。
【００５３】
適応符号帳１１及び雑音符号帳１２の出力はゲイン符号帳１３の出力と共にゲイン乗算器１４，１５にそれぞれ接続される。ゲイン乗算器１４，１５の出力は加算器１６に接続され、この加算器１６の出力は線形予測係数符号化部１７の出力と共に合成フィルタ１８に接続される。この合成フィルタ１８の出力は入力音声と共に加算器１９に入力される。加算器１９の出力はホルマント重みフィルタ２５に接続される。
【００５４】
即ち、この実施形態では、図１５に示す従来の音声符号化システムに対して更に高域成分を求める高域通過フィルタ２０及び２６、低域成分を求める低域通過フィルタ２１及び２７が追加されている。この構成において、帯域毎に算出されたピッチ重み係数１１２及び１１３を用いてピッチ重み付けを行う点が大きく異る。
【００５５】
この音声符号化システムでは、まず入力音声１００が５〜２０ｍｓ程度の一定間隔のフレーム単位に分割されて入力される。フレーム単位の入力音声は線形予測分析部１０に入力され、その周波数スペクトルの包絡形状を表す線形予測係数１０１が計算される。線形予測係数１０１は線形予測係数符号化部１７で符号化された後、合成フィルタ１８にフィルタ係数１０６として与えられる。また、線形予測係数１０１はホルマント重み付けを行うためにホルマント重みフィルタ２５にも供給される。
【００５６】
線形予測係数１０１の符号化の後、音源信号の符号化が行われる。音源信号の符号化では、適応符号帳１１から選択された適応符号ベクトル１０２と雑音符号帳１２から選択された雑音符号ベクトル１０３の各々にゲイン符号帳１３から選択されたゲイン１０４が乗じられて足し合わされることによって駆動音源１０５が生成される。このようにして生成された駆動音源１０５は、線形予測係数符号化部１７の出力により特徴づけられた合成フィルタ１８に入力され復号音声１０７が生成される。
【００５７】
入力音声１００と復号音声１０７の差信号１０８が計算される。差信号１０８は、先ず、ホルマント重みフィルタ２５に入力され、ホルマント重み付けが行われる。ホルマント重みフィルタ２５は、線形予測分析部１０で求められた線形予測係数１０１から算出されるホルマント重みフィルタ係数により特徴づけられる。例えば、ホルマント重みフィルタの伝達関数Ｗｓ（ｚ）は、線形予測分析部１０で求められたＬＰＣ係数から構成される予測フィルタの伝達関数Ａ（ｚ）を用いて
【００５８】
【数５】

【００５９】
と表される。定数γ１，γ２の値としては、例えばｒ１＝０．９、ｒ２＝０．４を用いることができる。なお、γ１，γ２はこの値に限定される必要はなく、異なる値を用いても良い。
【００６０】
次に、ホルマント重み付けされた差信号１１４は高域通過フィルタ２６及び低域通過フィルタ２７に入力され、２つの帯域に分割された後、各帯域のピッチ重みフィルタ２４、３０に入力される。一方、入力音声１００も高域通過フィルタ２０及び低域通過フィルタ２１に入力され、２つの帯域に分割された後、各帯域成分１１０、１１１はそれぞれピッチ重みフィルタ係数算出部２２、２３に入力される。ピッチ重みフィルタ係数算出部２２、２３では、入力された信号をピッチ予測して、ピッチ予測係数１１２、１１３が算出される。算出されたピッチ予測係数１１２、１１３はピッチ重みフィルタ２４、３０に供給される。
【００６１】
ピッチ重みフィルタでは、各帯域成分に対してそれぞれ異るピッチ重み付けが行われる。ピッチ重みフィルタはピッチ重みフィルタ係数算出部で求められたピッチ重みフィルタ係数によって特徴づけられる。例えば、高域のピッチ重みフィルタの伝達関数ＷＨｐ、及び低域のピッチ重みフィルタの伝達関数ＷＬｐは、ピッチ周期及びピッチ予測係数β_Ｈｉ，β_Ｌｉを用いて、
【００６２】
【数６】

【００６３】
と表される。ただし、Ｍはピッチ予測次数を制御する定数、γは雑音整形の度合を制御する定数である。定数γ_Ｈ，γ_Ｌの値としては、例えばγ_Ｈ＝γ_Ｌ＝０．４を用いることができる。なお、γ_Ｈ，γ_Ｌは別々の値を設定しても構わないし、γ_Ｈ，γ_Ｌを各帯域のピッチ強度Ｓ_Ｈ，Ｓ_Ｌの関数として定義し、ピッチ強度を用いて各帯域毎に制御することもできる。例えば、
【００６４】
【数７】

【００６５】
と定義することができる。ただし、ζ_Ｈ，ζ_Ｌは定数である。また、ピッチ強度Ｓ_Ｈ，Ｓ_Ｌは予測係数β_Ｈｉ，β_Ｌｉを用いて
【００６６】
【数８】

【００６７】
と定義することができる。ただし、ピッチ強度Ｓ_Ｈ，Ｓ_Ｌは上式に限定されず、信号のピッチ周期の強さを示すパラメータであれば良い。
【００６８】
次に、ピッチ重み付けされた高域成分１１７及び低域成分１１８は加算部３１で加算され、歪み計算部３２に入力される。歪み計算部３２では、歪みが最小となる適応符号ベクトル、雑音符号ベクトル及びゲインベクトルが選択され、これらのベクトルを表すインデックスがマルチプレクサ３４に入力される。また、マルチプレクサ３４には歪み計算部３２から入力されるインデックスとともに、線形予測係数符号化部１７からも線形予測係数を符号化して得られるインデックスが入力される。マルチプレクサ３４では、入力されたインデックスから符号化ビットストリーム１２２が生成され、この符号化ビットストリーム１２２が伝送路または蓄積媒体を経て復号側に伝送される。
【００６９】
上述したように、本実施形態では帯域毎にピッチ重み付けの度合を制御できるので、入力音声が図２のＳ（ｆ）に示す周波数特性を持つ場合でも、低域ではピッチ重み付けの度合を強くし、高域ではピッチ重み付けの度合を弱くすることで、符号化雑音の周波数特性を図２のＥ（ｆ）のような形にすることができる。このように、符号化雑音の調和構造を入力音声の調和構造に近づけることが可能となり、復号音声の音質を向上させることができる。
【００７０】
（第２の実施形態）
本発明の音声符号化法をＣＥＬＰ方式に適用した第２の実施形態について説明する。図３に本実施形態に係る音声符号化方法を適用した音声符号化システムの構成を示す。図３に示される本実施形態の音声符号化システムは、図１に示した第１の実施形態の音声符号化システムに有声／無声判定部４０、４１と切り替え部４４、４５が追加された構成となっている。図３において図１と同一の番号が付されている部分は同じ動作をするものとして、ここでは本実施形態の特徴的な部分を中心に説明する。
【００７１】
本実施形態では、高域と低域に分割された入力音声は、それぞれ各帯域の有声／無声判定部４０、４１とピッチ重みフィルタ係数算出部２２、２３に入力され、有声／無声判定部４０、４１では入力された帯域制限された信号１１０、１１１を分析して、その帯域の信号が有声であるか無声であるかを判定する。有声／無声の判定は、例えばＩＭＢＥ（ＩｍｐｒｏｖｅｄＭｕ１ｔｉ＝ＢａｎｄＥｘｃｉｔａｔｉｏｎｖｏｃｏｄｅｒ）で用いられているアルゴリズムを使用することで実現できる。なお、ＩＭＢＥの詳細は、例えばＤ．Ｗ．ＧｒｉｆｆｉｎａｎｄＪ．Ｓ．Ｌｉｍ ”ＭｕｌｔｉｂａｎｄＥｘｃｔａｔｉｏｎＶｏｃｏｄｅｒ”，ＩＥＥＥＴｒａｎｓ．Ａｃｏｕｓｔ．，Ｓｐｅｅｃｈ，ＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，ｖｏ１．ＡＳＳＰ−３６，ｐｐ．１２２３−１２３５，Ａｕｇ．１９８８（文献２）に示されている。有声／無声の判定結果はピッチ重みフィルタ係数算出部２２、２３と切り替え部４４、４５に送られる。
【００７２】
有声／無声の判定結果１４０、１４１が有声の場合、ピッチ重みフィルタ係数算出部２２、２３では入力信号を分析して、ピッチ重みフィルタ係数１１２、１１３が算出され、ピッチ重みフィルタ係数がピッチ重みフィルタに入力される。逆に、有声／無声の判定結果１４０、１４１が無声の場合、ピッチ重みフィルタ係数算出部２２、２３ではピッチ重みフィルタ係数１１２、１１３の算出は行われない。
【００７３】
一方、切り替え部４４、４５では有声／無声の判定結果１４２、１４３に従って、出力の切り替えが行われる。有声／無声の判定結果が有声の場合、切り替え部の出力はピッチ重みフィルタ２４，３０に入力される。逆に、有声／無声の判定結果が無声の場合、切り替え部の出力はそのまま加算部４６、４７に入力される。このようにして各帯域でピッチ重み付けの有／無が制御される。
【００７４】
ピッチ重み付けされた高域成分及び低域成分は加算部３１で加算され、歪み計算部３２に入力される。歪み計算部３２では、歪みが最小となる適応符号ベクトル、雑音符号ベクトル及びゲインベクトルが選択され、これらのベクトルを表すインデックスがマルチプレクサ３４に入力される。
【００７５】
また、マルチプレクサ３４には歪み計算部３２から入力されるインデックスとともに、線形予測係数符号化部１７からも線形予測係数を符号化して得られるインデックスが入力される。マルチプレクサ３４では、入力されたインデックスから符号化ビットストリーム１２２が生成され、この符号化ビットストリーム１２２が伝送路または蓄積媒体を経て符号化側に伝送される。
【００７６】
上述したように、本実施形態では帯域毎にピッチ重み付けの有／無を制御できるので、入力音声が図４のＳ（ｆ）に示す周波数特性を持つ場合でも、低域のみピッチ重み付けを行い、高域ではピッチ重み付けを行わないようにすることで、符号化雑音の周波数特性を図４のＥ（ｆ）のような形にすることができる。このように、符号化雑音の調和構造を入力音声の調和構造に近づけることが可能となり、復号音声の音質を向上させることができる。
【００７７】
なお、本発明の第２の実施形態は帯域毎にピッチ重み付けの有／無の制御を行う部分が特徴的な部分であり、帯域毎にピッチ重み付けの有／無の制御が行えるような構成であれば良く、図３の構成に限定されない。例えば、図５に示すように、図３から切り替え部４４、４５を取り除いた構成で、ピッチ重みフィルタ係数算出部２２，２３において、有声／無声判定結果に基づいてピッチ重みフィルタ係数を求めるように変更することもできる。
【００７８】
ここで、無声の場合はピッチ重み付けを行わないピッチ重みフィルタ係数を出力するようにしておくことで、ピッチ重み付けの有／無の切り替えと同様の操作を行うことができる。
【００７９】
（第３の実施形態）
本発明の音声符号化法をＣＥＬＰ方式に適用した第３の実施形態について説明する。図６に本実施形態に係る音声符号化方法を適用した音声符号化システムの構成を示す。この音声符号化システムは、図１５に示す従来のＣＥＬＰ方式と異なって、聴覚重み付け部分にピッチ重み制御フィルタ６０、６１、加算部６２及び減算部６３が追加された構成となっている。なお、ここでは本実施形態の特徴的な部分を中心に説明する。
【００８０】
ホルマント重み付けされた差信号１１４はピッチ重みフィルタ５０、ピッチ重み制御フィルタ６１及び減算部６３に入力される。ピッチ重みフィルタ５０ではホルマント重み付けされた差信号１１４に対してピッチ重み付けが行われ、処理された信号１５１がピッチ重み制御フィルタ６０に入力される。ピッチ重み制御フィルタ６０では入力された信号１５１をフィルタ処理した後、信号１５２として加算部６２に供給する。
【００８１】
一方、減算部６３では、ホルマント重み付けされた差信号１１４とホルマント重み付けされた差信号１１４をピッチ重み制御フィルタ６１でフィルタ処理した信号１５３の差信号１５４が求められ、この信号１５４が加算部６２に入力される。加算部６２では入力された２つの信号が加算され、加算された信号１５５が歪み計算部３２に入力される。歪み計算部３２では、歪みが最小となる適応符号ベクトル、雑音符号ベクトル及びゲインベクトルが選択され、これらのベクトルを表すインデックスがマルチプレクサ３４に入力される。また、マルチプレクサ３４には歪み計算部３２から入力されるインデックスとともに、線形予測係数符号化部１７からも線形予測係数を符号化して得られるインデックスが入力される。マルチプレクサ３４では、入力されたインデックスから符号化ビットストリーム１２２が生成され、この符号化ビットストリーム１２２が伝送路または蓄積媒体を経て符号化側に伝送される。
【００８２】
第３の本実施形態では、ピッチ重み制御フィルタ６０、６１は周波数に対してピッチの重み付けの度合を滑らかに変化させる役割をしている。例えば、ピッチ重みフィルタの周波数特性が図７のＷｐ（ｆ）で表され、ピッチ重み制御フィルタの周波数特性が図８のＨ（ｆ）で表されるような低域通過特性となるとき、変形ピッチ重み付けフィルタの周波数特性は図９のＷ（ｆ）のように周波数が高くなるに従ってピッチ重み付けの度合が弱くなっている。このような重み付けを行った場合、符号化により生じる符号化雑音のスペクトルは図９のＥ（ｆ）に示すように周波数が高くなるに従って調和構造が弱くなる。また、ピッチ重みフィルタの周波数特性が図７のＷｐ（ｆ）で表され、ピッチ制御フィルタの周波数特性が図１０のＨ（ｆ）で表されるような特性となるとき、変形ピッチ重み付けフィルタの周波数特性は図１１のＷ（ｆ）のように中域の周波数でピッチ重み付けの度合が弱くなっている。このような重み付けを行った場合、符号化により生じる符号化雑音のスペクトルは図１１のＥ（ｆ）に示すように中域の周波数で調和構造が弱くなる。
【００８３】
このように、ピッチ重み制御フィルタを用いることで、変形ピッチ重み付けフィルタのピッチ重み付けの度合を周波数で滑らかに変化させることができる。また、入力音声の特性に応じてピッチ重み制御フィルタの特性を変化させることもできる。例えば、入力音声を分析して周波数に対する調和構造の強さを求め、周波数に対する調和構造の強さを基にピッチ重み制御フィルタの特性を決定する。ピッチ制御フィルタの特性を調和構造が弱い周波数を減衰させるような特性にすることで、符号化雑音の調和構造を入力音声の調和構造に近づけることが可能となり、復号音声の音質を更に向上させることができる。
【００８４】
（第４の実施形態）
本発明の音声復号方法をＣＥＬＰ方式に適用した実施形態を説明する。図１２には、第４の実施形態に係る音声復号方法を適用した音声復号システムの構成が示されている。この音声復号システムでは、デマルチプレクサ７０の出力が、適応符号帳１１、雑音符号帳１２及びゲイン符号帳１３並びに線形予測係数復号部７１に接続される。
【００８５】
適応符号帳１１及び雑音符号帳１２の出力はゲイン符号帳１３の出力と共にゲイン乗算部１４、１５にそれぞれ接続される。ゲイン乗算部１４，１５の出力は加算部１６に接続される。この加算部１６の出力は適合符号帳１１に帰還され、更に線形予測係数復号部７１の出力と共に合成フィルタ１８に接続される。線形予測係数復号部７１の出力はポストフィルタ７８に接続される。
【００８６】
ポストフィルタ７８は、ホルマント強調フィルタ７２及び変形ピッチ強調フィルタ７７から構成されており、変形ピッチ強調フィルタ４７はピッチ強調制御フィルタ７３、ピッチ強調フィルタ７４、７５及び加算部７６から構成されている。
【００８７】
この音声復号システムでは、先ず、伝送路または蓄積媒体から得られたビットストリーム１７０がデマルチプレクサ７０に入力される。デマルチプレクサ７０では、入力されたビットストリーム１７０から線形予測係数を表す線形予測係数インデックス１７１、適応符号ベクトルを表す適応符号ベクトルインデックス１７２、雑音符号ベクトルを表す雑音符号ベクトルインデックス１７３、及びゲインベクトルを表すインデックス１７４が分離生成される。これらのインデックスのうち、線形予測係数インデックス１７１は線形予測係数復号部７１に、適応符号ベクトルインデックス１７２は適応符号帳１１に、雑音符号ベクトルインデックス１７３は雑音符号帳１２に、ゲインインデックス１７４はゲイン符号帳１３にそれぞれ入力される。
【００８８】
線形予測係数復号部７１では、入力された線形予測係数インデックス１７１から線形予測係数が復号され、これが合成フィルタ１８にフィルタ係数として与えられる。また、適応符号ベクトルインデックス１７２に従って適応符号帳１１から適応符号ベクトル１０２が選択され出力される。また、雑音符号ベクトルインデックス１７３に従って雑音符号帳１２から雑音符号ベクトル１０３が選択され出力される。
【００８９】
さらに、ゲインインデックス１７４に従ってゲイン符号帳１３から適応符号ベクトル及び雑音符号ベクトルに乗じるべきゲイン１０４が選択され出力される。このゲインが乗算部１４、１５で適応符号ベクトル１０２及び雑音符号ベクトル１０３に乗じられた後、これら２つのベクトルが加算部１６で足し合わされることによって復号残差波形信号１０５が生成され、この信号が駆動音源信号として合成フィルタ１８及び適応符号帳１１に入力される。
【００９０】
線形予測係数復号部７１で復号された線形予測係数により決定された合成フィルタ１８が駆動音源信号により駆動され、復号音声信号１０７が生成される。その後、復号音声１０７の主観品質を向上させるために復号音声１０７に対してポストフィルタ処理が行われる。従来のポストフィルタはホルマント強調フィルタとピッチ強調フィルタの従属接続で構成されているが、本実施形態におけるポストフィルタ４８はホルマント強調フィルタ７２と変形ピッチ強調フィルタ７３の従属接続で構成されている。変形ピッチ強調フィルタ７３は図１２に示されるように、ピッチ強調の度合を周波数毎に制御できるように、ピッチ強調フィルタ７３、ピッチ強調制御フィルタ７４、７５及び加算部７６から構成されている。この場合、変形ピッチ強調フィルタ７７の伝達関数Ｈ’ｐ（ｚ）は、ピッチ強調フィルタ７３の伝達関数Ｈ’ｐ（ｚ）、ピッチ強調制御フィルタ７４、７５の伝達関数Ｈ（ｚ）を用いて、
【００９１】
【数９】

【００９２】
と表される。なお、ホルマント強調フィルタ７２は公知の技術を用いて構成できる。
【００９３】
ここで、ピッチ強調フィルタ７３の伝達関数は式５で表され、その特性が図１３であり、また、ピッチ制御フィルタ７４、７５の特性が図８に示されるような低域通過の特性であるとき、変形ピッチ強調フィルタ４７の周波数特性は、図１４のＨ’ｐ（ｚ）に示されるような、高域ほど山谷の小さいものになる。このような変形ピッチ強調フィルタを用いれば、低域で強く高域で弱いピッチ強調を行うことができ、強いピッチ強調を行っても高域のスペクトルが変形しにくくなり、高域の品質の劣化を抑えたピッチ強調を行うことができる。
【００９４】
図１２に戻りポストフィルタ７８の動作を説明する。合成フィルタ１８から出力された復号音声１０７はホルマント強調フィルタ７２に入力され、ホルマント強調フィルタ７２でホルマント強調された復号音声１７５は加算部７６、ピッチ強調制御フィルタ７３及びピッチ強調フィルタ７４に入力される。ピッチ強調フィルタ７３に入力されたホルマント強調され本復号音声１７５は、ピッチ強調フィルタ７３でピシチ強調された後、ピッチ強調制御フィルタ７５で処理され加算部７６に入力される。
【００９５】
また、ピッチ強調制御フィルタ７４に入力されたホルマント強調された復号音声１７５はピッチ強調制御フィルタ処理され、加算部７６に入力される。加算部７６では供給された３つの信号１７５、１７６、１７８が加算され、その結果が最終的な復号音声１７９となって出力される。
【００９６】
上述したように、本実施形態におけるポストフィルタ７８は、従来のポストフィルタにピッチ強調制御フィルタ７４を追加することでピッチ強調の度合を周波数毎に制御できるようにしたものである。ピッチ強調制御フィルタ７４はその特性を変化させることでピッチ強調の度合を自由に変化させることができ、復号音声の特性に従いピッチ強調制御フィルタの特性を変化させれば、復号音声の周波数にあった強さのピッチ強調を行うことができ、復号音声の品質を更に向上させることができる。
【００９７】
なお、本発明の特徴的な部分はポストフィルタのピッチ強調に関する部分であって、音声復号方式はＣＥＬＰ方式に限定される必要はなく、他の復号方式を用いても構わない。
【００９８】
また、ここで述べたピッチ強調方法を音声符号化の駆動音源信号を生成する部分に適用することも可能である。
【００９９】
以上、本発明の実施形態を幾つか説明したが、本発明は上述した実施形態に限定される必要はなく、種々変形して実施が可能である。
【０１００】
例えば、上述した第１の実施形態及び第２の実施形態では簡単のため高域と低域の２つの帯域に分割しているが、分割される帯域の数は２つに限定される必要はなく、２つ以上であれば構わない。また、帯域分割部は図１〜図５に示した構成に限定されない。帯域分割する方法として、信号を一旦ＦＦＴして、ＦＦＴ上で周波数分割した後に逆ＦＦＴする方法や、ＱＭＦフィルタを用いて帯域分割する方法などを用でも構わない。
【０１０１】
さらに、本実施形態では入力音声と再生音声の差信号に対して聴覚重み付けフィルタ処理を行い聴覚重み付け歪みを求めているが、入力音声及び再生音声それぞれに聴覚重み付けを行った後に差信号を求め、聴覚重み付け歪みを求めるような構成に変形することも可能である。
【０１０２】
【発明の効果】
以上詳述したように、本発明によれば符号化雑音の調和構造を入力音声に類似させることができるようにになり、再生音声の品質を向上させることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態における音声符号化方法を用いた音声符号化システムの構成を示す図。
【図２】本発明の第１の実施形態における符号化雑音の周波数特性を示す図。
【図３】本発明の第２の実施形態における音声符号化方法を用いた音声符号化システムの構成を示す図。
【図４】本発明の第２の実施形態における符号化雑音の周波数特性を示す図。
【図５】本発明の第２の実施形態における音声符号化方法を用いた他の音声符号化システムの構成を示す図。
【図６】本発明の第３の実施形態における音声符号化方法を用いた音声符号化システムの構成を示す図。
【図７】本発明の第３の実施形態におけるピッチ重みフィルタの周波数特性を示す図。
【図８】本発明の第３の実施形態におけるピッチ重み制御フィルタの周波数特性を示す図。
【図９】本発明の第３の実施形態における符号化雑音の周波数特性を示す図。
【図１０】本発明の第３の実施形態におけるピッチ重み制御フィルタの周波数特性を示す図。
【図１１】本発明の第３の実施形態における符号化雑音の周波数特性を示す図。
【図１２】本発明の第４の実施形態における音声復号方法を用いた音声復号化システムの構成を示す図。
【図１３】本発明の第４の実施形態におけるピッチ強調フィルタの周波数特性を示す図。
【図１４】本発明の第４の実施形態における変形ピッチ強調フィルタの周波数特性を示す図。
【図１５】従来の音声符号化の構成を示す図である。
【図１６】従来の音声符号化における符号化雑音の周波数特性を示す第１の図。
【図１７】従来の音声符号化における符号化雑音の他の周波数特性を示す図。
【図１８】本発明の音声符号化における符号化雑音の周波数時性を示す図。
【符号の説明】
１０…線形予測分析部
１１…適応符号帳
１２…雑音符号帳
１３…ゲイン符号帳
１４、１５…ゲイン乗算部
１６…加算器
１７…線形予測係数符号化部
１８…合成フィルタ
１９…加算器
２０…広域通過フィルタ
２１…低域通過フィルタ
２２、２３…ピッチ重みフィルタ係数算出部
２４…帯域分割部
２５…ホルマント重みフィルタ
２６…広域通過フィルタ
２７…低域通過フィルタ
２８…帯域分割部
２９、３０…ピッチ重みフィルタ
３１…加算器
３２…歪み計算部
３３…聴覚重み付けフィルタ
３４…マルチプレクサ
４０、４１…有声／無声判定部
４４，４５…切り替え部
７１…線形予測係数復号部
７２…ホルマント強調フィルタ
７３…ピッチ強調フィルタ
７４…ピッチ強調制御フィルタ
７５…ピッチ強調制御フィルタ
７６…加算器
７７…変形ピッチ強調フィルタ
７８…ポストフィルタ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a compression encoding method and apparatus, and a decoding method and apparatus for speech signals such as telephone band speech, broadband speech, and audio signals.
[0002]
[Prior art]
A CELP (Code Excited Linear Prediction) method is known as a speech coding method capable of reproducing relatively high-quality sound even at a low bit rate. Details of the CELP method are described in, for example, R. Schroeder and Ata1. "Code-Exited Linear Prediction (CELP): high quq1ity speech avery bit rates", Proc. ICASSP '85. pp. 937-939, 1985 (Reference 1). The configuration of the CELP system is shown in FIG. As shown in FIG. 15, in the CELP method, noise (encoding noise) mixed in the speech by encoding is evaluated using an auditory weight filter, and the masking of a shape in which the encoding noise is determined from the speech spectrum of the current frame. It is characterized by selecting a code of a sound source that makes it difficult to hear noise using a principle masked by characteristics (simultaneous masking). In general, an auditory weight filter used for CELP is composed of a cascade connection of a formant weight filter and a pitch weight filter. The formant weight filter uses a masking characteristic due to the formant of the input voice, and the pitch weight filter uses a masking characteristic due to the harmonic structure (harmonics) of the input voice. The transfer function w (z) of the auditory weight filter uses the transfer function Ws (z) of the formant weight filter and the transfer function Wp (z) of the pitch weight filter.
[0003]
[Expression 1]

[0004]
It is expressed. The pitch weight filter functions to shape the spectrum of the coding noise into a harmonic structure having the same pitch as the input speech by applying a small weight to the pitch harmonic frequency components and a large weight to the components between the harmonic frequencies. Here, the transfer function Wp (z) of the pitch weight filter uses the pitch period T0 and the pitch prediction coefficient βi obtained by pitch prediction.
[0005]
[Expression 2]

[0006]
It is expressed. Here, M is a constant that controls the pitch prediction order, and γ is a constant that controls the degree of noise shaping.
[0007]
FIG. 16 shows the frequency characteristics of the pitch weight filter thus obtained. In FIG. 16, the frequency characteristic of the pitch weight filter is represented by W (f), and the frequency characteristic of speech is represented by S (f). As can be seen from this figure, the pitch weight filter has a valley characteristic at the pitch harmonic frequency and has a peak characteristic between the harmonic frequencies. Therefore, by weighting the coding noise with the pitch weight filter, it is possible to evaluate by assigning a small weight to the pitch harmonic frequency of the speech and vice versa.
[0008]
Thus, by selecting the code of the sound source using relative weighting for each frequency in the frame, the spectrum of the coding noise generated by the coding is the same as that of the voice as shown in E (f) of FIG. A harmonic structure with a pitch period can be obtained. In this case, the coding noise is masked by the unevenness of the spectrum of the speech and becomes difficult to hear. As described above, the pitch weight filter has been used in CELP because it can be obtained by relatively simple analysis and can perform speech coding while suppressing subjective coding noise.
[0009]
In the CELP system, a post filter is often used after decoding the speech in order to improve the subjective quality of the decoded speech. In general, a post filter used for CELP is composed of a cascade connection of a formant emphasis filter and a pitch emphasis filter. The post-filter transfer function Hpf (z) is obtained by using the transfer function Hs (z) of the formant enhancement filter and the transfer function Hp (z) of the pitch enhancement filter.
[0010]
[Equation 3]

[0011]
It is expressed. Here, the transfer function Hp (z) of the pitch enhancement filter uses the pitch period T0 and the pitch prediction coefficient λ,
[0012]
[Expression 4]

[0013]
It is expressed. Here, λ is a constant that controls the degree of pitch emphasis.
[0014]
[Problems to be solved by the invention]
However, the strength of the harmonic structure of actual speech differs depending on the band, and there may be a band with a weak harmonic structure as shown in S (f) of FIG. In the pitch weighting using the conventional pitch weighting filter, the encoding shown in E (f) is used in order to use the pitch weighting filter whose shaping strength is the same in all bands as in W (f) of FIG. There is a problem that the harmony structure of noise and the harmony structure of input speech are different, and the sound quality of decoded speech deteriorates.
[0015]
The same applies to pitch emphasis in post-filter processing. In the conventional pitch emphasis using the transfer function filter shown in Equation 5, the pitch emphasis is the same in all bands, so that the pitch emphasis is unnecessary. On the other hand, there is a problem that pitch emphasis is performed and the sound quality of decoded speech deteriorates.
[0016]
The present invention eliminates such problems and, as shown in FIG. 18, a speech encoding and decoding method for improving the quality of decoded speech by bringing the harmony structure of encoding noise closer to the harmonic structure of input speech, and An object is to provide a speech encoding and decoding apparatus.
[0017]
[Means for Solving the Problems]
According to a first aspect of the present invention, an error signal representing a difference between an input audio information signal and a synthesized audio information signal corresponding to the input audio information signal is generated, and a weighting signal is generated by changing a degree of pitch weighting for the error signal according to a frequency. And a speech encoding method characterized by generating index information based on the weighted signal.
[0018]
By changing the degree of pitch weighting according to the frequency in this way, it is possible to perform pitch weighting suitable for each frequency, and to control the harmonic structure of the coding noise at each frequency, thereby improving the sound quality of the decoded speech Can do.
[0019]
According to a second aspect of the present invention, there is provided a speech encoding method according to the first aspect, wherein the degree of pitch weighting of each frequency is changed according to the characteristics of the input speech.
[0020]
In this way, by changing the pitch weighting degree of each frequency according to the characteristics of the input signal, it becomes possible to change the harmonic structure of the coding noise in accordance with the harmonic structure of the input voice, and to improve the sound quality of the decoded voice. Can be improved.
[0021]
Further, the third invention is the speech coding method according to the second invention, wherein the input speech is analyzed to obtain the voicing degree of each frequency, and the pitch weighting degree of each frequency is changed according to the voicing degree. A featured speech encoding method is provided.
[0022]
In this way, by changing the pitch weighting degree of each frequency according to the voicing degree of each frequency of the input signal, it becomes possible to change the harmonic structure of the coding noise corresponding to the harmonic structure of the input speech. The sound quality of voice can be improved.
[0023]
According to a fourth aspect of the present invention, in the speech coding method according to the third aspect of the present invention, the degree of pitch weighting is increased at a high voiced frequency, and the degree of pitch weighting is decreased at a low voiced frequency. A speech encoding method characterized by the above is provided.
[0024]
By performing such weighting, the harmony structure of the coding noise can be brought close to the harmony structure of the input speech, and the sound quality of the decoded speech can be improved.
[0025]
According to a fifth aspect of the present invention, an error signal representing a difference between an input voice information signal and a synthesized voice information signal corresponding to the input voice information signal is generated, and the input voice information signal is divided into at least two frequency bands. A speech encoding method is provided, wherein a weighting signal is generated by changing a degree of pitch weighting for the error signal for each frequency band, and index information is generated based on the weighting signal.
[0026]
In this way, by changing the degree of pitch weighting for each band, it is possible to perform pitch weighting suitable for each band, control the harmonic structure of encoding noise for each band, and improve the sound quality of decoded speech be able to.
[0027]
The sixth invention is characterized in that, in the method according to the fifth invention, the input voice is analyzed to obtain the voicing degree of each band, and the pitch weighting degree of each band is changed according to the voicing degree. A speech encoding method is provided.
[0028]
In this way, by changing the pitch weighting degree of each band according to the voicing degree of each band of the input signal, the harmonic structure of the coding noise can be changed corresponding to the harmonic structure of the input voice, and the decoded voice Can improve the sound quality.
[0029]
According to a seventh aspect, in the speech coding method according to the sixth aspect, the degree of pitch weighting is increased in a band with high voicedness, and the degree of pitch weighting is reduced in a band with low voicedness. A speech encoding method is provided.
[0030]
By performing such weighting, the harmony structure of the coding noise can be brought close to the harmony structure of the input speech, and the sound quality of the decoded speech can be improved.
[0031]
The eighth invention is the speech coding method according to the fifth invention, wherein the input speech is analyzed to perform voiced / unvoiced determination of each band, and pitch weighting is performed on the band determined to be voiced. A speech encoding method is provided in which pitch weighting is not performed on a band determined to be unvoiced.
[0032]
Thus, by changing the degree of pitch weighting according to the band, the harmonic structure of the coding noise can be brought close to the harmonic structure of the input speech, and the quality of the decoded speech can be improved.
[0033]
Here, the degree of pitch weighting refers to the strength of noise pitch shaping, and the strength of noise pitch shaping can be controlled by the filter coefficient of a pitch weight filter, for example.
[0034]
The ninth invention extracts index information from the encoded speech information, generates a decoded speech signal based on the index information, changes the degree of pitch enhancement according to the frequency, and pitch-enhances the decoded speech signal. Provided is a speech decoding method characterized by performing processing.
[0035]
Thus, by changing the degree of pitch emphasis of the post filter depending on the frequency, pitch emphasis suitable for each frequency can be performed, and the quality of the decoded speech can be improved.
[0036]
The tenth invention provides a speech decoding method according to the ninth invention, wherein the degree of pitch emphasis of each frequency is changed according to the characteristics of the decoded speech.
[0037]
Thus, by changing the degree of pitch emphasis of each frequency according to the characteristics of the decoded speech, it is possible to perform pitch enhancement suitable for the decoded speech.
[0038]
The eleventh invention provides the speech decoding method according to the tenth invention, wherein the degree of pitch emphasis of each frequency is changed according to the voicing degree of each frequency of the decoded speech.
[0039]
The twelfth invention is characterized in that in the speech decoding method according to the eleventh invention, the degree of pitch emphasis is increased at a high voiced frequency, and the degree of pitch emphasis is reduced at a low voiced frequency. A speech decoding method is provided.
[0040]
The thirteenth invention extracts index information from the encoded speech information, generates a decoded speech signal based on the index information, divides the decoded speech signal into at least two frequency bands, and generates a pitch for each frequency band. Provided is a speech decoding method characterized in that pitch enhancement processing is performed on the decoded speech signal while changing the degree of enhancement.
[0041]
The fourteenth invention provides the speech decoding method according to the thirteenth invention, wherein the degree of pitch emphasis of each band is changed according to the voicing degree of each band of the decoded speech.
[0042]
The fifteenth invention is characterized in that, in the speech decoding method according to the fourteenth invention, the degree of pitch enhancement is increased in a band with high voicedness, and the degree of pitch enhancement is reduced in a band with low voicedness. A speech decoding method is provided.
[0043]
The sixteenth invention is the speech decoding method according to the thirteenth invention, wherein voiced / unvoiced determination is performed for each band of decoded speech, pitch emphasis is performed on the band determined to be voiced, and determination is made as unvoiced. There is provided a speech decoding method characterized in that pitch emphasis is not performed for a given band.
[0044]
According to the sixteenth aspect, pitch emphasis can be performed only on a necessary band, so that the quality of decoded speech can be improved.
[0045]
Here, the degree of pitch enhancement refers to the strength of pitch shaping of decoded speech, and the strength of pitch shaping can be controlled by, for example, the filter coefficient of the pitch enhancement filter.
[0046]
According to a seventeenth aspect of the present invention, there is provided synthesis filter means for generating an error signal representing a difference between an input voice information signal and a synthesized voice information signal corresponding to the input voice information signal, and a degree of pitch weighting for the error signal according to frequency. There is provided a speech coding apparatus characterized by comprising weighting filter means for generating weighting signals by changing the above and index information generating means for generating index information based on the weighting signals.
[0047]
According to an eighteenth aspect of the present invention, there is provided synthesis filter means for generating an error signal representing a difference between an input voice information signal and a synthesized voice information signal corresponding to the input voice information signal; Band dividing means for dividing into bands, weighting filter means for generating a weighted signal by changing the degree of pitch weighting for the error signal for each frequency band, and index information generating means for generating index information based on the weighted signal A speech encoding device characterized by comprising:
[0048]
According to a nineteenth aspect of the present invention, separation means for extracting index information from encoded speech information, synthesis filter means for generating a decoded speech signal based on the index information, and the degree of pitch enhancement according to frequency are changed. There is provided a speech decoding apparatus comprising post-filter means for performing pitch emphasis processing on the decoded speech signal.
[0049]
According to a twentieth aspect of the present invention, index information is extracted from the encoded speech information, a synthesis filter means for generating a decoded speech signal based on the index information, and the decoded speech signal is divided into at least two frequency bands. There is provided a speech decoding apparatus comprising post-filter means for performing pitch enhancement processing on the decoded speech signal by changing the degree of pitch enhancement for each band.
[0050]
DETAILED DESCRIPTION OF THE INVENTION
(First embodiment)
A first embodiment in which the speech coding method of the present invention is applied to the CELP system will be described. CELP coding can be broadly divided into coding of speech spectral envelope information and coding of sound source signals. The auditory weight filter is used for encoding a sound source signal. In the CELP method, speech analysis / encoding is performed in units of frames. Depending on the method, there is a method of dividing the frame into smaller subframes and encoding the sound source signal for each subframe, but here, for the sake of simplicity, the sound source signal is also encoded in units of frames. To do.
[0051]
FIG. 1 shows the configuration of a speech encoding system to which the speech encoding method according to this embodiment is applied. According to this speech coding system, the linear prediction coefficient 101 of the input speech 100 is calculated and input to the wide-band filter 20 and the low-pass filter 21 of the band dividing unit. The outputs of the wide-pass filter 20 and the low-pass filter 21 are connected to pitch weight filter

coefficient calculation units

22 and 23 for obtaining the pitch

weight filter coefficients

112 and 113 of the respective bands. Outputs of the pitch weight filter

coefficient calculation units

22 and 23 are connected to pitch weight filters 29 and 30 of the auditory weighting filter 33, respectively.
[0052]
The output of the linear prediction analysis unit 10 is connected to a linear prediction coefficient encoding unit 17 that encodes the linear prediction coefficient 101 and a formant weight filter 25 that performs formant weighting on the difference signal 108 between the input speech 100 and the decoded speech 107. The output of the linear prediction coefficient encoding unit 17 is connected to the synthesis filter 18 that generates the decoded speech 107 from the driving sound source 105 and the multiplexer 34. The output of the formant weight filter 25 is connected to pitch weight filters 29 and 30 via a wide pass filter 26 and a low pass filter 27, respectively. Outputs of pitch weight filters 29 and 30 that perform pitch weighting on the difference signals 115 and 116 with formant weights obtained by band division are input to an adder 31, and an output of the adder 31 is connected to a distortion calculation unit 32. The output of the distortion calculation unit 32 is output from the adaptive codebook 11 for encoding the pitch period component of speech, the noise codebook 12 for encoding components other than the pitch period of speech, and the adaptive codebook 11. The adaptive code vector 102 and the noise code vector 103 output from the noise code book 12 are connected to the gain code book 13 for encoding the quinn of the noise code vector 103 and to the multiplexer 34.
[0053]
The outputs of adaptive codebook 11 and noise codebook 12 are connected to gain

multipliers

14 and 15 together with the output of gain codebook 13. The outputs of the

gain multipliers

14 and 15 are connected to the adder 16, and the output of the adder 16 is connected to the synthesis filter 18 together with the output of the linear prediction coefficient encoding unit 17. The output of the synthesis filter 18 is input to the adder 19 together with the input sound. The output of the adder 19 is connected to the formant weight filter 25.
[0054]
That is, in this embodiment, high-

pass filters

20 and 26 for obtaining a high-frequency component and low-

pass filters

21 and 27 for obtaining a low-frequency component are added to the conventional speech coding system shown in FIG. Yes. This configuration is greatly different in that pitch weighting is performed using

pitch weight coefficients

112 and 113 calculated for each band.
[0055]
In this speech coding system, first, the input speech 100 is divided and inputted into frames at regular intervals of about 5 to 20 ms. The input speech for each frame is input to the linear prediction analysis unit 10, and a linear prediction coefficient 101 representing the envelope shape of the frequency spectrum is calculated. The linear prediction coefficient 101 is encoded by the linear prediction coefficient encoding unit 17 and then given to the synthesis filter 18 as a filter coefficient 106. The linear prediction coefficient 101 is also supplied to the formant weight filter 25 for formant weighting.
[0056]
After encoding the linear prediction coefficient 101, the excitation signal is encoded. In the excitation signal encoding, each of the adaptive code vector 102 selected from the adaptive codebook 11 and the noise code vector 103 selected from the noise codebook 12 is multiplied by the gain 104 selected from the gain codebook 13 and added. By combining these, the driving sound source 105 is generated. The drive excitation 105 generated in this way is input to the synthesis filter 18 characterized by the output of the linear prediction coefficient encoding unit 17, and the decoded speech 107 is generated.
[0057]
A difference signal 108 between the input speech 100 and the decoded speech 107 is calculated. The difference signal 108 is first input to the formant weight filter 25, and formant weighting is performed. The formant weight filter 25 is characterized by a formant weight filter coefficient calculated from the linear prediction coefficient 101 obtained by the linear prediction analysis unit 10. For example, the transfer function Ws (z) of the formant weight filter is obtained by using the transfer function A (z) of the prediction filter configured from the LPC coefficients obtained by the linear prediction analysis unit 10.
[0058]
[Equation 5]

[0059]
It is expressed. As the values of the constants γ1 and γ2, for example, r1 = 0.9 and r2 = 0.4 can be used. Note that γ1 and γ2 need not be limited to these values, and different values may be used.
[0060]
Next, the formant-weighted difference signal 114 is input to the high-pass filter 26 and the low-pass filter 27, divided into two bands, and then input to the pitch weight filters 24 and 30 of each band. On the other hand, the input sound 100 is also input to the high-pass filter 20 and the low-pass filter 21 and divided into two bands, and then the

band components

110 and 111 are input to the pitch weight filter

coefficient calculation units

22 and 23, respectively. The The pitch weight filter

coefficient calculation units

22 and 23 predict the pitch of the input signal and calculate the

pitch prediction coefficients

112 and 113. The calculated

pitch prediction coefficients

112 and 113 are supplied to the pitch weight filters 24 and 30.
[0061]
In the pitch weight filter, different pitch weights are applied to the respective band components. The pitch weight filter is characterized by the pitch weight filter coefficient obtained by the pitch weight filter coefficient calculation unit. For example, the transfer function WHp of the high-frequency pitch weight filter and the transfer function WLp of the low-frequency pitch weight filter are represented by the pitch period and the pitch prediction coefficient β. _Hi , Β _Li Using,
[0062]
[Formula 6]

[0063]
It is expressed. Here, M is a constant that controls the pitch prediction order, and γ is a constant that controls the degree of noise shaping. Constant γ _H , Γ _L For example, γ _H = Γ _L = 0.4 can be used. Γ _H , Γ _L May be set to different values, γ _H , Γ _L Pitch strength S of each band _H , S _L And can be controlled for each band using the pitch intensity. For example,
[0064]
[Expression 7]

[0065]
Can be defined as However, ζ _H , Ζ _L Is a constant. Also, pitch strength S _H , S _L Is the prediction coefficient β _Hi , Β _Li Using
[0066]
[Equation 8]

[0067]
Can be defined as However, pitch strength S _H , S _L Is not limited to the above equation, and may be a parameter indicating the strength of the pitch period of the signal.
[0068]
Next, the pitch-weighted high frequency component 117 and low frequency component 118 are added by the adder 31 and input to the distortion calculator 32. The distortion calculation unit 32 selects an adaptive code vector, a noise code vector, and a gain vector that minimize the distortion, and indexes representing these vectors are input to the multiplexer 34. The multiplexer 34 receives the index obtained by encoding the linear prediction coefficient from the linear prediction coefficient encoding unit 17 as well as the index input from the distortion calculation unit 32. In the multiplexer 34, an encoded bit stream 122 is generated from the input index, and the encoded bit stream 122 is transmitted to the decoding side via a transmission path or a storage medium.
[0069]
As described above, since the degree of pitch weighting can be controlled for each band in this embodiment, even when the input sound has the frequency characteristics shown in S (f) of FIG. 2, the degree of pitch weighting is increased in the low frequency range. By reducing the degree of pitch weighting in the high range, the frequency characteristics of the coding noise can be made as shown in E (f) of FIG. In this way, the harmonic structure of the coding noise can be brought close to the harmonic structure of the input speech, and the sound quality of the decoded speech can be improved.
[0070]
(Second Embodiment)
A second embodiment in which the speech coding method of the present invention is applied to the CELP system will be described. FIG. 3 shows the configuration of a speech coding system to which the speech coding method according to this embodiment is applied. The speech coding system of the present embodiment shown in FIG. 3 has a configuration in which voiced / unvoiced determination units 40 and 41 and switching units 44 and 45 are added to the speech coding system of the first embodiment shown in FIG. It has become. In FIG. 3, the same reference numerals as those in FIG. 1 perform the same operation, and here, the characteristic portions of the present embodiment will be mainly described.
[0071]
In the present embodiment, the input speech divided into the high frequency band and the low frequency band is input to the voiced / unvoiced determination units 40 and 41 and the pitch weight filter

coefficient calculation units

22 and 23 of the respective bands, and the voiced / unvoiced determination unit 40. 41, the input band-limited

signals

110 and 111 are analyzed to determine whether the signal in the band is voiced or unvoiced. The determination of voiced / unvoiced can be realized by using an algorithm used in IMBE (Improved Mu1ti = Band Excitation vocoder), for example. Details of IMBE are described in, for example, D.C. W. Griffin and J.M. S. Lim “Multiband Execution Vocoder”, IEEE Trans. Acoustic. , Speech, Signal Processing, vo1. ASSP-36, pp. 1223-1235, Aug. 1988 (Reference 2). The determination result of voiced / unvoiced is sent to pitch weight filter

coefficient calculation units

22 and 23 and switching units 44 and 45.
[0072]
When the voiced / unvoiced determination results 140 and 141 are voiced, the pitch weight filter

coefficient calculation units

22 and 23 analyze the input signal to calculate the pitch

weight filter coefficients

112 and 113, and the pitch weight filter coefficient is the pitch weight filter. Is input. On the other hand, when the voiced / unvoiced determination results 140 and 141 are unvoiced, the pitch weight filter

coefficient calculation units

22 and 23 do not calculate the pitch

weight filter coefficients

112 and 113.
[0073]
On the other hand, the switching units 44 and 45 perform output switching according to the voiced / unvoiced determination results 142 and 143. When the voiced / unvoiced determination result is voiced, the output of the switching unit is input to the pitch weight filters 24 and 30. On the contrary, when the voiced / unvoiced determination result is unvoiced, the output of the switching unit is directly input to the adding

units

46 and 47. In this way, the presence / absence of pitch weighting is controlled in each band.
[0074]
The pitch-weighted high frequency component and low frequency component are added by the adder 31 and input to the distortion calculator 32. The distortion calculation unit 32 selects an adaptive code vector, a noise code vector, and a gain vector that minimize the distortion, and indexes representing these vectors are input to the multiplexer 34.
[0075]
The multiplexer 34 receives the index obtained by encoding the linear prediction coefficient from the linear prediction coefficient encoding unit 17 as well as the index input from the distortion calculation unit 32. In the multiplexer 34, an encoded bit stream 122 is generated from the input index, and the encoded bit stream 122 is transmitted to the encoding side via a transmission path or a storage medium.
[0076]
As described above, in this embodiment, the presence / absence of pitch weighting can be controlled for each band. Therefore, even when the input voice has the frequency characteristics shown in S (f) of FIG. By not performing pitch weighting in the high frequency range, the frequency characteristics of the coding noise can be made as shown in E (f) of FIG. In this way, the harmonic structure of the coding noise can be brought close to the harmonic structure of the input speech, and the sound quality of the decoded speech can be improved.
[0077]
Note that the second embodiment of the present invention is characterized by a portion that performs control with / without pitch weighting for each band, and has a configuration that allows control with / without pitch weighting for each band. There is no limitation to the configuration shown in FIG. For example, as shown in FIG. 5, with the configuration in which the switching units 44 and 45 are removed from FIG. 3, the pitch weight filter

coefficient calculation units

22 and 23 obtain the pitch weight filter coefficient based on the voiced / unvoiced determination result. It can also be changed.
[0078]
Here, in the case of voicelessness, by outputting a pitch weight filter coefficient that does not perform pitch weighting, an operation similar to switching of pitch weighting on / off can be performed.
[0079]
(Third embodiment)
A third embodiment in which the speech coding method of the present invention is applied to the CELP system will be described. FIG. 6 shows the configuration of a speech encoding system to which the speech encoding method according to this embodiment is applied. Unlike the conventional CELP system shown in FIG. 15, this speech coding system has a configuration in which pitch weight control filters 60 and 61, an adder 62 and a subtractor 63 are added to the auditory weighting part. Here, the characteristic part of this embodiment will be mainly described.
[0080]
The formant weighted difference signal 114 is input to the pitch weight filter 50, the pitch weight control filter 61, and the subtractor 63. The pitch weight filter 50 performs pitch weighting on the formant-weighted difference signal 114, and the processed signal 151 is input to the pitch weight control filter 60. The pitch weight control filter 60 filters the input signal 151 and then supplies it to the adder 62 as a signal 152.
[0081]
On the other hand, the subtractor 63 obtains a difference signal 154 between the formant-weighted difference signal 114 and the signal 153 obtained by filtering the formant-weighted difference signal 114 with the pitch weight control filter 61, and this signal 154 is sent to the adder 62. Entered. The adder 62 adds the two input signals, and the added signal 155 is input to the distortion calculator 32. The distortion calculation unit 32 selects an adaptive code vector, a noise code vector, and a gain vector that minimize the distortion, and indexes representing these vectors are input to the multiplexer 34. The multiplexer 34 receives the index obtained by encoding the linear prediction coefficient from the linear prediction coefficient encoding unit 17 as well as the index input from the distortion calculation unit 32. In the multiplexer 34, an encoded bit stream 122 is generated from the input index, and the encoded bit stream 122 is transmitted to the encoding side via a transmission path or a storage medium.
[0082]
In the third embodiment, the pitch weight control filters 60 and 61 play a role of smoothly changing the degree of pitch weighting with respect to the frequency. For example, when the frequency characteristic of the pitch weight filter is represented by Wp (f) in FIG. 7 and the frequency characteristic of the pitch weight control filter is a low-pass characteristic as represented by H (f) in FIG. In the frequency characteristics of the pitch weighting filter, the degree of pitch weighting decreases as the frequency increases as shown by W (f) in FIG. When such weighting is performed, the harmonic structure of the encoding noise spectrum generated by encoding becomes weaker as the frequency increases as shown in E (f) of FIG. When the frequency characteristic of the pitch weight filter is represented by Wp (f) in FIG. 7 and the frequency characteristic of the pitch control filter is a characteristic represented by H (f) in FIG. In the frequency characteristics, the degree of pitch weighting is weak at a middle frequency as shown by W (f) in FIG. When such weighting is performed, the harmonic structure of the encoding noise spectrum generated by encoding becomes weak at the mid-range frequency as shown in E (f) of FIG.
[0083]
Thus, by using the pitch weight control filter, the degree of pitch weighting of the modified pitch weighting filter can be changed smoothly with frequency. Also, the characteristics of the pitch weight control filter can be changed according to the characteristics of the input voice. For example, the input speech is analyzed to determine the strength of the harmonic structure with respect to the frequency, and the characteristics of the pitch weight control filter are determined based on the strength of the harmonic structure with respect to the frequency. By making the characteristics of the pitch control filter attenuate frequencies where the harmonic structure is weak, the harmonic structure of the coding noise can be brought closer to the harmonic structure of the input speech, and the sound quality of the decoded speech can be further improved. Can do.
[0084]
(Fourth embodiment)
An embodiment in which the speech decoding method of the present invention is applied to the CELP system will be described. FIG. 12 shows the configuration of a speech decoding system to which the speech decoding method according to the fourth embodiment is applied. In this speech decoding system, the output of the demultiplexer 70 is connected to the adaptive codebook 11, the noise codebook 12, the gain codebook 13, and the linear prediction coefficient decoding unit 71.
[0085]
The outputs of adaptive codebook 11 and noise codebook 12 are connected to gain

multipliers

14 and 15 together with the output of gain codebook 13. The outputs of the

gain multipliers

14 and 15 are connected to the adder 16. The output of the adder 16 is fed back to the adaptive codebook 11 and further connected to the synthesis filter 18 together with the output of the linear prediction coefficient decoder 71. The output of the linear prediction coefficient decoding unit 71 is connected to the post filter 78.
[0086]
The post filter 78 includes a formant emphasis filter 72 and a modified pitch emphasis filter 77, and the deformation pitch emphasis filter 47 includes a pitch emphasis control filter 73, pitch emphasis filters 74 and 75, and an adder 76.
[0087]
In this speech decoding system, first, the bit stream 170 obtained from the transmission path or the storage medium is input to the demultiplexer 70. The demultiplexer 70 represents a linear prediction coefficient index 171 representing a linear prediction coefficient, an adaptive code vector index 172 representing an adaptive code vector, a noise code vector index 173 representing a noise code vector, and a gain vector from the input bit stream 170. An index 174 is generated separately. Among these indexes, the linear prediction coefficient index 171 is in the linear prediction coefficient decoding unit 71, the adaptive code vector index 172 is in the adaptive codebook 11, the noise code vector index 173 is in the noise codebook 12, and the gain index 174 is a gain code. Each is input to the book 13.
[0088]
The linear prediction coefficient decoding unit 71 decodes a linear prediction coefficient from the input linear prediction coefficient index 171, and provides this to the synthesis filter 18 as a filter coefficient. Also, the adaptive code vector 102 is selected from the adaptive codebook 11 according to the adaptive code vector index 172 and output. Also, the noise code vector 103 is selected from the noise codebook 12 according to the noise code vector index 173 and output.
[0089]
Further, the gain 104 to be multiplied by the adaptive code vector and the noise code vector is selected and output from the gain codebook 13 according to the gain index 174. After the gains are multiplied by the adaptive code vector 102 and the noise code vector 103 by the

multipliers

14 and 15, the two vectors are added by the adder 16 to generate a decoded residual waveform signal 105. Are input to the synthesis filter 18 and the adaptive codebook 11 as drive excitation signals.
[0090]
The synthesis filter 18 determined by the linear prediction coefficient decoded by the linear prediction coefficient decoding unit 71 is driven by the driving sound source signal, and a decoded speech signal 107 is generated. Thereafter, post-filter processing is performed on the decoded speech 107 in order to improve the subjective quality of the decoded speech 107. A conventional post filter is configured by a cascade connection of a formant emphasis filter and a pitch emphasis filter, but the post filter 48 in the present embodiment is configured by a cascade connection of a formant emphasis filter 72 and a modified pitch emphasis filter 73. As shown in FIG. 12, the modified pitch emphasis filter 73 includes a pitch emphasis filter 73, pitch emphasis control filters 74 and 75, and an adder 76 so that the degree of pitch emphasis can be controlled for each frequency. In this case, the transfer function H′p (z) of the modified pitch enhancement filter 77 uses the transfer function H′p (z) of the pitch enhancement filter 73 and the transfer function H (z) of the pitch enhancement control filters 74 and 75. ,
[0091]
[Equation 9]

[0092]
It is expressed. The formant emphasis filter 72 can be configured using a known technique.
[0093]
Here, the transfer function of the pitch emphasis filter 73 is expressed by Equation 5, the characteristics thereof are shown in FIG. 13, and the characteristics of the pitch control filters 74 and 75 are low-pass characteristics as shown in FIG. At this time, the frequency characteristic of the modified pitch emphasizing filter 47 becomes smaller as the frequency becomes higher as shown by H′p (z) in FIG. By using such a modified pitch emphasis filter, it is possible to perform pitch emphasis at high frequencies and weakness at high frequencies, and even if strong pitch emphasis is performed, the high frequency spectrum becomes difficult to be deformed, resulting in deterioration of high frequency quality. It is possible to perform pitch emphasis with reduced noise.
[0094]
Returning to FIG. 12, the operation of the post filter 78 will be described. The decoded speech 107 output from the synthesis filter 18 is input to the formant enhancement filter 72, and the decoded speech 175 that has been formant enhanced by the formant enhancement filter 72 is input to the adder 76, the pitch enhancement control filter 73, and the pitch enhancement filter 74. . The formant-enhanced main decoded speech 175 input to the pitch enhancement filter 73 is subjected to pitch enhancement by the pitch enhancement filter 73, processed by the pitch enhancement control filter 75, and input to the adder 76.
[0095]
The formant-enhanced decoded speech 175 input to the pitch emphasis control filter 74 is subjected to pitch emphasis control filter processing and input to the adder 76. The adder 76 adds the supplied three

signals

175, 176, and 178, and outputs the result as the final decoded speech 179.
[0096]
As described above, the post filter 78 according to this embodiment is configured such that the pitch emphasis control filter 74 is added to the conventional post filter so that the degree of pitch emphasis can be controlled for each frequency. The pitch emphasis control filter 74 can freely change the degree of pitch emphasis by changing its characteristics. If the characteristics of the pitch emphasis control filter are changed in accordance with the characteristics of the decoded speech, the pitch emphasis control filter 74 has the frequency of the decoded speech. Strength pitch emphasis can be performed, and the quality of decoded speech can be further improved.
[0097]
The characteristic part of the present invention is a part related to pitch enhancement of the post filter, and the speech decoding method is not necessarily limited to the CELP method, and other decoding methods may be used.
[0098]
It is also possible to apply the pitch emphasis method described here to a portion that generates a driving excitation signal for speech encoding.
[0099]
Although several embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and various modifications can be made.
[0100]
For example, although the first embodiment and the second embodiment described above are divided into two bands, a high band and a low band, for simplicity, the number of bands to be divided needs to be limited to two. There are two or more. Further, the band dividing unit is not limited to the configuration shown in FIGS. As a method of performing band division, a method in which a signal is once FFTed and frequency-divided on the FFT and then inverse FFT, a method of band-dividing using a QMF filter, or the like may be used.
[0101]
Furthermore, in the present embodiment, the perceptual weighting filter processing is performed on the difference signal between the input sound and the reproduced sound to obtain the perceptual weighting distortion, but the difference signal is obtained after performing the perceptual weighting on the input sound and the reproduced sound, It is also possible to modify the configuration so as to obtain the auditory weighting distortion.
[0102]
【The invention's effect】
As described above in detail, according to the present invention, the harmonic structure of the coding noise can be made similar to the input voice, and the quality of the reproduced voice can be improved.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a speech coding system using a speech coding method according to a first embodiment of the present invention.
FIG. 2 is a diagram showing frequency characteristics of coding noise in the first embodiment of the present invention.
FIG. 3 is a diagram showing a configuration of a speech encoding system using a speech encoding method according to a second embodiment of the present invention.
FIG. 4 is a diagram showing frequency characteristics of coding noise in the second embodiment of the present invention.
FIG. 5 is a diagram showing the configuration of another speech coding system using the speech coding method according to the second embodiment of the present invention.
FIG. 6 is a diagram showing a configuration of a speech encoding system using a speech encoding method according to a third embodiment of the present invention.
FIG. 7 is a diagram illustrating frequency characteristics of a pitch weight filter according to a third embodiment of the present invention.
FIG. 8 is a diagram showing frequency characteristics of a pitch weight control filter according to a third embodiment of the present invention.
FIG. 9 is a diagram showing frequency characteristics of coding noise in the third embodiment of the present invention.
FIG. 10 is a diagram illustrating frequency characteristics of a pitch weight control filter according to a third embodiment of the present invention.
FIG. 11 is a diagram showing the frequency characteristics of coding noise in the third embodiment of the present invention.
FIG. 12 is a diagram showing a configuration of a speech decoding system using a speech decoding method according to a fourth embodiment of the present invention.
FIG. 13 is a diagram showing frequency characteristics of a pitch enhancement filter according to a fourth embodiment of the present invention.
FIG. 14 is a diagram showing frequency characteristics of a modified pitch enhancement filter according to a fourth embodiment of the present invention.
FIG. 15 is a diagram illustrating a configuration of conventional speech encoding.
FIG. 16 is a first diagram showing frequency characteristics of coding noise in conventional speech coding.
FIG. 17 is a diagram showing another frequency characteristic of coding noise in conventional speech coding.
FIG. 18 is a diagram showing frequency temporality of coding noise in speech coding according to the present invention.
[Explanation of symbols]
10 ... Linear prediction analysis section
11 ... Adaptive codebook
12 ... Noise codebook
13 ... Gain codebook
14, 15 ... Gain multiplier
16 ... Adder
17: Linear prediction coefficient encoding unit
18 ... Synthesis filter
19 ... Adder
20 ... Wide-pass filter
21 ... Low-pass filter
22, 23 ... Pitch weight filter coefficient calculation unit
24. Band division unit
25 ... Formant weight filter
26 ... Wide-pass filter
27 ... Low-pass filter
28: Band division unit
29, 30 ... pitch weight filter
31 ... Adder
32 ... Strain calculator
33 ... Auditory weighting filter
34 ... Multiplexer
40, 41 ... voiced / unvoiced determination section
44, 45 ... switching unit
71: Linear prediction coefficient decoding unit
72 ... Formant emphasis filter
73 ... Pitch emphasis filter
74: Pitch emphasis control filter
75 ... Pitch emphasis control filter
76 ... Adder
77 ... Deformation pitch enhancement filter
78 ... Post filter

Claims

An error signal representing a difference between the input voice information signal and the synthesized voice information signal corresponding to the input voice information signal is generated, and the degree of pitch weighting for the error signal is changed for each frequency according to the characteristics of the input voice information signal. A speech encoding method characterized by generating a weighting signal and generating index information based on the weighting signal.

2. The speech encoding method according to claim 1, wherein the input speech information signal is analyzed to obtain a voicing level of each frequency, and the degree of pitch weighting for the error signal is changed for each frequency according to the voicing rate.

3. The speech encoding method according to claim 2, wherein the degree of pitch weighting is increased at a frequency where the voiced degree is high, and the degree of pitch weighting is reduced at a frequency where the voiced degree is low.

An error signal representing a difference between the input voice information signal and a synthesized voice information signal corresponding to the input voice information signal is generated, the input voice information signal is divided into at least two frequency bands, and the error is divided for each frequency band. A speech coding method, wherein a weighting signal is generated by changing a degree of pitch weighting for a signal, and index information is generated based on the weighting signal.

5. The speech encoding method according to claim 4, wherein the input speech information signal is analyzed to obtain a voicing level of each band, and the degree of pitch weighting is changed for each band according to the voicing level.

6. The speech encoding method according to claim 5, wherein the degree of pitch weighting is increased in a band having a high voiced degree, and the degree of pitch weighting is reduced in a band having a low voiced degree.

Analyzing the input voice information signal to determine voiced / unvoiced for each band, performing pitch weighting on a band determined to be voiced, and not performing pitch weighting on a band determined to be unvoiced The speech encoding method according to claim 4, wherein:

Index information is extracted from the encoded speech information, a decoded speech signal is generated based on the index information, and pitch enhancement processing is performed on the decoded speech signal by changing the degree of pitch enhancement for each frequency according to the characteristics of the decoded speech signal. A speech decoding method characterized by the above.

9. The speech decoding method according to claim 8, wherein the degree of pitch emphasis of each frequency is changed according to the voicing degree of each frequency of the decoded speech signal.

The speech decoding method according to claim 9, wherein the degree of pitch emphasis is increased at a frequency where the voiced degree is high, and the degree of pitch emphasis is reduced at a frequency where the voiced degree is low.

Index information is extracted from the encoded speech information, a decoded speech signal is generated based on the index information, the decoded speech signal is divided into at least two frequency bands, and the degree of pitch emphasis is changed for each frequency band. A speech decoding method comprising performing pitch emphasis processing on a decoded speech signal.

12. The speech decoding method according to claim 11, wherein the degree of pitch emphasis of each band is changed according to the voicing degree of each band of the decoded speech.

13. The speech decoding method according to claim 12, wherein the degree of pitch emphasis is increased in a band where the voiced degree is high, and the degree of pitch emphasis is reduced in a band where the voiced degree is low.

Voiced / unvoiced determination is performed for each band of the decoded speech signal, pitch enhancement is performed for a band determined to be voiced, and pitch enhancement is not performed for a band determined to be unvoiced. The speech decoding method according to claim 11.

Synthesis filter means for generating an error signal representing a difference between the input voice information signal and the synthesized voice information signal corresponding to the input voice information signal; and pitch weighting for the error signal for each frequency according to the characteristics of the input voice information signal . A speech coding apparatus comprising weighting filter means for generating weighted signals at different degrees and index information generating means for generating index information based on the weighted signals.

Synthesis filter means for generating an error signal representing a difference between the input voice information signal and the synthesized voice information signal corresponding to the input voice information signal; and band division means for dividing the input voice information signal into at least two frequency bands; A weighting filter means for generating a weighting signal by changing a degree of pitch weighting for the error signal for each frequency band, and an index information generating means for generating index information based on the weighting signal. A speech encoding device.

Separating means for extracting the index information from the encoded audio information, the decoded this and synthesis filter means for generating a decoded speech signal based on the index information, to change the degree of pitch enhancement for each frequency according to the characteristics of the decoded speech signal A speech decoding apparatus comprising post filter means for performing pitch emphasis processing on a speech signal.

Index information is extracted from the encoded speech information, a synthesis filter means for generating a decoded speech signal based on the index information, and the decoded speech signal is divided into at least two frequency bands, and the degree of pitch emphasis is set for each frequency band. A speech decoding apparatus comprising post-filter means for performing pitch emphasis processing on the decoded speech signal by changing the pitch.