JP3785363B2

JP3785363B2 - Audio signal encoding apparatus, audio signal decoding apparatus, and audio signal encoding method

Info

Publication number: JP3785363B2
Application number: JP2001396474A
Authority: JP
Inventors: 幸司吉田; 正米崎; 拓也河嶋; 茂明佐々木; 一則間野; 章俊片岡
Original assignee: Panasonic Corp; Nippon Telegraph and Telephone Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Nippon Telegraph and Telephone Corp; Panasonic Holdings Corp
Priority date: 2001-12-27
Filing date: 2001-12-27
Publication date: 2006-06-14
Anticipated expiration: 2021-12-27
Also published as: JP2003195900A

Description

【０００１】
【発明の属する技術分野】
本発明は音声信号符号化装置、音声信号復号装置及び音声信号符号化方法に関し、特に音声信号を符号励振線形予測（ＣＥＬＰ：Code-Excited Linear Prediction）符号化する場合に適用して好適なものである。
【０００２】
【従来の技術】
従来、有線通信、移動体通信、ディジタル記録メモ等に用いられる音声の圧縮展開を行う音声符号化のうち、中低ビットレートの音声符号化では、音声の生成過程をモデル化し、入力信号情報を声帯を模した音源情報と声道を模した合成フィルタ情報に分離して符号化する手法が一般的に用いられている。特に、音源情報として音源ベクトルを符号帳から選択し、合成フィルタ情報を線形予測分析により抽出するＣＥＬＰ型音声符号化が広く使用されている。
【０００３】
ＣＥＬＰ型音声符号化方式は、音声をある一定のフレーム長（５ｍｓ〜５０ｍｓ程度）に区切り、各フレーム毎に音声の線形予測分析を行い、フレーム毎の線形予測分析による予測残差（励振信号）を既知の波形からなる適応符号ベクトルと雑音符号ベクトルを用いて符号化するものである。適応符号ベクトルは過去に生成した駆動音源ベクトルを格納している適応符号帳から、雑音符号ベクトルは予め用意され定められた形状を有するベクトルを格納している雑音符号帳から選択されて使用される。
【０００４】
従来のこの種の音声信号符号化装置の構成を、図８に示す。先ず、音声信号符号化装置３０では、入力音声信号が線形予測分析器１に入力され、当該線形予測分析器１により入力音声が線形予測分析される。聴覚重みフィルタ生成器２は、分析された線形予測係数を用いて聴覚重みフィルタ１１を生成する。聴覚重みフィルタＨ_p（Ｚ）は、分析された線形予測係数｛αｉ｜ｉ＝１，………，Ｐ｝（Ｐは分析次数）で、次式を用いて表される。
【０００５】
【数１】

ここで、γ_Z、γ_Pはフォルマント強調係数で０＜γ_P＜γ_Z＜１を満たす定数である。
【０００６】
適応符号帳５は過去に生成した音源信号を蓄えている。乗算器６は適応符号帳５から選択された適応音源ベクトルにゲイン定数を乗じて適応音源信号を求める。雑音符号帳７には予め定められた音源ベクトルが蓄えられている。乗算器８は雑音符号帳７から選択された雑音音源ベクトルにゲイン定数を乗じて雑音音源信号を求める。加算器９は適応音源信号と雑音音源信号を加算し音源信号を得る。
【０００７】
合成フィルタ１０は、加算器９により得られた音源信号を、線形予測係数で構成されるフィルタでフィルタリングすることで合成音声を得る。聴覚重みフィルタ１１は、合成音声に対する聴覚的な信号強調を、聴覚重みフィルタ生成器２で生成されたフィルタを用いて行う。聴覚重みフィルタ１２は入力信号に対して聴覚的な信号強調を行う。
【０００８】
減算器１３は聴覚的に重み付けされた入力音声と合成音声の誤差信号を求める。二乗誤差最小化器１４は誤差信号のエネルギーを算出し、エネルギーが最小となる音源ベクトルとゲイン定数の組み合わせを求める。そして二乗誤差最小化器１４により求められた音源ベクトル、ゲイン定数とそのときの線形予測係数が多重化器１５により多重化されて音声符号化データが形成される。
【０００９】
このようにして形成された音声符号化データを復号する音声信号復号装置４０の構成を、図９に示す。音声符号化データは分離器１７に入力され、当該分離器１７により音声符号化データが線形予測係数、音源ベクトル及びゲイン定数に分離される。適応符号帳１８は過去に生成した音源信号を蓄えている。乗算器１９は適応符号帳１８から選択された適応音源ベクトルにゲイン定数を乗じて適応音源信号を求める。雑音符号帳２０は予め定められた音源ベクトルが蓄えられている。
【００１０】
乗算器２１は雑音符号帳２０から選択された雑音音源ベクトルにゲイン定数を乗じて雑音音源信号を求める。加算器２２は適応音源信号と雑音音源信号を加算し音源信号を生成する。合成フィルタ２３は、生成された音源ベクトルを、入力した線形予測係数で構成する合成フィルタ２３でフィルタリングし復号音声を合成する。
【００１１】
このように符号励振線形予測符号化技術を用いれば、聴感上量子化誤差の影響を受け易い周波数帯域に重みを付けて算出した誤差を最小化する音源を選択することができるため、主観的な量子化歪が少ない復号音声を得ることができる音声信号符号化装置３０及び音声信号復号装置４０を実現することができる。
【００１２】
【発明が解決しようとする課題】
しかしながら、特に入力音声として背景雑音が重畳された広周波数帯域信号が与えられたとき主観的な音声品質が劣化する問題がある。
【００１３】
この問題点について図１０を用いて説明する。人の聴覚は４〜５ｋHzを境に信号に対する感度が大きく変化する。図１０ではこの聴覚特性を破線で示しており、ここでの説明では破線以下の振幅値の信号に対して感度が０、つまり、その信号を検知できないと仮定する。
【００１４】
また図中の一点鎖線は聴覚重みフィルタ処理前の雑音振幅周波数特性を示し、図中細線は聴覚重みフィルタ処理後の雑音振幅周波数特性を示す。このとき、図中、一点鎖線と細線で示す２つの雑音振幅周波数特性の主観品質を比較したとき、聴覚特性を示す破線を超える振幅値の周波数帯域のみが主観的な品質に影響するため、一点鎖線（聴覚重みフィルタ処理前の雑音振幅周波数特性）の方がＳＮ比が高く、主観的な品質が良いと考えられる。すなわちこの例では聴覚重みフィルタを用いることで主観的な品質が劣化してしまう。
【００１５】
このことは音声信号のみであれば、低周波数帯域にエネルギが偏っているため大きな問題とはならないが、音声信号に高周波数帯域にもエネルギを有するような背景雑音が重畳されたとき、聴感上さほど重要でない高周波数帯域の信号の量子化誤差を少なくする音源が選択されることとなり、復号音声の品質が劣化する要因となる。
【００１６】
本発明はかかる点に鑑みてなされたものであり、背景雑音が重畳された音声信号が入力された場合でも、主観的な品質の劣化を抑制し得る音声信号符号化装置、音声信号復号装置及び音声信号符号化方法を提供することを目的とする。
【００１７】
【課題を解決するための手段】
かかる課題を解決するため本発明は、以下の構成を採る。
【００１８】
（１）本発明の音声信号符号化装置は、入力音声信号に対して処理フレーム単位で線形予測分析処理を施すことにより線形予測係数を算出する線形予測分析手段と、適応符号帳及び雑音符号帳に格納された適応音源及び雑音音源に対して前記線形予測係数を用いたフィルタリング処理を施すことにより符号化合成音を得る合成フィルタと、適応音源及び雑音音源のゲインを求め、さらにゲインを用いて得られる合成音と入力音声信号との間の符号化歪みが最小となる適応音源及び雑音音源の符号及びゲインを探索する演算手段と、入力音声信号及び合成音に対して聴覚重み付け処理を施す聴覚重みフィルタと、入力音声信号の処理フレームが、音声信号が支配的である有音区間か、又は非音声信号が支配的である無音区間かを判定する有音無音判定手段と、有音無音判定手段により無音区間であると判定された処理フレームが入力音声信号として入力された際、聴覚重みフィルタのフィルタ特性を、高周波数帯域を抑圧するように変換するフィルタ特性変換手段と、を具備する構成を採る。
【００１９】
この構成によれば、聴感上重要な低域をより正確に表現できる音源候補を選択し易くすることができるので、聴感上の劣化の少ない音声符号化データを形成することができる。
【００２０】
（２）本発明の音声信号符号化装置は、（１）に加えて、さらに、有音無音判定手段により無音区間であると判定された処理フレームが入力音声信号として入力された際、音源情報を生成するために設けられた適応符号帳及び雑音符号帳のうち、雑音符号帳から出力される雑音音源ベクトルの高周波数帯域を抑圧する周波数特性変換手段を具備する構成を採る。
【００２１】
（３）本発明の音声信号復号装置は、（２）の音声信号符号化装置から得られる情報を用いて音声信号を復号する符号励振線形予測型の音声信号復号装置であって、（２）の音声信号符号化装置から得られる情報を用いて、処理フレームが音声信号が支配的である有音区間か、又は非音声信号が支配的である無音区間かを判定する有音無音判定手段と、無音区間において雑音符号帳から得られる雑音音源ベクトルの高周波数帯域を抑圧する特性変換手段と、無音区間において雑音音源ベクトルに対して抑圧した周波数帯域の信号エネルギを補完する特性を有する雑音ベクトルを生成する雑音生成手段と、（２）の音声信号符号化装置から得られる情報を用いて生成した雑音ベクトルのゲインを推定し、雑音ベクトルに乗ずる乗算手段と、生成した雑音を音源信号に加算する加算手段と、を具備する構成を採る。
【００２２】
（２）及び（３）の構成によれば、音声信号復号装置側で、雑音が重畳された入力信号に対して量子化雑音が顕著となる高周波成分信号を定常雑音で置換することができるので、量子化雑音に起因する耳障りな雑音感を抑え、安定した背景雑音を復号して出力することができる。
【００２３】
（４）本発明の音声信号符号化方法は、入力音声信号に対して処理フレーム単位で線形予測分析処理を施すことにより線形予測係数を算出するステップと、適応符号帳及び雑音符号帳に格納された適応音源及び雑音音源に対して前記線形予測係数を用いたフィルタリング処理を施すことにより符号化合成音を得るステップと、適応音源及び雑音音源のゲインを求め、このゲインを用いて得られる合成音と入力音声信号との間の符号化歪みが最小となる適応音源及び雑音音源の符号及びゲインを探索するステップと、入力音声信号及び合成音に対して聴覚重み付け処理を施すステップと、入力音声信号の処理フレームが、音声信号が支配的である有音区間か、又は非音声信号が支配的である無音区間かを判定するステップと、無音区間であると判定された処理フレームが入力音声信号として入力された際、聴覚重みフィルタのフィルタ特性を、高周波数帯域を抑圧するように変換するステップと、を有するようにする。
【００２４】
この方法によれば、聴感上重要な低域をより正確に表現できる音源候補を選択し易くすることができるので、聴感上の劣化の少ない音声符号化データを形成することができる。
【００２５】
（５）本発明のプログラムは、コンピュータに、入力音声信号に対して処理フレーム単位で線形予測分析処理を施すことにより線形予測係数を算出する手順と、適応符号帳及び雑音符号帳に格納された適応音源及び雑音音源に対して前記線形予測係数を用いたフィルタリング処理を施すことにより符号化合成音を得る手順と、適応音源及び前記雑音音源のゲインを求め、このゲインを用いて得られる合成音と入力音声信号との間の符号化歪みが最小となる適応音源及び雑音音源の符号及びゲインを探索する手順と、入力音声信号及び合成音に対して聴覚重み付け処理を施す手順と、入力音声信号の処理フレームが、音声信号が支配的である有音区間か、又は非音声信号が支配的である無音区間かを判定する手順と、無音区間であると判定された処理フレームが入力音声信号として入力された際、聴覚重みフィルタのフィルタ特性を、高周波数帯域を抑圧するように変換する手順と、を実行させる構成を採る。
【００２６】
この構成によれば、コンピュータが、聴感上重要な低域をより正確に表現できる音源候補を選択して、聴感上の劣化の少ない音声符号化データを形成することができる。
【００２７】
【発明の実施の形態】
本発明者らは、入力音声として背景雑音が重畳された広周波数帯域信号が与えられたときに主観的な音声品質が劣化するのは、聴覚重みフィルタの生成において静的な聴覚特性を考慮していないためであると考えることで本発明に至った。
【００２８】
本発明の骨子は、入力信号を、音声信号が支配的である有音区間と、背景雑音が支配的である無音区間とに分類し、当該有音区間と無音区間とでそれぞれ異なる聴覚重みフィルタ特性を使って符号励振線形予測符号化処理を行うようにしたことである。
【００２９】
以下、本発明の実施の形態について図面を参照して詳細に説明する。
【００３０】
（実施の形態１）
図１において、１００は全体として、本発明による実施の形態１に係る音声信号符号化装置の構成を示す。音声信号符号化装置１００は、入力音声信号を線形予測分析器１０１に入力する。線形予測分析器１０１は入力音声を線形予測分析する。これにより線形予測分析器１０１は線形予測係数｛αｉ｜ｉ＝１，………，Ｐ｝を得る。ここでＰは分析次数である。
【００３１】
聴覚重みフィルタ生成器１０２は分析された線形予測係数を用いて聴覚重みフィルタを生成する。具体的には、聴覚重みフィルタ生成器１０２は抽出された線形予測係数から、次式を用いてフィルタの振幅周波数特性の谷部を強調した聴覚重みフィルタＨ_P（Ｚ）を生成する。
【００３２】
【数２】

ここで、γ_Z、γ_Pはフォルマント強調係数で０＜γ_P＜γ_Z＜１を満たす定数である。図２に、このように構成された聴覚重みフィルタの振幅周波数特性の一例を示す。
【００３３】
また音声信号符号化装置１００は有音無音判定器１０４を有し、当該有音無音判定器１０４に入力音声信号を入力させる。有音無音判定器１０４は入力音声信号から処理フレームが有音区間であるか無音区間であるかを判定する。ここで有音区間とは音声信号が支配的な区間であり、無音区間とは背景雑音が支配的な区間である。
【００３４】
このような有音区間と無音区間の判定は、音声信号と背景雑音それぞれにより異なる周波数特性や規則性を基に容易に行うことができる。有音無音判定器１０４は判定結果をフィルタ変換器１０５に送出する。
【００３５】
フィルタ変換器１０５は有音無音判定器１０４から入力音声信号が有音区間であることを示す判定結果が入力された場合には、聴覚重みフィルタ生成器１０２から出力される聴覚重みフィルタ特性をそのまま聴覚重みフィルタ１０６に送出する。これに対して有音無音判定器１０４から入力音声信号が無音区間であることを示す判定結果が入力された場合には、聴覚重みフィルタ生成器１０２から出力される聴覚重みフィルタ特性の高周波数帯域を抑圧するように周波数特性を変換した後、聴覚重みフィルタ１０６に送出する。
【００３６】
適応符号帳１０７は過去に生成した音源信号を蓄えている。つまり適応符号帳１０７は過去に生成された音源信号によって更新される動的な符号帳である。乗算器１０８は適応符号帳１０７により選択された適応音源ベクトルにゲイン定数を乗じて適応音源信号を求める。雑音符号帳１０９は予め定められた音源ベクトルが蓄えられている。乗算器１１０は雑音符号帳１０９により選択された雑音音源ベクトルにゲイン定数を乗じて雑音音源信号を求める。加算器１１１は適応音源信号と雑音音源信号を加算し音源信号を得る。
【００３７】
このように音声信号符号化装置１００においては、適応符号帳１０７、雑音符号帳１０９、乗算器１０８、１１０及び加算器１１１により音源が形成され、適応符号帳１０７及び雑音符号帳１０９により選択された音源ベクトルが、それぞれ乗算器１０８と乗算器１１０により定数倍され、加算器１１１で加算されることで音源信号が生成される。
【００３８】
合成フィルタ１１２は加算器１１１から出力される音源信号に対して線形予測係数で構成されるフィルタでフィルタリングすることで合成音声を得る。具体的には、合成フィルタ１１２では、線形予測係数｛αｉ｜ｉ＝１，………，Ｐ｝で構成され、次式で表されるフィルタＨ（Ｚ）を用いて音源信号をフィルタリングして合成音声を得る。
【００３９】
【数３】

聴覚重みフィルタ１０６は、合成音声に対する聴覚的な信号強調を、フィルタ変換器１０５で生成されたフィルタを用いて行う。聴覚重みフィルタ１１３は入力音声信号に対して聴覚的な信号強調を行う。この際、聴覚重みフィルタ１１３はフィルタ変換器１０５で生成されたフィルタを用いて聴覚的な信号強調を行う。
【００４０】
ここで上述したようにフィルタ変換器１０５は、無音区間において聴覚重みフィルタの高周波数帯域を抑圧する。この実施の形態の場合、この処理を実現するため、聴覚重みフィルタＨ_P（Ｚ）のインパルス応答をｈ_p（ｔ）とすると、このｈ_p（ｔ）に対して低域通過フィルタのインパルス応答ｈ_lpf（ｔ）を畳み込むことで周波数特性を変換した合成フィルタＦ_P（Ｚ）を得るようになされている。このときフィルタ変換器１０５は出力する合成フィルタＦ_P（Ｚ）のインパルス応答ｆ_p（ｔ）を、次式に従って決定する。
【００４１】
【数４】

聴覚重みフィルタ１０６及び聴覚重みフィルタ１１３では、このように決定されたフィルタを用いて、合成音声及び入力音声をフィルタリングすることで聴覚的に重み付けされた入力音声及び合成音声とを得る。減算器１１４は聴覚的に重み付けされた入力音声と合成音声の誤差信号を求める。
【００４２】
二乗誤差最小化器１１５は誤差信号のエネルギを算出し、エネルギが最小となる音源ベクトルとゲイン定数の組み合わせを求める。そして二乗誤差最小化器１１５により求められた音源ベクトル、ゲイン定数とそのときの線形予測係数が多重化器１１６により多重化されて音声符号化データが形成される。
【００４３】
次に図２〜図５を用いて、この実施の形態の音声信号符号化装置１００の動作について説明する。図４は、入力音声信号が音声信号が支配的である場合、すなわち処理フレームが有音区間である場合について着目したものである。一方、図５は、入力音声信号が非音声信号（背景雑音）が支配的である場合、つまり処理フレームが無音区間である場合について着目したものである。
【００４４】
まず有音区間について説明する。有音無音判定器１０４により現在の処理フレームが有音区間であることを示す判定結果が得られ、聴覚重みフィルタ１０６、１１３は、図２の点線で示すような聴覚重みフィルタ生成器１０２により生成された聴覚重みフィルタ特性とされる。このようなフィルタ特性の聴覚重みフィルタ１０６、１１３を用いることにより、小さな振幅の周波数帯域を強調した信号間で誤差を最小化することができる。例えば図３の一点鎖線のような振幅周波数特性をもつ量子化雑音を、図４に示す振幅周波数特性とすることができる。
【００４５】
ここで図３及び図４の斜線部の面積は量子化雑音エネルギを示しており、この量子化雑音エネルギはコーデックの構成とビットレートにより決定される。ところで、聴感に対応する雑音の客観尺度としてＳＮ比があり、ＳＮ比が同じならば主観的に感じる雑音の大きさは等しい。このことは雑音エネルギが同じならば、信号の振幅周波数特性に合わせて、大きな振幅の周波数帯域の振幅が大きくなるように雑音の振幅周波数特性をシェービングすることで主観的な品質を向上させることができることを意味している。
【００４６】
このことから図４に示す量子化雑音の振幅周波数特性は、図３に示す量子化雑音よりも主観的に感じる雑音が小さいということができ、聴覚重みフィルタ１０６、１１３により主観的な品質が向上することを示している。
【００４７】
これに対して、処理フレームとして、背景雑音が支配的である無音区間が入力された場合、有音無音判定器１０４により現在の処理フレームが無音区間であることを示す判定結果が得られ、聴覚重みフィルタ１０６、１１３はフィルタ変換器１０５により、図５の破線で示すような高周波数帯域を抑圧するような聴覚重みフィルタ特性とされる。
【００４８】
この結果、高周波数帯域にもエネルギを有するような背景雑音が重畳された入力音声が符号化対象となった場合でも、聴覚重みフィルタ１０６、１１３により高周波数帯域成分が抑制されるので、二乗誤差最小化器１１５では例えば４〜５［ｋＨｚ］までの聴覚上重要となる低周波数帯域の量子化誤差を少なくする音源が選択される。つまり聴感上重要な低周波数帯域に重み付けした誤差尺度に基づいて音源探索することになり、主観的な音声品質を向上させることができる。
【００４９】
具体的には、上述したように、二乗誤差最小化器１１５では、減算器１１４から出力される差分信号のエネルギが最小となる音源を選択することで、量子化誤差を小さくしている。このため差分信号において高周波数帯域のエネルギを抑制することで、二乗誤差最小化器１１５は、低周波数帯域の量子化誤差を小さくするような音源を選択するように動作するので、実際上重要となる低周波数帯域の量子化誤差の低減効果が生じる。
【００５０】
因みに、音声信号が支配的である有音区間では、音声信号が低周波数帯域に大きなエネルギをもっているので、高周波数帯域をそれほど抑制しなくても、二乗誤差最小化器１１５により低周波数帯域の量子化誤差を重点的に小さくするような音源が選択されるので、この実施の形態では、有音区間の処理フレームが入力された場合には、聴覚重みフィルタ１０６、１１３に高周波数帯域を抑制する特性を持たせないようにしている。
【００５１】
以上の構成によれば、入力音声信号を声帯を模した音源情報と声道を模した合成フィルタ情報に分離して符号化する場合に、入力音声信号が有音区間か無音区間かを判定し、音声信号が殆ど含まれず背景雑音が支配的である無音区間であった場合に聴覚重みフィルタ１０６、１１３のフィルタ特性を高周波数帯域成分を抑制するように設定したことにより、背景雑音における低周波数帯域での量子化歪みを低減することができる。この結果、聴感上重要な低域をより正確に表現できる音源候補を選択し易くできるので、聴感上の劣化が少ない音声符号化装置１００を実現できる。
【００５２】
（実施の形態２）
図１との対応部分に同一符号を付して示す図６において、２００は全体として本発明の実施の形態２に係る音声信号符号化装置の構成を示す。この実施の形態の音声信号符号化装置２００は、雑音符号帳１０９から出力された雑音音源ベクトルの特性を有音無音判定器１０４の判定結果に応じて変換する特性変換器２０１を有すること、及び有音無音判定器１０４の判定結果の情報を多重化器１１６に出力することを除いて、実施の形態１の音声信号符号化装置１００と同様の構成でなる。
【００５３】
特性変換器２０１はフィルタ変換器１０５と同様の特性を有する。特性変換器２０１は有音無音判定器１０４からの判定結果に応じてフィルタ変換器１０５に同期して雑音音源ベクトルの周波数特性を変更する。すなわち特性変換器２０１は、有音無音判定器１０４から入力処理フレームが無音区間である判定結果が入力されると、雑音符号帳１０９から入力される雑音音源ベクトルν_S（ｔ）に対して、フィルタ変換器１０５と同じ低域通過フィルタ特性ｈ_lpf（ｔ）を用いて、次式で示すフィルタリング処理を施すことにより、特性変換器出力ψ_S（ｔ）を得る。
【００５４】
【数５】

ここで雑音符号帳１０９には一般に周波数特性が平坦な雑音音源ベクトルが蓄積されているので、特性変換器２０１は、フィルタ変換器１０５に同期して無音区間においてこの平坦な周波数特性の雑音音源ベクトルの高周波数帯域を抑制するように周波数特性を変更する。
【００５５】
図７に、音声信号符号化装置２００により得られた符号化データを復号する音声信号復号装置３００の構成を示す。音声信号復号装置３００は、分離器３０１で音声信号符号化装置２００から受信した符号化データを線形予測係数、有音無音判定情報、音源ベクトル情報（適応音源ベクトル情報、雑音音源ベクトル情報）及びゲイン定数（適応音源ゲイン定数、雑音音源ゲイン定数）に分離する。そして線形予測係数を合成フィルタ３１２に、適応音源ベクトル情報を適応符号帳３０２に、雑音音源ベクトル情報を雑音符号帳３０４に、適応音源ゲイン定数、雑音音源ゲイン定数をそれぞれ適応符号帳３０２、雑音符号帳３０４に対応する乗算器３０３、３０７に、有音無音判定情報を有音無音判定器３０５に送出する。適応符号帳３０２は過去に生成した音源信号を蓄えている。乗算器３０３は適応符号帳３０２から出力された適応音源ベクトルに適応音源ゲイン定数を乗じて適応音源信号を求める。雑音符号帳３０４は予め定められた音源ベクトルが蓄えられている。
【００５６】
有音無音判定器３０５は音声信号符号化装置２００からの有音無音判定情報を用いて処理フレームが有音区間であるか無音区間であるか判定する。特性変換器３０６は雑音符号帳３０４から選択された雑音音源ベクトルの特性を有音無音情報に応じて変換する。実際上、特性変換器３０６は、音声信号符号化装置２００の特性変換器２０１と同様のフィルタ特性を有し、（５）式に基づいて雑音符号帳３０４から入力された雑音音源ベクトルをフィルタリングする。
【００５７】
乗算器３０７は特性変換器３０６から出力されたベクトルに雑音音源ゲイン定数を乗じて雑音音源信号を求める。加算器３０８は適応音源信号と雑音音源信号を加算し音源信号を生成する。
【００５８】
雑音生成器３０９は特性変換器３０６で変換した雑音音源の周波数領域でのエネルギ分布特性を補完する自励雑音ベクトルを生成する。実際上、雑音生成器３０９は有音無音情報に基づいて次式により自励雑音ベクトルν_r（ｔ）を生成する。
【００５９】
【数６】

ここでｒ（ｔ）は例えば発振器を用いて音声信号符号化装置２００と独立して音声信号復号装置３００で生成する白色雑音であり、ｈ_hpfは特性変換器３０６で用いたｈ_lpfと相補して全域通過型フィルタを構成する高域通過型フィルタである。
【００６０】
乗算器３１０は有音無音情報及び雑音音源ゲイン定数を利用して決定したゲインを自励雑音ベクトルに乗じ自励雑音信号を出力する。加算器３１１は音源信号と自励雑音信号を加算し、補正音源信号を生成する。合成フィルタ３１２は生成された補正音源信号をフィルタリングし復号音声を合成する。
【００６１】
以上の構成によれば、音声信号符号化装置２００により、誤差算出対象外の周波数帯域で生じる量子化雑音を抑制し、音声信号復号装置３００により、定常的な雑音に置換するようにしたことにより、主観的な音声品質を向上させることができる。
【００６２】
なお、上述の実施の形態では、有音無音判定情報は符号化装置から送信される構成として説明したが、符号化装置から送信する構成とせず、復号装置で受信した他の情報を用いて判定する構成としても良い。
【００６３】
【発明の効果】
以上説明したように、本発明によれば、入力音声信号の処理フレームが、音声信号が支配的である有音区間か、又は非音声信号が支配的である無音区間かを判定し、無音区間である処理フレームが入力音声信号として入力された際、聴覚重みフィルタのフィルタ特性を、高周波数帯域を抑圧するようにしたことにより、背景雑音が重畳された音声信号が入力された場合でも、主観的な品質の劣化を抑制し得る音声信号符号化装置を実現できる。
【図面の簡単な説明】
【図１】本発明の実施の形態１に係る音声信号符号化装置の構成を示すブロック図
【図２】実施の形態での有音区間における聴覚重みフィルタ特性の説明に供する特性曲線図
【図３】聴覚重みフィルタ特性の説明に供する特性曲線図
【図４】実施の形態での有音区間における量子化雑音特性を示す特性曲線図
【図５】実施の形態での無音区間における聴覚重みフィルタ特性の説明に供する特性曲線図
【図６】実施の形態２の音声信号符号化装置の構成を示すブロック図
【図７】実施の形態２の音声信号復号装置の構成を示すブロック図
【図８】符号励振線形予測符号化を行う従来の音声信号符号化装置の構成を示すブロック図
【図９】従来の音声信号復号装置の構成を示すブロック図
【図１０】量子化雑音と主観特性の関係を示す特性曲線図
【符号の説明】
１００、２００音声信号符号化装置
１０１線形予測分析器
１０２聴覚重みフィルタ生成器
１０４、３０５有音無音判定器
１０５フィルタ変換器
１０６、１１３聴覚重みフィルタ
１０７、３０２適応符号帳
１０９、３０４雑音符号帳
２０１特性変換器
３００音声信号復号装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech signal encoding device, a speech signal decoding device, and a speech signal encoding method, and is particularly suitable for application to code-excited linear prediction (CELP) encoding of a speech signal. is there.
[0002]
[Prior art]
Conventionally, among speech coding that compresses and expands speech used for wired communication, mobile communications, digital recording memos, etc., medium and low bit rate speech coding models the speech generation process and inputs signal information. A method of separating and encoding sound source information imitating a vocal cord and synthesis filter information imitating a vocal tract is generally used. In particular, CELP speech coding, in which a sound source vector is selected from a codebook as sound source information and synthesis filter information is extracted by linear prediction analysis, is widely used.
[0003]
The CELP speech coding method divides speech into a certain frame length (about 5 ms to 50 ms), performs speech linear prediction analysis for each frame, and predicts residual (excitation signal) by linear prediction analysis for each frame. Is encoded using an adaptive code vector having a known waveform and a noise code vector. The adaptive code vector is selected from the adaptive code book that stores the drive excitation vector generated in the past, and the noise code vector is selected from the noise code book that stores a vector having a predetermined shape. .
[0004]
FIG. 8 shows the configuration of a conventional audio signal encoding apparatus of this type. First, in the speech signal encoding device 30, an input speech signal is input to the linear prediction analyzer 1, and the input speech is subjected to linear prediction analysis by the linear prediction analyzer 1. The auditory weight filter generator 2 generates the auditory weight filter 11 using the analyzed linear prediction coefficient. Auditory weight filter H _p (Z) is the analyzed linear prediction coefficient {αi | i = 1,..., P} (P is the analysis order), and is expressed using the following equation.
[0005]
[Expression 1]

Where γ _Z , Γ _P Is the formant emphasis coefficient 0 <γ _P <Γ _Z It is a constant that satisfies <1.
[0006]
The adaptive codebook 5 stores sound source signals generated in the past. A multiplier 6 multiplies the adaptive excitation vector selected from the adaptive codebook 5 by a gain constant to obtain an adaptive excitation signal. The noise codebook 7 stores a predetermined excitation vector. The multiplier 8 multiplies the noise source vector selected from the noise codebook 7 by a gain constant to obtain a noise source signal. The adder 9 adds the adaptive sound source signal and the noise sound source signal to obtain a sound source signal.
[0007]
The synthesis filter 10 obtains synthesized speech by filtering the sound source signal obtained by the adder 9 with a filter composed of linear prediction coefficients. The auditory weight filter 11 performs auditory signal enhancement on the synthesized speech using the filter generated by the auditory weight filter generator 2. The auditory weight filter 12 performs auditory signal enhancement on the input signal.
[0008]
The subtractor 13 obtains an error signal between the input sound and the synthesized sound that are weighted auditorily. The square error minimizer 14 calculates the energy of the error signal and obtains a combination of a sound source vector and a gain constant that minimizes the energy. Then, the sound source vector and gain constant obtained by the square error minimizer 14 and the linear prediction coefficient at that time are multiplexed by the multiplexer 15 to form speech encoded data.
[0009]
FIG. 9 shows the configuration of a speech signal decoding apparatus 40 that decodes the speech encoded data formed in this way. The speech encoded data is input to the separator 17, and the separator 17 separates the speech encoded data into linear prediction coefficients, excitation vectors, and gain constants. The adaptive codebook 18 stores sound source signals generated in the past. The multiplier 19 multiplies the adaptive excitation vector selected from the adaptive codebook 18 by a gain constant to obtain an adaptive excitation signal. The noise codebook 20 stores predetermined excitation vectors.
[0010]
The multiplier 21 multiplies the noise source vector selected from the noise codebook 20 by a gain constant to obtain a noise source signal. The adder 22 adds the adaptive sound source signal and the noise sound source signal to generate a sound source signal. The synthesis filter 23 synthesizes the decoded speech by filtering the generated excitation vector with the synthesis filter 23 configured with the input linear prediction coefficient.
[0011]
By using the code-excited linear predictive coding technique in this way, it is possible to select a sound source that minimizes an error calculated by weighting a frequency band that is easily affected by a quantization error. The audio signal encoding device 30 and the audio signal decoding device 40 that can obtain decoded speech with less quantization distortion can be realized.
[0012]
[Problems to be solved by the invention]
However, there is a problem that subjective voice quality deteriorates particularly when a wide frequency band signal on which background noise is superimposed is given as input voice.
[0013]
This problem will be described with reference to FIG. Human sensitivity changes greatly with respect to the signal at 4 to 5 kHz. In FIG. 10, this auditory characteristic is indicated by a broken line, and in the description here, it is assumed that the sensitivity is 0 with respect to a signal having an amplitude value below the broken line, that is, the signal cannot be detected.
[0014]
Also, the alternate long and short dash line in the figure indicates the noise amplitude frequency characteristic before the auditory weight filter processing, and the thin line in the figure indicates the noise amplitude frequency characteristic after the auditory weight filter processing. At this time, when comparing the subjective quality of the two noise amplitude frequency characteristics indicated by the one-dot chain line and the thin line in the figure, only the frequency band of the amplitude value exceeding the broken line indicating the auditory characteristics affects the subjective quality. The chain line (noise amplitude frequency characteristic before auditory weight filter processing) has a higher SN ratio and is considered to have better subjective quality. That is, in this example, the subjective quality is deteriorated by using the auditory weight filter.
[0015]
This is not a big problem if only the audio signal is used, because the energy is biased toward the low frequency band, but when background noise that has energy in the high frequency band is superimposed on the audio signal, A sound source that reduces a quantization error of a signal in a high frequency band that is not so important is selected, which causes degradation of the quality of decoded speech.
[0016]
The present invention has been made in view of such a point, and even when an audio signal on which background noise is superimposed is input, an audio signal encoding device, an audio signal decoding device, and an audio signal encoding device that can suppress subjective quality degradation, and An object of the present invention is to provide an audio signal encoding method.
[0017]
[Means for Solving the Problems]
In order to solve this problem, the present invention adopts the following configuration.
[0018]
(1) A speech signal encoding apparatus according to the present invention includes linear prediction analysis means for calculating a linear prediction coefficient by performing linear prediction analysis processing on an input speech signal in units of processing frames, an adaptive codebook, and a noise codebook. A synthesis filter that obtains an encoded synthesized sound by performing a filtering process using the linear prediction coefficient on the adaptive sound source and the noise sound source stored in the sound source, obtains the gain of the adaptive sound source and the noise sound source, and further uses the gain. An arithmetic means for searching for codes and gains of an adaptive sound source and a noise sound source that minimize the coding distortion between the obtained synthesized sound and the input sound signal, and an auditory weighting process for the input sound signal and the synthesized sound. The voice filter determines whether the weighting filter and the processing frame of the input audio signal are a sound segment in which the audio signal is dominant or a silence interval in which the non-voice signal is dominant A filter characteristic for converting the filter characteristic of the auditory weight filter so as to suppress a high frequency band when a processing frame determined to be a silent section by the determination unit and the sound / silence determination unit is input as an input audio signal And a conversion means.
[0019]
According to this configuration, it is possible to easily select a sound source candidate that can accurately express a low frequency range that is important for auditory perception, and thus it is possible to form speech encoded data with little perceptual degradation.
[0020]
(2) In addition to (1), the speech signal encoding apparatus according to the present invention further includes sound source information when a processing frame determined to be a silent section by a voiced / silent determination unit is input as an input voice signal. Among the adaptive codebook and the noise codebook provided to generate the signal, the frequency characteristic conversion means for suppressing the high frequency band of the noise source vector output from the noise codebook is employed.
[0021]
(3) The speech signal decoding device of the present invention is a code-excited linear prediction speech signal decoding device that decodes a speech signal using information obtained from the speech signal encoding device of (2), and (2) Using the information obtained from the speech signal encoding apparatus, the sound / silence determination means for determining whether the processing frame is a sound section in which the speech signal is dominant or a silence section in which the non-speech signal is dominant; A characteristic converting means for suppressing a high frequency band of a noise source vector obtained from a noise codebook in a silent section, and a noise vector having a characteristic for complementing a signal energy in a frequency band suppressed with respect to the noise source vector in a silent section. A noise generating means for generating, a multiplying means for estimating a gain of a noise vector generated using the information obtained from the audio signal encoding apparatus of (2) and multiplying the noise vector, and generating The noise employs a configuration which comprises adding means for adding the source signal.
[0022]
According to the configurations of (2) and (3), the high-frequency component signal in which the quantization noise becomes significant with respect to the input signal on which the noise is superimposed can be replaced with stationary noise on the audio signal decoding device side. It is possible to suppress annoying noise caused by quantization noise and to decode and output stable background noise.
[0023]
(4) In the speech signal encoding method of the present invention, a linear prediction coefficient is calculated by subjecting an input speech signal to a linear prediction analysis process in units of processing frames, and stored in an adaptive codebook and a noise codebook. Obtaining a coded synthesized sound by performing filtering processing using the linear prediction coefficient on the adaptive sound source and the noise sound source, obtaining gains of the adaptive sound source and the noise sound source, and obtaining the synthesized sound using the gains. Searching for codes and gains of adaptive sound sources and noise sound sources that minimize coding distortion between the input sound signal and the input sound signal, applying auditory weighting processing to the input sound signal and the synthesized sound, and the input sound signal Determining whether the processing frame is a voiced segment in which the audio signal is dominant or a silent segment in which the non-voice signal is dominant; When the constant is a processing frame is input as an input audio signal, the filter characteristic of the perceptually weighted filter, so as to have a step of converting to suppress the high frequency band, a.
[0024]
According to this method, it is possible to easily select a sound source candidate that can accurately express a low frequency range that is important for auditory perception, and thus speech encoded data with little perceptual degradation can be formed.
[0025]
(5) The program of the present invention is stored in the adaptive codebook and the noise codebook, and the procedure for calculating the linear prediction coefficient by performing linear prediction analysis processing on the input speech signal in units of processing frames for the input speech signal. A procedure for obtaining a coded synthesized sound by performing filtering processing using the linear prediction coefficient on an adaptive sound source and a noise sound source, and obtaining gains of the adaptive sound source and the noise sound source, and a synthesized sound obtained using the gains A procedure for searching for codes and gains of an adaptive sound source and a noise sound source that minimize the coding distortion between the input sound signal and the input sound signal, a procedure for performing auditory weighting processing on the input sound signal and the synthesized sound, and an input sound signal And a procedure for determining whether the processing frame is a voiced section in which the audio signal is dominant or a silent section in which the non-voice signal is dominant, and is determined to be a silent section. When the processing frame is input as an input audio signal, the filter characteristic of the perceptually weighted filter, a configuration to execute a procedure for converting to suppress the high frequency band, a.
[0026]
According to this configuration, the computer can select sound source candidates that can more accurately represent a low frequency range that is important for auditory sense, and can form speech encoded data with little auditory degradation.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
The present inventors consider that static auditory characteristics are considered in the generation of an auditory weighting filter when subjective speech quality deteriorates when a wide frequency band signal superimposed with background noise is given as input speech. Therefore, the present invention has been reached.
[0028]
The essence of the present invention is that the input signal is classified into a voiced section in which the audio signal is dominant and a silent section in which the background noise is dominant, and the auditory weight filter is different in the voiced section and the silent section. That is, the code-excited linear predictive encoding process is performed using the characteristics.
[0029]
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0030]
(Embodiment 1)
In FIG. 1, reference numeral 100 generally indicates the configuration of a speech signal encoding apparatus according to Embodiment 1 of the present invention. The speech signal encoding apparatus 100 inputs an input speech signal to the linear prediction analyzer 101. The linear prediction analyzer 101 performs linear prediction analysis on the input speech. Thereby, the linear prediction analyzer 101 obtains linear prediction coefficients {αi | i = 1,..., P}. Here, P is the analysis order.
[0031]
The auditory weight filter generator 102 generates an auditory weight filter using the analyzed linear prediction coefficient. Specifically, the auditory weight filter generator 102 emphasizes the valley of the amplitude frequency characteristic of the filter from the extracted linear prediction coefficient using the following equation: _P (Z) is generated.
[0032]
[Expression 2]

Where γ _Z , Γ _P Is the formant emphasis coefficient 0 <γ _P <Γ _Z It is a constant that satisfies <1. FIG. 2 shows an example of the amplitude frequency characteristic of the auditory weighting filter configured as described above.
[0033]
The speech signal encoding apparatus 100 includes a sound / silence determination unit 104, and causes the sound / silence determination unit 104 to input an input speech signal. The voice / silence determination unit 104 determines whether the processing frame is a voiced section or a silent section from the input voice signal. Here, the voiced section is a section where the audio signal is dominant, and the silent section is a section where the background noise is dominant.
[0034]
Such determination of the voiced section and the silent section can be easily performed based on frequency characteristics and regularity that differ depending on the voice signal and background noise. The sound / silence determination unit 104 sends the determination result to the filter converter 105.
[0035]
When the determination result indicating that the input speech signal is a sound section is input from the sound / silence determination unit 104, the filter converter 105 uses the auditory weight filter characteristic output from the auditory weight filter generator 102 as it is. This is sent to the auditory weight filter 106. On the other hand, when a determination result indicating that the input speech signal is a silent section is input from the sound / silence determination unit 104, the high frequency band of the auditory weight filter characteristic output from the auditory weight filter generator 102 Then, the frequency characteristic is converted so as to suppress the signal, and the result is sent to the auditory weight filter 106.
[0036]
The adaptive codebook 107 stores sound source signals generated in the past. That is, the adaptive codebook 107 is a dynamic codebook that is updated by a sound source signal generated in the past. Multiplier 108 obtains an adaptive excitation signal by multiplying the adaptive excitation vector selected by adaptive codebook 107 by a gain constant. The noise codebook 109 stores predetermined excitation vectors. Multiplier 110 multiplies the noise source vector selected by noise codebook 109 by a gain constant to obtain a noise source signal. The adder 111 adds the adaptive sound source signal and the noise sound source signal to obtain a sound source signal.
[0037]
As described above, in speech signal encoding apparatus 100, a sound source is formed by adaptive codebook 107, noise codebook 109,

multipliers

108 and 110, and adder 111, and is selected by adaptive codebook 107 and noise codebook 109. The sound source vectors are multiplied by a constant by multiplier 108 and multiplier 110, respectively, and added by adder 111 to generate a sound source signal.
[0038]
The synthesis filter 112 obtains synthesized speech by filtering the sound source signal output from the adder 111 with a filter composed of linear prediction coefficients. Specifically, the synthesis filter 112 is configured with linear prediction coefficients {αi | i = 1,..., P}, and filters the sound source signal using a filter H (Z) represented by the following equation. Get synthesized speech.
[0039]
[Equation 3]

The auditory weight filter 106 performs auditory signal enhancement on the synthesized speech using the filter generated by the filter converter 105. The auditory weight filter 113 performs auditory signal enhancement on the input voice signal. At this time, the auditory weight filter 113 performs auditory signal enhancement using the filter generated by the filter converter 105.
[0040]
Here, as described above, the filter converter 105 suppresses the high frequency band of the auditory weight filter in the silent period. In this embodiment, the auditory weight filter H is used to realize this process. _P The impulse response of (Z) is h _p If (t), this h _p The impulse response h of the low-pass filter with respect to (t) _lpf Synthetic filter F whose frequency characteristic is converted by convolving (t) _P (Z) is obtained. At this time, the filter converter 105 outputs the synthesis filter F to be output. _P Impulse response f of (Z) _p (T) is determined according to the following equation.
[0041]
[Expression 4]

In the auditory weight filter 106 and the auditory weight filter 113, the synthesized speech and the input speech are filtered using the filter determined in this manner, thereby obtaining the input speech and the synthesized speech that are aurally weighted. The subtracter 114 obtains an error signal between the input sound and the synthesized sound that are aurally weighted.
[0042]
The square error minimizer 115 calculates the energy of the error signal and obtains a combination of a sound source vector and a gain constant that minimizes the energy. Then, the sound source vector and gain constant obtained by the square error minimizer 115 and the linear prediction coefficient at that time are multiplexed by the multiplexer 116 to form speech encoded data.
[0043]
Next, the operation of the speech signal encoding apparatus 100 according to this embodiment will be described with reference to FIGS. FIG. 4 focuses on the case where the input audio signal is dominant in the audio signal, that is, the case where the processing frame is a sound section. On the other hand, FIG. 5 focuses on the case where the non-voice signal (background noise) is dominant in the input voice signal, that is, the case where the processing frame is a silent section.
[0044]
First, the sound section will be described. The sound / silence determination unit 104 obtains a determination result indicating that the current processing frame is a sound section, and the auditory weight filters 106 and 113 are generated by the auditory weight filter generator 102 as indicated by a dotted line in FIG. Auditory weight filter characteristics. By using the auditory weight filters 106 and 113 having such filter characteristics, an error can be minimized between signals in which a frequency band with a small amplitude is emphasized. For example, quantization noise having an amplitude frequency characteristic such as a one-dot chain line in FIG. 3 can be used as the amplitude frequency characteristic shown in FIG.
[0045]
Here, the hatched area in FIGS. 3 and 4 indicates the quantization noise energy, and this quantization noise energy is determined by the codec configuration and the bit rate. By the way, there is an SN ratio as an objective measure of noise corresponding to hearing. This means that, if the noise energy is the same, the subjective quality can be improved by shaving the noise amplitude frequency characteristic so that the amplitude of the large frequency band is increased in accordance with the amplitude frequency characteristic of the signal. It means you can do it.
[0046]
Therefore, it can be said that the amplitude frequency characteristic of the quantization noise shown in FIG. 4 is less subjectively felt than the quantization noise shown in FIG. 3, and the subjective quality is improved by the auditory weight filters 106 and 113. It shows that
[0047]
On the other hand, when a silent section in which background noise is dominant is input as a processing frame, the utterance / silence determination unit 104 obtains a determination result indicating that the current processing frame is a silent section. The weighting filters 106 and 113 are set to auditory weighting filter characteristics that suppress high frequency bands as indicated by broken lines in FIG.
[0048]
As a result, even if the input speech on which background noise having energy also in the high frequency band is superimposed becomes the target of encoding, the high frequency band components are suppressed by the auditory weighting filters 106 and 113, so that the square error In the minimizer 115, for example, a sound source that reduces a quantization error in a low frequency band that is important in hearing from 4 to 5 [kHz] is selected. That is, the sound source search is performed based on the error scale weighted to the low frequency band important for hearing, and the subjective voice quality can be improved.
[0049]
Specifically, as described above, the square error minimizer 115 reduces the quantization error by selecting a sound source that minimizes the energy of the difference signal output from the subtractor 114. For this reason, by suppressing the energy in the high frequency band in the difference signal, the square error minimizer 115 operates to select a sound source that reduces the quantization error in the low frequency band. This produces an effect of reducing the quantization error in the low frequency band.
[0050]
Incidentally, in a sound section where the audio signal is dominant, the audio signal has a large energy in the low frequency band. Therefore, even if the high frequency band is not suppressed so much, the square error minimizer 115 can reduce the quantum in the low frequency band. In this embodiment, when a processing frame in a sound section is input, the high-frequency band is suppressed in the auditory weighting filters 106 and 113. I try not to have the characteristics.
[0051]
According to the above configuration, when the input speech signal is separated and encoded into sound source information simulating a vocal cord and synthesis filter information simulating a vocal tract, it is determined whether the input speech signal is a sound segment or a silent segment. By setting the filter characteristics of the auditory weighting filters 106 and 113 so as to suppress the high frequency band component when the sound signal is hardly included and the background noise is dominant, the low frequency in the background noise is set. The quantization distortion in the band can be reduced. As a result, it is possible to easily select a sound source candidate that can more accurately represent a low frequency range that is important for auditory sense, and thus the speech encoding apparatus 100 with little auditory degradation can be realized.
[0052]
(Embodiment 2)
In FIG. 6, in which parts corresponding to those in FIG. 1 are assigned the same reference numerals, 200 indicates the overall configuration of the speech signal encoding apparatus according to Embodiment 2 of the present invention. The audio signal encoding apparatus 200 according to this embodiment includes a characteristic converter 201 that converts the characteristic of the noise source vector output from the noise codebook 109 according to the determination result of the utterance / non-utterance determination unit 104, and The configuration is the same as that of the speech signal encoding apparatus 100 of Embodiment 1 except that the information of the determination result of the sound / silence determination unit 104 is output to the multiplexer 116.
[0053]
The characteristic converter 201 has the same characteristics as the filter converter 105. The characteristic converter 201 changes the frequency characteristic of the noise source vector in synchronization with the filter converter 105 according to the determination result from the utterance / non-utterance determination unit 104. That is, when the determination result that the input processing frame is a silent section is input from the sound / silence determination unit 104, the characteristic converter 201 receives the noise excitation vector ν input from the noise codebook 109. _S For (t), the same low-pass filter characteristic h as the filter converter 105 _lpf Using (t), the characteristic converter output ψ is obtained by performing a filtering process represented by the following equation: _S (T) is obtained.
[0054]
[Equation 5]

Here, since the noise codebook 109 generally stores a noise source vector having a flat frequency characteristic, the characteristic converter 201 synchronizes with the filter converter 105 and the noise source vector having the flat frequency characteristic in the silent period. The frequency characteristics are changed so as to suppress the high frequency band.
[0055]
FIG. 7 shows a configuration of audio signal decoding apparatus 300 that decodes encoded data obtained by audio signal encoding apparatus 200. The audio signal decoding apparatus 300 converts the encoded data received from the audio signal encoding apparatus 200 by the separator 301 into linear prediction coefficients, sound / silence determination information, excitation vector information (adaptive excitation vector information, noise excitation vector information), and gain. Separated into constants (adaptive sound source gain constant, noise sound source gain constant). Then, the linear prediction coefficient is set in the synthesis filter 312, the adaptive excitation vector information in the adaptive codebook 302, the noise excitation vector information in the noise codebook 304, the adaptive excitation gain constant and the noise excitation gain constant in the adaptive codebook 302, and the noise code, respectively. The sound / silence determination information is sent to the sound / silence determination unit 305 to the

multipliers

303 and 307 corresponding to the book 304. Adaptive codebook 302 stores sound source signals generated in the past. Multiplier 303 multiplies the adaptive excitation vector output from adaptive codebook 302 by an adaptive excitation gain constant to obtain an adaptive excitation signal. The noise codebook 304 stores predetermined excitation vectors.
[0056]
The sound / silence determination unit 305 determines whether the processing frame is a sound section or a silence section using the sound / silence determination information from the speech signal encoding device 200. The characteristic converter 306 converts the characteristic of the noise source vector selected from the noise codebook 304 according to the voiced / silent information. In practice, the characteristic converter 306 has the same filter characteristics as the characteristic converter 201 of the speech signal encoding apparatus 200, and filters the noise source vector input from the noise codebook 304 based on the equation (5). .
[0057]
Multiplier 307 multiplies the vector output from characteristic converter 306 by a noise source gain constant to obtain a noise source signal. The adder 308 adds the adaptive sound source signal and the noise sound source signal to generate a sound source signal.
[0058]
The noise generator 309 generates a self-excited noise vector that complements the energy distribution characteristic in the frequency domain of the noise source converted by the characteristic converter 306. In practice, the noise generator 309 generates a self-excited noise vector v _r (T) is generated.
[0059]
[Formula 6]

Here, r (t) is white noise generated by the audio signal decoding device 300 independently of the audio signal encoding device 200 using an oscillator, for example, h _hpf H used in the characteristic converter 306 _lpf Is a high-pass filter constituting an all-pass filter.
[0060]
The multiplier 310 multiplies the self-excited noise vector by the gain determined using the sound / silent information and the noise source gain constant, and outputs a self-excited noise signal. The adder 311 adds the sound source signal and the self-excited noise signal to generate a corrected sound source signal. The synthesis filter 312 filters the generated corrected excitation signal and synthesizes decoded speech.
[0061]
According to the above configuration, the speech signal encoding device 200 suppresses quantization noise generated in a frequency band that is not subject to error calculation, and the speech signal decoding device 300 replaces the noise with stationary noise. , Subjective voice quality can be improved.
[0062]
In the above-described embodiment, the sound / silence determination information has been described as being configured to be transmitted from the encoding device, but is not configured to be transmitted from the encoding device, and is determined using other information received by the decoding device. It is good also as composition to do.
[0063]
【The invention's effect】
As described above, according to the present invention, it is determined whether the processing frame of the input sound signal is a sound period in which the sound signal is dominant or a silence period in which the non-speech signal is dominant. When a processing frame is input as an input audio signal, the filter characteristics of the auditory weighting filter are suppressed in the high frequency band, so that even if an audio signal with background noise superimposed is input, Therefore, it is possible to realize a speech signal encoding device that can suppress the deterioration of quality.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a speech signal encoding apparatus according to Embodiment 1 of the present invention.
FIG. 2 is a characteristic curve diagram for explaining auditory weight filter characteristics in a voiced section in the embodiment.
FIG. 3 is a characteristic curve diagram for explaining auditory weight filter characteristics.
FIG. 4 is a characteristic curve diagram showing quantization noise characteristics in a sound section in the embodiment.
FIG. 5 is a characteristic curve diagram for explaining auditory weight filter characteristics in a silent section in the embodiment.
6 is a block diagram showing a configuration of a speech signal encoding apparatus according to Embodiment 2. FIG.
7 is a block diagram showing a configuration of an audio signal decoding apparatus according to Embodiment 2. FIG.
FIG. 8 is a block diagram showing a configuration of a conventional speech signal encoding apparatus that performs code excitation linear prediction encoding;
FIG. 9 is a block diagram showing a configuration of a conventional audio signal decoding apparatus.
FIG. 10 is a characteristic curve diagram showing the relationship between quantization noise and subjective characteristics.
[Explanation of symbols]
100, 200 Audio signal encoding apparatus
101 Linear prediction analyzer
102 Auditory weight filter generator
104, 305 Sound / silence determination device
105 Filter converter
106,113 Auditory weight filter
107, 302 Adaptive codebook
109, 304 Noise codebook
201 Characteristic converter
300 Audio signal decoding device

Claims

Linear prediction analysis means for calculating a linear prediction coefficient by applying linear prediction analysis processing to the input speech signal in units of processing frames;
A synthesis filter that obtains a coded synthesized sound by performing a filtering process using the linear prediction coefficient on the adaptive sound source and the noise sound source stored in the adaptive code book and the noise code book;
Obtaining the gains of the adaptive sound source and the noise sound source, and further, the adaptive sound source and the noise sound source code that minimize the coding distortion between the synthesized sound and the input speech signal obtained using the gain, and Computing means for searching for gain;
An auditory weighting filter that applies auditory weighting processing to the input voice signal and the synthesized sound;
Voiced / silent determination means for determining whether the processing frame of the input voice signal is a voiced section in which the voice signal is dominant or a silent section in which the non-voice signal is dominant;
Filter characteristic conversion for converting a filter characteristic of the auditory weight filter so as to suppress a high frequency band when a processing frame determined to be a silent section by the voiced / silent determination unit is input as the input voice signal Means ,
Among the adaptive codebook and noise codebook provided to generate sound source information when the processing frame determined to be a silent section by the voiced / silent determination means is input as the input voice signal, a noise code An audio signal encoding apparatus comprising: frequency characteristic conversion means for suppressing a high frequency band of a noise source vector output from a book .

A code-excited linear prediction type speech signal decoding device that decodes a speech signal using information obtained from the speech signal encoding device according to claim 1 ,
Using the information obtained from the speech signal encoding apparatus according to claim 1, the sound and silence for determining whether the processing frame is a sound segment in which the speech signal is dominant or a silence segment in which the non-speech signal is dominant A determination means;
Characteristic conversion means for suppressing a high frequency band of a noise source vector obtained from a noise codebook in a silent section;
Noise generating means for generating a noise vector having a characteristic of complementing signal energy in a frequency band suppressed with respect to a noise source vector in a silent section;
Multiplication means for estimating a gain of a noise vector generated using information obtained from the speech signal encoding device of claim 1 and multiplying the noise vector;
An audio signal decoding apparatus comprising: addition means for adding generated noise to a sound source signal.

A step of calculating a linear prediction coefficient by performing a linear prediction analysis process on the input speech signal in units of processing frames; and the linear prediction coefficient for the adaptive sound source and the noise sound source stored in the adaptive code book and the noise code book Obtaining a coded synthesized sound by performing a filtering process using, obtaining gains of the adaptive sound source and the noise sound source, and a code between the synthesized sound and the input speech signal obtained using the gains Searching for codes and gains of the adaptive sound source and the noise sound source that minimize the quantization distortion, applying an auditory weighting process to the input sound signal and the synthesized sound, and processing frames of the input sound signal Determining whether the voice signal is dominant or the non-voice signal is silent. When the determination has been processed frame that is input as the input audio signal, the filter characteristic of the auditory weighting filter, and converting to suppress the high frequency band, the processing frame which is determined to be silent section A step of suppressing a high frequency band of a noise excitation vector output from the noise codebook among the adaptive codebook and noise codebook provided to generate excitation information when input as an input speech signal; An audio signal encoding method comprising:

A procedure for calculating a linear prediction coefficient by performing linear prediction analysis processing on the input speech signal in units of processing frames for an input speech signal, and the adaptive sound source and the noise sound source stored in the adaptive code book and the noise code book A procedure for obtaining a coded synthesized sound by performing a filtering process using a linear prediction coefficient, obtaining gains of the adaptive sound source and the noise sound source, and obtaining the synthesized sound obtained using the gain and the input speech signal A procedure for searching for codes and gains of the adaptive sound source and the noise sound source that minimize the coding distortion between them, a procedure for performing auditory weighting processing on the input sound signal and the synthesized sound, and the input sound signal A procedure for determining whether the processing frame is a voiced section in which the audio signal is dominant or a silent section in which the non-voice signal is dominant, and a silent section When the constant is a processing frame is input as said input audio signal, the filter characteristic of the auditory weighting filter, and the procedure for converting to suppress the high frequency band, the processing frame which is determined to be silent section input A step of suppressing a high frequency band of a noise excitation vector output from a noise codebook among adaptive codebooks and noise codebooks provided for generating excitation information when input as a speech signal Program to let you.