JP3607774B2

JP3607774B2 - Speech encoding device

Info

Publication number: JP3607774B2
Application number: JP09117796A
Authority: JP
Inventors: 秀享 ▲高▼橋
Original assignee: Olympus Corp
Current assignee: Olympus Corp
Priority date: 1996-04-12
Filing date: 1996-04-12
Publication date: 2005-01-05
Anticipated expiration: 2016-04-12
Also published as: JPH09281999A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声符号化装置、より詳しくは、音声信号をディジタル情報圧縮して記録または伝送する音声符号化装置に関する。
【０００２】
【従来の技術】
音声信号を効率良く圧縮するために広く用いられている手段として、音声信号を、スペクトル包絡を表す線形予測パラメータと、線形予測残差信号に対応する音源パラメータとを用いて符号化する方式がある。このような線形予測の手段を用いた音声符号化方式は、少ない伝送容量で比較的高品質な合成音声を得られることから、最近のハードウェア技術の進歩と相まって様々な応用方式が盛んに研究され、開発されている。
【０００３】
その中でも良い音質が得られる方式として、Ｋｌｅｉｊｉｎ等による ”ＩｍｐｒｏｖｅｄｓｐｅｅｃｈｑｕａｌｉｔｙａｎｄｅｆｆｉｃｉｅｎｔｖｅｃｔｏｒｑｕａｎｔｉｚａｔｉｏｎｉｎＳＥＬＰ”（ＩＣＡＳＰ’８８ｓ４．４，ｐｐ．１５５−１５８，１９８８）と題した論文に記載されている、過去の音源信号を繰り返して得られる適応コードブックを用いるＣＥＬＰ（ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ）方式がよく知られている。
【０００４】
上述したような線形予測分析を基礎とした音声符号化装置は、比較的低いビットレートで高品質な符号化性能を得ることができるという利点を有している。このような線形予測分析を基礎とした音声符号化装置は、人間が発する概周期的な有声音を前提として構成されており、１フレームの分析長は２０ｍｓ前後が適当であるとされている。
【０００５】
しかしながら、上述したような従来の音声符号化装置は、音声信号区間以外の非音声信号区間については良好に符号化することができず、特に背景雑音等が混入すると急激に音質が劣化してしまうという問題点があった。
【０００６】
上述したような音声符号化装置の適用分野としては、移動体電話や音声録音装置などが考えられており、これらは背景雑音が混入する場合を含む様々な環境下で使用されるものと想定されるために、上記音質劣化の問題点は、魅力的な製品を実現する上でどうしても解決しなければならない必須の課題である。
【０００７】
このような問題点に鑑みて本出願人は、特願平７−２６８７５６号において、予め定められたフレーム間隔に分割された入力信号が音声信号か非音声信号かを判別する音声判別手段と、上記入力信号のスペクトルパラメータを出力する線形予測分析手段と、上記音声判別手段による判別結果が非音声信号であることが所定フレーム数にわたって連続した場合に上記入力信号のスペクトルパラメータとして上記線形予測分析手段に所定の先行フレームにおけるスペクトルパラメータを継続して出力させる制御手段と、線形予測残差信号に相当する駆動音源信号を生成する駆動音源信号生成手段と、上記スペクトルパラメータに基づいて上記駆動音源信号から音声を合成する合成フィルタとを備えた音声符号化装置を提案しており、これにより、非音声信号が入力しても良好に符号化することができる音質の良い音声符号化装置としている。
【０００８】
【発明が解決しようとする課題】
しかしながら、上記特願平７−２６８７５６号に記載のものでは、非音声区間におけるスペクトルパラメータの切り替え時における音質劣化を抑制することはできるが、長時間にわたって非音声区間が続いたときは、その非音声区間に対する音質の向上には寄与するものではなく、音質劣化は大きいままである。
【０００９】
本発明は上記事情に鑑みてなされたものであり、良好に音声信号を符号化することができる高品質な音声符号化装置を提供することを目的としている。
【００１０】
【課題を解決するための手段】
上記の目的を達成するために、本発明の第１の音声符号化装置は、所定のフレーム間隔に分割された入力信号が音声信号または非音声信号の何れであるかを判別する音声判別手段と、上記入力信号のスペクトルパラメータを出力する線形予測分析手段と、上記音声判別手段による判別結果が非音声信号であることが所定フレーム数にわたって連続した場合に入力信号のスペクトルパラメータとして先行フレームにおけるスペクトルパラメータを継続して使用させるように制御する制御手段と、非音声信号であると判別されたフレームが上記所定フレーム数を越えて連続した場合には予め用意した非音声用スペクトルパラメータと現在のフレームのスペクトルパラメータとの平滑化を行いその平滑化されたスペクトルパラメータを出力する非音声フレーム用スペクトル平滑化手段と、線形予測残差信号に相当する駆動音源信号を生成する駆動音源信号生成手段と、上記スペクトルパラメータに基づいて上記駆動音源信号から音声を合成する合成フィルタとを備えたものである。
【００１１】
また、本発明の第２の音声符号化装置は、上記第１の音声符号化装置において、上記非音声用スペクトルパラメータの初期値として所定の背景ノイズに基づくスペクトルパラメータの値を用いるものである。
【００１２】
さらに、本発明の第３の音声符号化装置は、上記第１または第２の音声符号化装置において、上記非音声用スペクトルパラメータの重み付けを現在のフレームのスペクトルパラメータに対する重み付けよりも大きくして平滑化するものである。
【００１３】
そして、本発明の第４の音声符号化装置は、上記第１，第２または第３の音声符号化装置において、上記非音声フレーム用スペクトル平滑化手段から出力されたスペクトルパラメータを、次のフレームの平滑化に用いるために非音声スペクトルパラメータとして記憶するパラメータ記憶手段を備えたものである。
【００１４】
従って、本発明の第１の音声符号化装置は、音声判別手段が所定のフレーム間隔に分割された入力信号が音声信号または非音声信号の何れであるかを判別し、線形予測分析手段が上記入力信号のスペクトルパラメータを出力し、上記音声判別手段による判別結果が非音声信号であることが所定フレーム数にわたって連続した場合に制御手段が入力信号のスペクトルパラメータとして先行フレームにおけるスペクトルパラメータを継続して使用させるように制御し、非音声信号であると判別されたフレームが上記所定フレーム数を越えて連続した場合には非音声フレーム用スペクトル平滑化手段が予め用意した非音声用スペクトルパラメータと現在のフレームのスペクトルパラメータとの平滑化を行いその平滑化されたスペクトルパラメータを出力し、駆動音源信号生成手段が線形予測残差信号に相当する駆動音源信号を生成し、合成フィルタが上記スペクトルパラメータに基づいて上記駆動音源信号から音声を合成する。
【００１５】
また、本発明の第２の音声符号化装置は、上記非音声用スペクトルパラメータの初期値として所定の背景ノイズに基づくスペクトルパラメータの値を用いる。
【００１６】
さらに、本発明の第３の音声符号化装置は、上記非音声用スペクトルパラメータの重み付けを現在のフレームのスペクトルパラメータに対する重み付けよりも大きくして平滑化する。
【００１７】
そして、本発明の第４の音声符号化装置は、パラメータ記憶手段が上記非音声フレーム用スペクトル平滑化手段から出力されたスペクトルパラメータを、次のフレームの平滑化に用いるために非音声スペクトルパラメータとして記憶する。
【００１８】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態を説明する。
図１から図４は本発明の一実施形態を示したものであり、図１は音声符号化装置の構成を示すブロック図である。
【００１９】
この音声符号化装置は、コード駆動線形予測符号化によるものであり、図１に示すように、入力端子に接続されたバッファメモリ１の出力端は３つに分岐されていて、第１の出力端はサブフレーム分割器７を介して減算器８に接続され、第２の出力端は線形予測分析手段たるＬＰＣ分析器５の入力端に接続され、第３の出力端は音声判別手段たる音声判別器２を介して後述する切替スイッチ４Ａ，４Ｂの制御を行う制御手段たるスイッチ制御回路３に接続されている。
【００２０】
上記ＬＰＣ分析器５は、切替スイッチ４Ａの入力端子ｂに接続されているとともに、パラメータメモリ５ａにその出力を記憶させるようになっていて、このパラメータメモリ５ａは切替スイッチ４Ａの入力端子ａに接続されている。
【００２１】
上記切換スイッチ４Ａの出力端子は引き続いて配設されている切換スイッチ４Ｂの入力端子に接続されていて、この切換スイッチ４Ｂの出力端子ａは合成フィルタ６に、出力端子ｂは非音声フレーム用スペクトル平滑化手段たるパラメータ平滑化器１９に接続されている。
【００２２】
上記パラメータ平滑化器１９にはその出力を記憶させて必要に応じて読み出すためのパラメータ記憶手段たるパラメータメモリ１９ａが接続されていて、該パラメータ平滑化器１９の出力端は上記合成フィルタ６に接続されている。
【００２３】
この合成フィルタ６には、適応コードブック１２と確率コードブック１４を用いて生成される信号が入力されるようになっている。
【００２４】
すなわち、上記適応コードブック１２は、乗算器１３を介して加算器１７の第１入力端子に接続されており、また、確率コードブック１４は、乗算器１５とスイッチ１６とを介して上記加算器１７の第２入力端子に接続されている。
【００２５】
この加算器１７の出力端子は、合成フィルタ６を介して上記減算器８の入力端子に接続されている一方で、遅延回路１１を介して上記適応コードブック１２に接続されている。
【００２６】
上記合成フィルタ６の出力端は、サブフレーム分割器７が接続された減算器８および聴感重み付けフィルタ９を介して誤差評価器１０の入力端子に接続されている。この誤差評価器１０による評価結果は、上述した適応コードブック１２と、確率コードブック１４と、さらに乗算器１３，１５とにフィードバックされて、最適なコードの選択やゲインの調整に用いられるようになっているとともに、上記誤差評価器１０は、マルチプレクサ１８にも接続されている。
【００２７】
上述のような音声符号化装置において、線形予測残差信号に相当する駆動音源信号を生成する駆動音源信号生成手段は、上記遅延回路１１、適応コードブック１２、確率コードブック１４、乗算器１３，１５、スイッチ１６、加算器１７等を含んで構成されている。
【００２８】
次に、図２は上記音声判別器２のより詳細な構成を示すブロック図である。
【００２９】
この音声判別器２に入力された上記バッファメモリ１の出力信号は、２つに分岐されて一方がフレームエネルギー分析回路２ａに、他方が初期フレームエネルギー分析回路２ｂに入力されるようになっている。
【００３０】
上記フレームエネルギー分析回路２ａは加算器２ｃの＋端子となっている第１入力端子に、上記初期フレームエネルギー分析回路２ｂは該加算器２ｃの−端子となっている第２入力端子にそれぞれ接続されているとともに、さらに、初期フレームエネルギー分析回路２ｂは、閾値決定回路２ｄにも接続されている。
【００３１】
そして、上記加算器２ｃの出力端子と上記閾値決定回路２ｄの出力端子は、共に判別回路２ｅに接続されていて、この判別回路２ｅの出力が上記スイッチ制御回路３に出力されるようになっている。
【００３２】
次に、上記図１および図２に示したような構成における信号の流れを説明する。
【００３３】
入力端子から例えば８ｋＨｚ（すなわち、１サンプル当たり１／８ｍｓ）でサンプリングされた原音声信号を入力して、予め定められたフレーム間隔（例えば２０ｍｓ、すなわち１６０サンプル）の音声信号をバッファメモリ１に格納する。
【００３４】
バッファメモリ１は、入力信号をフレーム単位でサブフレーム分割器７とＬＰＣ分析器５と音声判別器２とに送出する。
【００３５】
この音声判別器２は、フレームの入力信号が音声か非音声かを、例えば以下に説明するような方法で判別する。
【００３６】
上記図２に示したような構成の音声判別器２において、フレームエネルギー分析回路２ａは、入力されたフレーム入力信号のフレームエネルギーＥｆを次に示すような数式により算出する。
【００３７】
【数１】

ここに、ｓ（ｎ）はサンプルｎにおける入力信号、Ｎはフレーム長をそれぞれ示している。
【００３８】
また、上記初期フレームエネルギー分析回路２ｂは、符号化を開始したときのフレームエネルギーＥｂを上記数式１と同様の数式を用いて算出する。
【００３９】
上記閾値決定回路２ｄは、背景雑音エネルギーの大きさに応じて閾値を決定する。例えば、図３に示すように、背景雑音エネルギーがｄＢ単位で増加するに従って、閾値をｄＢ単位で減少させる関係により、閾値を決定する。そして、その結果を判別回路２ｅに送出する。
【００４０】
加算器２ｃでは、フレームエネルギーＥｆを正として入力するとともに、初期フレームエネルギーＥｂを負として入力してこれらを加算することにより、フレームエネルギーＥｆから初期フレームエネルギーＥｂを減算し、その減算結果を判別回路２ｅに送出する。
【００４１】
そして、判別回路２ｅは、入力された減算結果と閾値を比較して、減算結果が閾値より大きければフレーム入力信号は音声区間であると判別し、そうでなければ非音声区間であると判別する。
【００４２】
図１に戻って、サブフレーム分割器７は、フレームの入力信号を予め定められたサブフレーム間隔（例えば５ｍｓ、つまり４０サンプル）に分割する。すなわち、１フレームの入力信号から、第１サブフレームから第４サブフレームまでの４つのサブフレーム信号が作成される。
【００４３】
ＬＰＣ分析器５は、入力信号に対して線形予測分析（ＬＰＣ分析）を行って、スペクトル特性を表すスペクトルパラメータたる線形予測パラメータαを抽出し、パラメータメモリ５ａに送出するとともに、切替スイッチ４Ａ，４Ｂを介して合成フィルタ６あるいはパラメータ平滑化器１９に送出する。
【００４４】
次に、スイッチ制御回路３の動作について、図４のフローチャートを参照して説明する。
【００４５】
まず、符号化を開始すると（ステップＳ１）、非音声フレーム連続数を示すｉを０にセットする（ステップＳ２）。
【００４６】
次に、音声判別器２における判別結果が音声（ｖ）であるか非音声（ｕｖ）であるかを判定して（ステップＳ３）、もし判別結果が非音声である場合には、ｉを１だけ増加させる（ステップＳ４）。そして、このｉが所定数Ｒ（例えば５）よりも大きいか否かを判定し（ステップＳ５）、大きい場合には切替スイッチ４Ａの端子および切替スイッチ４Ｂの端子を両方ともａ側に閉じ（ステップＳ６）、パラメータメモリ５ａから出力される先行フレームのスペクトルパラメータを継続して使用する（ステップＳ７）。
【００４７】
続いて、ｉがＲ＋１よりも大きいか否かを判定し（ステップＳ８）、大きい場合には切替スイッチ４Ａの端子および切替スイッチ４Ｂの端子を両方ともｂ側に閉じて（ステップＳ９）、ＬＰＣ分析器５によりＬＰＣ分析を行い（ステップＳ１０）、その結果がパラメータ平滑化器１９に入力されるようにする。
【００４８】
そして、このパラメータ平滑化器１９において、パラメータの平滑化を以下に説明するように行う（ステップＳ１１）。
【００４９】
背景ノイズ用初期ｋパラメータＮｏｉｓｅ＿αを予め用意してパラメータメモリ１９ａに記憶しておき、平滑化を数式２に示すような重み付けにより行う。
【００５０】
【数２】

なお、この背景ノイズ用初期ｋパラメータＮｏｉｓｅ＿αは、例えばオフィス環境における背景ノイズを入力したときの、あるスペクトルパラメータを代表させた値である。
【００５１】
このように背景ノイズ用初期ｋパラメータＮｏｉｓｅ＿αの重み付けを現在のフレームのスペクトルパラメータα［ｉ］の重み付けよりも重くすることによって、パラメータα［ｉ］の値が揺らいでも、その揺らぎの影響を小さく抑制するようになっている。
【００５２】
続いて、直後に背景ノイズ用初期ｋパラメータＮｏｉｓｅ＿α［ｉ］を数式３に示すように更新する（ステップＳ１２）。
【００５３】
【数３】

その後、次のフレームの処理を待つ（ステップＳ１３）。
【００５４】
また、上記ステップＳ８においてｉがＲ＋１以下である場合にも、上記ステップＳ１３に進む。
【００５５】
このようにして、非音声部での音源信号は、ＬＰＣ分析の結果を反映しつつも、揺らぎを抑えることができる。
【００５６】
一方、上記ステップＳ３における判別結果が音声である場合には、非音声フレーム連続数を示すｉを０にリセットした後に（ステップＳ１４）、切替スイッチ４Ａの端子をｂ側に閉じるとともに切替スイッチ４Ｂの端子をａ側に閉じて（ステップＳ１５）、ＬＰＣ分析を行ってスペクトルパラメータを更新する（ステップＳ１６）。その後、上記ステップＳ１３に進み、次のフレームの処理を待つ。
【００５７】
また、上記ステップＳ５においてｉが所定数Ｒ以下である場合にも、上記ステップＳ１４に進む。
【００５８】
図１の説明に再び戻って、適応コードブックの遅れＬ、ゲインβ、確率コードブックのインデックスｉ、ゲインγは、次に説明するような手段により決定される。
【００５９】
まず、適応コードブックの遅延Ｌとゲインβは、以下の処理によって決定される。
【００６０】
遅延回路１１において、先行サブフレームにおける合成フィルタ６の入力信号すなわち駆動音源信号に、ピッチ周期に相当する遅延を与えて適応コードベクトルとして作成する。
【００６１】
例えば、想定するピッチ周期を４０〜１６７サンプルとすると、４０〜１６７サンプル遅れの１２８種類の信号が適応コードベクトルとして作成され、適応コードブック１２に格納される。
【００６２】
このときスイッチ１６は開いた状態となっていて、各適応コードベクトルは乗算器１３でゲイン値を可変して乗じた後に、加算器１７を通過してそのまま合成フィルタ６に入力される。
【００６３】
この合成フィルタ６は、線形予測パラメータα’を用いて合成処理を行い、合成ベクトルを減算器８に送出する。この減算器８は、原音声ベクトルと合成ベクトルとの減算を行うことにより誤差ベクトルを生成し、得られた誤差ベクトルを聴感重み付けフィルタ９に送出する。
【００６４】
この聴感重み付けフィルタ９は、誤差ベクトルに対して聴感特性を考慮した重み付け処理を行い、誤差評価器１０に送出する。
【００６５】
誤差評価器１０は、誤差ベクトルの２乗平均を計算し、その２乗平均値が最小となる適応コードベクトルを検索して、その遅れＬとゲインβをマルチプレクサ１８に送出する。このようにして、適応コードブック１２の遅延Ｌとゲインβが決定される。
【００６６】
続いて、確率コードブックのインデックスｉとゲインγは、以下の処理によって決定される。
【００６７】
確率コードブック１４は、サブフレーム長に対応する次元数（すなわち、上述の例では４０次元）の確率コードベクトルが、例えば５１２種類予め格納されており、各々にインデックスが付与されている。なお、このときにはスイッチ１６は閉じた状態となっている。
【００６８】
まず、上記処理によって決定された最適な適応コードベクトルを、乗算器１３で最適ゲインβを乗じた後に、加算器１７に送出する。
【００６９】
次に、各確率コードベクトルを乗算器１５でゲイン値を可変して乗じた後に、加算器１７に入力する。加算器１７は上記最適ゲインβを乗じた最適な適応コードベクトルと各確率コードベクトルの加算を行い、その結果が合成フィルタ６に入力される。
【００７０】
この後の処理は、上記適応コードブックパラメータの決定処理と同様に行われる。すなわち、合成フィルタ６は線形予測パラメータα’を用いて合成処理を行い、合成ベクトルを減算器８に送出する。
【００７１】
減算器８は原音声ベクトルと合成ベクトルとの減算を行うことにより誤差ベクトルを生成し、得られた誤差ベクトルを聴感重み付けフィルタ９に送出する。
【００７２】
聴感重み付けフィルタ９は、誤差ベクトルに対して聴感特性を考慮した重み付け処理を行い、誤差評価器１０に送出する。
【００７３】
誤差評価器１０は、誤差ベクトルの２乗平均を計算して、その２乗平均値が最小となる確率コードベクトルを検索して、そのインデックスｉとゲインγをマルチプレクサ１８に送出する。このようにして、確率コードブック１４のインデックスｉとゲインγが決定される。
【００７４】
上記マルチプレクサ１８は、量子化された線形予測パラメータα’、適応コードブックの遅れＬ、ゲインβ、確率コードブックのインデックスｉ、ゲインγの各々をマルチプレクスして伝送するものである。
【００７５】
なお、上述したような音声符号化装置に対応する音声復号化装置の復号化動作は、上記特願平７−２６８７５６号に記載した従来例におけるものと同様である。
【００７６】
また、音声判別器における音声判別方法は、上述した手段に限るものではないことはいうまでもない。
【００７７】
さらに、上記実施形態においては、コード駆動線形予測符号化による音声符号化装置を一例として取り上げて説明したが、線形予測パラメータと、線形予測残差信号に相当する駆動音源信号のパラメータとで表現する音声符号化装置であれば、当然にして、何れのものにも適用することが可能である。
【００７８】
そして、符号化パラメータに音声／非音声の情報も伝送するようにして、復号化装置に符号化装置と同様のスイッチ制御回路および切替スイッチを設け、音声／非音声の情報に基づいて切替スイッチの制御を行うことにより可変ビットレート符号化装置／復号化装置を構成し、より高い圧縮効率で符号化することも可能である。
【００７９】
このような実施形態の音声符号化装置によれば、非音声フレームが所定数以上連続する場合におけるフレーム毎のパラメータの揺らぎを抑制することができるために、非音声区間における音質が安定して、良好に音声信号を符号化することができる高品質な音声符号化装置となる。
【００８０】
【発明の効果】
以上説明したように請求項１に記載の発明によれば、符号化された非音声信号のフレーム毎の揺らぎが小さくなり、良好に音声信号を符号化することができる高品質な音声符号化装置を得ることができる。
【００８１】
また、請求項２に記載の発明によれば、請求項１に記載の発明と同様の効果を奏するとともに、所定の背景ノイズに基づくスペクトルパラメータの値を非音声用スペクトルパラメータの初期値として用いるために、符号化された非音声信号に違和感が生じることはない。
【００８２】
さらに、請求項３に記載の発明によれば、請求項１または請求項２に記載の発明と同様の効果を奏するとともに、符号化された非音声信号のフレーム毎の揺らぎを一層抑制することができて、より円滑に平滑化することができる。
【００８３】
そして、請求項４に記載の発明によれば、請求項１、請求項２または請求項３に記載の発明と同様の効果を奏するとともに、非音声スペクトルパラメータを記憶するパラメータ記憶手段を備えたために、次の平滑化の処理が容易になる。
【図面の簡単な説明】
【図１】本発明の一実施形態の音声符号化装置の構成を示すブロック図。
【図２】上記実施形態の音声判別器のより詳細な構成を示すブロック図。
【図３】上記実施形態において、音声判別器の閾値決定回路により決定される閾値と背景雑音エネルギーとの関係の一例を示す線図。
【図４】上記実施形態の音声符号化装置の動作を示すフローチャート。
【符号の説明】
２…音声判別器（音声判別手段）
３…スイッチ制御回路（制御手段）
５…ＬＰＣ分析器（線形予測分析手段）
６…合成フィルタ
１１…遅延回路（駆動音源信号生成手段の一部）
１２…適応コードブック（駆動音源信号生成手段の一部）
１４…確率コードブック（駆動音源信号生成手段の一部）
１９…パラメータ平滑化器（非音声フレーム用スペクトル平滑化手段）
１９ａ…パラメータメモリ（パラメータ記憶手段）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech coding apparatus, and more particularly to a speech coding apparatus that records or transmits a speech signal by compressing digital information.
[0002]
[Prior art]
As a widely used means for efficiently compressing an audio signal, there is a method of encoding an audio signal using a linear prediction parameter representing a spectral envelope and an excitation parameter corresponding to the linear prediction residual signal. . Speech coding systems using such linear prediction means can obtain relatively high-quality synthesized speech with a small transmission capacity, so various applied systems are actively studied in conjunction with recent advances in hardware technology. Has been developed.
[0003]
Among them, as a method for obtaining a good sound quality, it is described in a paper entitled “Improved speed quality and efficiency vectorization in SELP” (ICASP'88 s4.4, pp.155-158, 1988) by Kleijin et al., A CELP (Code Excited Linear Predictive Coding) system that uses an adaptive codebook obtained by repeating past sound source signals is well known.
[0004]
A speech coding apparatus based on linear prediction analysis as described above has an advantage that high-quality coding performance can be obtained at a relatively low bit rate. A speech coding apparatus based on such linear prediction analysis is configured on the assumption of an approximately periodic voiced sound generated by a human, and an analysis length of one frame is considered to be about 20 ms.
[0005]
However, the conventional speech encoding apparatus as described above cannot satisfactorily encode non-speech signal sections other than the speech signal section, and the sound quality deteriorates rapidly particularly when background noise or the like is mixed. There was a problem.
[0006]
As a field of application of the speech encoding apparatus as described above, mobile telephones and speech recording apparatuses are considered, and these are assumed to be used in various environments including cases where background noise is mixed. Therefore, the above problem of sound quality degradation is an indispensable problem that must be solved in order to realize an attractive product.
[0007]
In view of such a problem, the applicant of the present application, in Japanese Patent Application No. 7-268756, is a voice discrimination means for discriminating whether an input signal divided into predetermined frame intervals is a voice signal or a non-voice signal; Linear prediction analysis means for outputting a spectrum parameter of the input signal, and linear prediction analysis means as a spectrum parameter of the input signal when the discrimination result by the speech discrimination means is a non-speech signal for a predetermined number of frames. Control means for continuously outputting a spectral parameter in a predetermined preceding frame, driving excitation signal generating means for generating a driving excitation signal corresponding to a linear prediction residual signal, and from the driving excitation signal based on the spectral parameter We have proposed a speech coding device with a synthesis filter that synthesizes speech, Even if input audio signal is set to sound good speech coding apparatus capable of satisfactorily coded.
[0008]
[Problems to be solved by the invention]
However, in the above-mentioned Japanese Patent Application No. 7-268756, it is possible to suppress deterioration in sound quality at the time of switching the spectrum parameter in the non-speech section, but when the non-speech section continues for a long time, the non-speech It does not contribute to the improvement of the sound quality for the speech section, and the sound quality degradation remains large.
[0009]
The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a high-quality speech coding apparatus that can satisfactorily encode speech signals.
[0010]
[Means for Solving the Problems]
In order to achieve the above object, a first speech coding apparatus according to the present invention comprises speech discrimination means for discriminating whether an input signal divided into predetermined frame intervals is a speech signal or a non-speech signal. A linear prediction analysis means for outputting a spectral parameter of the input signal, and a spectral parameter in a preceding frame as a spectral parameter of the input signal when the discrimination result by the speech discrimination means is a non-speech signal for a predetermined number of frames. Control means for controlling the continuous use of the non-speech signal and the non-speech spectrum parameter prepared in advance when the frame determined to be a non-speech signal exceeds the predetermined number of frames and the current frame. Non-sound that is smoothed with spectral parameters and outputs the smoothed spectral parameters A spectrum smoothing unit for frames; a driving excitation signal generating unit that generates a driving excitation signal corresponding to a linear prediction residual signal; and a synthesis filter that synthesizes speech from the driving excitation signal based on the spectrum parameter. Is.
[0011]
In the second speech encoding apparatus of the present invention, in the first speech encoding apparatus, a spectral parameter value based on a predetermined background noise is used as an initial value of the non-speech spectral parameter.
[0012]
Further, the third speech coding apparatus according to the present invention is characterized in that, in the first or second speech coding apparatus, the weighting of the non-speech spectral parameter is made larger than the weighting of the spectral parameter of the current frame. It is to become.
[0013]
The fourth speech coding apparatus according to the present invention uses the spectrum parameter output from the spectrum smoothing means for non-speech frames in the first, second, or third speech coding apparatus as a next frame. Parameter storage means for storing as a non-speech spectrum parameter for use in smoothing.
[0014]
Therefore, in the first speech coding apparatus of the present invention, the speech discrimination means discriminates whether the input signal divided into predetermined frame intervals is a speech signal or a non-speech signal, and the linear prediction analysis means The spectrum parameter of the input signal is output, and when the discrimination result by the voice discrimination means is a non-speech signal for a predetermined number of frames, the control means continues the spectrum parameter in the preceding frame as the spectrum parameter of the input signal. When the frames determined to be non-speech signals are continued beyond the predetermined number of frames, the non-speech spectrum parameter prepared in advance by the non-speech frame spectrum smoothing means Perform smoothing with the spectral parameters of the frame and change the smoothed spectral parameters And force excitation signal generating means generates an excitation signal corresponding to the linear prediction residual signal, synthesis filter to synthesize speech from the excitation signal based on said spectrum parameter.
[0015]
The second speech encoding apparatus of the present invention uses a spectral parameter value based on a predetermined background noise as an initial value of the non-speech spectral parameter.
[0016]
Furthermore, the third speech coding apparatus according to the present invention smoothes the non-speech spectrum parameter by making the weight of the non-speech spectrum parameter larger than the weight of the spectrum parameter of the current frame.
[0017]
In the fourth speech encoding apparatus of the present invention, the parameter storage means uses the spectral parameter output from the non-speech frame spectrum smoothing means as a non-speech spectrum parameter for use in smoothing the next frame. Remember.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
1 to 4 show an embodiment of the present invention, and FIG. 1 is a block diagram showing the configuration of a speech encoding apparatus.
[0019]
This speech encoding apparatus is based on code-driven linear predictive encoding. As shown in FIG. 1, the output terminal of the buffer memory 1 connected to the input terminal is branched into three, and the first output The end is connected to the subtracter 8 via the subframe divider 7, the second output end is connected to the input end of the LPC analyzer 5 which is a linear prediction analysis means, and the third output end is a voice which is a speech discrimination means. The discriminator 2 is connected to a switch control circuit 3 as control means for controlling the selector switches 4A and 4B described later.
[0020]
The LPC analyzer 5 is connected to the input terminal b of the changeover switch 4A and stores the output in the parameter memory 5a. The parameter memory 5a is connected to the input terminal a of the changeover switch 4A. Has been.
[0021]
The output terminal of the changeover switch 4A is connected to the input terminal of the changeover switch 4B, which is subsequently arranged. The output terminal a of the changeover switch 4B is connected to the synthesis filter 6 and the output terminal b is the spectrum for non-voice frames. It is connected to a parameter smoother 19 which is a smoothing means.
[0022]
The parameter smoother 19 is connected to a parameter memory 19a serving as parameter storage means for storing the output and reading it out as necessary. The output terminal of the parameter smoother 19 is connected to the synthesis filter 6. Has been.
[0023]
A signal generated using the adaptive code book 12 and the probability code book 14 is input to the synthesis filter 6.
[0024]
That is, the adaptive code book 12 is connected to the first input terminal of the adder 17 via the multiplier 13, and the probability code book 14 is connected to the adder via the multiplier 15 and the switch 16. 17 is connected to the second input terminal.
[0025]
The output terminal of the adder 17 is connected to the input terminal of the subtracter 8 through the synthesis filter 6, and is connected to the adaptive codebook 12 through the delay circuit 11.
[0026]
The output terminal of the synthesis filter 6 is connected to an input terminal of an error evaluator 10 through a subtracter 8 and an audibility weighting filter 9 to which a subframe divider 7 is connected. The evaluation result by the error evaluator 10 is fed back to the above-described adaptive codebook 12, the probability codebook 14, and the

multipliers

13 and 15 so as to be used for selecting an optimum code and adjusting the gain. The error evaluator 10 is also connected to the multiplexer 18.
[0027]
In the speech encoding apparatus as described above, the driving excitation signal generating means for generating the driving excitation signal corresponding to the linear prediction residual signal includes the delay circuit 11, the adaptive code book 12, the probability code book 14, the

multiplier

13, 15, a switch 16, an adder 17, and the like.
[0028]
Next, FIG. 2 is a block diagram showing a more detailed configuration of the voice discriminator 2.
[0029]
The output signal of the buffer memory 1 input to the voice discriminator 2 is branched into two, one input to the frame energy analysis circuit 2a and the other input to the initial frame energy analysis circuit 2b. .
[0030]
The frame energy analysis circuit 2a is connected to the first input terminal which is the + terminal of the adder 2c, and the initial frame energy analysis circuit 2b is connected to the second input terminal which is the-terminal of the adder 2c. In addition, the initial frame energy analysis circuit 2b is also connected to a threshold value determination circuit 2d.
[0031]
The output terminal of the adder 2c and the output terminal of the threshold value determination circuit 2d are both connected to the determination circuit 2e, and the output of the determination circuit 2e is output to the switch control circuit 3. Yes.
[0032]
Next, the signal flow in the configuration shown in FIGS. 1 and 2 will be described.
[0033]
For example, an original audio signal sampled at 8 kHz (ie, 1/8 ms per sample) is input from the input terminal, and an audio signal at a predetermined frame interval (eg, 20 ms, ie 160 samples) is stored in the buffer memory 1. To do.
[0034]
The buffer memory 1 sends the input signal to the subframe divider 7, LPC analyzer 5, and speech discriminator 2 in units of frames.
[0035]
The voice discriminator 2 discriminates whether the input signal of the frame is voice or non-voice by, for example, a method described below.
[0036]
In the speech discriminator 2 configured as shown in FIG. 2, the frame energy analysis circuit 2a calculates the frame energy Ef of the input frame input signal by the following mathematical formula.
[0037]
[Expression 1]

Here, s (n) indicates an input signal in the sample n, and N indicates a frame length.
[0038]
The initial frame energy analysis circuit 2b calculates the frame energy Eb at the start of encoding using the same mathematical expression as the mathematical expression 1.
[0039]
The threshold value determination circuit 2d determines a threshold value according to the magnitude of background noise energy. For example, as shown in FIG. 3, as the background noise energy increases in dB, the threshold is determined based on the relationship of decreasing the threshold in dB. Then, the result is sent to the discrimination circuit 2e.
[0040]
In the adder 2c, the frame energy Ef is input as positive, the initial frame energy Eb is input as negative, and these are added to subtract the initial frame energy Eb from the frame energy Ef. Send to 2e.
[0041]
Then, the discrimination circuit 2e compares the input subtraction result with a threshold value, and if the subtraction result is larger than the threshold value, it determines that the frame input signal is a voice interval, and otherwise determines that it is a non-voice interval. .
[0042]
Returning to FIG. 1, the sub-frame divider 7 divides the input signal of the frame into predetermined sub-frame intervals (for example, 5 ms, that is, 40 samples). That is, four subframe signals from the first subframe to the fourth subframe are generated from one frame of the input signal.
[0043]
The LPC analyzer 5 performs linear prediction analysis (LPC analysis) on the input signal, extracts a linear prediction parameter α which is a spectral parameter representing the spectral characteristics, and sends it to the parameter memory 5a, as well as the changeover switches 4A and 4B. Is sent to the synthesis filter 6 or the parameter smoother 19.
[0044]
Next, the operation of the switch control circuit 3 will be described with reference to the flowchart of FIG.
[0045]
First, when encoding is started (step S1), i indicating the number of consecutive non-speech frames is set to 0 (step S2).
[0046]
Next, it is determined whether the discrimination result in the voice discriminator 2 is voice (v) or non-voice (uv) (step S3). If the discrimination result is non-voice, i is set to 1. (Step S4). Then, it is determined whether or not i is larger than a predetermined number R (for example, 5) (step S5). If i is larger, both the terminals of the changeover switch 4A and the terminals of the changeover switch 4B are closed to the a side (step S5). S6) The spectrum parameter of the preceding frame output from the parameter memory 5a is continuously used (step S7).
[0047]
Subsequently, it is determined whether i is larger than R + 1 (step S8). If it is larger, both the terminal of the changeover switch 4A and the terminal of the changeover switch 4B are closed to the b side (step S9), and LPC analysis is performed. The LPC analysis is performed by the device 5 (step S10), and the result is input to the parameter smoother 19.
[0048]
The parameter smoother 19 performs parameter smoothing as described below (step S11).
[0049]
An initial k parameter Noise_α for background noise is prepared in advance and stored in the parameter memory 19a, and smoothing is performed by weighting as shown in Formula 2.
[0050]
[Expression 2]

The background noise initial k parameter Noise_α is a value representative of a certain spectrum parameter when background noise in an office environment, for example, is input.
[0051]
Thus, by setting the weight of the initial k parameter Noise_α for background noise to be heavier than the weight of the spectrum parameter α [i] of the current frame, even if the value of the parameter α [i] fluctuates, the influence of the fluctuation is suppressed to a small level. It is supposed to be.
[0052]
Subsequently, the background noise initial k parameter Noise_α [i] is updated as shown in Equation 3 (step S12).
[0053]
[Equation 3]

Thereafter, processing for the next frame is awaited (step S13).
[0054]
Further, when i is equal to or less than R + 1 in step S8, the process proceeds to step S13.
[0055]
In this way, the sound source signal in the non-speech part can suppress fluctuations while reflecting the result of the LPC analysis.
[0056]
On the other hand, if the determination result in step S3 is speech, after i indicating the number of consecutive non-speech frames is reset to 0 (step S14), the switch 4A terminal is closed to the b side and the switch 4B The terminal is closed to the a side (step S15), the LPC analysis is performed, and the spectrum parameter is updated (step S16). Thereafter, the process proceeds to step S13 to wait for processing of the next frame.
[0057]
Further, when i is equal to or less than the predetermined number R in step S5, the process proceeds to step S14.
[0058]
Returning to the description of FIG. 1, the adaptive codebook delay L, gain β, probability codebook index i, and gain γ are determined by the following means.
[0059]
First, the delay L and the gain β of the adaptive codebook are determined by the following processing.
[0060]
The delay circuit 11 creates an adaptive code vector by giving a delay corresponding to the pitch period to the input signal of the synthesis filter 6 in the preceding subframe, that is, the driving sound source signal.
[0061]
For example, assuming an assumed pitch period of 40 to 167 samples, 128 types of signals with a delay of 40 to 167 samples are created as adaptive code vectors and stored in the adaptive code book 12.
[0062]
At this time, the switch 16 is in an open state, and each adaptive code vector is multiplied by the multiplier 13 while changing the gain value, and then passes through the adder 17 and is input to the synthesis filter 6 as it is.
[0063]
The synthesis filter 6 performs a synthesis process using the linear prediction parameter α ′ and sends a synthesis vector to the subtracter 8. The subtracter 8 generates an error vector by subtracting the original speech vector and the synthesized vector, and sends the obtained error vector to the perceptual weighting filter 9.
[0064]
This perceptual weighting filter 9 performs weighting processing on the error vector in consideration of perceptual characteristics and sends it to the error evaluator 10.
[0065]
The error evaluator 10 calculates the mean square of the error vector, searches for an adaptive code vector that minimizes the mean square value, and sends the delay L and the gain β to the multiplexer 18. In this way, the delay L and gain β of the adaptive codebook 12 are determined.
[0066]
Subsequently, the index i and the gain γ of the probability codebook are determined by the following processing.
[0067]
In the probability code book 14, for example, 512 types of probability code vectors corresponding to the subframe length (that is, 40 dimensions in the above example) are stored in advance, and an index is assigned to each. At this time, the switch 16 is closed.
[0068]
First, the optimum adaptive code vector determined by the above processing is multiplied by the optimum gain β by the multiplier 13 and then sent to the adder 17.
[0069]
Next, each probability code vector is multiplied by a multiplier 15 with a variable gain value and then input to an adder 17. The adder 17 adds the optimum adaptive code vector multiplied by the optimum gain β and each probability code vector, and the result is input to the synthesis filter 6.
[0070]
The subsequent processing is performed in the same manner as the adaptive code book parameter determination processing. That is, the synthesis filter 6 performs a synthesis process using the linear prediction parameter α ′ and sends the synthesis vector to the subtracter 8.
[0071]
The subtracter 8 generates an error vector by subtracting the original speech vector and the synthesized vector, and sends the obtained error vector to the perceptual weighting filter 9.
[0072]
The perceptual weighting filter 9 performs a weighting process in consideration of perceptual characteristics on the error vector and sends it to the error evaluator 10.
[0073]
The error evaluator 10 calculates the mean square of the error vector, searches for a probability code vector that minimizes the mean square value, and sends the index i and the gain γ to the multiplexer 18. In this way, the index i and the gain γ of the probability codebook 14 are determined.
[0074]
The multiplexer 18 multiplexes and transmits each of the quantized linear prediction parameter α ′, adaptive codebook delay L, gain β, probability codebook index i, and gain γ.
[0075]
The decoding operation of the speech decoding apparatus corresponding to the speech encoding apparatus as described above is the same as that in the conventional example described in the above Japanese Patent Application No. 7-268756.
[0076]
Needless to say, the speech discrimination method in the speech discriminator is not limited to the above-described means.
[0077]
Furthermore, in the above-described embodiment, the speech coding apparatus based on code-driven linear prediction coding has been described as an example. However, the speech coding apparatus is represented by a linear prediction parameter and a parameter of a driving sound source signal corresponding to the linear prediction residual signal. Of course, the present invention can be applied to any speech encoding device.
[0078]
Then, the decoding parameter is provided with a switch control circuit and a changeover switch similar to those of the encoding device so that the audio / non-audio information is transmitted as the encoding parameter. It is also possible to configure a variable bit rate encoding device / decoding device by performing control, and to perform encoding with higher compression efficiency.
[0079]
According to the speech encoding apparatus of such an embodiment, since the fluctuation of the parameter for each frame when a predetermined number or more of non-speech frames continue can be suppressed, the sound quality in the non-speech section is stable, It becomes a high-quality speech encoding apparatus that can encode speech signals satisfactorily.
[0080]
【The invention's effect】
As described above, according to the first aspect of the present invention, the fluctuation of each encoded non-speech signal for each frame is reduced, and a high-quality speech encoding apparatus capable of encoding a speech signal satisfactorily. Can be obtained.
[0081]
According to the second aspect of the invention, the same effect as that of the first aspect of the invention can be obtained, and the value of the spectral parameter based on the predetermined background noise is used as the initial value of the non-speech spectral parameter. In addition, there is no sense of incongruity in the encoded non-voice signal.
[0082]
Furthermore, according to the third aspect of the present invention, the same effects as the first or second aspect of the present invention can be obtained, and the fluctuation of the encoded non-speech signal for each frame can be further suppressed. Can be smoothed more smoothly.
[0083]
According to the invention described in claim 4, since the same effect as that of the invention described in claim 1, claim 2, or claim 3 is achieved, the parameter storage means for storing the non-speech spectrum parameter is provided. The next smoothing process becomes easy.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the configuration of a speech encoding apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a more detailed configuration of the speech discriminator of the embodiment.
FIG. 3 is a diagram illustrating an example of a relationship between a threshold value determined by a threshold value determination circuit of a speech discriminator and background noise energy in the embodiment.
FIG. 4 is a flowchart showing the operation of the speech encoding apparatus according to the embodiment.
[Explanation of symbols]
2 ... Voice discriminator (voice discrimination means)
3 ... Switch control circuit (control means)
5 ... LPC analyzer (linear prediction analysis means)
6 ... Synthetic filter 11 ... Delay circuit (part of driving sound source signal generating means)
12 ... Adaptive codebook (part of driving sound source signal generation means)
14 ... Probability codebook (part of driving sound source signal generation means)
19 ... Parameter smoother (spectrum smoothing means for non-voice frames)
19a ... Parameter memory (parameter storage means)

Claims

Audio discrimination means for discriminating whether the input signal divided into predetermined frame intervals is an audio signal or a non-audio signal;
Linear predictive analysis means for outputting spectral parameters of the input signal;
Control means for controlling the spectrum parameter in the preceding frame to be continuously used as the spectrum parameter of the input signal when the discrimination result by the voice discrimination means is a non-speech signal continuously over a predetermined number of frames;
If the frames determined to be non-speech signals continue beyond the predetermined number of frames, the non-speech spectrum parameter prepared in advance and the spectrum parameter of the current frame are smoothed, and the smoothing is performed. A non-voice frame spectrum smoothing means for outputting the spectrum parameters;
Driving excitation signal generating means for generating a driving excitation signal corresponding to the linear prediction residual signal;
A synthesis filter that synthesizes speech from the driving sound source signal based on the spectral parameters;
A speech encoding apparatus comprising:

2. The speech encoding apparatus according to claim 1, wherein a spectral parameter value based on predetermined background noise is used as an initial value of the non-speech spectral parameter.

3. The speech encoding apparatus according to claim 1, wherein the non-speech spectrum parameter is weighted more smoothly than the weight of the spectrum parameter of the current frame.

2. A parameter storage means for storing the spectrum parameter output from the spectrum smoothing means for non-speech frames as a non-speech spectrum parameter for use in smoothing the next frame. The speech encoding device according to claim 2 or claim 3.