JP3803306B2

JP3803306B2 - Acoustic signal encoding method, encoder and program thereof

Info

Publication number: JP3803306B2
Application number: JP2002124540A
Authority: JP
Inventors: 浩太日▲高▼; 信弥中嶌; 理水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-04-25
Filing date: 2002-04-25
Publication date: 2006-08-02
Anticipated expiration: 2022-04-25
Also published as: JP2003316398A

Description

【０００１】
【発明の属する技術分野】
この発明は音声信号、音楽信号などの音響信号を圧縮符号化、特に音響信号の状態に応じて圧縮率、つまり符号化ビット率を変更する符号化方法、その符号化器及びそのプログラムに関する。
【０００２】
【従来の技術】
従来において、音声信号の圧縮符号化を効率的に行うため、例えば特開平７−２２５５９９号公報に、入力音声信号が非音声の場合と無声音の場合と、過渡有声音の場合と、定常有声音の場合とに分け、これらの場合に応じてビット率を適応的に変更するＣＥＬＰ符号化方法が示されている。この符号化方法はあくまでも音声信号の各部において聴覚的な品質を維持しながら情報量をなるべく少なくしようとするものである。
【０００３】
また例えば複数の利用者が低ビット率の符号化を利用して回線を共用するようにし、かつ利用者の数や品質の要求に応じてビット率を適応的に切り換えるＡＤＰＣＭ（差分ＰＣＭ）符号化方法が例えば電子情報通信学会、平成１０年１０月２０日発行、守谷健弘著「音声符号化」１２４〜１２８頁に示されている。
更に入力音響信号が音声信号か音楽信号であるかを検出して、前者の場合はＣＥＬＰ符号化法により、後者の場合はＴｗｉｎＶＱ符号化法により符号化方法を切り換えて、それぞれの入力信号に適した符号化を行うことが特開平８−２６３０９８号公報に示されている。
【０００４】
【発明が解決しようとする課題】
従来においては符号化ビット率の変更は、要求される品質に応じて、あるいは、信号の部分的な状態に応じて品質に影響を与えない範囲で適応的に行うものであった。
しかし、音声の発話内容の重要部分、ある１つの音楽中における特に強調したい部分については高ビット率とし、特に品質を高め、例えば些細な感情的な変化も聴きとれるように、あるいは音楽において特に高品質なものとすれば一層効率的な符号化、あるいは同一情報量でも従来よりも有意義な情報とすることができる。この発明はこのような圧縮符号化を可能とする符号化方法を提供することを目的とする。
【０００５】
【課題を解決するための手段】
（１）この発明の音響信号を圧縮符号化する方法によれば上記音響信号をフレーム毎に分析して少なくとも基本周波数、パワー、動的特徴量の時間変化特性又はこれらのフレーム間差分（Δ成分）をパラメータとして含む、音響特徴量を抽出し、音響特徴量とその強調状態での出現確率を含むデータの組を少なくとも２以上格納した確率符号帳を参照して上記抽出した音響特徴量の強調状態での出現確率を求め、求めた強調状態での出現確率に基づき上記音響信号の強調度を、フレーム又はフレームを含む段落ごとに決定し、上記決定した強調度が大きい程、高いビット率で上記音響信号を圧縮符号化した符号と、ビット率又はビット率変化を示す符号を出力する。
【０００６】
（２）前記（１）項の方法において好ましくは、上記音響信号の分析において線形予測係数を算出し、その線形予測係数から上記動的特徴量を算出し、
上記音響特徴量として抽出した上記基本周波数と上記パワーと上記線形予測係数を、上記圧縮符号化した符号の決定に用いる。
（３）前記（１）項の方法において好ましくは、各ビット率に対応した個数の各音響特徴量と符号の組を記憶する複数の符号帳群を用意し、上記強調度に対応するビット率の１つの符号帳群を選択して用いて、上記抽出された各音響特徴量に対応する符号を決定して、上記圧縮符号化した符号とし、上記ビット率又はビット率変化を示す符号として上記選択された符号帳群を示す符号を用いる。
【０００７】
（４）前記（１）項の方法において、好ましくは上記強調度に対応する符号化方法を選択し、選択した符号化方法を用いて上記音響特徴量を符号化して上記圧縮符号化した符号を求め、上記ビット率又はビット率変化を示す符号として上記選択された符号化方法を示す符号を用いる。
（５）前記（１）〜（４）項の方法において好ましくは、音響信号の音響段落ごとに強調度を、その音響段落中における強調状態での出現確率から決定する。
（６）前記（５）項の方法において好ましくは、確率符号帳は各音響特徴量ごとの平静状態での出現確率を含み、上記音響段落に含まれる小段落ごとに上記出現確率を求め、その小段落が強調状態であるか平静状態であるかを判定し、音響段落内中の小段落の状態判定結果に基づき上記強調度を求める。
【０００８】
（７）前記（６）項の方法において、好ましくは上記音響段落内の強調状態と判定された小段落の数に基づき上記強調度を求める。
（８）前記（６）項の方法において、好ましくは上記確率符号帳は各音響特徴量ごとの平静状態での出現確率を含み、上記強調状態での出現確率、上記平静状態での出現確率の少くとも一方に重みを与え、その重みを変化させ、強調状態と判定される小段落の数を所定値以下に制御し、その時の重みの大きさに基づきその音響段落の上記強調度を求める。
【０００９】
（９）前記（６）項の方法において、好ましくは上記音楽信号をフレームごとに、無音区間か、有音区間かを判定し、所定数以上の無音区間で囲まれた有音区間を上記小段落として検出し、小段落に含まれる一つ以上の有音区間の平均パワーがその小段落内の平均パワーの定数倍より小さい小段落を末尾とし、隣接する末尾小段落間の小段落群を上記音響段落として検出する。
（１０）前記（１）〜（６）項の何れかの方法において好ましくは、上記確率符号帳は各音響特徴量ごとの平静状態での出現確率を含み、強調状態での出現確率と平静状態での出現確率との比から、上記強調度を決定する。
【００１０】
この発明の符号化器によれば、入力音響信号を、ビット率制御信号に応じて異なるビット率で圧縮符号化することができる符号化部と、少なくとも基本周波数、パワー、動的特徴量の時間変化特性又はこれらのフレーム間差分（Δ成分）をパラメータとして含む音響特徴量とその強調状態での出現確率を含むデータの組が少なくとも２以上格納された確率符号帳と、入力音響信号をフレーム毎に分析して、上記音響特徴量を抽出する特徴抽出部と、上記抽出された音響特徴量の強調状態での出現確率を上記確率符号帳を参照して求め、これら求めた出現確率に基づき、音響信号のフレームごと又はフレームを含む段落ごとに強調度を求め、強調度が大きい程、高いビット率とするビット率制御信号を上記符号化部へ供給する強調度判定部とを具備する。
この発明のプログラムは前記（１）〜（１０）の何れかに記載の音響信号符号化方法の各手順をコンピュータに実行させるための符号化プログラムである。
【００１１】
【発明の実施の形態】
概要
図１Ａにこの発明の実施形態を示す。入力端子１１からの音響信号は特徴量抽出部１２でフレームごとに分析して、少なくとも基本周波数、パワー、動的特徴量の時間変化特性又はこれらのフレーム間差分（Δ成分とも書く）をパラメータとして含む音響特徴量を抽出する。
確率符号帳１３には音響特徴量とその強調状態での出現確率及び平静状態での出現確率の組を少なくとも２以上格納されている。強調度判定部１４において特徴量抽出部１２で抽出された音響特徴量の各状態での出現確率を確率符号帳１３を参照して求め、これら出現確率を用いて、入力音響信号の音響段落ごとに強調度を決定し、この決定した強調度が大きい程、高いビット率の制御信号を出力する。符号化部１５では入力端子１１からの入力音響信号を、強調度判定部１４からのビット率制御信号により決まるビット率で圧縮符号化し、その圧縮符号とビット率又はビット率の変化を示す符号とを出力端子１６に出力する。
特徴量、確率符号帳
音声特徴量としては基本周波数（ｆ０）、パワー（ｐ）、音声の動的特徴量の時間変化特性（ｄ）、ポーズ時間長（無音区間）（ｐｓ）、などがあり音声の動的特徴量の時間変化特性（ｄ）は発話速度の尺度となるパラメータであり、動的変化量としてスペクトル包絡を反映するＬＰＣスペクトラム係数の時間変化特性を求め、その時間変化をもとに発話速度係数が求められるものであり、より具体的にはフレーム毎にＬＰＣケプストラム係数Ｃｌ（ｔ），…，Ｃｋ（ｔ）を抽出して次式のような動的特徴量ｄ（ダイナミックメジャー）を求める。ｄ（ｔ）＝Σ_i=1 ^k［Σ_f=t-f0 ^t+f0［ｆ×Ｃi(ｔ）］／（Σ_f=t-f0 ^t+f0ｆ²)］²ここで、ｆ０は前後の音声区間フレーム数（必ずしも整数個のフレームでなくとも一定の時間区間でもよい）、ｋはＬＰＣケプストラムの次数、ｉ＝１，２，…，ｋである。発話速度の係数として動的特徴量の変化の極大点の単位時間当たりの個数、もしくは単位時間当たりの変化率が用いられる。
【００１２】
例えば１００ｍｓを１フレームとし、シフトを５０ｍｓとする。１フレームごとの平均の基本周波数を求める（ｆ０′）。パワーについても同様に１フレームごとの平均パワー（ｐ′）を求める。更に現フレームのｆ０′と±ｉフレーム前後のｆ０′との差分をとり、±Δｆ０′ｉ（Δ成分）とする。パワーについても同様に現フレームのｐ′と±ｉフレーム前後のｐ′との差分±Δｐ′ｉ（Δ成分）を求める。ｆ０′、±Δｆ０′ｉ、ｐ′、±Δｐ′ｉを規格化する。この規格は例えばｆ０′、±Δｆ０′ｉをそれぞれ、音声波形全体の平均基本周波数で割り規格化する。ｐ′、±Δｐ′ｉについても同様に、発話状態判定の対象とする音声波形全体の平均パワーで割り、規格化する。規格化するにあたり、後述する小段落、音声段落ごとの平均パワーで割ってもよい。ｉの値は例えばｉ＝４とする。現フレームの前後±Ｔ１ｍｓの、ダイナミックメジャーのピーク本数、即ち動的特徴量の変化の極大点の個数を数える（ｄｐ）。これと、現フレームの開始時刻の、Ｔ２ｍｓ前の時刻を区間に含むフレームのｄｐとの差分（−Δｄｐ）を求める。前記±Ｔ１ｍｓのｄｐ数と、現フレームの終了時刻の、Ｔ３ｍｓ後の時刻を区間に含むフレームのｄｐとの差成分（＋Δｄｐ）を求める。これら、Ｔ１，Ｔ２，Ｔ３の値は例えばＴ１＝Ｔ２＝Ｔ３＝４５０ｍｓとする。フレームの前後の無音区間を±ｐｓとする。ステップ１ではこれらパラメータの各値を音声特徴量としてフレームごとに抽出する。これらパラメータとしては少なくとも基本周波数（又はピッチ周期）、パワー、動的特徴量の時間変化特性又はこれらのフレーム間差分（Δ成分）を含むものとする。
【００１３】
次に確率符号帳１３の作成方法について簡単に述べる。
多数の学習用音声を被験者が聴取し、発話状態が、平静状態であるものと、強調状態であるものをラベリングする。
例えば、被験者は強調状態とする理由として、
（ａ）声が大きく、名詞や接続詞を伸ばすように発話する
（ｂ）話し始めを伸ばして話題変更を主張、意見を集約するように声を大きくする
（ｃ）声を大きく高くして重要な名詞などを強調する時
（ｄ）高音であるが声はそれほど大きくない
（ｅ）苦笑いしながら、焦りから本音をごまかすような時
（ｆ）周囲に同意を求める、あるいは問いかけるように、語尾が高音になる時
（ｇ）ゆっくりと力強く、念を押すように、語尾の声が大きくなる時
（ｈ）声が大きく高く、割り込んで発話するという主張、相手より大きな声で
（ｉ）声が小さくボソボソ、ヒソヒソという感じ、大きな声では憚られる真意や秘密を発言、普段、声の大きい人にとっての重要なことを発話する時
を挙げた。この例では、平静状態とは、前記の（ａ）〜（ｉ）のいずれでもなく、発話が平静であると被験者が感じたものとした。
【００１４】
平静状態と強調状態の各ラベル区間について、学習音声から前記音声特徴量を抽出し、パラメータを選択し、平静状態と強調状態のラベル区間の、前記パラメータを用いて、ＬＢＧアルゴリズムでコードブックを作成する。このようにして得られた各量子化音声特徴量（コード）についてその強調状態での出現確率と平静状態での出現確率との組を符号帳１３に格納する。その強調状態での出現確率は、その音声特徴量が過去のフレームでの音声特徴量と無関係に強調状態で出現する確率（ｕｎｉｇｒａｍ：単独出現確率と記す）のみ、又はこれと、過去のフレームでの音声特徴量から現在のフレームの音声特徴量に至るフレーム単位の音声特徴量列ごとにその音声特徴量が強調状態で出現する条件付確率との組合せの何れかであり、平静状態での出現確率も同様に、その音声特徴量が過去のフレームでの音声特徴量と無関係に平静状態で出現する確率（ｕｎｉｇｒａｍ：単独出現確率と記す）のみ、又はこれと、過去のフレームでの音声特徴量から現在のフレームの音声特徴量に至るフレーム単位の音声特徴量列ごとにその音声特徴量が平静状態で出現する条件付確率と組合せの何れかである。
例えば図２に示すように確率符号帳１３には各コードＣ１，Ｃ２，…ごとにその音声特徴量と、その単独出現確率が強調状態、平静状態について、また条件付確率が強調状態、平静状態についてそれぞれ組として格納されている。
【００１５】
強調度判定
次に強調度判定部１４の具体例を図１Ｂを参照して説明する。この例では音声信号の小段落ごとにその出現確率から強調状態か平静状態かを判定し、これに基づき音声段落の強調度を決定している。
小段落の検出のために入力音声信号に有声／無音判定部２１で有声区間か無声区間かの判定が行われる。入力音声信号のフレームごとにパワーが所定以下であれば無音区間と判定し、有声区間は例えばフレームごとの相関関数が所定値以上であれば有声区間と判定する。
小段落判定部２２で有声区間を挟む無音区間の時間がそれぞれｔ秒、例えば４００ｍｓ以上であれば、その挟まれた有声区間を小段落と判定する。末尾小段落検出部２３において、例えば図３に示すように入力音声信号が小段落ｊ−１，ｊ，ｊ＋１と判定されたとする。音声小段落ｊは、ｎ個の有声区間から構成され、平均パワーをＰｊとし、音声小段落ｊの後部、つまりｉ＝ｎ−α番目からｎ番目までの有声区間の平均パワーｐｉの平均が音声小段落ｊの平均パワーＰｊより小さい時、すなわち、
Σｐｉ／（α＋１）＜βＰｊ，Σはｉ＝ｎ−αからｎまでの総和、α，βは定数
を満たす時、この音声小段落ｊを音声段落ｋの末尾音声小段落として検出し、この末尾音声小段落から、その直前の末尾音声小段落までの小段落群を音声段落ｋと音声段落判定部２４は判定する。例えばαは３、βは０．８である。このようにして末尾音声小段落を区切りとして隣接する末尾音声小段落間の音声小段落群を音声段落と判定する。
【００１６】
各小段落の各フレームごとの音声特徴量についてこれに近い量子化音声特徴量（コード）を確率符号帳１３からそれぞれ探し、その各コードと組をなすその強調状態での出現確率Ｐ（ｎ）、平静状態での出現確率Ｐ（ｅ）を、強調状態確率計算部２５、平静状態確率計算部２６でそれぞれ取出す。
強調状態確率計算部２５、平静状態確率計算部２６ではそれぞれ、これら各フレームごとに取出した確率を用いて、それぞれその小段落が強調状態になる確率、平静状態になる確率を計算する。例えば音声小段落ｓがフレーム数Ｎｓであり、その各フレームについての音声特徴量と対応するコードがそのフレーム順にＣｉ１，Ｃｉ２，…，ＣｉＮｓであるとする。この音声小段落ｓが強調状態になる確率Ｐｓ（ｅ）と平静状態になる確率Ｐｓ（ｎ）をそれぞれ次式により計算する。
【００１７】
Ｐｓ（ｅ）＝Ｐｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｅｍｐ（ＣｉＮｓ｜Ｃｉ（Ｎｓ−１）Ｃｉ（Ｎｓ−２））
Ｐｓ（ｎ）＝Ｐｎｒｍ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｎｒｍ（ＣｉＮｓ｜Ｃｉ（Ｎｓ−１）Ｃｉ（Ｎｓ−２））
ここで例えばＰｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）はＣｉ１Ｃｉ２の次にＣｉ３が強調状態で出現する確率を表わす。このＰｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）を求めるには符号帳１３からＣｉ３の強調状態の各単独出現確率と、Ｃｉ２の次にＣｉ３が強調状態で出現する条件付確率と更にＣｉ３がＣｉ１，Ｃｉ２の次に、強調状態で出現する条件付確率を求め、これらを線形補間法により求めることが好ましい。その他の強調状態での出現確率、また平静状態での出現確率も同様にそれぞれ線形補間法により求めることが好ましい。線形補間法については例えば「音声言語処理」（北研二、他、森北出版、１９９６年２９頁）などに示されている。
【００１８】
このようにして小段落ｓの強調状態での出現確率Ｐｓ（ｅ）、平静状態での出現確率Ｐｓ（ｎ）がそれぞれ確率計算部２５，２６で計算され、これら計算結果に基づき強調状態判定部２７はＰｓ（ｅ）＞Ｐｓ（ｎ）であればその小段落ｓを強調状態であり、Ｐｓ（ｅ）＜Ｐｓ（ｎ）であれば小段落ｓを平静状態であると判定する。強調状態判定部２７は同様にして各小段落についてそれが強調状態であるか平静状態であるかを判定する。
強調度決定部２８は、音声段落判定部２４で判定された音声段落の強調度を、その音声段落を構成する全ての小段落に対して強調状態判定部２７により判定された結果に基づいて決定する。例えば図４Ａに示すように強調度を０，１，２の３段階に分ける場合は、音声段落中に強調状態と判定された小段落が１つもない場合は強調度０とし、強調状態と判定された小段落の数が１個又は２個の場合は、強調度１とし、３個以上の場合は強調度２とする。
【００１９】
あるいは図４Ｂに示すように、強調状態判定部２７における判定の際に、強調状態での出現確率に、重み０≦ｗ＜εを乗算した場合において、音声段落中で強調状態と判定された小段落の数が１以上の場合は、強調度２とし、ε≦ｗ＜1/εを乗算した状態で判定した場合に音声段落中に強調状態と判定された小段落の数が１個の場合は、強調度１とし、1/ε≦ｗを乗算した状態で判定した場合に音声段落中に強調状態と判定された小段落の数が０又は１個の場合は強調度０とする。εは１未満の正の実数、好ましくは０．５〜０．８程度である。
【００２０】
なおこの強調状態での出現確率に対する重みｗの乗算は図１Ｂ中に破線で示すように乗算部２７ａで行い、その重みｗの値の制御は強調度決定部２８により行う。
強調度決定部２８で決定された強調度がビット率制御信号生成部２９へ入力され、ビット率制御信号生成部２９は図４Ａ、図４Ｂの例の場合、強調度が０の場合は低ビット率制御信号を、強調度が１の場合は中ビット率制御信号を、強調度が２の場合は高ビット率制御信号をそれぞれ生成して、図１Ａ中の符号化部１５へ供給する。
【００２１】
符号化部１５は入力されたビット率制御信号に応じ、低ビット率制御信号の場合は予め決められた低いビット率で入力音声信号を低レート部１５Ｌにより圧縮符号化し、中ビット率制御信号の場合は、前記低いビット率よりも高い予め決められた中ビット率で入力音声信号を中レート部１５Ｍにより符号化し、高ビット率制御信号の場合は中ビット率より更に高い、予め決められた高いビット率で入力音声信号を高レート部１５Ｈにより符号化する。
符号化部１５として各ビット率に応じてこの例では３つの符号化部１５Ｌ，１５Ｍ，１５Ｈを示したが、これは便宜的なものであり、これらを独立の符号化部で構成してもよいが、従来の技術の項で示した公知例のようにビット率に応じて一部機能構成を切り替えたり、付加したり、除去したりして符号化するようにしてもよく、各種の構成をとることができ、従来の可変ビット率符号化器と同様に構成することができる。またビット率制御信号が生成されるまでの時間遅れに応じて入力音声信号をバッファ記憶部１７で遅延させて符号化部１５へ供給するようにする。
【００２２】
符号化部１５は例えば図４Ｃに示すように、音声信号に対する圧縮符号列３１に先立ち、その圧縮符号列３１の符号化ビット率を示す符号３２を付けて出力する。このビット率符号３２は、少なくとも符号化ビット率が変更されるごとに付加する。
この符号化器によれば、音声信号中の意味内容の重要な部分は一般には強調状態の小段落を含むため、高いビット率で符号化され、その復号音声信号をより明確に、場合によっては話者の感情がこもった音声も復元でき、平静状態の音声段落は低ビット率とされるため、符号化効率を高くすることができる。
【００２３】
図１Ａに示した符号化器よりの符号化出力を復号する復号器の例を図５に示す。入力端子４１からの符号列は分離部４２により符号化列３１とビット率符号３２に分離され、符号化列３１は復号部４３へ供給される。ビット率符号３２はビット率符号復号部４４で復号され、復号されたビット率制御信号に応じて、復号部４３において高ビット率復号部４３Ｈ、中ビット率復号部４３Ｍ、低ビット率復号部４３Ｌの何れかにより符号化列３１が復号される。この復号部４３は図１Ａ中の符号化部１５と対応したものであり、各ビット率復号部４３Ｈ，４３Ｍ，４３Ｌが独立に構成される場合や、共通化され、一部の機能構成が切替えられたり、外されたり、付加される。従来の可変ビット率復号器と同様に構成することができる。つまりこの発明では強調度が高いほどビットレートが高い、即ち高精度に符号化が行われるが、復号音響信号は、強調度が高い区間ほど高品質なものが得られる。
【００２４】
図１に示した符号化器と対応するこの発明の符号化方法の手順の例を図６を参照して説明する。
音響信号、この例では音声信号が入力されると、一旦記憶部に記憶し（Ｓ１）、その音声信号の音声特徴量をフレームごとに抽出する（Ｓ２）。また無音区間か有声区間かの判別を行い（Ｓ３）、所定長以上の無音区間に挟まれた有声区間を音声小段落として抽出し（Ｓ４）、その音声小段落中の末尾音声小段落を検出して、隣接する末尾音声小段落間の音声小段落群からなる音声段落を抽出する（Ｓ５）。
【００２５】
ステップＳ２で抽出したフレームごとの音声特徴量を、確率符号帳１３を参照して量子化し、つまり距離が最も近い確率符号帳１３中のコードと対応付け（Ｓ６）、その各コードの強調状態及び平静状態での各独立出現確率又はこれらと各条件付出現確率を符号帳１３から取り出して、各音声小段落ごとの強調状態及び平静状態での出現確率Ｐｓ（ｅ），Ｐｓ（ｎ）を求める（Ｓ７）。更にこれら出現確率から各小段落が強調状態であるか平静状態であるかを判定し、音声段落ごとにその強調状態の小段落の数に応じて又は小段落が強調状態であるか平静状態であるかの判定の際に強調状態での出現確率に重みを与えることにより、あるいはこれらの併用により、音声段落の強調度を決定する（Ｓ８）。
【００２６】
この決定した強調度に応じてビット率を決定し（Ｓ９）、その決定したビット率で対応音声段落の音声信号を符号化し（Ｓ１０）、その符号化列とその符号化のビット率を示す符号とを出力する（Ｓ１１）。
変形例
符号化部１５における符号化において、その符号方法によっては特徴量抽出部１２で抽出した音声特徴量のパラメータと同一のものを用いる場合は、特徴量抽出部１２で抽出したものを用いることができる。逆に符号化部１５で符号化のために抽出した音声特徴量パラメータの一部又は全部を、特徴量抽出部１２の抽出パラメータに流用してもよい。つまり例えば図７に示すように入力音響信号は特徴量抽出部１２でフレームごとに特徴量が抽出され、その抽出された特徴量の少くとも一部のパラメータを用いて、強調度判定部１４により音響信号の音響段落ごとの強調度が判定され、これに応じたビット率制御信号が符号化部１５へ供給される。符号化部１５では特徴量抽出部１２で抽出された少くとも一部のパラメータについて入力されたビット率制御信号に応じたビット率になるように入力音響信号のその音響段落の部分が符号化される。
【００２７】
この発明は音声信号の符号化のみならず、音楽信号に対しても同様に重要な音楽段落に対して高いビット率で符号化し、重要でない部分は低ビット率で符号化して符号化効率を上げることもできる。音楽の場合における確率符号帳１３の作成時に被験者が強調状態と感じた理由を下記に示す。
（ａ）声が大きく、かつ声が高い
（ｂ）声が力強い
（ｃ）声が高く、かつアクセントが強い
（ｄ）声が高く、声質が変化する
（ｅ）声を伸長させ、かつ声が大きい
（ｆ）声が大きく、かつ、声が高く、アクセントが強い
（ｇ）声が大きく、かつ、声が高く、叫んでいる
（ｈ）声が高く、アクセントが変化する
（ｉ）声を伸長させ、かつ、声が大きく、語尾が高い
（ｊ）声が高く、かつ、声を伸長させる
（ｋ）声を伸長させ、かつ、叫び、声が高い
（ｌ）語尾上がり力強い
（ｍ）ゆっくり強め
（ｎ）曲調が不規則
（ｏ）曲調が不規則、かつ、声が高い
次にこの発明をＴｗｉｎＶＱに適用した例を図８を参照して説明する。ＴｗｉｎＶＱについては例えば雑誌日経エレクトロニクス１９９７、４．２１、（Ｎｏ．６８７）１８１〜２０２頁、その他に示されている。入力端子１１よりの音響信号は直交変換に先立ってフレーム内の信号を分析して、変換長、即ちフレームの分割数を分割選択部５１で決定する。例えばフレーム長が２０４８サンプルの場合、フレームのデータに対して２０４８点の変換を１回適用するか、５１２点の変換を４回適用するか、１２８点の変換を１６回適用するかを判定する。この判定に従って窓分割ＭＤＣＴ部５２で音響信号を決定された変換長ごとに変形離散余弦変換（Modified Discrete Cosine Transform）により周波数領域の信号（係数）に変換する。
【００２８】
この変換とは別にＬＰＣ分析部５３で音響信号のスペクトル包絡の推定を線形予測分析により行い、そのスペクトル包絡により、周波数領域の係数を割算部５４で正規化する。この平坦化された周波数領域係数の低周波部から周期的なピーク成分（ピッチ成分）をピッチ分析部５５で抽出し、この成分を割算部５４の出力から減算部５６で差し引き、更にその割算部５４の出力を、バーク尺度スペクトル分析部５７でバーク尺度に比例する非線形な分割で平均化した包絡を求め、この包絡により減算部の出力を割算部５８で割算する。これにより、割算部５４よりの平坦化された周波数領域係数における微細なスペクトルピークがピッチ成分で平坦化され、更に割算部５８で平坦化される。
【００２９】
このようにして平坦化された周波数領域係数全体の平均パワーをパワー分析部５９で計算し、この平均パワーで割算部５８の出力を正規部６１で正規化する。この正規化された全ての周波数領域係数をインターリーブ分割部６２でインタリーブして副ベクトルに分割し、この分割された正規化周波数領域係数を二つの符号帳からの和で表わす共役構造の重み付きベクトル量子化部６３によりベクトル量子化する。この際に、ＬＰＣ分析部５３よりのスペクトル包絡特性、ピッチ成分分析部５５よりのピッチ成分、バーク尺度スペクトル分析部５７よりの平均化スペクトル包絡、パワー分析部５９よりの副フレームごとのパワーに基づき聴覚心理モデルをモデル生成部５４で作り、この聴覚心理モデルでベクトル量子化部６３で選択されたベクトルに対し重み付けして、復号信号が聴覚的に歪が最小になるような距離尺度でベクトル量子化を行う。
【００３０】
この発明のこの実施例ではＬＰＣ分析部５３、ピッチ成分分析部５５、パワー分析部５９の各分析結果が強調度判定部１４に入力され、その入力された分析結果の少なくとも一部を用いて前述した音響特徴量とし、また入力分析結果に基づき、小段落、音響段落を抽出し、更に確率符号帳１３を参照して、各小段落ごとの強調状態での出現確率、平静状態での出現確率を計算して、強調状態や平静状態更に音響段落の強調度を判定する。なお、ＬＰＣ分析部５３からの線形予測係数は先に述べたようにこれよりＬＰＣケプストラムを求め、更にその時間変化特性を求めて、動的特徴量が動的特徴量計算部７８で計算されて強調度判定部７６に入力される。
【００３１】
この例では強調度を０と１の２段階とした場合でこれに応じてＬＰＣ用符号帳６５Ｌと６５Ｈがピッチ用符号帳６６Ｌと６６Ｈが、スペクトル用符号帳６７Ｌと６７Ｈが、パワー用符号帳６８Ｌと６８Ｈが、共役構造符号帳６９Ｌと６９Ｈがそれぞれ設けられる。符号帳６６Ｌ，６７Ｌ，６８Ｌ，６９Ｌはそれぞれ対応する符号帳６６Ｈ，６７Ｈ，６８Ｈ，６９Ｈよりも各符号ベクトルの要素数が少なく、従って収容する符号ベクトルの数も少ない、つまり前者の符号帳により符号化した符号（インデックス）は後者のそれより少ないビット数のものとなる。
【００３２】
強調度判定部１４が強調度０と判定すると、強調度判定部１４からの低ビット率制御信号により各符号帳切替部７１〜７５に制御され、ＬＰＣ分析部５５において分析スペクトル包絡を表わすＬＳＰパラメータがＬＰＣ用符号帳６５Ｌを用いて符号化され、ピッチ成分分析部５５において、ピッチ用符号帳６６Ｌを用いて抽出ピッチ成分が符号化され、バーク尺度スペクトル分析部５７において分析平均化包絡がスペクトル用符号帳６７Ｌを用いて符号化され、パワー分析部５９において、検出平均パワーがパワー符号帳６８Ｌを用いて符号化され、重み付きベクトル量子化部６３において、共役構造符号帳６９Ｌを用いて平坦化周波数領域係数が符号化される。これら符号化出力は合成部７６で統合されて出力される。
【００３３】
強調度判定部１４が強調度１と判定すると、強調度判定部１４からのビット率制御信号により各符号帳切替部７１〜７５がそれぞれ、符号帳６５Ｈ〜６９Ｈ側に切替えられ、各分析部５３，５５，５７，５９はそれぞれ符号帳６５Ｈ，６６Ｈ，６７Ｈ，６８Ｈを用い、ベクトル量子化部６３は符号帳６９Ｈを用いてそれぞれ符号化を行い、これらの符号化出力は合成部７６で統合されて出力される。なお合成部７６では少なくともビット率制御信号が変化するごとにそのことを示すビット率符号を統合した符号の先頭に付けて出力する。この例では強調度は０と１の何れかであるからビット率符号は１ビットでよく、また強調度変化は０から１，または１から０の何れかであるから、ビット率制御信号が変化したことが示す符号も１ビットでよい。
【００３４】
これと対応する復号器においては従来のＴｗｉｎＶＱ復号器において、符号帳として符号帳６５Ｌと６５Ｈ、６６Ｌと６６Ｈ、６７Ｌと６７Ｈ、６８Ｌと６８Ｈ、６９Ｌと６９Ｈと同一のものを設け、入力されたビット率符号に応じてこれら符号帳の何れかの組を用いて、復号すればよい。
この実施例では以上の説明から明らかなように、ＬＰＣ分析部５３、ピッチ成分分析部５５、パワー分析部６９は、強調度判定のために用いる特徴量を抽出すると共に、入力音響信号を符号化するための特徴量をも抽出する特徴量抽出部７９を構成している。
【００３５】
またこの実施例は、予め決められた複数の符号化ビット率と対応した複数の符号帳群、この例では符号帳６５Ｌ，６６Ｌ，６７Ｌ，６８Ｌ，６９Ｌからなる符号帳群と、符号帳６５Ｈ，６６Ｈ，６７Ｈ，６８Ｈ，６９Ｈからなる符号帳群との二つの符号帳群を用意し、強調度判定部１４により決定されたビット率に応じて、その１つの符号帳群を用いて入力音響信号を符号化する例である。この符号帳群の選択によるビット率の変更は、その符号化器に用いる全ての符号帳をビット率に応じて選択する場合に限らず、例えば図８中に破線で示すようにＬＰＣ用符号帳はビット率の変更にかかわらず１個の符号帳６５を用いて符号化するようにしてもよい。
【００３６】
次に強調度に応じて符号化ビット率の変更を符号化方法を変更する例を図９を参照して説明する。この実施例は特開平８−２６３０９８号公報に示す「音響信号符号化方法」における符号化方法の変更を強調度により決定された符号化ビット率に応じて行うようにした場合であり、以下に簡単に述べる。
入力音響信号は特徴抽出部７９でフレームごとに特徴量が抽出され、その抽出された特徴量に応じて強調度判定部１４で強調度が判定され、これに応じたビット率制御信号が出力される。この例では判定される強調度は０又は１の何れかであり、強調度が０と判定されると、ビット率制御信号により切替え部８１が制御され、入力端子が第１符号化部、この例ではＣＥＬＰ符号化部８２に接続され、入力音響信号は符号化部８２中の逆フィルタ８３に入力され、入力音響信号のスペクトル包絡の周波数依存性が抑圧され、残差信号が得られる。この残差信号は符号帳選択部８５に入力され、適応符号帳８６から各ピッチ周期分過去の残差信号である適応符号帳ベクトルとの歪を算出し、歪が最小となる適応符号帳ベクトルに対応するピッチ周期が選択される。このピッチ周期即ち選択された適応符号帳ベクトルに対応する符号Ｃ_bが符号化出力となる。適応符号化ベクトルとして前符号化フレームで得られた復号化残差信号を各種ピッチ周期に基づいて周期化して生成してもよい。その選択ベクトルが差回路８７で逆フィルタ８３よりの残差信号より差し引かれ、その残差信号が時間領域ベクトル量子化部８８で固定符号帳８９を参照してベクトル量子化される。その逆量子化出力と符号帳選択部８５で選択したベクトルとが加算回路９１で加算されて残差信号が復号（合成）され、これが適応符号帳８６へ供給される。逆フィルタ８３のフィルタ係数は特徴量抽出部７９内のＬＰＣ分析部８４の分析結果が利用され、また適応符号帳８６からピッチ周期長に応じた切り出しは、特徴量抽出部７９内で抽出した基本周波数に基づき行うことにより、符号化と強調度決定とに音響特徴パラメータを共通に利用できる。ＬＰＣ分析部８４の分析結果がＬＰＣ量子化部９２で符号化され、その符号Ｃ_a、適応符号帳選択部８５からの選択ベクトルを示すＣ_b、量子化部８８からの選択ベクトルを示す符号Ｃ_c、ＣＥＬＰ符号化部８２を選択したことを示す符号（ビット率を示す符号）とを合成部７６から出力する。
【００３７】
一方、強調度が１と判定されると、ビット率制御信号により切替え部８１が制御されて、入力端子が第２符号化部、この例ではＴｗｉｎＶＱ符号化部９４に接続され、入力音響信号はその符号化部９４内のＭＤＣＴ部５２に周波数領域の信号に変換され、この周波数領域の係数は量子化された周波数領域のスペクトル概形で割算回路５４において割算されて正規化され、周波数領域残差係数とされ、この周波数領域残差係数はフレーム間予測部９５よりのフレーム間予測スペクトルで割算回路５８において割算されて正規化され、一層平坦化された係数（周波数領域微細構造）となり、周波数領域量子化部９６でベクトル量子化され、またこの量子化部９６からその逆量子化微細構造と微細構造の量子化符号Ｃ_eが出力される。一方、逆量子化微細構造はフレーム間予測部９５よりのフレーム間予測スペクトルが掛算器９７で乗算されて残差係数が復号され、これがフレーム間予測部９５に入力され、フレーム間予測スペクトルとフレーム間予測係数の量子化符号Ｃ_fが出力される。これら符号Ｃ_e，Ｃ_fと、ＬＰＣ量子化部９２からのＬＰＣ量子化符号Ｃ_aと、ＴｗｉｎＶＱ符号化部９４を選択したことを示す符号（ビット率を示す符号）とが合成部７６で統合されて出力される。
【００３８】
この場合も特徴量抽出部７９において抽出したＬＰＣ係数は強調度判定と、符号化の両者に利用され、また図９中には示していないが、周波数領域量子化部９６内で行っているパワーによる規格化に、特徴量抽出部７９で抽出したパワーを用いることもできる。
判定する強調度の数を増加し、例えば強調度０を２段階とし、固定符号帳８９とサイズが異なる（収容された固定ベクトルの数が異なる）固定符号帳８９′を設け、２段階に分けた強調度０の一方で、固定符号帳８９，８９′の一方を用いて符号化し、２段階の他方で固定符号帳８９，８９′の他方を用いて符号化するようにしてもよい。同様に、周波数領域量子化部９６で用いる符号帳として収容されたベクトル数の異なるものを複数用意し、強調度１を複数の段階に分け、強調度が高いほど収容されたベクトルの数が多い符号帳を用いて周波数領域量子化部９６での符号化を行うようにすることもできる。
【００３９】
上述において強調度判定時に、強調状態での出現確率に重みｗを掛けて判定を行う例において、この代りに平静状態での出現確率に重みｗを掛けて判定を行ってもよい。この場合は、例えば図１Ａ中に破線枠で示すように乗算部２７ｂで乗算する重みｗを制御して、重みｗが大きくても強調状態となる小段落が残る場合に強調度と判定する。強調度の判定の手法としては、例えば図１中に破線で示すように、強調状態確率計算部２５で求めた強調状態での出現確率Ｐ（ｎ）と平静状態確率計算部２６で求めた平静状態での出現確率Ｐ（ｅ）との比をＰ（ｅ）／Ｐ（ｎ）又はＰ（ｎ）／Ｐ（ｅ）を割算部３０で求め、この比の大きさに応じて、強調度を決定してもよい。例えばＰ（ｅ）／Ｐ（ｎ）を求めた場合は、この比が小さい程、強調度を大とする。この図示の例の場合は小段落ごとに強調度を決定して、対応したビット率を決めることになる。更に強調状態での出現確率Ｐ（ｎ）のみを用い、この値が予め設定した以上であればその小段落は強調状態にあると判定してもよい。
【００４０】
強調度の決定は、フレームを含む段落ごと、つまり音響段落ごと、小段落ごと、有声区間（有声段落）ごと、またはフレームごと、あるいは１０秒や５秒などの一定時間ごとなど各種の段落ごとに行なってもよい。
テレビジョンなどのように音声信号と密接な関係の映像信号を符号化する際に、音声信号に対し、強調度に応じたビット率で符号化し、これと同様にビット率の変更をしながら映像信号を符号化してもよい。
上述において符号化ビット率の制御を強調度を４つ以上に分けてより細かに行ってもよい。更に小段落や音響段落はそれぞれ固定長としてもよい。
【００４１】
この発明の符号化器、例えば図１に示したものとして、コンピュータにプログラムを実行させて機能させてもよい。この場合は、例えば図６に示した符号化方法の各手順をコンピュータにより実行するための符号化プログラムを、コンピュータのプログラムメモリにＣＤ−ＲＯＭ、可撓性磁気ディスク、などの記録媒体からインストールし、あるいは通信回線を通じてダウンロードしてそのプログラムをコンピュータに実行させればよい。
【００４２】
【発明の効果】
以上述べたようにこの発明によれば音響信号中の内容が重要な部分、強調したい部分のビット率を高くし、その他の部分のビット率を低くすることにより、効率的な符号化ができる。
【図面の簡単な説明】
【図１】Ａはこの発明による符号化器の一例の機能構成を示す図、Ｂはその強調度判定部１４の具体的機能構成例を示す図である。
【図２】図１中の確率符号帳１３の記憶例を示す図。
【図３】音声小段落、末尾音声小段落、音声段落を説明するための図。
【図４】Ａ及びＢは強調度判定テーブルの例を示す図、Ｃは出力符号のフォーマットの例を示す図である。
【図５】図１Ａの符号化器と対応する復号器の機能構成例を示す図。
【図６】この発明の符号化方法の処理手順の例を示す流れ図。
【図７】この発明の他の実施例の機能構成例を示す図。
【図８】この発明をＴｗｉｎＶＱに適用した符号化器の機能構成例を示す図。
【図９】この発明の更に他の実施例の機能構成例を示す図。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to compression coding of audio signals such as audio signals and music signals, and more particularly to an encoding method for changing a compression rate, that is, an encoding bit rate according to the state of the audio signal, an encoder thereof, and a program thereof.
[0002]
[Prior art]
Conventionally, in order to efficiently compress and encode an audio signal, for example, Japanese Patent Application Laid-Open No. 7-225599 discloses a case where an input audio signal is a non-voice, an unvoiced sound, a transient voiced sound, and a steady voiced sound. The CELP coding method is shown in which the bit rate is adaptively changed according to these cases. This encoding method is intended to reduce the amount of information as much as possible while maintaining auditory quality in each part of the audio signal.
[0003]
Also, for example, ADPCM (differential PCM) coding that allows a plurality of users to share a line using low bit rate coding and adaptively switches the bit rate according to the number of users and quality requirements. The method is shown, for example, in pages 124 to 128 of "Speech coding" by Takehiro Moriya, published by the Institute of Electronics, Information and Communication Engineers, October 20, 1998.
Furthermore, it is detected whether the input acoustic signal is an audio signal or a music signal. In the former case, the coding method is switched by the CELP coding method, and in the latter case by the TwinVQ coding method. JP-A-8-263098 discloses that encoding is performed.
[0004]
[Problems to be solved by the invention]
Conventionally, the coding bit rate is changed adaptively according to the required quality or within a range that does not affect the quality according to the partial state of the signal.
However, the important part of the speech utterance content, especially the part that you want to emphasize in one piece of music, should have a high bit rate, especially improve the quality, for example to be able to hear even minor emotional changes, or especially in music If the quality is high, more efficient coding, or even the same amount of information can be made more meaningful information than before. It is an object of the present invention to provide an encoding method that enables such compression encoding.
[0005]
[Means for Solving the Problems]
(1) According to the method for compressing and encoding an acoustic signal according to the present invention, the acoustic signal is analyzed for each frame, and at least the fundamental frequency, power, time-varying characteristics of the dynamic feature quantity, or the difference between these frames (Δ component) ) Is extracted as a parameter, and enhancement of the extracted acoustic feature quantity is performed with reference to a probability codebook storing at least two sets of data including the acoustic feature quantity and the appearance probability in the emphasized state. The appearance probability in the state is obtained, and the enhancement degree of the acoustic signal is determined for each frame or paragraph including the frame based on the obtained appearance probability in the enhancement state, and the higher the determined enhancement degree, the higher the bit rate. A code obtained by compression-coding the acoustic signal and a code indicating a bit rate or a bit rate change are output.
[0006]
(2) Preferably in the method of (1), a linear prediction coefficient is calculated in the analysis of the acoustic signal, and the dynamic feature value is calculated from the linear prediction coefficient.
The fundamental frequency, the power, and the linear prediction coefficient extracted as the acoustic feature amount are used for determining the compression-coded code.
(3) Preferably, in the method of (1), a plurality of codebook groups storing a set of acoustic feature quantities and codes corresponding to the respective bit rates are prepared, and the bit rate corresponding to the enhancement degree By selecting and using one codebook group, a code corresponding to each of the extracted acoustic feature values is determined, the code is the compression-coded code, and the code is a code indicating the bit rate or the bit rate change. A code indicating the selected codebook group is used.
[0007]
(4) In the method of (1), preferably, an encoding method corresponding to the degree of enhancement is selected, the acoustic feature value is encoded using the selected encoding method, and the compression encoded code is used. The code indicating the selected encoding method is used as the code indicating the bit rate or the bit rate change.
(5) Preferably in the method of said (1)-(4) item, an emphasis degree is determined for every acoustic paragraph of an acoustic signal from the appearance probability in the emphasis state in the acoustic paragraph.
(6) Preferably, in the method of (5), the probability codebook includes the appearance probability in a calm state for each acoustic feature, determines the appearance probability for each sub-paragraph included in the acoustic paragraph, It is determined whether the small paragraph is in an emphasized state or a calm state, and the degree of enhancement is obtained based on the state determination result of the small paragraph in the acoustic paragraph.
[0008]
(7) In the method of (6), the degree of enhancement is preferably determined based on the number of small paragraphs determined to be in an enhanced state in the acoustic paragraph.
(8) In the method of (6), the probability codebook preferably includes an appearance probability in a calm state for each acoustic feature, and includes an appearance probability in the emphasized state and an appearance probability in the calm state. At least one of them is given a weight, the weight is changed, the number of small paragraphs determined to be in an emphasized state is controlled to a predetermined value or less, and the above-described enhancement degree of the acoustic paragraph is obtained based on the magnitude of the weight at that time.
[0009]
(9) In the method of (6), preferably, the music signal is determined for each frame as a silent section or a voiced section, and a voiced section surrounded by a predetermined number or more of the silent sections is defined as the small sound section. Detect sub-paragraphs between adjacent sub-paragraphs, with sub-paragraphs that end with sub-paragraphs whose average power in one or more voiced sections is less than a constant multiple of the average power in the sub-paragraph. Detect as the acoustic paragraph.
(10) Preferably, in the method according to any one of (1) to (6), the probability codebook includes an appearance probability in a calm state for each acoustic feature quantity, and an appearance probability and a calm state in an emphasized state. The degree of emphasis is determined from the ratio to the appearance probability at.
[0010]
According to the encoder of the present invention, an encoding unit capable of compressing and encoding an input acoustic signal at a different bit rate according to a bit rate control signal, and at least a time of a fundamental frequency, power, and a dynamic feature amount A probability codebook in which at least two or more sets of data including an acoustic feature amount including a change characteristic or an inter-frame difference (Δ component) as a parameter and an appearance probability in the emphasized state are stored, and an input acoustic signal for each frame To determine the appearance probability in the emphasized state of the extracted acoustic feature amount with reference to the probability codebook, and based on the obtained appearance probability, An enhancement degree determination unit that obtains an enhancement degree for each frame of an acoustic signal or each paragraph including a frame, and supplies a bit rate control signal with a higher bit rate to the encoding unit as the enhancement degree increases. To Bei.
The program of the present invention is an encoding program for causing a computer to execute each procedure of the acoustic signal encoding method according to any one of (1) to (10).
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Overview
FIG. 1A shows an embodiment of the present invention. The acoustic signal from the input terminal 11 is analyzed for each frame by the feature quantity extraction unit 12, and at least the fundamental frequency, the power, the time change characteristic of the dynamic feature quantity, or the difference between these frames (also referred to as Δ component) is used as a parameter. Extract the acoustic features.
The probability codebook 13 stores at least two or more sets of acoustic features, appearance probabilities in the emphasized state, and appearance probabilities in the calm state. The appearance probability in each state of the acoustic feature amount extracted by the feature amount extraction unit 12 in the enhancement level determination unit 14 is obtained with reference to the probability codebook 13, and using these appearance probabilities, for each acoustic paragraph of the input acoustic signal The degree of enhancement is determined, and a control signal with a higher bit rate is output as the determined degree of enhancement is larger. The encoding unit 15 compresses and encodes the input audio signal from the input terminal 11 at a bit rate determined by the bit rate control signal from the enhancement degree determination unit 14, and the compression code and a code indicating a change in the bit rate or the bit rate Is output to the output terminal 16.
Features, probability codebook
The voice feature amount includes a fundamental frequency (f0), power (p), time change characteristic (d) of voice dynamic feature amount, pause time length (silent section) (ps), and the like. The time variation characteristic (d) is a parameter as a measure of the speech rate, and the time variation characteristic of the LPC spectrum coefficient reflecting the spectrum envelope is obtained as the dynamic variation amount, and the speech rate coefficient is obtained based on the time variation. More specifically, LPC cepstrum coefficients Cl (t),..., Ck (t) are extracted for each frame to obtain a dynamic feature amount d (dynamic measure) as shown in the following equation. d (t) = Σ_{i = 1} ^k[Σ_{f = t-f0} ^{t + f0}[F × Ci (t)] / (Σ_{f = t-f0} ^{t + f0}f²)]²Here, f0 is the number of frames in the preceding and following speech sections (not necessarily an integer number of frames but may be a fixed time section), and k is the order of the LPC cepstrum, i = 1, 2,. As the coefficient of speech rate, the number of maximum points of change in dynamic feature quantity per unit time or the rate of change per unit time is used.
[0012]
For example, 100 ms is one frame and the shift is 50 ms. An average fundamental frequency for each frame is obtained (f0 ′). Similarly, the average power (p ′) for each frame is obtained for the power. Further, the difference between f0 ′ of the current frame and f0 ′ before and after ± i frames is taken as ± Δf0′i (Δ component). Similarly for power, the difference ± Δp′i (Δ component) between p ′ of the current frame and p ′ before and after ± i frames is obtained. f0 ', ± Δf0'i, p', ± Δp'i are normalized. In this standard, for example, f0 ′ and ± Δf0′i are respectively divided by the average fundamental frequency of the entire speech waveform and standardized. Similarly, p ′ and ± Δp′i are divided by the average power of the entire speech waveform that is the target of speech state determination and normalized. In normalization, it may be divided by the average power for each small paragraph and audio paragraph described later. The value of i is, for example, i = 4. Count the number of dynamic measure peaks before and after the current frame ± T1 ms, that is, the number of local maximum changes in the dynamic feature (dp). The difference (−Δdp) between this and the dp of the frame including the time T2 ms before the start time of the current frame is obtained. A difference component (+ Δdp) between the number of dp of ± T1 ms and the dp of the frame including the time after T3 ms of the end time of the current frame is obtained. The values of T1, T2, and T3 are, for example, T1 = T2 = T3 = 450 ms. Let the silence interval before and after the frame be ± ps. In step 1, the values of these parameters are extracted for each frame as speech feature values. These parameters include at least the fundamental frequency (or pitch period), power, time-varying characteristics of dynamic features, or the difference between these frames (Δ component).
[0013]
Next, a method for creating the probability codebook 13 will be briefly described.
The subject listens to a large number of learning voices, and the speech state is labeled as calm and as emphasized.
For example, as a reason for the subject to be in an emphasized state,
(A) The voice is loud and utters so that the nouns and conjunctions are extended
(B) Extend the beginning of the conversation, insist on a topic change, and make a loud voice to gather opinions
(C) When emphasizing important nouns with a loud voice
(D) High pitched but not very loud
(E) When you are laughing and deceiving your true intentions
(F) When the ending of the ending sound is high, asking for consent or asking questions
(G) When the ending voice is loud enough to be strong
(H) The voice is loud and loud, claims to speak and speak louder than the other party
(I) When the voice is small and the feeling is tingling or tingling, the truth or secret that is spoken with a loud voice is spoken, and what is usually important for a loud voice is spoken
Mentioned. In this example, the calm state is not any of the above (a) to (i), and the subject felt that the utterance was calm.
[0014]
For each label section in the calm state and the emphasized state, the speech feature is extracted from the learning speech, a parameter is selected, and a code book is created by the LBG algorithm using the parameters in the label section in the calm state and the emphasized state. To do. A set of the appearance probability in the emphasized state and the appearance probability in the calm state is stored in the codebook 13 for each quantized speech feature (code) obtained in this way. The appearance probability in the emphasized state is only the probability that the voice feature amount appears in the emphasized state regardless of the voice feature amount in the past frame (unigram: described as a single appearance probability), or this and the past frame. Is a combination of a conditional probability that the speech feature amount appears in the emphasized state for each speech feature amount sequence in units of frames from the speech feature amount of the current frame to the speech feature amount of the current frame, and appears in a calm state Similarly, the probability is that only the probability that the speech feature amount appears in a calm state regardless of the speech feature amount in the past frame (unigram: expressed as a single appearance probability), or this, and the speech feature amount in the past frame. For each frame feature amount sequence from the current frame speech feature amount to the current frame speech feature amount, any one of the conditional probabilities and combinations in which the speech feature amount appears in a calm state.
For example, as shown in FIG. 2, in the probability codebook 13, for each code C1, C2,..., The voice feature amount and the single appearance probability are in the emphasized state and the calm state, and the conditional probability is the emphasized state and the calm state. Each is stored as a set.
[0015]
Emphasis judgment
Next, a specific example of the enhancement degree determination unit 14 will be described with reference to FIG. 1B. In this example, the emphasis state or the calm state is determined from the appearance probability for each small paragraph of the speech signal, and the enhancement degree of the speech paragraph is determined based on this.
In order to detect a small paragraph, the voice / silence determination unit 21 determines whether the input voice signal is voiced or unvoiced. If the power of each frame of the input speech signal is less than or equal to a predetermined value, it is determined as a silent interval.
If the time of the silent section sandwiching the voiced section by the small paragraph determination unit 22 is t seconds, for example, 400 ms or more, the sandwiched voiced section is determined to be a small paragraph. Assume that the tail small paragraph detection unit 23 determines that the input audio signal is a small paragraph j−1, j, j + 1 as shown in FIG. 3, for example. The voice sub-paragraph j is composed of n voiced sections, the average power is Pj, and the average of the average power pi of the voice sub-paragraph j, that is, i = n−αth to n-th voiced section is the voice. When it is smaller than the average power Pj of the small paragraph j, that is,
Σpi / (α + 1) <βPj, Σ is the sum of i = n−α to n, and α and β are constants
When this condition is satisfied, this audio sub-paragraph j is detected as the last audio sub-paragraph of the audio paragraph k, and the sub-paragraph group from the end audio sub-paragraph to the last audio sub-paragraph immediately before is detected as the audio paragraph k and the audio paragraph determination unit. 24 determines. For example, α is 3 and β is 0.8. In this way, a group of audio sub-paragraphs between adjacent end audio sub-paragraphs is determined as an audio paragraph with the end audio sub-paragraph as a delimiter.
[0016]
Quantized speech feature values (codes) close to the speech feature values for each frame of each small paragraph are searched from the probability codebook 13, and the appearance probability P (n) in the emphasized state that forms a pair with each code. The appearance probability P (e) in the calm state is taken out by the emphasized state probability calculation unit 25 and the calm state probability calculation unit 26, respectively.
The emphasized state probability calculating unit 25 and the calm state probability calculating unit 26 use the probabilities extracted for each frame, respectively, to calculate the probability that the small paragraph will be in the emphasized state and the probability that the sub-paragraph will be in the calm state. For example, it is assumed that the audio sub-paragraph s is the number of frames Ns, and the code corresponding to the audio feature amount for each frame is Ci1, Ci2,. The probability Ps (e) that the voice sub-paragraph s is in an emphasized state and the probability Ps (n) that it is in a calm state are calculated by the following equations, respectively.
[0017]
Ps (e) = Pemp (Ci3 | Ci1Ci2) ... Pemp (CiNs | Ci (Ns-1) Ci (Ns-2))
Ps (n) = Pnrm (Ci3 | Ci1Ci2) ... Pnrm (CiNs | Ci (Ns-1) Ci (Ns-2))
Here, for example, Pemp (Ci3 | Ci1Ci2) represents the probability that Ci3 appears in an emphasized state next to Ci1Ci2. In order to obtain this Pemp (Ci3 | Ci1Ci2), the individual appearance probability of the emphasized state of Ci3 from the codebook 13, the conditional probability that Ci3 appears in the emphasized state next to Ci2, and Ci3 is next to Ci1, Ci2. It is preferable that the conditional probabilities appearing in the emphasized state are obtained and obtained by linear interpolation. Similarly, the appearance probability in the other emphasized state and the appearance probability in the calm state are preferably obtained by the linear interpolation method. The linear interpolation method is described in, for example, “Spoken Language Processing” (Kenji Kenji et al., Morikita Publishing, p. 29, 1996).
[0018]
In this way, the appearance probability Ps (e) in the emphasized state of the small paragraph s and the appearance probability Ps (n) in the calm state are calculated by the probability calculating units 25 and 26, respectively, and based on these calculation results, the emphasized state determining unit 27, if Ps (e)> Ps (n), the small paragraph s is in an emphasized state, and if Ps (e) <Ps (n), the small paragraph s is determined to be in a calm state. Similarly, the emphasis state determination unit 27 determines whether each small paragraph is in an emphasis state or a calm state.
The enhancement level determination unit 28 determines the enhancement level of the speech paragraph determined by the speech paragraph determination unit 24 based on the result determined by the enhancement state determination unit 27 for all the small paragraphs constituting the speech paragraph. To do. For example, as shown in FIG. 4A, when the degree of emphasis is divided into three levels of 0, 1, and 2, when there is no sub-paragraph determined to be in the emphasized state in the audio paragraph, the degree of emphasis is set to 0, and the emphasis state is determined. The degree of emphasis is 1 when the number of sub-paragraphs is 1 or 2, and the degree of emphasis is 2 when the number is 3 or more.
[0019]
Alternatively, as shown in FIG. 4B, when the determination in the enhancement state determination unit 27 is performed, the appearance probability in the enhancement state is multiplied by the weight 0 ≦ w <ε. When the number of paragraphs is 1 or more, the degree of enhancement is 2, and the number of sub-paragraphs determined to be in the emphasized state in the audio paragraph is 1 when judged in a state multiplied by ε ≦ w <1 / ε Is 1 when the degree of enhancement is 1 and when the number of sub-paragraphs determined to be in the emphasized state in the speech paragraph is 0 or 1 when determined in a state multiplied by 1 / ε ≦ w. ε is a positive real number less than 1, preferably about 0.5 to 0.8.
[0020]
The multiplication of the weight w with respect to the appearance probability in the emphasized state is performed by the multiplier 27a as indicated by a broken line in FIG. 1B, and the value of the weight w is controlled by the emphasis degree determining unit 28.
The enhancement level determined by the enhancement level determination unit 28 is input to the bit rate control signal generation unit 29. The bit rate control signal generation unit 29 is a low bit when the enhancement level is 0 in the example of FIGS. 4A and 4B. A rate control signal, a medium bit rate control signal when the enhancement level is 1, and a high bit rate control signal when the enhancement level is 2, are generated and supplied to the encoding unit 15 in FIG. 1A.
[0021]
In response to the input bit rate control signal, the encoding unit 15 compresses and encodes the input speech signal with the low rate unit 15L at a predetermined low bit rate in the case of the low bit rate control signal, and the medium bit rate control signal In the case, the input audio signal is encoded by the medium rate unit 15M at a predetermined medium bit rate higher than the low bit rate, and in the case of a high bit rate control signal, it is higher than the medium bit rate. The input speech signal is encoded by the high rate unit 15H at the bit rate.
In this example, three encoding units 15L, 15M, and 15H are shown as the encoding unit 15 according to each bit rate. However, this is for convenience, and these may be configured by independent encoding units. However, as in the known example shown in the section of the prior art, coding may be performed by switching, adding, or removing part of the functional configuration according to the bit rate. And can be configured in the same manner as a conventional variable bit rate encoder. Further, the input speech signal is delayed by the buffer storage unit 17 according to the time delay until the bit rate control signal is generated, and supplied to the encoding unit 15.
[0022]
For example, as illustrated in FIG. 4C, the encoding unit 15 adds and outputs a code 32 indicating a coding bit rate of the compression code string 31 prior to the compression code string 31 for the audio signal. This bit rate code 32 is added at least every time the coding bit rate is changed.
According to this encoder, an important part of the semantic content in the audio signal generally contains a small paragraph in an emphasized state, so it is encoded at a high bit rate, and the decoded audio signal is more clearly and possibly Speech that is full of the speaker's emotion can also be restored, and the speech paragraph in a calm state has a low bit rate, so that the coding efficiency can be increased.
[0023]
FIG. 5 shows an example of a decoder that decodes the encoded output from the encoder shown in FIG. 1A. The code sequence from the input terminal 41 is separated into the encoded sequence 31 and the bit rate code 32 by the separation unit 42, and the encoded sequence 31 is supplied to the decoding unit 43. The bit rate code 32 is decoded by the bit rate code decoding unit 44, and in the decoding unit 43, the high bit rate decoding unit 43H, the medium bit rate decoding unit 43M, and the low bit rate decoding unit 43L according to the decoded bit rate control signal. The encoded sequence 31 is decoded by either of the above. This decoding unit 43 corresponds to the encoding unit 15 in FIG. 1A, and each bit rate decoding unit 43H, 43M, 43L is configured independently, or is shared and a part of functional configuration is switched. Added, removed, added. It can be configured in the same way as a conventional variable bit rate decoder. That is, according to the present invention, the higher the enhancement degree, the higher the bit rate, that is, the higher the accuracy of encoding, but the higher the quality of the decoded acoustic signal, the higher the enhancement degree.
[0024]
An example of the procedure of the encoding method of the present invention corresponding to the encoder shown in FIG. 1 will be described with reference to FIG.
When an acoustic signal, in this example, an audio signal is input, it is temporarily stored in the storage unit (S1), and the audio feature amount of the audio signal is extracted for each frame (S2). Also, it is determined whether it is a silent segment or a voiced segment (S3), and a voiced segment sandwiched between silent segments of a predetermined length or longer is extracted as a speech sub-paragraph (S4), and the last speech sub-paragraph in the speech sub-paragraph is detected. Then, an audio paragraph consisting of audio subgroups between adjacent end audio subparagraphs is extracted (S5).
[0025]
The speech feature value for each frame extracted in step S2 is quantized with reference to the probability codebook 13, that is, associated with the code in the probability codebook 13 having the closest distance (S6), the emphasis state of each code, and The independent appearance probabilities in the calm state or these and the conditional appearance probabilities are extracted from the codebook 13, and the appearance state probabilities Ps (e) and Ps (n) in the sound state and the calm state are obtained for each audio sub-paragraph. (S7). Furthermore, it is determined from these appearance probabilities whether each sub-paragraph is in an emphasized state or in a calm state, and for each audio paragraph, depending on the number of sub-paragraphs in the emphasized state or whether the sub-paragraph is in an emphasized state or in a calm state. The emphasis degree of the speech paragraph is determined by giving a weight to the appearance probability in the emphasized state when determining whether or not there is a combination thereof (S8).
[0026]
A bit rate is determined in accordance with the determined degree of enhancement (S9), the audio signal of the corresponding speech paragraph is encoded at the determined bit rate (S10), and the encoded sequence and the code indicating the bit rate of the encoding are encoded. Are output (S11).
Modified example
In the encoding in the encoding unit 15, depending on the encoding method, when the same parameter as the speech feature amount extracted by the feature amount extraction unit 12 is used, the one extracted by the feature amount extraction unit 12 can be used. . Conversely, some or all of the speech feature amount parameters extracted for encoding by the encoding unit 15 may be used as the extraction parameters of the feature amount extracting unit 12. That is, for example, as shown in FIG. 7, the feature quantity is extracted from the input acoustic signal for each frame by the feature quantity extraction unit 12, and the enhancement degree judgment unit 14 uses at least some parameters of the extracted feature quantity. The degree of enhancement for each acoustic paragraph of the acoustic signal is determined, and a bit rate control signal corresponding to this is supplied to the encoding unit 15. The encoding unit 15 encodes the portion of the acoustic paragraph of the input acoustic signal so that the bit rate corresponds to the bit rate control signal input for at least some of the parameters extracted by the feature amount extraction unit 12. The
[0027]
The present invention not only encodes audio signals, but also encodes music paragraphs with high bit rates as well as important music paragraphs, and encodes unimportant portions with low bit rates to increase encoding efficiency. You can also. The reason why the subject felt the emphasis state when creating the probability codebook 13 in the case of music is shown below.
(A) Loud voice and loud voice
(B) Strong voice
(C) High voice and strong accent
(D) Voice is high and voice quality changes
(E) Extend voice and loud
(F) Loud voice, loud voice, strong accent
(G) Loud voice, loud voice, screaming
(H) Voice is loud and accent changes
(I) Extend voice, loud voice, high ending
(J) The voice is high and the voice is extended
(K) Extend voice, scream, high voice
(L) Strong ending
(M) Slowly strengthen
(N) The tone is irregular
(O) The tone is irregular and the voice is high
Next, an example in which the present invention is applied to TwinVQ will be described with reference to FIG. TwinVQ is disclosed in, for example, the magazine Nikkei Electronics 1997, 4.21, (No. 687), pages 181 to 202, and others. The acoustic signal from the input terminal 11 analyzes the signal in the frame prior to the orthogonal transformation, and the transformation length, that is, the number of divisions of the frame is determined by the division selection unit 51. For example, when the frame length is 2048 samples, it is determined whether 2048-point conversion is applied to the frame data once, 512-point conversion is applied four times, or 128-point conversion is applied 16 times. . In accordance with this determination, the acoustic signal is converted into a frequency domain signal (coefficient) by a modified discrete cosine transform for each conversion length determined by the window division MDCT unit 52.
[0028]
Aside from this conversion, the spectral envelope of the acoustic signal is estimated by linear prediction analysis in the LPC analysis unit 53, and the frequency domain coefficients are normalized by the division unit 54 based on the spectral envelope. A periodic peak component (pitch component) is extracted from the low frequency part of the flattened frequency domain coefficient by the pitch analysis unit 55, and this component is subtracted from the output of the division unit 54 by the subtraction unit 56. An envelope obtained by averaging the output of the calculation unit 54 by a non-linear division proportional to the Bark scale by the Bark scale spectrum analysis unit 57 is obtained, and the output of the subtraction unit is divided by the division unit 58 based on this envelope. As a result, fine spectral peaks in the flattened frequency domain coefficients from the division unit 54 are flattened by the pitch component, and further flattened by the division unit 58.
[0029]
The average power of the entire frequency domain coefficient thus flattened is calculated by the power analysis unit 59, and the output of the division unit 58 is normalized by the normal unit 61 with this average power. All of the normalized frequency domain coefficients are interleaved by the interleave dividing unit 62 and divided into sub-vectors, and the weighted vector of the conjugate structure representing the divided normalized frequency domain coefficients as the sum from the two codebooks. Vector quantization is performed by the quantization unit 63. At this time, based on the spectral envelope characteristic from the LPC analysis unit 53, the pitch component from the pitch component analysis unit 55, the averaged spectral envelope from the Bark scale spectrum analysis unit 57, and the power for each subframe from the power analysis unit 59. An auditory psychological model is created by the model generation unit 54, the vector selected by the vector quantization unit 63 is weighted by the auditory psychological model, and the vector quantum is measured with a distance measure that minimizes the distortion of the decoded signal acoustically. To do.
[0030]
In this embodiment of the present invention, the analysis results of the LPC analysis unit 53, the pitch component analysis unit 55, and the power analysis unit 59 are input to the emphasis degree determination unit 14, and at least a part of the input analysis results are used to describe the analysis results. And, based on the input analysis result, extract a small paragraph and an acoustic paragraph, and further refer to the probability codebook 13 to show the appearance probability in the emphasized state and the appearance probability in the calm state for each small paragraph. Is calculated to determine the emphasis state, the calm state, and the enhancement degree of the acoustic paragraph. As described above, the linear prediction coefficient from the LPC analysis unit 53 is obtained from the LPC cepstrum, and the time change characteristic thereof is obtained. The dynamic feature quantity is calculated by the dynamic feature quantity calculation unit 78. This is input to the enhancement degree determination unit 76.
[0031]
In this example, the degree of emphasis is two steps, 0 and 1, and accordingly, the LPC codebooks 65L and 65H are the pitch codebooks 66L and 66H, the spectrum codebooks 67L and 67H are the power codebooks. 68L and 68H are provided, and conjugate structure codebooks 69L and 69H are provided, respectively. The code books 66L, 67L, 68L, and 69L each have a smaller number of code vector elements than the corresponding code books 66H, 67H, 68H, and 69H. The converted code (index) has a smaller number of bits than that of the latter.
[0032]
When the enhancement level determination unit 14 determines that the enhancement level is 0, the codebook switching units 71 to 75 are controlled by the low bit rate control signal from the enhancement level determination unit 14, and the LPC parameter representing the analysis spectrum envelope in the LPC analysis unit 55. Is encoded using the LPC codebook 65L, the pitch component analysis unit 55 encodes the extracted pitch component using the pitch codebook 66L, and the bark scale spectrum analysis unit 57 uses the analysis average envelope for the spectrum. It is encoded using the codebook 67L, the detected average power is encoded using the power codebook 68L in the power analysis unit 59, and is flattened using the conjugate structure codebook 69L in the weighted vector quantization unit 63. Frequency domain coefficients are encoded. These encoded outputs are integrated and output by the synthesis unit 76.
[0033]
When the enhancement level determination unit 14 determines that the enhancement level is 1, the codebook switching units 71 to 75 are switched to the codebooks 65H to 69H side by the bit rate control signal from the enhancement level determination unit 14, respectively. , 55, 57, and 59 respectively use codebooks 65H, 66H, 67H, and 68H, the vector quantization unit 63 performs encoding using the codebook 69H, and these encoded outputs are integrated by the synthesis unit 76. Is output. The synthesizing unit 76 outputs a bit rate code indicating that at the beginning of the integrated code at least every time the bit rate control signal changes. In this example, since the emphasis degree is either 0 or 1, the bit rate code may be 1 bit, and since the emphasis change is either 0 to 1, or 1 to 0, the bit rate control signal changes. The code indicating this may be 1 bit.
[0034]
In the corresponding decoder, in the conventional TwinVQ decoder, the codebooks 65L and 65H, 66L and 66H, 67L and 67H, 68L and 68H, 69L and 69H are provided as the codebook, and the input bits What is necessary is just to decode using either set of these code books according to a rate code.
In this embodiment, as is clear from the above description, the LPC analysis unit 53, the pitch component analysis unit 55, and the power analysis unit 69 extract feature amounts used for enhancement degree determination and encode the input acoustic signal. The feature amount extraction unit 79 is also configured to extract a feature amount for the purpose.
[0035]
In this embodiment, a plurality of codebook groups corresponding to a plurality of predetermined coding bit rates, in this example, a codebook group consisting of codebooks 65L, 66L, 67L, 68L, and 69L, and a codebook 65H, Two codebook groups including a codebook group consisting of 66H, 67H, 68H, and 69H are prepared, and an input acoustic signal is generated using the one codebook group according to the bit rate determined by the enhancement degree determination unit 14. Is an example of encoding. The change of the bit rate by the selection of the code book group is not limited to the case where all the code books used for the encoder are selected according to the bit rate. For example, as shown by the broken line in FIG. May be encoded using one codebook 65 regardless of the change in the bit rate.
[0036]
Next, an example in which the encoding bit rate is changed according to the enhancement degree and the encoding method is changed will be described with reference to FIG. This embodiment is a case in which the encoding method in the “acoustic signal encoding method” disclosed in Japanese Patent Laid-Open No. 8-263830 is changed according to the encoding bit rate determined by the enhancement degree. Briefly stated.
A feature amount is extracted from the input acoustic signal for each frame by the feature extraction unit 79, the enhancement degree determination unit 14 determines the enhancement degree according to the extracted feature amount, and a bit rate control signal corresponding to this is output. The In this example, the degree of enhancement to be determined is either 0 or 1. When the degree of enhancement is determined to be 0, the switching unit 81 is controlled by the bit rate control signal, the input terminal is the first encoding unit, In the example, the input acoustic signal is connected to the CELP encoding unit 82, and the input acoustic signal is input to the inverse filter 83 in the encoding unit 82. The frequency dependence of the spectral envelope of the input acoustic signal is suppressed, and a residual signal is obtained. The residual signal is input to the codebook selection unit 85, and distortion from the adaptive codebook vector, which is a residual signal in the past for each pitch period, is calculated from the adaptive codebook 86, and the adaptive codebook vector that minimizes the distortion is calculated. A pitch period corresponding to is selected. This pitch period, ie the code C corresponding to the selected adaptive codebook vector_bBecomes the encoded output. The decoded residual signal obtained in the previous encoded frame may be generated as an adaptive encoded vector by periodicizing based on various pitch periods. The selected vector is subtracted from the residual signal from the inverse filter 83 by the difference circuit 87, and the residual signal is vector quantized by the time domain vector quantization unit 88 with reference to the fixed codebook 89. The dequantized output and the vector selected by the codebook selection unit 85 are added by the adder circuit 91 to decode (synthesize) the residual signal, which is supplied to the adaptive codebook 86. The filter coefficient of the inverse filter 83 uses the analysis result of the LPC analysis unit 84 in the feature quantity extraction unit 79, and the cutout according to the pitch period length from the adaptive codebook 86 is extracted in the feature quantity extraction unit 79. By performing based on the frequency, the acoustic feature parameter can be commonly used for encoding and enhancement degree determination. The analysis result of the LPC analysis unit 84 is encoded by the LPC quantization unit 92, and the code C_a, C indicating a selection vector from the adaptive codebook selection unit 85_b, A code C indicating a selection vector from the quantization unit 88_cThe combining unit 76 outputs a code indicating that the CELP encoding unit 82 has been selected (a code indicating the bit rate).
[0037]
On the other hand, when the enhancement degree is determined to be 1, the switching unit 81 is controlled by the bit rate control signal, the input terminal is connected to the second encoding unit, in this example, the TwinVQ encoding unit 94, and the input acoustic signal is The frequency domain signal is converted into a frequency domain signal by the MDCT unit 52 in the encoding unit 94, and the frequency domain coefficient is divided and normalized by the division circuit 54 by the quantized frequency domain spectral outline. The frequency domain residual coefficient is divided and normalized in the division circuit 58 by the inter-frame prediction spectrum from the inter-frame prediction unit 95 and is further flattened (frequency domain fine structure). ) And the vector quantization is performed by the frequency domain quantization unit 96, and the inverse quantization fine structure and the fine structure quantization code C are obtained from the quantization unit 96._eIs output. On the other hand, the inverse quantization fine structure is obtained by multiplying the inter-frame prediction spectrum from the inter-frame prediction unit 95 by the multiplier 97 and decoding the residual coefficient, which is input to the inter-frame prediction unit 95, and the inter-frame prediction spectrum and the frame Quantization code C of inter prediction coefficient_fIs output. These codes C_e, C_fLPC quantization code C from the LPC quantization unit 92_aAnd a code indicating that the Twin VQ encoding unit 94 has been selected (a code indicating a bit rate) are integrated by the combining unit 76 and output.
[0038]
Also in this case, the LPC coefficients extracted by the feature quantity extraction unit 79 are used for both enhancement degree determination and encoding, and are not shown in FIG. 9, but are performed in the frequency domain quantization unit 96. The power extracted by the feature amount extraction unit 79 can also be used for normalization by the above.
The number of emphasis levels to be determined is increased. For example, the emphasis degree 0 is divided into two stages, and a fixed codebook 89 'having a different size from the fixed codebook 89 (the number of accommodated fixed vectors is different) is provided. On the other hand, one of the fixed codebooks 89 and 89 ′ may be encoded with one of the enhancement degrees 0, and may be encoded with the other of the fixed codebooks 89 and 89 ′ in the other of the two stages. Similarly, a plurality of vectors having different numbers of vectors accommodated as codebooks used in the frequency domain quantization unit 96 are prepared, and the enhancement degree 1 is divided into a plurality of stages. The higher the enhancement degree, the larger the number of accommodated vectors. It is also possible to perform encoding by the frequency domain quantization unit 96 using a codebook.
[0039]
In the example described above, when determining the degree of enhancement, the determination is performed by multiplying the appearance probability in the emphasized state by the weight w, and instead, the determination may be performed by multiplying the appearance probability in the calm state by the weight w. In this case, for example, as shown by a broken line frame in FIG. 1A, the weight w multiplied by the multiplication unit 27b is controlled, and the degree of enhancement is determined when a small paragraph that is in an emphasized state remains even if the weight w is large. As a technique for determining the degree of enhancement, for example, as indicated by a broken line in FIG. 1, the appearance probability P (n) in the enhancement state obtained by the enhancement state probability calculation unit 25 and the calmness obtained by the calm state probability calculation unit 26. The ratio with the appearance probability P (e) in the state is determined by the division unit 30 as P (e) / P (n) or P (n) / P (e), and is emphasized according to the magnitude of this ratio. The degree may be determined. For example, when P (e) / P (n) is obtained, the degree of enhancement is increased as the ratio is decreased. In the illustrated example, the degree of emphasis is determined for each small paragraph, and the corresponding bit rate is determined. Furthermore, only the appearance probability P (n) in the emphasized state may be used, and if this value is not less than a preset value, it may be determined that the small paragraph is in the emphasized state.
[0040]
The degree of emphasis is determined for each paragraph including a frame, that is, for each paragraph, such as each acoustic paragraph, each sub-paragraph, each voiced section (voiced paragraph), each frame, or every fixed time such as 10 seconds or 5 seconds. You may do it.
When encoding a video signal closely related to an audio signal, such as a television, the audio signal is encoded with a bit rate corresponding to the degree of enhancement, and the video is changed while changing the bit rate in the same way. The signal may be encoded.
In the above description, the coding bit rate may be controlled more finely by dividing the emphasis degree into four or more. Furthermore, each of the small paragraph and the acoustic paragraph may have a fixed length.
[0041]
The encoder of the present invention, for example, the one shown in FIG. 1, may be caused to function by causing a computer to execute a program. In this case, for example, an encoding program for executing each procedure of the encoding method shown in FIG. 6 by a computer is installed from a recording medium such as a CD-ROM or a flexible magnetic disk into the computer program memory. Alternatively, it may be downloaded through a communication line and executed by the computer.
[0042]
【The invention's effect】
As described above, according to the present invention, efficient coding can be performed by increasing the bit rate of the portion where the content in the audio signal is important, the portion to be emphasized, and decreasing the bit rate of the other portion.
[Brief description of the drawings]
FIG. 1A is a diagram showing a functional configuration of an example of an encoder according to the present invention, and B is a diagram showing a specific functional configuration example of an enhancement degree determination unit 14;
FIG. 2 is a diagram showing a storage example of a probability codebook 13 in FIG. 1;
FIG. 3 is a diagram for explaining an audio sub-paragraph, an end audio sub-paragraph, and an audio paragraph;
FIGS. 4A and 4B are diagrams illustrating an example of an enhancement degree determination table, and FIG. 4C is a diagram illustrating an example of an output code format;
FIG. 5 is a diagram illustrating a functional configuration example of a decoder corresponding to the encoder of FIG. 1A.
FIG. 6 is a flowchart showing an example of a processing procedure of the encoding method of the present invention.
FIG. 7 is a diagram showing a functional configuration example of another embodiment of the present invention.
FIG. 8 is a diagram showing a functional configuration example of an encoder in which the present invention is applied to TwinVQ.
FIG. 9 is a diagram showing a functional configuration example of still another embodiment of the present invention.

Claims

It includes at least one of the following: the fundamental frequency, the power, the time variation characteristic of the dynamic feature, the inter-frame difference of the fundamental frequency, the inter-frame difference of the power, and the inter-frame difference of the temporal variation of the dynamic feature. Using a codebook that stores a speech feature vector composed of a set of feature amounts, an appearance probability of the speech feature vector in an emphasized state, and an appearance probability of the speech feature vector in a calm state,
Analyzing the acoustic signal for each frame to obtain the voice feature amount,
Whether the sound signal is a silent interval for each frame, whether it is a voiced interval,
A portion including a voiced section surrounded by a silent section of a predetermined number of frames or more of the acoustic signal is determined as a voice sub-paragraph, and the average power of the voiced section included in the voice sub-paragraph is a predetermined average power in the voice sub-paragraph Audio sub-paragraphs that end with audio sub-paragraphs smaller than a constant multiple of
A code is obtained by quantizing the set of audio feature values of each frame of each audio sub-paragraph, and the appearance probability in the emphasized state and the appearance probability in the calm state of the sound feature vector corresponding to the code are obtained from the codebook. Seeking
Using the probability of appearance of the speech feature vector of each frame of the speech sub-paragraph in the emphasized state, calculating the probability that the speech sub-paragraph is in the emphasized state,
Using the probability of appearance of the speech feature vector of each frame of the speech sub-paragraph in the calm state, calculating the probability that the speech sub-paragraph will be in a calm state,
A speech sub-paragraph with a higher probability of being in the emphasized state than the probability of being in the calm state is determined as the emphasized state;
For each audio paragraph, the more the number of audio sub-paragraphs determined to be in the emphasized state, the higher the compression rate at a bit rate,
A code obtained by the coding and a code indicating a bit rate or a change in the bit rate are output.

When the codebook includes at least a time variation characteristic of a dynamic feature amount or a difference between frames of a time variation property of a dynamic feature amount,
The process of analyzing the acoustic signal includes a process of calculating a linear prediction coefficient from the acoustic signal and obtaining a dynamic feature amount from the linear prediction coefficient,
The linear prediction coefficients, the acoustic signal encoding method according to claim 1 Symbol mounting is characterized by using to determine the sign of the process of the compression coding.

As a codebook used for the compression coding process, a plurality of codebook groups storing a set of each voice feature amount and code corresponding to each bit rate are prepared,
In the compression encoding process, a codebook group corresponding to the determined bit rate is selected and used to determine a code corresponding to each extracted speech feature and the compression encoding is performed. Sign
Acoustic signal encoding method according to claim 1 Symbol mounting is characterized by using the code indicating the selected codebook set as a code indicating the bit rate or bit rate changes.

In the compression encoding process, an encoding method corresponding to the determined bit rate is selected, the acoustic signal is encoded using the selected encoding method, and the compression encoded code is obtained,
Acoustic signal encoding method according to claim 1 Symbol mounting is characterized by using a code indicating the selected encoding method as a code indicating the bit rate or bit rate changes.

It includes at least one of the following: the fundamental frequency, the power, the time variation characteristic of the dynamic feature, the inter-frame difference of the fundamental frequency, the inter-frame difference of the power, and the inter-frame difference of the temporal variation of the dynamic feature. A codebook that stores a speech feature vector composed of a set of feature values, an appearance probability of the speech feature vector in an emphasized state, and an appearance probability of the speech feature vector in a calm state;
A feature quantity extraction unit for analyzing the acoustic signal for each frame to obtain the voice feature quantity;
Whether silent section of the audio signal for each frame, and determines voiced / silence judging means whether voiced segment,
A small paragraph determination means for determining a portion including a voiced section surrounded by a silent section of a predetermined number of frames or more of the acoustic signal as a voice small paragraph;
A voice paragraph determining means for determining, as a voice paragraph, a voice subparagraph group that ends with a voice subparagraph whose average power of a voiced section included in the voice subparagraph is smaller than a predetermined constant multiple of the average power in the voice subparagraph;
A code is obtained by quantizing the set of audio feature values of each frame of each audio sub-paragraph, and the appearance probability in the emphasized state and the appearance probability in the calm state of the sound feature vector corresponding to the code are obtained from the codebook. A probability calculation means to be obtained;
Using the probability of appearance of the speech feature vector of each frame in the speech sub-paragraph in the emphasized state, calculating the probability that the speech sub-paragraph will be in the emphasized state,
Using the probability of appearance of the speech feature vector of each frame in the speech sub-paragraph in the calm state, calculating the probability that the speech sub-paragraph will be in a calm state,
An emphasis state determination unit that determines an audio sub-paragraph as an emphasis state that has a higher probability of being in the emphasis state than the probability of the calm state;
A bit rate control signal generation unit that generates a bit rate control signal indicating that compression encoding is performed at a higher bit rate as the number of audio sub-paragraphs determined to be in an emphasized state is larger for each audio paragraph;
An encoding unit that encodes the acoustic signal at a predetermined bit rate corresponding to the bit rate control signal according to the bit rate control signal and outputs a code indicating a bit rate or a change in the bit rate; An acoustic signal encoder comprising:

When the codebook includes at least a time variation characteristic of a dynamic feature amount or a difference between frames of a time variation property of a dynamic feature amount,
The feature quantity extraction unit has a means for calculating a linear prediction coefficient from the acoustic signal and obtaining a dynamic feature quantity from the linear prediction coefficient;
The acoustic signal encoder according to claim 5 , wherein the encoding unit uses the linear prediction coefficient for determining a code in the compression encoding process.

The said encoding unit,
As a codebook used for the compression coding process, a plurality of codebook groups for storing a set of each audio feature quantity and code corresponding to each bit rate are prepared ,
In the process in which the encoding unit performs the compression encoding, a codebook group corresponding to the determined compression encoding bit rate is selected and used, and a code corresponding to each extracted speech feature amount is selected. determined by the previous SL compression coded codes, acoustic signal encoder according to claim 5, characterized by using a code indicating the selected codebook set as a code indicating the bit rate or the bit rate change .

The encoding unit is
In the compression encoding process, an encoding method corresponding to the determined bit rate is selected, the acoustic signal is encoded using the selected encoding method to obtain the compression encoded code, and the bit rate is determined. 6. The acoustic signal encoder according to claim 5 , wherein a code indicating the selected encoding method is used as a code indicating a change in bit rate.

The encoding program for making a computer perform each procedure of the acoustic signal encoding method in any one of Claims 1-4 .