JP3803306B2 - Acoustic signal encoding method, encoder and program thereof - Google Patents

Acoustic signal encoding method, encoder and program thereof Download PDF

Info

Publication number
JP3803306B2
JP3803306B2 JP2002124540A JP2002124540A JP3803306B2 JP 3803306 B2 JP3803306 B2 JP 3803306B2 JP 2002124540 A JP2002124540 A JP 2002124540A JP 2002124540 A JP2002124540 A JP 2002124540A JP 3803306 B2 JP3803306 B2 JP 3803306B2
Authority
JP
Japan
Prior art keywords
bit rate
paragraph
probability
state
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2002124540A
Other languages
Japanese (ja)
Other versions
JP2003316398A (en
Inventor
浩太 日▲高▼
信弥 中嶌
理 水野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2002124540A priority Critical patent/JP3803306B2/en
Publication of JP2003316398A publication Critical patent/JP2003316398A/en
Application granted granted Critical
Publication of JP3803306B2 publication Critical patent/JP3803306B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Description

【0001】
【発明の属する技術分野】
この発明は音声信号、音楽信号などの音響信号を圧縮符号化、特に音響信号の状態に応じて圧縮率、つまり符号化ビット率を変更する符号化方法、その符号化器及びそのプログラムに関する。
【0002】
【従来の技術】
従来において、音声信号の圧縮符号化を効率的に行うため、例えば特開平7−225599号公報に、入力音声信号が非音声の場合と無声音の場合と、過渡有声音の場合と、定常有声音の場合とに分け、これらの場合に応じてビット率を適応的に変更するCELP符号化方法が示されている。この符号化方法はあくまでも音声信号の各部において聴覚的な品質を維持しながら情報量をなるべく少なくしようとするものである。
【0003】
また例えば複数の利用者が低ビット率の符号化を利用して回線を共用するようにし、かつ利用者の数や品質の要求に応じてビット率を適応的に切り換えるADPCM(差分PCM)符号化方法が例えば電子情報通信学会、平成10年10月20日発行、守谷健弘著「音声符号化」124〜128頁に示されている。
更に入力音響信号が音声信号か音楽信号であるかを検出して、前者の場合はCELP符号化法により、後者の場合はTwinVQ符号化法により符号化方法を切り換えて、それぞれの入力信号に適した符号化を行うことが特開平8−263098号公報に示されている。
【0004】
【発明が解決しようとする課題】
従来においては符号化ビット率の変更は、要求される品質に応じて、あるいは、信号の部分的な状態に応じて品質に影響を与えない範囲で適応的に行うものであった。
しかし、音声の発話内容の重要部分、ある1つの音楽中における特に強調したい部分については高ビット率とし、特に品質を高め、例えば些細な感情的な変化も聴きとれるように、あるいは音楽において特に高品質なものとすれば一層効率的な符号化、あるいは同一情報量でも従来よりも有意義な情報とすることができる。この発明はこのような圧縮符号化を可能とする符号化方法を提供することを目的とする。
【0005】
【課題を解決するための手段】
(1)この発明の音響信号を圧縮符号化する方法によれば上記音響信号をフレーム毎に分析して少なくとも基本周波数、パワー、動的特徴量の時間変化特性又はこれらのフレーム間差分(Δ成分)をパラメータとして含む、音響特徴量を抽出し、音響特徴量とその強調状態での出現確率を含むデータの組を少なくとも2以上格納した確率符号帳を参照して上記抽出した音響特徴量の強調状態での出現確率を求め、求めた強調状態での出現確率に基づき上記音響信号の強調度を、フレーム又はフレームを含む段落ごとに決定し、上記決定した強調度が大きい程、高いビット率で上記音響信号を圧縮符号化した符号と、ビット率又はビット率変化を示す符号を出力する。
【0006】
(2)前記(1)項の方法において好ましくは、上記音響信号の分析において線形予測係数を算出し、その線形予測係数から上記動的特徴量を算出し、
上記音響特徴量として抽出した上記基本周波数と上記パワーと上記線形予測係数を、上記圧縮符号化した符号の決定に用いる。
(3)前記(1)項の方法において好ましくは、各ビット率に対応した個数の各音響特徴量と符号の組を記憶する複数の符号帳群を用意し、上記強調度に対応するビット率の1つの符号帳群を選択して用いて、上記抽出された各音響特徴量に対応する符号を決定して、上記圧縮符号化した符号とし、上記ビット率又はビット率変化を示す符号として上記選択された符号帳群を示す符号を用いる。
【0007】
(4)前記(1)項の方法において、好ましくは上記強調度に対応する符号化方法を選択し、選択した符号化方法を用いて上記音響特徴量を符号化して上記圧縮符号化した符号を求め、上記ビット率又はビット率変化を示す符号として上記選択された符号化方法を示す符号を用いる。
(5)前記(1)〜(4)項の方法において好ましくは、音響信号の音響段落ごとに強調度を、その音響段落中における強調状態での出現確率から決定する。
(6)前記(5)項の方法において好ましくは、確率符号帳は各音響特徴量ごとの平静状態での出現確率を含み、上記音響段落に含まれる小段落ごとに上記出現確率を求め、その小段落が強調状態であるか平静状態であるかを判定し、音響段落内中の小段落の状態判定結果に基づき上記強調度を求める。
【0008】
(7)前記(6)項の方法において、好ましくは上記音響段落内の強調状態と判定された小段落の数に基づき上記強調度を求める。
(8)前記(6)項の方法において、好ましくは上記確率符号帳は各音響特徴量ごとの平静状態での出現確率を含み、上記強調状態での出現確率、上記平静状態での出現確率の少くとも一方に重みを与え、その重みを変化させ、強調状態と判定される小段落の数を所定値以下に制御し、その時の重みの大きさに基づきその音響段落の上記強調度を求める。
【0009】
(9)前記(6)項の方法において、好ましくは上記音楽信号をフレームごとに、無音区間か、有音区間かを判定し、所定数以上の無音区間で囲まれた有音区間を上記小段落として検出し、小段落に含まれる一つ以上の有音区間の平均パワーがその小段落内の平均パワーの定数倍より小さい小段落を末尾とし、隣接する末尾小段落間の小段落群を上記音響段落として検出する。
(10)前記(1)〜(6)項の何れかの方法において好ましくは、上記確率符号帳は各音響特徴量ごとの平静状態での出現確率を含み、強調状態での出現確率と平静状態での出現確率との比から、上記強調度を決定する。
【0010】
この発明の符号化器によれば、入力音響信号を、ビット率制御信号に応じて異なるビット率で圧縮符号化することができる符号化部と、少なくとも基本周波数、パワー、動的特徴量の時間変化特性又はこれらのフレーム間差分(Δ成分)をパラメータとして含む音響特徴量とその強調状態での出現確率を含むデータの組が少なくとも2以上格納された確率符号帳と、入力音響信号をフレーム毎に分析して、上記音響特徴量を抽出する特徴抽出部と、上記抽出された音響特徴量の強調状態での出現確率を上記確率符号帳を参照して求め、これら求めた出現確率に基づき、音響信号のフレームごと又はフレームを含む段落ごとに強調度を求め、強調度が大きい程、高いビット率とするビット率制御信号を上記符号化部へ供給する強調度判定部とを具備する。
この発明のプログラムは前記(1)〜(10)の何れかに記載の音響信号符号化方法の各手順をコンピュータに実行させるための符号化プログラムである。
【0011】
【発明の実施の形態】
概 要
図1Aにこの発明の実施形態を示す。入力端子11からの音響信号は特徴量抽出部12でフレームごとに分析して、少なくとも基本周波数、パワー、動的特徴量の時間変化特性又はこれらのフレーム間差分(Δ成分とも書く)をパラメータとして含む音響特徴量を抽出する。
確率符号帳13には音響特徴量とその強調状態での出現確率及び平静状態での出現確率の組を少なくとも2以上格納されている。強調度判定部14において特徴量抽出部12で抽出された音響特徴量の各状態での出現確率を確率符号帳13を参照して求め、これら出現確率を用いて、入力音響信号の音響段落ごとに強調度を決定し、この決定した強調度が大きい程、高いビット率の制御信号を出力する。符号化部15では入力端子11からの入力音響信号を、強調度判定部14からのビット率制御信号により決まるビット率で圧縮符号化し、その圧縮符号とビット率又はビット率の変化を示す符号とを出力端子16に出力する。
特徴量、確率符号帳
音声特徴量としては基本周波数(f0)、パワー(p)、音声の動的特徴量の時間変化特性(d)、ポーズ時間長(無音区間)(ps)、などがあり音声の動的特徴量の時間変化特性(d)は発話速度の尺度となるパラメータであり、動的変化量としてスペクトル包絡を反映するLPCスペクトラム係数の時間変化特性を求め、その時間変化をもとに発話速度係数が求められるものであり、より具体的にはフレーム毎にLPCケプストラム係数Cl(t),…,Ck(t)を抽出して次式のような動的特徴量d(ダイナミックメジャー)を求める。d(t)=Σi=1 k[Σf=t-f0 t+f0[f×Ci(t)]/(Σf=t-f0 t+f02)]2 ここで、f0は前後の音声区間フレーム数(必ずしも整数個のフレームでなくとも一定の時間区間でもよい)、kはLPCケプストラムの次数、i=1,2,…,kである。発話速度の係数として動的特徴量の変化の極大点の単位時間当たりの個数、もしくは単位時間当たりの変化率が用いられる。
【0012】
例えば100msを1フレームとし、シフトを50msとする。1フレームごとの平均の基本周波数を求める(f0′)。パワーについても同様に1フレームごとの平均パワー(p′)を求める。更に現フレームのf0′と±iフレーム前後のf0′との差分をとり、±Δf0′i(Δ成分)とする。パワーについても同様に現フレームのp′と±iフレーム前後のp′との差分±Δp′i(Δ成分)を求める。f0′、±Δf0′i、p′、±Δp′iを規格化する。この規格は例えばf0′、±Δf0′iをそれぞれ、音声波形全体の平均基本周波数で割り規格化する。p′、±Δp′iについても同様に、発話状態判定の対象とする音声波形全体の平均パワーで割り、規格化する。規格化するにあたり、後述する小段落、音声段落ごとの平均パワーで割ってもよい。iの値は例えばi=4とする。現フレームの前後±T1msの、ダイナミックメジャーのピーク本数、即ち動的特徴量の変化の極大点の個数を数える(dp)。これと、現フレームの開始時刻の、T2ms前の時刻を区間に含むフレームのdpとの差分(−Δdp)を求める。前記±T1msのdp数と、現フレームの終了時刻の、T3ms後の時刻を区間に含むフレームのdpとの差成分(+Δdp)を求める。これら、T1,T2,T3の値は例えばT1=T2=T3=450msとする。フレームの前後の無音区間を±psとする。ステップ1ではこれらパラメータの各値を音声特徴量としてフレームごとに抽出する。これらパラメータとしては少なくとも基本周波数(又はピッチ周期)、パワー、動的特徴量の時間変化特性又はこれらのフレーム間差分(Δ成分)を含むものとする。
【0013】
次に確率符号帳13の作成方法について簡単に述べる。
多数の学習用音声を被験者が聴取し、発話状態が、平静状態であるものと、強調状態であるものをラベリングする。
例えば、被験者は強調状態とする理由として、
(a)声が大きく、名詞や接続詞を伸ばすように発話する
(b)話し始めを伸ばして話題変更を主張、意見を集約するように声を大きくする
(c)声を大きく高くして重要な名詞などを強調する時
(d)高音であるが声はそれほど大きくない
(e)苦笑いしながら、焦りから本音をごまかすような時
(f)周囲に同意を求める、あるいは問いかけるように、語尾が高音になる時
(g)ゆっくりと力強く、念を押すように、語尾の声が大きくなる時
(h)声が大きく高く、割り込んで発話するという主張、相手より大きな声で
(i)声が小さくボソボソ、ヒソヒソという感じ、大きな声では憚られる真意や秘密を発言、普段、声の大きい人にとっての重要なことを発話する時
を挙げた。この例では、平静状態とは、前記の(a)〜(i)のいずれでもなく、発話が平静であると被験者が感じたものとした。
【0014】
平静状態と強調状態の各ラベル区間について、学習音声から前記音声特徴量を抽出し、パラメータを選択し、平静状態と強調状態のラベル区間の、前記パラメータを用いて、LBGアルゴリズムでコードブックを作成する。このようにして得られた各量子化音声特徴量(コード)についてその強調状態での出現確率と平静状態での出現確率との組を符号帳13に格納する。その強調状態での出現確率は、その音声特徴量が過去のフレームでの音声特徴量と無関係に強調状態で出現する確率(unigram:単独出現確率と記す)のみ、又はこれと、過去のフレームでの音声特徴量から現在のフレームの音声特徴量に至るフレーム単位の音声特徴量列ごとにその音声特徴量が強調状態で出現する条件付確率との組合せの何れかであり、平静状態での出現確率も同様に、その音声特徴量が過去のフレームでの音声特徴量と無関係に平静状態で出現する確率(unigram:単独出現確率と記す)のみ、又はこれと、過去のフレームでの音声特徴量から現在のフレームの音声特徴量に至るフレーム単位の音声特徴量列ごとにその音声特徴量が平静状態で出現する条件付確率と組合せの何れかである。
例えば図2に示すように確率符号帳13には各コードC1,C2,…ごとにその音声特徴量と、その単独出現確率が強調状態、平静状態について、また条件付確率が強調状態、平静状態についてそれぞれ組として格納されている。
【0015】
強調度判定
次に強調度判定部14の具体例を図1Bを参照して説明する。この例では音声信号の小段落ごとにその出現確率から強調状態か平静状態かを判定し、これに基づき音声段落の強調度を決定している。
小段落の検出のために入力音声信号に有声/無音判定部21で有声区間か無声区間かの判定が行われる。入力音声信号のフレームごとにパワーが所定以下であれば無音区間と判定し、有声区間は例えばフレームごとの相関関数が所定値以上であれば有声区間と判定する。
小段落判定部22で有声区間を挟む無音区間の時間がそれぞれt秒、例えば400ms以上であれば、その挟まれた有声区間を小段落と判定する。末尾小段落検出部23において、例えば図3に示すように入力音声信号が小段落j−1,j,j+1と判定されたとする。音声小段落jは、n個の有声区間から構成され、平均パワーをPjとし、音声小段落jの後部、つまりi=n−α番目からn番目までの有声区間の平均パワーpiの平均が音声小段落jの平均パワーPjより小さい時、すなわち、
Σpi/(α+1)<βPj,Σはi=n−αからnまでの総和、α,βは定数
を満たす時、この音声小段落jを音声段落kの末尾音声小段落として検出し、この末尾音声小段落から、その直前の末尾音声小段落までの小段落群を音声段落kと音声段落判定部24は判定する。例えばαは3、βは0.8である。このようにして末尾音声小段落を区切りとして隣接する末尾音声小段落間の音声小段落群を音声段落と判定する。
【0016】
各小段落の各フレームごとの音声特徴量についてこれに近い量子化音声特徴量(コード)を確率符号帳13からそれぞれ探し、その各コードと組をなすその強調状態での出現確率P(n)、平静状態での出現確率P(e)を、強調状態確率計算部25、平静状態確率計算部26でそれぞれ取出す。
強調状態確率計算部25、平静状態確率計算部26ではそれぞれ、これら各フレームごとに取出した確率を用いて、それぞれその小段落が強調状態になる確率、平静状態になる確率を計算する。例えば音声小段落sがフレーム数Nsであり、その各フレームについての音声特徴量と対応するコードがそのフレーム順にCi1,Ci2,…,CiNsであるとする。この音声小段落sが強調状態になる確率Ps(e)と平静状態になる確率Ps(n)をそれぞれ次式により計算する。
【0017】
Ps(e)=Pemp(Ci3|Ci1Ci2)…Pemp(CiNs|Ci(Ns−1)Ci(Ns−2))
Ps(n)=Pnrm(Ci3|Ci1Ci2)…Pnrm(CiNs|Ci(Ns−1)Ci(Ns−2))
ここで例えばPemp(Ci3|Ci1Ci2)はCi1Ci2の次にCi3が強調状態で出現する確率を表わす。このPemp(Ci3|Ci1Ci2)を求めるには符号帳13からCi3の強調状態の各単独出現確率と、Ci2の次にCi3が強調状態で出現する条件付確率と更にCi3がCi1,Ci2の次に、強調状態で出現する条件付確率を求め、これらを線形補間法により求めることが好ましい。その他の強調状態での出現確率、また平静状態での出現確率も同様にそれぞれ線形補間法により求めることが好ましい。線形補間法については例えば「音声言語処理」(北 研二、他、森北出版、1996年29頁)などに示されている。
【0018】
このようにして小段落sの強調状態での出現確率Ps(e)、平静状態での出現確率Ps(n)がそれぞれ確率計算部25,26で計算され、これら計算結果に基づき強調状態判定部27はPs(e)>Ps(n)であればその小段落sを強調状態であり、Ps(e)<Ps(n)であれば小段落sを平静状態であると判定する。強調状態判定部27は同様にして各小段落についてそれが強調状態であるか平静状態であるかを判定する。
強調度決定部28は、音声段落判定部24で判定された音声段落の強調度を、その音声段落を構成する全ての小段落に対して強調状態判定部27により判定された結果に基づいて決定する。例えば図4Aに示すように強調度を0,1,2の3段階に分ける場合は、音声段落中に強調状態と判定された小段落が1つもない場合は強調度0とし、強調状態と判定された小段落の数が1個又は2個の場合は、強調度1とし、3個以上の場合は強調度2とする。
【0019】
あるいは図4Bに示すように、強調状態判定部27における判定の際に、強調状態での出現確率に、重み0≦w<εを乗算した場合において、音声段落中で強調状態と判定された小段落の数が1以上の場合は、強調度2とし、ε≦w<1/εを乗算した状態で判定した場合に音声段落中に強調状態と判定された小段落の数が1個の場合は、強調度1とし、1/ε≦wを乗算した状態で判定した場合に音声段落中に強調状態と判定された小段落の数が0又は1個の場合は強調度0とする。εは1未満の正の実数、好ましくは0.5〜0.8程度である。
【0020】
なおこの強調状態での出現確率に対する重みwの乗算は図1B中に破線で示すように乗算部27aで行い、その重みwの値の制御は強調度決定部28により行う。
強調度決定部28で決定された強調度がビット率制御信号生成部29へ入力され、ビット率制御信号生成部29は図4A、図4Bの例の場合、強調度が0の場合は低ビット率制御信号を、強調度が1の場合は中ビット率制御信号を、強調度が2の場合は高ビット率制御信号をそれぞれ生成して、図1A中の符号化部15へ供給する。
【0021】
符号化部15は入力されたビット率制御信号に応じ、低ビット率制御信号の場合は予め決められた低いビット率で入力音声信号を低レート部15Lにより圧縮符号化し、中ビット率制御信号の場合は、前記低いビット率よりも高い予め決められた中ビット率で入力音声信号を中レート部15Mにより符号化し、高ビット率制御信号の場合は中ビット率より更に高い、予め決められた高いビット率で入力音声信号を高レート部15Hにより符号化する。
符号化部15として各ビット率に応じてこの例では3つの符号化部15L,15M,15Hを示したが、これは便宜的なものであり、これらを独立の符号化部で構成してもよいが、従来の技術の項で示した公知例のようにビット率に応じて一部機能構成を切り替えたり、付加したり、除去したりして符号化するようにしてもよく、各種の構成をとることができ、従来の可変ビット率符号化器と同様に構成することができる。またビット率制御信号が生成されるまでの時間遅れに応じて入力音声信号をバッファ記憶部17で遅延させて符号化部15へ供給するようにする。
【0022】
符号化部15は例えば図4Cに示すように、音声信号に対する圧縮符号列31に先立ち、その圧縮符号列31の符号化ビット率を示す符号32を付けて出力する。このビット率符号32は、少なくとも符号化ビット率が変更されるごとに付加する。
この符号化器によれば、音声信号中の意味内容の重要な部分は一般には強調状態の小段落を含むため、高いビット率で符号化され、その復号音声信号をより明確に、場合によっては話者の感情がこもった音声も復元でき、平静状態の音声段落は低ビット率とされるため、符号化効率を高くすることができる。
【0023】
図1Aに示した符号化器よりの符号化出力を復号する復号器の例を図5に示す。入力端子41からの符号列は分離部42により符号化列31とビット率符号32に分離され、符号化列31は復号部43へ供給される。ビット率符号32はビット率符号復号部44で復号され、復号されたビット率制御信号に応じて、復号部43において高ビット率復号部43H、中ビット率復号部43M、低ビット率復号部43Lの何れかにより符号化列31が復号される。この復号部43は図1A中の符号化部15と対応したものであり、各ビット率復号部43H,43M,43Lが独立に構成される場合や、共通化され、一部の機能構成が切替えられたり、外されたり、付加される。従来の可変ビット率復号器と同様に構成することができる。つまりこの発明では強調度が高いほどビットレートが高い、即ち高精度に符号化が行われるが、復号音響信号は、強調度が高い区間ほど高品質なものが得られる。
【0024】
図1に示した符号化器と対応するこの発明の符号化方法の手順の例を図6を参照して説明する。
音響信号、この例では音声信号が入力されると、一旦記憶部に記憶し(S1)、その音声信号の音声特徴量をフレームごとに抽出する(S2)。また無音区間か有声区間かの判別を行い(S3)、所定長以上の無音区間に挟まれた有声区間を音声小段落として抽出し(S4)、その音声小段落中の末尾音声小段落を検出して、隣接する末尾音声小段落間の音声小段落群からなる音声段落を抽出する(S5)。
【0025】
ステップS2で抽出したフレームごとの音声特徴量を、確率符号帳13を参照して量子化し、つまり距離が最も近い確率符号帳13中のコードと対応付け(S6)、その各コードの強調状態及び平静状態での各独立出現確率又はこれらと各条件付出現確率を符号帳13から取り出して、各音声小段落ごとの強調状態及び平静状態での出現確率Ps(e),Ps(n)を求める(S7)。更にこれら出現確率から各小段落が強調状態であるか平静状態であるかを判定し、音声段落ごとにその強調状態の小段落の数に応じて又は小段落が強調状態であるか平静状態であるかの判定の際に強調状態での出現確率に重みを与えることにより、あるいはこれらの併用により、音声段落の強調度を決定する(S8)。
【0026】
この決定した強調度に応じてビット率を決定し(S9)、その決定したビット率で対応音声段落の音声信号を符号化し(S10)、その符号化列とその符号化のビット率を示す符号とを出力する(S11)。
変形例
符号化部15における符号化において、その符号方法によっては特徴量抽出部12で抽出した音声特徴量のパラメータと同一のものを用いる場合は、特徴量抽出部12で抽出したものを用いることができる。逆に符号化部15で符号化のために抽出した音声特徴量パラメータの一部又は全部を、特徴量抽出部12の抽出パラメータに流用してもよい。つまり例えば図7に示すように入力音響信号は特徴量抽出部12でフレームごとに特徴量が抽出され、その抽出された特徴量の少くとも一部のパラメータを用いて、強調度判定部14により音響信号の音響段落ごとの強調度が判定され、これに応じたビット率制御信号が符号化部15へ供給される。符号化部15では特徴量抽出部12で抽出された少くとも一部のパラメータについて入力されたビット率制御信号に応じたビット率になるように入力音響信号のその音響段落の部分が符号化される。
【0027】
この発明は音声信号の符号化のみならず、音楽信号に対しても同様に重要な音楽段落に対して高いビット率で符号化し、重要でない部分は低ビット率で符号化して符号化効率を上げることもできる。音楽の場合における確率符号帳13の作成時に被験者が強調状態と感じた理由を下記に示す。
(a)声が大きく、かつ声が高い
(b)声が力強い
(c)声が高く、かつアクセントが強い
(d)声が高く、声質が変化する
(e)声を伸長させ、かつ声が大きい
(f)声が大きく、かつ、声が高く、アクセントが強い
(g)声が大きく、かつ、声が高く、叫んでいる
(h)声が高く、アクセントが変化する
(i)声を伸長させ、かつ、声が大きく、語尾が高い
(j)声が高く、かつ、声を伸長させる
(k)声を伸長させ、かつ、叫び、声が高い
(l)語尾上がり力強い
(m)ゆっくり強め
(n)曲調が不規則
(o)曲調が不規則、かつ、声が高い
次にこの発明をTwinVQに適用した例を図8を参照して説明する。TwinVQについては例えば雑誌日経エレクトロニクス1997、4.21、(No.687)181〜202頁、その他に示されている。入力端子11よりの音響信号は直交変換に先立ってフレーム内の信号を分析して、変換長、即ちフレームの分割数を分割選択部51で決定する。例えばフレーム長が2048サンプルの場合、フレームのデータに対して2048点の変換を1回適用するか、512点の変換を4回適用するか、128点の変換を16回適用するかを判定する。この判定に従って窓分割MDCT部52で音響信号を決定された変換長ごとに変形離散余弦変換(Modified Discrete Cosine Transform)により周波数領域の信号(係数)に変換する。
【0028】
この変換とは別にLPC分析部53で音響信号のスペクトル包絡の推定を線形予測分析により行い、そのスペクトル包絡により、周波数領域の係数を割算部54で正規化する。この平坦化された周波数領域係数の低周波部から周期的なピーク成分(ピッチ成分)をピッチ分析部55で抽出し、この成分を割算部54の出力から減算部56で差し引き、更にその割算部54の出力を、バーク尺度スペクトル分析部57でバーク尺度に比例する非線形な分割で平均化した包絡を求め、この包絡により減算部の出力を割算部58で割算する。これにより、割算部54よりの平坦化された周波数領域係数における微細なスペクトルピークがピッチ成分で平坦化され、更に割算部58で平坦化される。
【0029】
このようにして平坦化された周波数領域係数全体の平均パワーをパワー分析部59で計算し、この平均パワーで割算部58の出力を正規部61で正規化する。この正規化された全ての周波数領域係数をインターリーブ分割部62でインタリーブして副ベクトルに分割し、この分割された正規化周波数領域係数を二つの符号帳からの和で表わす共役構造の重み付きベクトル量子化部63によりベクトル量子化する。この際に、LPC分析部53よりのスペクトル包絡特性、ピッチ成分分析部55よりのピッチ成分、バーク尺度スペクトル分析部57よりの平均化スペクトル包絡、パワー分析部59よりの副フレームごとのパワーに基づき聴覚心理モデルをモデル生成部54で作り、この聴覚心理モデルでベクトル量子化部63で選択されたベクトルに対し重み付けして、復号信号が聴覚的に歪が最小になるような距離尺度でベクトル量子化を行う。
【0030】
この発明のこの実施例ではLPC分析部53、ピッチ成分分析部55、パワー分析部59の各分析結果が強調度判定部14に入力され、その入力された分析結果の少なくとも一部を用いて前述した音響特徴量とし、また入力分析結果に基づき、小段落、音響段落を抽出し、更に確率符号帳13を参照して、各小段落ごとの強調状態での出現確率、平静状態での出現確率を計算して、強調状態や平静状態更に音響段落の強調度を判定する。なお、LPC分析部53からの線形予測係数は先に述べたようにこれよりLPCケプストラムを求め、更にその時間変化特性を求めて、動的特徴量が動的特徴量計算部78で計算されて強調度判定部76に入力される。
【0031】
この例では強調度を0と1の2段階とした場合でこれに応じてLPC用符号帳65Lと65Hがピッチ用符号帳66Lと66Hが、スペクトル用符号帳67Lと67Hが、パワー用符号帳68Lと68Hが、共役構造符号帳69Lと69Hがそれぞれ設けられる。符号帳66L,67L,68L,69Lはそれぞれ対応する符号帳66H,67H,68H,69Hよりも各符号ベクトルの要素数が少なく、従って収容する符号ベクトルの数も少ない、つまり前者の符号帳により符号化した符号(インデックス)は後者のそれより少ないビット数のものとなる。
【0032】
強調度判定部14が強調度0と判定すると、強調度判定部14からの低ビット率制御信号により各符号帳切替部71〜75に制御され、LPC分析部55において分析スペクトル包絡を表わすLSPパラメータがLPC用符号帳65Lを用いて符号化され、ピッチ成分分析部55において、ピッチ用符号帳66Lを用いて抽出ピッチ成分が符号化され、バーク尺度スペクトル分析部57において分析平均化包絡がスペクトル用符号帳67Lを用いて符号化され、パワー分析部59において、検出平均パワーがパワー符号帳68Lを用いて符号化され、重み付きベクトル量子化部63において、共役構造符号帳69Lを用いて平坦化周波数領域係数が符号化される。これら符号化出力は合成部76で統合されて出力される。
【0033】
強調度判定部14が強調度1と判定すると、強調度判定部14からのビット率制御信号により各符号帳切替部71〜75がそれぞれ、符号帳65H〜69H側に切替えられ、各分析部53,55,57,59はそれぞれ符号帳65H,66H,67H,68Hを用い、ベクトル量子化部63は符号帳69Hを用いてそれぞれ符号化を行い、これらの符号化出力は合成部76で統合されて出力される。なお合成部76では少なくともビット率制御信号が変化するごとにそのことを示すビット率符号を統合した符号の先頭に付けて出力する。この例では強調度は0と1の何れかであるからビット率符号は1ビットでよく、また強調度変化は0から1,または1から0の何れかであるから、ビット率制御信号が変化したことが示す符号も1ビットでよい。
【0034】
これと対応する復号器においては従来のTwinVQ復号器において、符号帳として符号帳65Lと65H、66Lと66H、67Lと67H、68Lと68H、69Lと69Hと同一のものを設け、入力されたビット率符号に応じてこれら符号帳の何れかの組を用いて、復号すればよい。
この実施例では以上の説明から明らかなように、LPC分析部53、ピッチ成分分析部55、パワー分析部69は、強調度判定のために用いる特徴量を抽出すると共に、入力音響信号を符号化するための特徴量をも抽出する特徴量抽出部79を構成している。
【0035】
またこの実施例は、予め決められた複数の符号化ビット率と対応した複数の符号帳群、この例では符号帳65L,66L,67L,68L,69Lからなる符号帳群と、符号帳65H,66H,67H,68H,69Hからなる符号帳群との二つの符号帳群を用意し、強調度判定部14により決定されたビット率に応じて、その1つの符号帳群を用いて入力音響信号を符号化する例である。この符号帳群の選択によるビット率の変更は、その符号化器に用いる全ての符号帳をビット率に応じて選択する場合に限らず、例えば図8中に破線で示すようにLPC用符号帳はビット率の変更にかかわらず1個の符号帳65を用いて符号化するようにしてもよい。
【0036】
次に強調度に応じて符号化ビット率の変更を符号化方法を変更する例を図9を参照して説明する。この実施例は特開平8−263098号公報に示す「音響信号符号化方法」における符号化方法の変更を強調度により決定された符号化ビット率に応じて行うようにした場合であり、以下に簡単に述べる。
入力音響信号は特徴抽出部79でフレームごとに特徴量が抽出され、その抽出された特徴量に応じて強調度判定部14で強調度が判定され、これに応じたビット率制御信号が出力される。この例では判定される強調度は0又は1の何れかであり、強調度が0と判定されると、ビット率制御信号により切替え部81が制御され、入力端子が第1符号化部、この例ではCELP符号化部82に接続され、入力音響信号は符号化部82中の逆フィルタ83に入力され、入力音響信号のスペクトル包絡の周波数依存性が抑圧され、残差信号が得られる。この残差信号は符号帳選択部85に入力され、適応符号帳86から各ピッチ周期分過去の残差信号である適応符号帳ベクトルとの歪を算出し、歪が最小となる適応符号帳ベクトルに対応するピッチ周期が選択される。このピッチ周期即ち選択された適応符号帳ベクトルに対応する符号Cbが符号化出力となる。適応符号化ベクトルとして前符号化フレームで得られた復号化残差信号を各種ピッチ周期に基づいて周期化して生成してもよい。その選択ベクトルが差回路87で逆フィルタ83よりの残差信号より差し引かれ、その残差信号が時間領域ベクトル量子化部88で固定符号帳89を参照してベクトル量子化される。その逆量子化出力と符号帳選択部85で選択したベクトルとが加算回路91で加算されて残差信号が復号(合成)され、これが適応符号帳86へ供給される。逆フィルタ83のフィルタ係数は特徴量抽出部79内のLPC分析部84の分析結果が利用され、また適応符号帳86からピッチ周期長に応じた切り出しは、特徴量抽出部79内で抽出した基本周波数に基づき行うことにより、符号化と強調度決定とに音響特徴パラメータを共通に利用できる。LPC分析部84の分析結果がLPC量子化部92で符号化され、その符号Ca 、適応符号帳選択部85からの選択ベクトルを示すCb 、量子化部88からの選択ベクトルを示す符号Cc 、CELP符号化部82を選択したことを示す符号(ビット率を示す符号)とを合成部76から出力する。
【0037】
一方、強調度が1と判定されると、ビット率制御信号により切替え部81が制御されて、入力端子が第2符号化部、この例ではTwinVQ符号化部94に接続され、入力音響信号はその符号化部94内のMDCT部52に周波数領域の信号に変換され、この周波数領域の係数は量子化された周波数領域のスペクトル概形で割算回路54において割算されて正規化され、周波数領域残差係数とされ、この周波数領域残差係数はフレーム間予測部95よりのフレーム間予測スペクトルで割算回路58において割算されて正規化され、一層平坦化された係数(周波数領域微細構造)となり、周波数領域量子化部96でベクトル量子化され、またこの量子化部96からその逆量子化微細構造と微細構造の量子化符号Ce が出力される。一方、逆量子化微細構造はフレーム間予測部95よりのフレーム間予測スペクトルが掛算器97で乗算されて残差係数が復号され、これがフレーム間予測部95に入力され、フレーム間予測スペクトルとフレーム間予測係数の量子化符号Cf が出力される。これら符号Ce ,Cf と、LPC量子化部92からのLPC量子化符号Ca と、Twin VQ符号化部94を選択したことを示す符号(ビット率を示す符号)とが合成部76で統合されて出力される。
【0038】
この場合も特徴量抽出部79において抽出したLPC係数は強調度判定と、符号化の両者に利用され、また図9中には示していないが、周波数領域量子化部96内で行っているパワーによる規格化に、特徴量抽出部79で抽出したパワーを用いることもできる。
判定する強調度の数を増加し、例えば強調度0を2段階とし、固定符号帳89とサイズが異なる(収容された固定ベクトルの数が異なる)固定符号帳89′を設け、2段階に分けた強調度0の一方で、固定符号帳89,89′の一方を用いて符号化し、2段階の他方で固定符号帳89,89′の他方を用いて符号化するようにしてもよい。同様に、周波数領域量子化部96で用いる符号帳として収容されたベクトル数の異なるものを複数用意し、強調度1を複数の段階に分け、強調度が高いほど収容されたベクトルの数が多い符号帳を用いて周波数領域量子化部96での符号化を行うようにすることもできる。
【0039】
上述において強調度判定時に、強調状態での出現確率に重みwを掛けて判定を行う例において、この代りに平静状態での出現確率に重みwを掛けて判定を行ってもよい。この場合は、例えば図1A中に破線枠で示すように乗算部27bで乗算する重みwを制御して、重みwが大きくても強調状態となる小段落が残る場合に強調度と判定する。強調度の判定の手法としては、例えば図1中に破線で示すように、強調状態確率計算部25で求めた強調状態での出現確率P(n)と平静状態確率計算部26で求めた平静状態での出現確率P(e)との比をP(e)/P(n)又はP(n)/P(e)を割算部30で求め、この比の大きさに応じて、強調度を決定してもよい。例えばP(e)/P(n)を求めた場合は、この比が小さい程、強調度を大とする。この図示の例の場合は小段落ごとに強調度を決定して、対応したビット率を決めることになる。更に強調状態での出現確率P(n)のみを用い、この値が予め設定した以上であればその小段落は強調状態にあると判定してもよい。
【0040】
強調度の決定は、フレームを含む段落ごと、つまり音響段落ごと、小段落ごと、有声区間(有声段落)ごと、またはフレームごと、あるいは10秒や5秒などの一定時間ごとなど各種の段落ごとに行なってもよい。
テレビジョンなどのように音声信号と密接な関係の映像信号を符号化する際に、音声信号に対し、強調度に応じたビット率で符号化し、これと同様にビット率の変更をしながら映像信号を符号化してもよい。
上述において符号化ビット率の制御を強調度を4つ以上に分けてより細かに行ってもよい。更に小段落や音響段落はそれぞれ固定長としてもよい。
【0041】
この発明の符号化器、例えば図1に示したものとして、コンピュータにプログラムを実行させて機能させてもよい。この場合は、例えば図6に示した符号化方法の各手順をコンピュータにより実行するための符号化プログラムを、コンピュータのプログラムメモリにCD−ROM、可撓性磁気ディスク、などの記録媒体からインストールし、あるいは通信回線を通じてダウンロードしてそのプログラムをコンピュータに実行させればよい。
【0042】
【発明の効果】
以上述べたようにこの発明によれば音響信号中の内容が重要な部分、強調したい部分のビット率を高くし、その他の部分のビット率を低くすることにより、効率的な符号化ができる。
【図面の簡単な説明】
【図1】Aはこの発明による符号化器の一例の機能構成を示す図、Bはその強調度判定部14の具体的機能構成例を示す図である。
【図2】図1中の確率符号帳13の記憶例を示す図。
【図3】音声小段落、末尾音声小段落、音声段落を説明するための図。
【図4】A及びBは強調度判定テーブルの例を示す図、Cは出力符号のフォーマットの例を示す図である。
【図5】図1Aの符号化器と対応する復号器の機能構成例を示す図。
【図6】この発明の符号化方法の処理手順の例を示す流れ図。
【図7】この発明の他の実施例の機能構成例を示す図。
【図8】この発明をTwinVQに適用した符号化器の機能構成例を示す図。
【図9】この発明の更に他の実施例の機能構成例を示す図。
[0001]
BACKGROUND OF THE INVENTION
The present invention relates to compression coding of audio signals such as audio signals and music signals, and more particularly to an encoding method for changing a compression rate, that is, an encoding bit rate according to the state of the audio signal, an encoder thereof, and a program thereof.
[0002]
[Prior art]
Conventionally, in order to efficiently compress and encode an audio signal, for example, Japanese Patent Application Laid-Open No. 7-225599 discloses a case where an input audio signal is a non-voice, an unvoiced sound, a transient voiced sound, and a steady voiced sound. The CELP coding method is shown in which the bit rate is adaptively changed according to these cases. This encoding method is intended to reduce the amount of information as much as possible while maintaining auditory quality in each part of the audio signal.
[0003]
Also, for example, ADPCM (differential PCM) coding that allows a plurality of users to share a line using low bit rate coding and adaptively switches the bit rate according to the number of users and quality requirements. The method is shown, for example, in pages 124 to 128 of "Speech coding" by Takehiro Moriya, published by the Institute of Electronics, Information and Communication Engineers, October 20, 1998.
Furthermore, it is detected whether the input acoustic signal is an audio signal or a music signal. In the former case, the coding method is switched by the CELP coding method, and in the latter case by the TwinVQ coding method. JP-A-8-263098 discloses that encoding is performed.
[0004]
[Problems to be solved by the invention]
Conventionally, the coding bit rate is changed adaptively according to the required quality or within a range that does not affect the quality according to the partial state of the signal.
However, the important part of the speech utterance content, especially the part that you want to emphasize in one piece of music, should have a high bit rate, especially improve the quality, for example to be able to hear even minor emotional changes, or especially in music If the quality is high, more efficient coding, or even the same amount of information can be made more meaningful information than before. It is an object of the present invention to provide an encoding method that enables such compression encoding.
[0005]
[Means for Solving the Problems]
(1) According to the method for compressing and encoding an acoustic signal according to the present invention, the acoustic signal is analyzed for each frame, and at least the fundamental frequency, power, time-varying characteristics of the dynamic feature quantity, or the difference between these frames (Δ component) ) Is extracted as a parameter, and enhancement of the extracted acoustic feature quantity is performed with reference to a probability codebook storing at least two sets of data including the acoustic feature quantity and the appearance probability in the emphasized state. The appearance probability in the state is obtained, and the enhancement degree of the acoustic signal is determined for each frame or paragraph including the frame based on the obtained appearance probability in the enhancement state, and the higher the determined enhancement degree, the higher the bit rate. A code obtained by compression-coding the acoustic signal and a code indicating a bit rate or a bit rate change are output.
[0006]
(2) Preferably in the method of (1), a linear prediction coefficient is calculated in the analysis of the acoustic signal, and the dynamic feature value is calculated from the linear prediction coefficient.
The fundamental frequency, the power, and the linear prediction coefficient extracted as the acoustic feature amount are used for determining the compression-coded code.
(3) Preferably, in the method of (1), a plurality of codebook groups storing a set of acoustic feature quantities and codes corresponding to the respective bit rates are prepared, and the bit rate corresponding to the enhancement degree By selecting and using one codebook group, a code corresponding to each of the extracted acoustic feature values is determined, the code is the compression-coded code, and the code is a code indicating the bit rate or the bit rate change. A code indicating the selected codebook group is used.
[0007]
(4) In the method of (1), preferably, an encoding method corresponding to the degree of enhancement is selected, the acoustic feature value is encoded using the selected encoding method, and the compression encoded code is used. The code indicating the selected encoding method is used as the code indicating the bit rate or the bit rate change.
(5) Preferably in the method of said (1)-(4) item, an emphasis degree is determined for every acoustic paragraph of an acoustic signal from the appearance probability in the emphasis state in the acoustic paragraph.
(6) Preferably, in the method of (5), the probability codebook includes the appearance probability in a calm state for each acoustic feature, determines the appearance probability for each sub-paragraph included in the acoustic paragraph, It is determined whether the small paragraph is in an emphasized state or a calm state, and the degree of enhancement is obtained based on the state determination result of the small paragraph in the acoustic paragraph.
[0008]
(7) In the method of (6), the degree of enhancement is preferably determined based on the number of small paragraphs determined to be in an enhanced state in the acoustic paragraph.
(8) In the method of (6), the probability codebook preferably includes an appearance probability in a calm state for each acoustic feature, and includes an appearance probability in the emphasized state and an appearance probability in the calm state. At least one of them is given a weight, the weight is changed, the number of small paragraphs determined to be in an emphasized state is controlled to a predetermined value or less, and the above-described enhancement degree of the acoustic paragraph is obtained based on the magnitude of the weight at that time.
[0009]
(9) In the method of (6), preferably, the music signal is determined for each frame as a silent section or a voiced section, and a voiced section surrounded by a predetermined number or more of the silent sections is defined as the small sound section. Detect sub-paragraphs between adjacent sub-paragraphs, with sub-paragraphs that end with sub-paragraphs whose average power in one or more voiced sections is less than a constant multiple of the average power in the sub-paragraph. Detect as the acoustic paragraph.
(10) Preferably, in the method according to any one of (1) to (6), the probability codebook includes an appearance probability in a calm state for each acoustic feature quantity, and an appearance probability and a calm state in an emphasized state. The degree of emphasis is determined from the ratio to the appearance probability at.
[0010]
According to the encoder of the present invention, an encoding unit capable of compressing and encoding an input acoustic signal at a different bit rate according to a bit rate control signal, and at least a time of a fundamental frequency, power, and a dynamic feature amount A probability codebook in which at least two or more sets of data including an acoustic feature amount including a change characteristic or an inter-frame difference (Δ component) as a parameter and an appearance probability in the emphasized state are stored, and an input acoustic signal for each frame To determine the appearance probability in the emphasized state of the extracted acoustic feature amount with reference to the probability codebook, and based on the obtained appearance probability, An enhancement degree determination unit that obtains an enhancement degree for each frame of an acoustic signal or each paragraph including a frame, and supplies a bit rate control signal with a higher bit rate to the encoding unit as the enhancement degree increases. To Bei.
The program of the present invention is an encoding program for causing a computer to execute each procedure of the acoustic signal encoding method according to any one of (1) to (10).
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Overview
FIG. 1A shows an embodiment of the present invention. The acoustic signal from the input terminal 11 is analyzed for each frame by the feature quantity extraction unit 12, and at least the fundamental frequency, the power, the time change characteristic of the dynamic feature quantity, or the difference between these frames (also referred to as Δ component) is used as a parameter. Extract the acoustic features.
The probability codebook 13 stores at least two or more sets of acoustic features, appearance probabilities in the emphasized state, and appearance probabilities in the calm state. The appearance probability in each state of the acoustic feature amount extracted by the feature amount extraction unit 12 in the enhancement level determination unit 14 is obtained with reference to the probability codebook 13, and using these appearance probabilities, for each acoustic paragraph of the input acoustic signal The degree of enhancement is determined, and a control signal with a higher bit rate is output as the determined degree of enhancement is larger. The encoding unit 15 compresses and encodes the input audio signal from the input terminal 11 at a bit rate determined by the bit rate control signal from the enhancement degree determination unit 14, and the compression code and a code indicating a change in the bit rate or the bit rate Is output to the output terminal 16.
Features, probability codebook
The voice feature amount includes a fundamental frequency (f0), power (p), time change characteristic (d) of voice dynamic feature amount, pause time length (silent section) (ps), and the like. The time variation characteristic (d) is a parameter as a measure of the speech rate, and the time variation characteristic of the LPC spectrum coefficient reflecting the spectrum envelope is obtained as the dynamic variation amount, and the speech rate coefficient is obtained based on the time variation. More specifically, LPC cepstrum coefficients Cl (t),..., Ck (t) are extracted for each frame to obtain a dynamic feature amount d (dynamic measure) as shown in the following equation. d (t) = Σi = 1 kf = t-f0 t + f0[F × Ci (t)] / (Σf = t-f0 t + f0f2)]2Here, f0 is the number of frames in the preceding and following speech sections (not necessarily an integer number of frames but may be a fixed time section), and k is the order of the LPC cepstrum, i = 1, 2,. As the coefficient of speech rate, the number of maximum points of change in dynamic feature quantity per unit time or the rate of change per unit time is used.
[0012]
For example, 100 ms is one frame and the shift is 50 ms. An average fundamental frequency for each frame is obtained (f0 ′). Similarly, the average power (p ′) for each frame is obtained for the power. Further, the difference between f0 ′ of the current frame and f0 ′ before and after ± i frames is taken as ± Δf0′i (Δ component). Similarly for power, the difference ± Δp′i (Δ component) between p ′ of the current frame and p ′ before and after ± i frames is obtained. f0 ', ± Δf0'i, p', ± Δp'i are normalized. In this standard, for example, f0 ′ and ± Δf0′i are respectively divided by the average fundamental frequency of the entire speech waveform and standardized. Similarly, p ′ and ± Δp′i are divided by the average power of the entire speech waveform that is the target of speech state determination and normalized. In normalization, it may be divided by the average power for each small paragraph and audio paragraph described later. The value of i is, for example, i = 4. Count the number of dynamic measure peaks before and after the current frame ± T1 ms, that is, the number of local maximum changes in the dynamic feature (dp). The difference (−Δdp) between this and the dp of the frame including the time T2 ms before the start time of the current frame is obtained. A difference component (+ Δdp) between the number of dp of ± T1 ms and the dp of the frame including the time after T3 ms of the end time of the current frame is obtained. The values of T1, T2, and T3 are, for example, T1 = T2 = T3 = 450 ms. Let the silence interval before and after the frame be ± ps. In step 1, the values of these parameters are extracted for each frame as speech feature values. These parameters include at least the fundamental frequency (or pitch period), power, time-varying characteristics of dynamic features, or the difference between these frames (Δ component).
[0013]
Next, a method for creating the probability codebook 13 will be briefly described.
The subject listens to a large number of learning voices, and the speech state is labeled as calm and as emphasized.
For example, as a reason for the subject to be in an emphasized state,
(A) The voice is loud and utters so that the nouns and conjunctions are extended
(B) Extend the beginning of the conversation, insist on a topic change, and make a loud voice to gather opinions
(C) When emphasizing important nouns with a loud voice
(D) High pitched but not very loud
(E) When you are laughing and deceiving your true intentions
(F) When the ending of the ending sound is high, asking for consent or asking questions
(G) When the ending voice is loud enough to be strong
(H) The voice is loud and loud, claims to speak and speak louder than the other party
(I) When the voice is small and the feeling is tingling or tingling, the truth or secret that is spoken with a loud voice is spoken, and what is usually important for a loud voice is spoken
Mentioned. In this example, the calm state is not any of the above (a) to (i), and the subject felt that the utterance was calm.
[0014]
For each label section in the calm state and the emphasized state, the speech feature is extracted from the learning speech, a parameter is selected, and a code book is created by the LBG algorithm using the parameters in the label section in the calm state and the emphasized state. To do. A set of the appearance probability in the emphasized state and the appearance probability in the calm state is stored in the codebook 13 for each quantized speech feature (code) obtained in this way. The appearance probability in the emphasized state is only the probability that the voice feature amount appears in the emphasized state regardless of the voice feature amount in the past frame (unigram: described as a single appearance probability), or this and the past frame. Is a combination of a conditional probability that the speech feature amount appears in the emphasized state for each speech feature amount sequence in units of frames from the speech feature amount of the current frame to the speech feature amount of the current frame, and appears in a calm state Similarly, the probability is that only the probability that the speech feature amount appears in a calm state regardless of the speech feature amount in the past frame (unigram: expressed as a single appearance probability), or this, and the speech feature amount in the past frame. For each frame feature amount sequence from the current frame speech feature amount to the current frame speech feature amount, any one of the conditional probabilities and combinations in which the speech feature amount appears in a calm state.
For example, as shown in FIG. 2, in the probability codebook 13, for each code C1, C2,..., The voice feature amount and the single appearance probability are in the emphasized state and the calm state, and the conditional probability is the emphasized state and the calm state. Each is stored as a set.
[0015]
Emphasis judgment
Next, a specific example of the enhancement degree determination unit 14 will be described with reference to FIG. 1B. In this example, the emphasis state or the calm state is determined from the appearance probability for each small paragraph of the speech signal, and the enhancement degree of the speech paragraph is determined based on this.
In order to detect a small paragraph, the voice / silence determination unit 21 determines whether the input voice signal is voiced or unvoiced. If the power of each frame of the input speech signal is less than or equal to a predetermined value, it is determined as a silent interval.
If the time of the silent section sandwiching the voiced section by the small paragraph determination unit 22 is t seconds, for example, 400 ms or more, the sandwiched voiced section is determined to be a small paragraph. Assume that the tail small paragraph detection unit 23 determines that the input audio signal is a small paragraph j−1, j, j + 1 as shown in FIG. 3, for example. The voice sub-paragraph j is composed of n voiced sections, the average power is Pj, and the average of the average power pi of the voice sub-paragraph j, that is, i = n−αth to n-th voiced section is the voice. When it is smaller than the average power Pj of the small paragraph j, that is,
Σpi / (α + 1) <βPj, Σ is the sum of i = n−α to n, and α and β are constants
When this condition is satisfied, this audio sub-paragraph j is detected as the last audio sub-paragraph of the audio paragraph k, and the sub-paragraph group from the end audio sub-paragraph to the last audio sub-paragraph immediately before is detected as the audio paragraph k and the audio paragraph determination unit. 24 determines. For example, α is 3 and β is 0.8. In this way, a group of audio sub-paragraphs between adjacent end audio sub-paragraphs is determined as an audio paragraph with the end audio sub-paragraph as a delimiter.
[0016]
Quantized speech feature values (codes) close to the speech feature values for each frame of each small paragraph are searched from the probability codebook 13, and the appearance probability P (n) in the emphasized state that forms a pair with each code. The appearance probability P (e) in the calm state is taken out by the emphasized state probability calculation unit 25 and the calm state probability calculation unit 26, respectively.
The emphasized state probability calculating unit 25 and the calm state probability calculating unit 26 use the probabilities extracted for each frame, respectively, to calculate the probability that the small paragraph will be in the emphasized state and the probability that the sub-paragraph will be in the calm state. For example, it is assumed that the audio sub-paragraph s is the number of frames Ns, and the code corresponding to the audio feature amount for each frame is Ci1, Ci2,. The probability Ps (e) that the voice sub-paragraph s is in an emphasized state and the probability Ps (n) that it is in a calm state are calculated by the following equations, respectively.
[0017]
Ps (e) = Pemp (Ci3 | Ci1Ci2) ... Pemp (CiNs | Ci (Ns-1) Ci (Ns-2))
Ps (n) = Pnrm (Ci3 | Ci1Ci2) ... Pnrm (CiNs | Ci (Ns-1) Ci (Ns-2))
Here, for example, Pemp (Ci3 | Ci1Ci2) represents the probability that Ci3 appears in an emphasized state next to Ci1Ci2. In order to obtain this Pemp (Ci3 | Ci1Ci2), the individual appearance probability of the emphasized state of Ci3 from the codebook 13, the conditional probability that Ci3 appears in the emphasized state next to Ci2, and Ci3 is next to Ci1, Ci2. It is preferable that the conditional probabilities appearing in the emphasized state are obtained and obtained by linear interpolation. Similarly, the appearance probability in the other emphasized state and the appearance probability in the calm state are preferably obtained by the linear interpolation method. The linear interpolation method is described in, for example, “Spoken Language Processing” (Kenji Kenji et al., Morikita Publishing, p. 29, 1996).
[0018]
In this way, the appearance probability Ps (e) in the emphasized state of the small paragraph s and the appearance probability Ps (n) in the calm state are calculated by the probability calculating units 25 and 26, respectively, and based on these calculation results, the emphasized state determining unit 27, if Ps (e)> Ps (n), the small paragraph s is in an emphasized state, and if Ps (e) <Ps (n), the small paragraph s is determined to be in a calm state. Similarly, the emphasis state determination unit 27 determines whether each small paragraph is in an emphasis state or a calm state.
The enhancement level determination unit 28 determines the enhancement level of the speech paragraph determined by the speech paragraph determination unit 24 based on the result determined by the enhancement state determination unit 27 for all the small paragraphs constituting the speech paragraph. To do. For example, as shown in FIG. 4A, when the degree of emphasis is divided into three levels of 0, 1, and 2, when there is no sub-paragraph determined to be in the emphasized state in the audio paragraph, the degree of emphasis is set to 0, and the emphasis state is determined. The degree of emphasis is 1 when the number of sub-paragraphs is 1 or 2, and the degree of emphasis is 2 when the number is 3 or more.
[0019]
Alternatively, as shown in FIG. 4B, when the determination in the enhancement state determination unit 27 is performed, the appearance probability in the enhancement state is multiplied by the weight 0 ≦ w <ε. When the number of paragraphs is 1 or more, the degree of enhancement is 2, and the number of sub-paragraphs determined to be in the emphasized state in the audio paragraph is 1 when judged in a state multiplied by ε ≦ w <1 / ε Is 1 when the degree of enhancement is 1 and when the number of sub-paragraphs determined to be in the emphasized state in the speech paragraph is 0 or 1 when determined in a state multiplied by 1 / ε ≦ w. ε is a positive real number less than 1, preferably about 0.5 to 0.8.
[0020]
The multiplication of the weight w with respect to the appearance probability in the emphasized state is performed by the multiplier 27a as indicated by a broken line in FIG. 1B, and the value of the weight w is controlled by the emphasis degree determining unit 28.
The enhancement level determined by the enhancement level determination unit 28 is input to the bit rate control signal generation unit 29. The bit rate control signal generation unit 29 is a low bit when the enhancement level is 0 in the example of FIGS. 4A and 4B. A rate control signal, a medium bit rate control signal when the enhancement level is 1, and a high bit rate control signal when the enhancement level is 2, are generated and supplied to the encoding unit 15 in FIG. 1A.
[0021]
In response to the input bit rate control signal, the encoding unit 15 compresses and encodes the input speech signal with the low rate unit 15L at a predetermined low bit rate in the case of the low bit rate control signal, and the medium bit rate control signal In the case, the input audio signal is encoded by the medium rate unit 15M at a predetermined medium bit rate higher than the low bit rate, and in the case of a high bit rate control signal, it is higher than the medium bit rate. The input speech signal is encoded by the high rate unit 15H at the bit rate.
In this example, three encoding units 15L, 15M, and 15H are shown as the encoding unit 15 according to each bit rate. However, this is for convenience, and these may be configured by independent encoding units. However, as in the known example shown in the section of the prior art, coding may be performed by switching, adding, or removing part of the functional configuration according to the bit rate. And can be configured in the same manner as a conventional variable bit rate encoder. Further, the input speech signal is delayed by the buffer storage unit 17 according to the time delay until the bit rate control signal is generated, and supplied to the encoding unit 15.
[0022]
For example, as illustrated in FIG. 4C, the encoding unit 15 adds and outputs a code 32 indicating a coding bit rate of the compression code string 31 prior to the compression code string 31 for the audio signal. This bit rate code 32 is added at least every time the coding bit rate is changed.
According to this encoder, an important part of the semantic content in the audio signal generally contains a small paragraph in an emphasized state, so it is encoded at a high bit rate, and the decoded audio signal is more clearly and possibly Speech that is full of the speaker's emotion can also be restored, and the speech paragraph in a calm state has a low bit rate, so that the coding efficiency can be increased.
[0023]
FIG. 5 shows an example of a decoder that decodes the encoded output from the encoder shown in FIG. 1A. The code sequence from the input terminal 41 is separated into the encoded sequence 31 and the bit rate code 32 by the separation unit 42, and the encoded sequence 31 is supplied to the decoding unit 43. The bit rate code 32 is decoded by the bit rate code decoding unit 44, and in the decoding unit 43, the high bit rate decoding unit 43H, the medium bit rate decoding unit 43M, and the low bit rate decoding unit 43L according to the decoded bit rate control signal. The encoded sequence 31 is decoded by either of the above. This decoding unit 43 corresponds to the encoding unit 15 in FIG. 1A, and each bit rate decoding unit 43H, 43M, 43L is configured independently, or is shared and a part of functional configuration is switched. Added, removed, added. It can be configured in the same way as a conventional variable bit rate decoder. That is, according to the present invention, the higher the enhancement degree, the higher the bit rate, that is, the higher the accuracy of encoding, but the higher the quality of the decoded acoustic signal, the higher the enhancement degree.
[0024]
An example of the procedure of the encoding method of the present invention corresponding to the encoder shown in FIG. 1 will be described with reference to FIG.
When an acoustic signal, in this example, an audio signal is input, it is temporarily stored in the storage unit (S1), and the audio feature amount of the audio signal is extracted for each frame (S2). Also, it is determined whether it is a silent segment or a voiced segment (S3), and a voiced segment sandwiched between silent segments of a predetermined length or longer is extracted as a speech sub-paragraph (S4), and the last speech sub-paragraph in the speech sub-paragraph is detected. Then, an audio paragraph consisting of audio subgroups between adjacent end audio subparagraphs is extracted (S5).
[0025]
The speech feature value for each frame extracted in step S2 is quantized with reference to the probability codebook 13, that is, associated with the code in the probability codebook 13 having the closest distance (S6), the emphasis state of each code, and The independent appearance probabilities in the calm state or these and the conditional appearance probabilities are extracted from the codebook 13, and the appearance state probabilities Ps (e) and Ps (n) in the sound state and the calm state are obtained for each audio sub-paragraph. (S7). Furthermore, it is determined from these appearance probabilities whether each sub-paragraph is in an emphasized state or in a calm state, and for each audio paragraph, depending on the number of sub-paragraphs in the emphasized state or whether the sub-paragraph is in an emphasized state or in a calm state. The emphasis degree of the speech paragraph is determined by giving a weight to the appearance probability in the emphasized state when determining whether or not there is a combination thereof (S8).
[0026]
A bit rate is determined in accordance with the determined degree of enhancement (S9), the audio signal of the corresponding speech paragraph is encoded at the determined bit rate (S10), and the encoded sequence and the code indicating the bit rate of the encoding are encoded. Are output (S11).
Modified example
In the encoding in the encoding unit 15, depending on the encoding method, when the same parameter as the speech feature amount extracted by the feature amount extraction unit 12 is used, the one extracted by the feature amount extraction unit 12 can be used. . Conversely, some or all of the speech feature amount parameters extracted for encoding by the encoding unit 15 may be used as the extraction parameters of the feature amount extracting unit 12. That is, for example, as shown in FIG. 7, the feature quantity is extracted from the input acoustic signal for each frame by the feature quantity extraction unit 12, and the enhancement degree judgment unit 14 uses at least some parameters of the extracted feature quantity. The degree of enhancement for each acoustic paragraph of the acoustic signal is determined, and a bit rate control signal corresponding to this is supplied to the encoding unit 15. The encoding unit 15 encodes the portion of the acoustic paragraph of the input acoustic signal so that the bit rate corresponds to the bit rate control signal input for at least some of the parameters extracted by the feature amount extraction unit 12. The
[0027]
The present invention not only encodes audio signals, but also encodes music paragraphs with high bit rates as well as important music paragraphs, and encodes unimportant portions with low bit rates to increase encoding efficiency. You can also. The reason why the subject felt the emphasis state when creating the probability codebook 13 in the case of music is shown below.
(A) Loud voice and loud voice
(B) Strong voice
(C) High voice and strong accent
(D) Voice is high and voice quality changes
(E) Extend voice and loud
(F) Loud voice, loud voice, strong accent
(G) Loud voice, loud voice, screaming
(H) Voice is loud and accent changes
(I) Extend voice, loud voice, high ending
(J) The voice is high and the voice is extended
(K) Extend voice, scream, high voice
(L) Strong ending
(M) Slowly strengthen
(N) The tone is irregular
(O) The tone is irregular and the voice is high
Next, an example in which the present invention is applied to TwinVQ will be described with reference to FIG. TwinVQ is disclosed in, for example, the magazine Nikkei Electronics 1997, 4.21, (No. 687), pages 181 to 202, and others. The acoustic signal from the input terminal 11 analyzes the signal in the frame prior to the orthogonal transformation, and the transformation length, that is, the number of divisions of the frame is determined by the division selection unit 51. For example, when the frame length is 2048 samples, it is determined whether 2048-point conversion is applied to the frame data once, 512-point conversion is applied four times, or 128-point conversion is applied 16 times. . In accordance with this determination, the acoustic signal is converted into a frequency domain signal (coefficient) by a modified discrete cosine transform for each conversion length determined by the window division MDCT unit 52.
[0028]
Aside from this conversion, the spectral envelope of the acoustic signal is estimated by linear prediction analysis in the LPC analysis unit 53, and the frequency domain coefficients are normalized by the division unit 54 based on the spectral envelope. A periodic peak component (pitch component) is extracted from the low frequency part of the flattened frequency domain coefficient by the pitch analysis unit 55, and this component is subtracted from the output of the division unit 54 by the subtraction unit 56. An envelope obtained by averaging the output of the calculation unit 54 by a non-linear division proportional to the Bark scale by the Bark scale spectrum analysis unit 57 is obtained, and the output of the subtraction unit is divided by the division unit 58 based on this envelope. As a result, fine spectral peaks in the flattened frequency domain coefficients from the division unit 54 are flattened by the pitch component, and further flattened by the division unit 58.
[0029]
The average power of the entire frequency domain coefficient thus flattened is calculated by the power analysis unit 59, and the output of the division unit 58 is normalized by the normal unit 61 with this average power. All of the normalized frequency domain coefficients are interleaved by the interleave dividing unit 62 and divided into sub-vectors, and the weighted vector of the conjugate structure representing the divided normalized frequency domain coefficients as the sum from the two codebooks. Vector quantization is performed by the quantization unit 63. At this time, based on the spectral envelope characteristic from the LPC analysis unit 53, the pitch component from the pitch component analysis unit 55, the averaged spectral envelope from the Bark scale spectrum analysis unit 57, and the power for each subframe from the power analysis unit 59. An auditory psychological model is created by the model generation unit 54, the vector selected by the vector quantization unit 63 is weighted by the auditory psychological model, and the vector quantum is measured with a distance measure that minimizes the distortion of the decoded signal acoustically. To do.
[0030]
In this embodiment of the present invention, the analysis results of the LPC analysis unit 53, the pitch component analysis unit 55, and the power analysis unit 59 are input to the emphasis degree determination unit 14, and at least a part of the input analysis results are used to describe the analysis results. And, based on the input analysis result, extract a small paragraph and an acoustic paragraph, and further refer to the probability codebook 13 to show the appearance probability in the emphasized state and the appearance probability in the calm state for each small paragraph. Is calculated to determine the emphasis state, the calm state, and the enhancement degree of the acoustic paragraph. As described above, the linear prediction coefficient from the LPC analysis unit 53 is obtained from the LPC cepstrum, and the time change characteristic thereof is obtained. The dynamic feature quantity is calculated by the dynamic feature quantity calculation unit 78. This is input to the enhancement degree determination unit 76.
[0031]
In this example, the degree of emphasis is two steps, 0 and 1, and accordingly, the LPC codebooks 65L and 65H are the pitch codebooks 66L and 66H, the spectrum codebooks 67L and 67H are the power codebooks. 68L and 68H are provided, and conjugate structure codebooks 69L and 69H are provided, respectively. The code books 66L, 67L, 68L, and 69L each have a smaller number of code vector elements than the corresponding code books 66H, 67H, 68H, and 69H. The converted code (index) has a smaller number of bits than that of the latter.
[0032]
When the enhancement level determination unit 14 determines that the enhancement level is 0, the codebook switching units 71 to 75 are controlled by the low bit rate control signal from the enhancement level determination unit 14, and the LPC parameter representing the analysis spectrum envelope in the LPC analysis unit 55. Is encoded using the LPC codebook 65L, the pitch component analysis unit 55 encodes the extracted pitch component using the pitch codebook 66L, and the bark scale spectrum analysis unit 57 uses the analysis average envelope for the spectrum. It is encoded using the codebook 67L, the detected average power is encoded using the power codebook 68L in the power analysis unit 59, and is flattened using the conjugate structure codebook 69L in the weighted vector quantization unit 63. Frequency domain coefficients are encoded. These encoded outputs are integrated and output by the synthesis unit 76.
[0033]
When the enhancement level determination unit 14 determines that the enhancement level is 1, the codebook switching units 71 to 75 are switched to the codebooks 65H to 69H side by the bit rate control signal from the enhancement level determination unit 14, respectively. , 55, 57, and 59 respectively use codebooks 65H, 66H, 67H, and 68H, the vector quantization unit 63 performs encoding using the codebook 69H, and these encoded outputs are integrated by the synthesis unit 76. Is output. The synthesizing unit 76 outputs a bit rate code indicating that at the beginning of the integrated code at least every time the bit rate control signal changes. In this example, since the emphasis degree is either 0 or 1, the bit rate code may be 1 bit, and since the emphasis change is either 0 to 1, or 1 to 0, the bit rate control signal changes. The code indicating this may be 1 bit.
[0034]
In the corresponding decoder, in the conventional TwinVQ decoder, the codebooks 65L and 65H, 66L and 66H, 67L and 67H, 68L and 68H, 69L and 69H are provided as the codebook, and the input bits What is necessary is just to decode using either set of these code books according to a rate code.
In this embodiment, as is clear from the above description, the LPC analysis unit 53, the pitch component analysis unit 55, and the power analysis unit 69 extract feature amounts used for enhancement degree determination and encode the input acoustic signal. The feature amount extraction unit 79 is also configured to extract a feature amount for the purpose.
[0035]
In this embodiment, a plurality of codebook groups corresponding to a plurality of predetermined coding bit rates, in this example, a codebook group consisting of codebooks 65L, 66L, 67L, 68L, and 69L, and a codebook 65H, Two codebook groups including a codebook group consisting of 66H, 67H, 68H, and 69H are prepared, and an input acoustic signal is generated using the one codebook group according to the bit rate determined by the enhancement degree determination unit 14. Is an example of encoding. The change of the bit rate by the selection of the code book group is not limited to the case where all the code books used for the encoder are selected according to the bit rate. For example, as shown by the broken line in FIG. May be encoded using one codebook 65 regardless of the change in the bit rate.
[0036]
Next, an example in which the encoding bit rate is changed according to the enhancement degree and the encoding method is changed will be described with reference to FIG. This embodiment is a case in which the encoding method in the “acoustic signal encoding method” disclosed in Japanese Patent Laid-Open No. 8-263830 is changed according to the encoding bit rate determined by the enhancement degree. Briefly stated.
A feature amount is extracted from the input acoustic signal for each frame by the feature extraction unit 79, the enhancement degree determination unit 14 determines the enhancement degree according to the extracted feature amount, and a bit rate control signal corresponding to this is output. The In this example, the degree of enhancement to be determined is either 0 or 1. When the degree of enhancement is determined to be 0, the switching unit 81 is controlled by the bit rate control signal, the input terminal is the first encoding unit, In the example, the input acoustic signal is connected to the CELP encoding unit 82, and the input acoustic signal is input to the inverse filter 83 in the encoding unit 82. The frequency dependence of the spectral envelope of the input acoustic signal is suppressed, and a residual signal is obtained. The residual signal is input to the codebook selection unit 85, and distortion from the adaptive codebook vector, which is a residual signal in the past for each pitch period, is calculated from the adaptive codebook 86, and the adaptive codebook vector that minimizes the distortion is calculated. A pitch period corresponding to is selected. This pitch period, ie the code C corresponding to the selected adaptive codebook vectorbBecomes the encoded output. The decoded residual signal obtained in the previous encoded frame may be generated as an adaptive encoded vector by periodicizing based on various pitch periods. The selected vector is subtracted from the residual signal from the inverse filter 83 by the difference circuit 87, and the residual signal is vector quantized by the time domain vector quantization unit 88 with reference to the fixed codebook 89. The dequantized output and the vector selected by the codebook selection unit 85 are added by the adder circuit 91 to decode (synthesize) the residual signal, which is supplied to the adaptive codebook 86. The filter coefficient of the inverse filter 83 uses the analysis result of the LPC analysis unit 84 in the feature quantity extraction unit 79, and the cutout according to the pitch period length from the adaptive codebook 86 is extracted in the feature quantity extraction unit 79. By performing based on the frequency, the acoustic feature parameter can be commonly used for encoding and enhancement degree determination. The analysis result of the LPC analysis unit 84 is encoded by the LPC quantization unit 92, and the code Ca, C indicating a selection vector from the adaptive codebook selection unit 85b, A code C indicating a selection vector from the quantization unit 88cThe combining unit 76 outputs a code indicating that the CELP encoding unit 82 has been selected (a code indicating the bit rate).
[0037]
On the other hand, when the enhancement degree is determined to be 1, the switching unit 81 is controlled by the bit rate control signal, the input terminal is connected to the second encoding unit, in this example, the TwinVQ encoding unit 94, and the input acoustic signal is The frequency domain signal is converted into a frequency domain signal by the MDCT unit 52 in the encoding unit 94, and the frequency domain coefficient is divided and normalized by the division circuit 54 by the quantized frequency domain spectral outline. The frequency domain residual coefficient is divided and normalized in the division circuit 58 by the inter-frame prediction spectrum from the inter-frame prediction unit 95 and is further flattened (frequency domain fine structure). ) And the vector quantization is performed by the frequency domain quantization unit 96, and the inverse quantization fine structure and the fine structure quantization code C are obtained from the quantization unit 96.eIs output. On the other hand, the inverse quantization fine structure is obtained by multiplying the inter-frame prediction spectrum from the inter-frame prediction unit 95 by the multiplier 97 and decoding the residual coefficient, which is input to the inter-frame prediction unit 95, and the inter-frame prediction spectrum and the frame Quantization code C of inter prediction coefficientfIs output. These codes Ce, CfLPC quantization code C from the LPC quantization unit 92aAnd a code indicating that the Twin VQ encoding unit 94 has been selected (a code indicating a bit rate) are integrated by the combining unit 76 and output.
[0038]
Also in this case, the LPC coefficients extracted by the feature quantity extraction unit 79 are used for both enhancement degree determination and encoding, and are not shown in FIG. 9, but are performed in the frequency domain quantization unit 96. The power extracted by the feature amount extraction unit 79 can also be used for normalization by the above.
The number of emphasis levels to be determined is increased. For example, the emphasis degree 0 is divided into two stages, and a fixed codebook 89 'having a different size from the fixed codebook 89 (the number of accommodated fixed vectors is different) is provided. On the other hand, one of the fixed codebooks 89 and 89 ′ may be encoded with one of the enhancement degrees 0, and may be encoded with the other of the fixed codebooks 89 and 89 ′ in the other of the two stages. Similarly, a plurality of vectors having different numbers of vectors accommodated as codebooks used in the frequency domain quantization unit 96 are prepared, and the enhancement degree 1 is divided into a plurality of stages. The higher the enhancement degree, the larger the number of accommodated vectors. It is also possible to perform encoding by the frequency domain quantization unit 96 using a codebook.
[0039]
In the example described above, when determining the degree of enhancement, the determination is performed by multiplying the appearance probability in the emphasized state by the weight w, and instead, the determination may be performed by multiplying the appearance probability in the calm state by the weight w. In this case, for example, as shown by a broken line frame in FIG. 1A, the weight w multiplied by the multiplication unit 27b is controlled, and the degree of enhancement is determined when a small paragraph that is in an emphasized state remains even if the weight w is large. As a technique for determining the degree of enhancement, for example, as indicated by a broken line in FIG. 1, the appearance probability P (n) in the enhancement state obtained by the enhancement state probability calculation unit 25 and the calmness obtained by the calm state probability calculation unit 26. The ratio with the appearance probability P (e) in the state is determined by the division unit 30 as P (e) / P (n) or P (n) / P (e), and is emphasized according to the magnitude of this ratio. The degree may be determined. For example, when P (e) / P (n) is obtained, the degree of enhancement is increased as the ratio is decreased. In the illustrated example, the degree of emphasis is determined for each small paragraph, and the corresponding bit rate is determined. Furthermore, only the appearance probability P (n) in the emphasized state may be used, and if this value is not less than a preset value, it may be determined that the small paragraph is in the emphasized state.
[0040]
The degree of emphasis is determined for each paragraph including a frame, that is, for each paragraph, such as each acoustic paragraph, each sub-paragraph, each voiced section (voiced paragraph), each frame, or every fixed time such as 10 seconds or 5 seconds. You may do it.
When encoding a video signal closely related to an audio signal, such as a television, the audio signal is encoded with a bit rate corresponding to the degree of enhancement, and the video is changed while changing the bit rate in the same way. The signal may be encoded.
In the above description, the coding bit rate may be controlled more finely by dividing the emphasis degree into four or more. Furthermore, each of the small paragraph and the acoustic paragraph may have a fixed length.
[0041]
The encoder of the present invention, for example, the one shown in FIG. 1, may be caused to function by causing a computer to execute a program. In this case, for example, an encoding program for executing each procedure of the encoding method shown in FIG. 6 by a computer is installed from a recording medium such as a CD-ROM or a flexible magnetic disk into the computer program memory. Alternatively, it may be downloaded through a communication line and executed by the computer.
[0042]
【The invention's effect】
As described above, according to the present invention, efficient coding can be performed by increasing the bit rate of the portion where the content in the audio signal is important, the portion to be emphasized, and decreasing the bit rate of the other portion.
[Brief description of the drawings]
FIG. 1A is a diagram showing a functional configuration of an example of an encoder according to the present invention, and B is a diagram showing a specific functional configuration example of an enhancement degree determination unit 14;
FIG. 2 is a diagram showing a storage example of a probability codebook 13 in FIG. 1;
FIG. 3 is a diagram for explaining an audio sub-paragraph, an end audio sub-paragraph, and an audio paragraph;
FIGS. 4A and 4B are diagrams illustrating an example of an enhancement degree determination table, and FIG. 4C is a diagram illustrating an example of an output code format;
FIG. 5 is a diagram illustrating a functional configuration example of a decoder corresponding to the encoder of FIG. 1A.
FIG. 6 is a flowchart showing an example of a processing procedure of the encoding method of the present invention.
FIG. 7 is a diagram showing a functional configuration example of another embodiment of the present invention.
FIG. 8 is a diagram showing a functional configuration example of an encoder in which the present invention is applied to TwinVQ.
FIG. 9 is a diagram showing a functional configuration example of still another embodiment of the present invention.

Claims (9)

基本周波数、パワー、動的特徴量の時間変化特性、基本周波数のフレーム間差分、パワーのフレーム間差分、動的特徴量の時間変化特性のフレーム間差分の6つのうち少なくともいずれか1つを含む特徴量の組から成る音声特徴量ベクトルと、前記音声特徴量ベクトルの強調状態での出現確率および前記音声特徴量ベクトルの平静状態での出現確率とを対応させて格納した符号帳を用い、
音響信号をフレームごとに分析して前記音声特徴量を求め、
前記音響信号をフレームごとに無音区間か否か、有声区間か否か判定し、
前記音響信号の所定フレーム数以上の無音区間で囲まれ、有声区間を含む部分を音声小段落と判定し、音声小段落に含まれる有声区間の平均パワーが該音声小段落内の平均パワーの所定の定数倍より小さい音声小段落を末尾とする音声小段落群を音声段落と判定し、
各音声小段落の各フレームの前記音声特徴量の組を量子化してコードを求め、そのコードと対応する音声特徴量ベクトルの強調状態での出現確率および平静状態での出現確率を前記符号帳から求め、
前記音声小段落の各フレームの前記音声特徴量ベクトルの前記強調状態での前記出現確率を利用して、該音声小段落が強調状態となる確率を算出し、
前記音声小段落の各フレームの前記音声特徴量ベクトルの前記平静状態での前記出現確率を利用して、該音声小段落が平静状態となる確率を算出し、
前記強調状態となる確率が前記平静状態となる確率よりも高い音声小段落を強調状態と判定し、
前記音声段落ごとに、強調状態と判定された音声小段落の数が多いほど、高いビット率で圧縮符号化し、
前記符号化により得られた符号と、ビット率またはビット率の変化を示す符号を出力する
ことを特徴とする音響信号符号化方法。
It includes at least one of the following: the fundamental frequency, the power, the time variation characteristic of the dynamic feature, the inter-frame difference of the fundamental frequency, the inter-frame difference of the power, and the inter-frame difference of the temporal variation of the dynamic feature. Using a codebook that stores a speech feature vector composed of a set of feature amounts, an appearance probability of the speech feature vector in an emphasized state, and an appearance probability of the speech feature vector in a calm state,
Analyzing the acoustic signal for each frame to obtain the voice feature amount,
Whether the sound signal is a silent interval for each frame, whether it is a voiced interval,
A portion including a voiced section surrounded by a silent section of a predetermined number of frames or more of the acoustic signal is determined as a voice sub-paragraph, and the average power of the voiced section included in the voice sub-paragraph is a predetermined average power in the voice sub-paragraph Audio sub-paragraphs that end with audio sub-paragraphs smaller than a constant multiple of
A code is obtained by quantizing the set of audio feature values of each frame of each audio sub-paragraph, and the appearance probability in the emphasized state and the appearance probability in the calm state of the sound feature vector corresponding to the code are obtained from the codebook. Seeking
Using the probability of appearance of the speech feature vector of each frame of the speech sub-paragraph in the emphasized state, calculating the probability that the speech sub-paragraph is in the emphasized state,
Using the probability of appearance of the speech feature vector of each frame of the speech sub-paragraph in the calm state, calculating the probability that the speech sub-paragraph will be in a calm state,
A speech sub-paragraph with a higher probability of being in the emphasized state than the probability of being in the calm state is determined as the emphasized state;
For each audio paragraph, the more the number of audio sub-paragraphs determined to be in the emphasized state, the higher the compression rate at a bit rate,
A code obtained by the coding and a code indicating a bit rate or a change in the bit rate are output.
前記符号帳に動的特徴量の時間変化特性または動的特徴量の時間変化特性のフレーム間差分が少なくとも含まれている場合であって、
前記音響信号を分析する処理は、前記音響信号から線形予測係数を算出し、その線形予測係数から動的特徴量を求める処理を含み、
前記線形予測係数を、前記圧縮符号化する処理での符号の決定に用いる
ことを特徴とする請求項1記載の音響信号符号化方法。
When the codebook includes at least a time variation characteristic of a dynamic feature amount or a difference between frames of a time variation property of a dynamic feature amount,
The process of analyzing the acoustic signal includes a process of calculating a linear prediction coefficient from the acoustic signal and obtaining a dynamic feature amount from the linear prediction coefficient,
The linear prediction coefficients, the acoustic signal encoding method according to claim 1 Symbol mounting is characterized by using to determine the sign of the process of the compression coding.
前記圧縮符号化の処理に用いる符号帳として、各ビット率に対応した個数の各音声特徴量と符号の組を記憶する複数の符号帳群を用意し、
前記圧縮符号化する処理では、決定されたビット率に対応する1つの符号帳群を選択して用いて、前記抽出された各音声特徴量に対応する符号を決定して、前記圧縮符号化した符号とし、
前記ビット率又はビット率変化を示す符号として前記選択された符号帳群を示す符号を用いる
ことを特徴とする請求項1記載の音響信号符号化方法。
As a codebook used for the compression coding process, a plurality of codebook groups storing a set of each voice feature amount and code corresponding to each bit rate are prepared,
In the compression encoding process, a codebook group corresponding to the determined bit rate is selected and used to determine a code corresponding to each extracted speech feature and the compression encoding is performed. Sign
Acoustic signal encoding method according to claim 1 Symbol mounting is characterized by using the code indicating the selected codebook set as a code indicating the bit rate or bit rate changes.
前記圧縮符号化する処理では、決定されたビット率に対応した符号化方法を選択し、選択した符号化方法を用いて前記音響信号を符号化して前記圧縮符号化した符号を求め、
前記ビット率又はビット率変化を示す符号として前記選択された符号化方法を示す符号を用いる
ことを特徴とする請求項1記載の音響信号符号化方法。
In the compression encoding process, an encoding method corresponding to the determined bit rate is selected, the acoustic signal is encoded using the selected encoding method, and the compression encoded code is obtained,
Acoustic signal encoding method according to claim 1 Symbol mounting is characterized by using a code indicating the selected encoding method as a code indicating the bit rate or bit rate changes.
基本周波数、パワー、動的特徴量の時間変化特性、基本周波数のフレーム間差分、パワーのフレーム間差分、動的特徴量の時間変化特性のフレーム間差分の6つのうち少なくともいずれか1つを含む特徴量の組から成る音声特徴量ベクトルと、前記音声特徴量ベクトルの強調状態での出現確率および前記音声特徴量ベクトルの平静状態での出現確率とを対応させて格納した符号帳と、
音響信号をフレームごとに分析して前記音声特徴量を求める特徴量抽出部と、
前記音響信号をフレームごとに無音区間か否か、有声区間か否か判定する有声/無判定手段と、
前記音響信号の所定フレーム数以上の無音区間で囲まれ、有声区間を含む部分を音声小段落と判定する小段落判定手段と、
音声小段落に含まれる有声区間の平均パワーが該音声小段落内の平均パワーの所定の定数倍より小さい音声小段落を末尾とする音声小段落群を音声段落と判定する音声段落判定手段と、
各音声小段落の各フレームの前記音声特徴量の組を量子化してコードを求め、そのコードと対応する音声特徴量ベクトルの強調状態での出現確率および平静状態での出現確率を前記符号帳から求める確率計算手段と、
前記音声小段落内の各フレームの前記音声特徴量ベクトルの前記強調状態での前記出現確率を利用して、該音声小段落が強調状態となる確率を算出し、
前記音声小段落内の各フレームの前記音声特徴量ベクトルの前記平静状態での前記出現確率を利用して、該音声小段落が平静状態となる確率を算出し、
前記強調状態となる確率が前記平静状態となる確率よりも高い音声小段落を強調状態と判定する強調状態判定部と、
前記音声段落ごとに、強調状態と判定された音声小段落の数が多いほど、高いビット率で圧縮符号化することを表すビット率制御信号を生成するビット率制御信号生成部と、
前記ビット率制御信号に応じて、前記ビット率制御信号に対応する予め決められたビット率で前記音響信号を符号化すると共に、ビット率またはビット率の変化を示す符号を出力する符号化部と
を備える音響信号符号化器。
It includes at least one of the following: the fundamental frequency, the power, the time variation characteristic of the dynamic feature, the inter-frame difference of the fundamental frequency, the inter-frame difference of the power, and the inter-frame difference of the temporal variation of the dynamic feature. A codebook that stores a speech feature vector composed of a set of feature values, an appearance probability of the speech feature vector in an emphasized state, and an appearance probability of the speech feature vector in a calm state;
A feature quantity extraction unit for analyzing the acoustic signal for each frame to obtain the voice feature quantity;
Whether silent section of the audio signal for each frame, and determines voiced / silence judging means whether voiced segment,
A small paragraph determination means for determining a portion including a voiced section surrounded by a silent section of a predetermined number of frames or more of the acoustic signal as a voice small paragraph;
A voice paragraph determining means for determining, as a voice paragraph, a voice subparagraph group that ends with a voice subparagraph whose average power of a voiced section included in the voice subparagraph is smaller than a predetermined constant multiple of the average power in the voice subparagraph;
A code is obtained by quantizing the set of audio feature values of each frame of each audio sub-paragraph, and the appearance probability in the emphasized state and the appearance probability in the calm state of the sound feature vector corresponding to the code are obtained from the codebook. A probability calculation means to be obtained;
Using the probability of appearance of the speech feature vector of each frame in the speech sub-paragraph in the emphasized state, calculating the probability that the speech sub-paragraph will be in the emphasized state,
Using the probability of appearance of the speech feature vector of each frame in the speech sub-paragraph in the calm state, calculating the probability that the speech sub-paragraph will be in a calm state,
An emphasis state determination unit that determines an audio sub-paragraph as an emphasis state that has a higher probability of being in the emphasis state than the probability of the calm state;
A bit rate control signal generation unit that generates a bit rate control signal indicating that compression encoding is performed at a higher bit rate as the number of audio sub-paragraphs determined to be in an emphasized state is larger for each audio paragraph;
An encoding unit that encodes the acoustic signal at a predetermined bit rate corresponding to the bit rate control signal according to the bit rate control signal and outputs a code indicating a bit rate or a change in the bit rate; An acoustic signal encoder comprising:
前記符号帳に動的特徴量の時間変化特性または動的特徴量の時間変化特性のフレーム間差分が少なくとも含まれている場合であって、
前記特徴量抽出部が、前記音響信号から線形予測係数を算出し、その線形予測係数から動的特徴量を求める手段を有しており、
前記符号化部が、前記線形予測係数を、前記圧縮符号化する処理での符号の決定に用いる
ことを特徴とする請求項記載の音響信号符号化器。
When the codebook includes at least a time variation characteristic of a dynamic feature amount or a difference between frames of a time variation property of a dynamic feature amount,
The feature quantity extraction unit has a means for calculating a linear prediction coefficient from the acoustic signal and obtaining a dynamic feature quantity from the linear prediction coefficient;
The acoustic signal encoder according to claim 5 , wherein the encoding unit uses the linear prediction coefficient for determining a code in the compression encoding process.
前記符号化部には
前記圧縮符号化の処理に用いる符号帳として、各ビット率に対応した個数の各音声特徴量と符号の組を記憶する複数の符号帳群用意されており、
前記符号化部が前記圧縮符号化する処理では、決定された圧縮符号化のビット率に対応する1つの符号帳群を選択して用いて、前記抽出された各音声特徴量に対応する符号を決定して前記圧縮符号化した符号とし、前記ビット率又はビット率変化を示す符号として前記選択された符号帳群を示す符号を用いる
ことを特徴とする請求項記載の音響信号符号化器。
The said encoding unit,
As a codebook used for the compression coding process, a plurality of codebook groups for storing a set of each audio feature quantity and code corresponding to each bit rate are prepared ,
In the process in which the encoding unit performs the compression encoding, a codebook group corresponding to the determined compression encoding bit rate is selected and used, and a code corresponding to each extracted speech feature amount is selected. determined by the previous SL compression coded codes, acoustic signal encoder according to claim 5, characterized by using a code indicating the selected codebook set as a code indicating the bit rate or the bit rate change .
前記符号化部が、
前記圧縮符号化する処理では、決定されたビット率に対応した符号化方法を選択し、選択した符号化方法を用いて前記音響信号を符号化して前記圧縮符号化した符号を求め、前記ビット率又はビット率変化を示す符号として前記選択された符号化方法を示す符号を用いる
ことを特徴とする請求項記載の音響信号符号化器。
The encoding unit is
In the compression encoding process, an encoding method corresponding to the determined bit rate is selected, the acoustic signal is encoded using the selected encoding method to obtain the compression encoded code, and the bit rate is determined. 6. The acoustic signal encoder according to claim 5 , wherein a code indicating the selected encoding method is used as a code indicating a change in bit rate.
請求項1〜の何れかに記載の音響信号符号化方法の各手順をコンピュータに実行させるための符号化プログラム。The encoding program for making a computer perform each procedure of the acoustic signal encoding method in any one of Claims 1-4 .
JP2002124540A 2002-04-25 2002-04-25 Acoustic signal encoding method, encoder and program thereof Expired - Fee Related JP3803306B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2002124540A JP3803306B2 (en) 2002-04-25 2002-04-25 Acoustic signal encoding method, encoder and program thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2002124540A JP3803306B2 (en) 2002-04-25 2002-04-25 Acoustic signal encoding method, encoder and program thereof

Publications (2)

Publication Number Publication Date
JP2003316398A JP2003316398A (en) 2003-11-07
JP3803306B2 true JP3803306B2 (en) 2006-08-02

Family

ID=29539554

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2002124540A Expired - Fee Related JP3803306B2 (en) 2002-04-25 2002-04-25 Acoustic signal encoding method, encoder and program thereof

Country Status (1)

Country Link
JP (1) JP3803306B2 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4580190B2 (en) * 2004-05-31 2010-11-10 日本電信電話株式会社 Audio processing apparatus, audio processing method and program thereof
JP5086366B2 (en) * 2007-10-26 2012-11-28 パナソニック株式会社 Conference terminal device, relay device, and conference system
WO2010008173A2 (en) * 2008-07-14 2010-01-21 한국전자통신연구원 Apparatus for signal state decision of audio signal
KR101230183B1 (en) 2008-07-14 2013-02-15 광운대학교 산학협력단 Apparatus for signal state decision of audio signal

Also Published As

Publication number Publication date
JP2003316398A (en) 2003-11-07

Similar Documents

Publication Publication Date Title
JP3707116B2 (en) Speech decoding method and apparatus
KR100566713B1 (en) Speech parameter coding and decoding methods, coder and decoder, and programs, and speech coding and decoding methods, coder and decoder, and programs
US5749065A (en) Speech encoding method, speech decoding method and speech encoding/decoding method
JP3680380B2 (en) Speech coding method and apparatus
JPH1091194A (en) Method of voice decoding and device therefor
KR20020052191A (en) Variable bit-rate celp coding of speech with phonetic classification
JPH096397A (en) Voice signal reproducing method, reproducing device and transmission method
JP3803311B2 (en) Voice processing method, apparatus using the method, and program thereof
JP2006171751A (en) Speech coding apparatus and method therefor
CN101609681B (en) Coding method, coder, decoding method and decoder
CA2671068C (en) Multicodebook source-dependent coding and decoding
KR100480341B1 (en) Apparatus for coding wide-band low bit rate speech signal
JP3803306B2 (en) Acoustic signal encoding method, encoder and program thereof
JP4281131B2 (en) Signal encoding apparatus and method, and signal decoding apparatus and method
JP4256393B2 (en) Voice processing method and program thereof
JP3353852B2 (en) Audio encoding method
JP3237178B2 (en) Encoding method and decoding method
JP3353267B2 (en) Audio signal conversion encoding method and decoding method
JP3268750B2 (en) Speech synthesis method and system
Possemiers et al. Evaluating deep learned voice compression for use in video games
JP3916934B2 (en) Acoustic parameter encoding, decoding method, apparatus and program, acoustic signal encoding, decoding method, apparatus and program, acoustic signal transmitting apparatus, acoustic signal receiving apparatus
JP3348759B2 (en) Transform coding method and transform decoding method
JP4489371B2 (en) Method for optimizing synthesized speech, method for generating speech synthesis filter, speech optimization method, and speech optimization device
JP3006790B2 (en) Voice encoding / decoding method and apparatus
JP3024467B2 (en) Audio coding device

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20040227

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20050719

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20050816

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20051014

RD03 Notification of appointment of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7423

Effective date: 20051014

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A132

Effective date: 20051129

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20060127

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20060418

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20060502

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090512

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100512

Year of fee payment: 4

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100512

Year of fee payment: 4

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20110512

Year of fee payment: 5

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120512

Year of fee payment: 6

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130512

Year of fee payment: 7

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20140512

Year of fee payment: 8

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

LAPS Cancellation because of no payment of annual fees