JP3907906B2

JP3907906B2 - Speech coding apparatus and speech decoding apparatus

Info

Publication number: JP3907906B2
Application number: JP2000049867A
Authority: JP
Inventors: 正山浦; 裕久田崎
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2000-02-25
Filing date: 2000-02-25
Publication date: 2007-04-18
Anticipated expiration: 2020-02-25
Also published as: JP2001242898A

Abstract

PROBLEM TO BE SOLVED: To solve the problem that an encode characteristics is deteriorated where position candidates at every voice source number are distributed uniformly in a frame, in order to realize the position candidates with a low bit rate, since only the number of pulses is reduced, or the number of position candidates at every voice source number are thinner at a uniform interval, a drive voice source signal in which the pulses are concentrated locally cannot be generated. SOLUTION: The combinations of the voice source positions are selected from voice source position combination tables 41, 51 containing the information for showing the combinations of the voice source positions whose use frequency is higher than a reference frequency among total combination of plural voice source positions, and voice source information of an input voice is encoded or decoded using the combinations of the voice source positions.

Description

【０００１】
【発明の属する技術分野】
この発明は、ディジタル音声信号を少ない情報量に圧縮する音声符号化装置、その音声符号化装置等により生成された音声符号を復号化してディジタル音声信号を再生する音声復号化装置に関するものである。
【０００２】
【従来の技術】
従来の多くの音声符号化装置及び音声復号化装置では、音声符号化装置が入力音声をスペクトル包絡情報と音源情報に分けて、所定長区間のフレーム単位で各々を符号化して音声符号を生成し、音声復号化装置がこの音声符号を復号化して、合成フィルタによってスペクトル包絡情報と音源情報を合わせることで復号音声を生成するようにしている。最も代表的な音声符号化装置及び音声復号化装置としては、符号駆動線形予測符号化（Ｃｏｄｅ−ＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ：ＣＥＬＰ）方式を用いたものがある。
【０００３】
図１７は従来のＣＥＬＰ方式を用いる音声符号化装置を示す構成図であり、図において、１は入力音声を分析して、その入力音声のスペクトル包絡情報である線形予測係数を抽出する線形予測分析手段、２は線形予測分析手段１により抽出された線形予測係数を符号化する線形予測係数符号化手段、３は線形予測係数符号化手段２により量子化された線形予測係数を用いて仮の合成音を生成し、仮の合成音と入力音声の距離が最小になる適応音源符号を選択して多重化手段６に出力するとともに、その適応音源符号に対応する適応音源信号（過去の所定長の音源信号が周期的に繰り返された時系列ベクトル）をゲイン符号化手段５に出力する適応音源符号化手段、４は線形予測係数符号化手段２により量子化された線形予測係数を用いて仮の合成音を生成し、仮の合成音と符号化対象信号（入力音声から適応音源信号による合成音を差し引いた信号）の距離が最小になる駆動音源符号を選択して多重化手段６に出力するとともに、その駆動音源符号に対応する時系列ベクトルである駆動音源信号をゲイン符号化手段５に出力する駆動音源符号化手段である。
【０００４】
５は適応音源符号化手段３から出力された適応音源信号と駆動音源符号化手段４から出力された駆動音源信号にゲインベクトルの各要素を乗算し、各乗算結果を相互に加算して音源信号を生成する一方、線形予測係数符号化手段２により量子化された線形予測係数を用いて、その音源信号から仮の合成音を生成し、仮の合成音と入力音声の距離が最小になるゲイン符号を選択して多重化手段６に出力するゲイン符号化手段、６は線形予測係数符号化手段２により符号化された線形予測係数の符号と、適応音源符号化手段３から出力された適応音源符号と、駆動音源符号化手段４から出力された駆動音源符号と、ゲイン符号化手段５から出力されたゲイン符号とを多重化して、音声符号を出力する多重化手段である。
【０００５】
図１８は従来のＣＥＬＰ方式を用いる音声復号化装置を示す構成図であり、図において、１１は音声符号化装置から出力された音声符号を分離して、線形予測係数の符号を線形予測係数復号化手段１２に出力し、適応音源符号を適応音源復号化手段１３に出力し、駆動音源符号を駆動音源復号化手段１４に出力し、ゲイン符号をゲイン復号化手段１５に出力する分離手段、１２は分離手段１１から出力された線形予測係数を復号化し、その復号結果を合成フィルタ１９のフィルタ係数に変換して、そのフィルタ係数を合成フィルタ１９に出力する線形予測係数復号化手段である。
【０００６】
１３は分離手段１１から出力された適応音源符号に対応する適応音源信号（過去の音源信号が周期的に繰り返された時系列ベクトル）を出力する適応音源復号化手段、１４は分離手段１１から出力された駆動音源符号に対応する時系列ベクトルである駆動音源信号を出力する駆動音源復号化手段、１５は分離手段１１から出力されたゲイン符号に対応するゲインベクトルを出力するゲイン復号化手段である。
【０００７】
１６はゲイン復号化手段１５から出力されたゲインベクトルの要素を適応音源復号化手段１３から出力された適応音源信号に乗算する乗算器、１７はゲイン復号化手段１５から出力されたゲインベクトルの要素を駆動音源復号化手段１４から出力された駆動音源信号に乗算する乗算器、１８は乗算器１６の乗算結果と乗算器１７の乗算結果を加算して音源信号を生成する加算器、１９は加算器１８により生成された音源信号に対する合成フィルタリング処理を実行して出力音声を生成する合成フィルタである。
【０００８】
次に動作について説明する。
従来の音声符号化装置及び音声復号化装置では、５〜５０ｍｓ程度を１フレームとして、フレーム単位で処理を行う。
【０００９】
まず、音声符号化装置の線形予測分析手段１は、音声を入力すると、その入力音声を分析して、音声のスペクトル包絡情報である線形予測係数を抽出する。
線形予測係数符号化手段２は、線形予測分析手段１が線形予測係数を抽出すると、その線形予測係数を符号化し、その符号を多重化手段６に出力する。また、その符号に対応する量子化された線形予測係数を適応音源符号化手段３，駆動音源符号化手段４及びゲイン符号化手段５に出力する。
【００１０】
適応音源符号化手段３は、過去の所定長の音源信号を記憶する適応音源符号帳を内蔵し、内部で発生させる各適応音源符号（適応音源符号は数ビットの２進数値で示される）に応じて、過去の音源信号が周期的に繰り返された時系列ベクトルを生成する。
次に、各時系列ベクトルに適切なゲインを乗じた後、線形予測係数符号化手段２により量子化された線形予測係数を用いる合成フィルタに各時系列ベクトルを通すことにより、仮の合成音を生成する。
【００１１】
そして、適応音源符号化手段３は、仮の合成音と入力音声との距離を調査し、この距離を最小とする適応音源符号を選択して多重化手段６に出力するとともに、その選択した適応音源符号に対応する時系列ベクトルを適応音源信号として、ゲイン符号化手段５に出力する。また、入力音声から適応音源信号による合成音を差し引いた信号を符号化対象信号として、駆動音源符号化手段４に出力する。
【００１２】
駆動音源符号化手段４は、非雑音的又は雑音的な複数の時系列ベクトルである駆動符号ベクトルを格納する駆動音源符号帳を内蔵し、内部で発生させる各駆動音源符号（駆動音源符号は数ビットの２進数値で示される）に応じて、その駆動音源符号帳から時系列ベクトルの読み出しを順次実行する。
次に、各時系列ベクトルに適切なゲインを乗じた後、線形予測係数符号化手段２により量子化された線形予測係数を用いる合成フィルタに各時系列ベクトルを通すことにより、仮の合成音を生成する。
【００１３】
そして、駆動音源符号化手段４は、仮の合成音と、適応音源符号化手段３から出力された符号化対象信号との距離を調査し、この距離を最小とする駆動音源符号を選択して多重化手段６に出力するとともに、その選択した駆動音源符号に対応する時系列ベクトルを駆動音源信号として、ゲイン符号化手段５に出力する。
【００１４】
ゲイン符号化手段５は、ゲインベクトルを格納するゲイン符号帳を内蔵し、内部で発生させる各ゲイン符号（ゲイン符号は数ビットの２進数値で示される）に応じて、そのゲイン符号帳からゲインベクトルの読み出しを順次実行する。
そして、各ゲインベクトルの要素を、適応音源符号化手段３から出力された適応音源信号と、駆動音源符号化手段４から出力された駆動音源信号にそれぞれ乗算し、各乗算結果を相互に加算して音源信号を生成する。
次に、その音源信号を線形予測係数符号化手段２により量子化された線形予測係数を用いる合成フィルタに通すことにより、仮の合成音を生成する。
【００１５】
そして、ゲイン符号化手段５は、仮の合成音と入力音声との距離を調査し、この距離を最小とするゲイン符号を選択して多重化手段６に出力する。また、そのゲイン符号に対応する音源信号を適応音源符号化手段３に出力する。これにより、適応音源符号化手段３は、ゲイン符号化手段５により選択されたゲイン符号に対応する音源信号を用いて、内蔵する適応音源符号帳の更新を行う。
【００１６】
多重化手段６は、線形予測係数符号化手段２により符号化された線形予測係数と、適応音源符号化手段３から出力された適応音源符号と、駆動音源符号化手段４から出力された駆動音源符号と、ゲイン符号化手段５から出力されたゲイン符号とを多重化し、その多重化結果である音声符号を音声復号化装置に出力する。
【００１７】
音声復号化装置の分離手段１１は、音声符号化装置が音声符号を出力すると、その音声符号を分離して、線形予測係数の符号を線形予測係数復号化手段１２に出力し、適応音源符号を適応音源復号化手段１３に出力し、駆動音源符号を駆動音源復号化手段１４に出力し、ゲイン符号をゲイン復号化手段１５に出力する。線形予測係数復号化手段１２は、分離手段１１から線形予測係数の符号を受けると、その符号を復号化し、その復号結果を合成フィルタ１９のフィルタ係数に変換して、そのフィルタ係数を合成フィルタ１９に出力する。
【００１８】
適応音源復号化手段１３は、過去の所定長の音源信号を記憶する適応音源符号帳を内蔵し、分離手段１１から出力された適応音源符号に対応する適応音源信号（過去の音源信号が周期的に繰り返された時系列ベクトル）を出力する。
また、駆動音源復号化手段１４は、非雑音的又は雑音的な複数の時系列ベクトルである駆動符号信号を格納する駆動音源符号帳を内蔵し、分離手段１１から出力された駆動音源符号に対応する駆動音源信号を出力する。
ゲイン復号化手段１５は、ゲインベクトルを格納するゲイン符号帳を内蔵し、分離手段１１から出力されたゲイン符号に対応するゲインベクトルを出力する。
【００１９】
そして、適応音源復号化手段１３から出力された適応音源信号と駆動音源復号化手段１４から出力された駆動音源信号は、乗算器１６，１７により当該ゲインベクトルの要素が乗算され、加算器１８により乗算器１６，１７の乗算結果が相互に加算される。
【００２０】
合成フィルタ１９は、加算器１８の加算結果である音源信号に対する合成フィルタリング処理を実行して出力音声を生成する。なお、フィルタ係数としては、線形予測係数復号化手段１２により復号化された線形予測係数を用いる。
最後に、適応音源復号化手段１３は、上記音源信号を用いて、内蔵する適応音源符号帳の更新を行う。
【００２１】
次に、上述した音声符号化装置及び音声復号化装置の改良が図られた従来の技術について説明する。
片岡章俊、林伸二、守谷健弘、栗原祥子、間野一則「ＣＳ−ＡＣＥＬＰの基本アルゴリズム」ＮＴＴＲ＆Ｄ，Ｖｏｌ．４５，ｐｐ．３２５−３３０，１９９６年４月（文献１）には、演算量とメモリ量の削減を主な目的として、駆動音源の符号化にパルス音源を導入したＣＥＬＰ系の音声符号化装置及び音声復号化装置が開示されている。この従来の構成では、駆動音源を数本のパルスの各位置情報と極性情報のみで表現している。このような音源は代数的音源と呼ばれ、構造が簡単な割に符号化特性がよく、最近の多くの標準方式に採用されている。
【００２２】
図１９は文献１で用いられているパルス音源の位置候補を示す音源位置テーブルであり、図１７の音声符号化装置では駆動音源符号化手段４、図１８の音声復号化装置では駆動音源復号化手段１４に搭載される。
文献１では、音源符号化フレーム長が４０サンプルであり、駆動音源は４つのパルスで構成されている。音源番号１から音源番号３のパルス音源の位置候補は、図１９に示すように各々８つの位置に制約されており、パルス位置は各々３ｂｉｔで符号化できる。音源番号４のパルスは１６の位置に制約されており、パルス位置は４ｂｉｔで符号化できる。パルス音源の位置候補に制約を与えることにより、符号化特性の劣化を抑えつつ、符号化ビット数の削減、組合せ数の削減による演算量の削減を実現している。
【００２３】
この代数的音源の品質を維持して低ビットレート化を図る構成が、大室、間野「高速パルス探索型４ｋｂｉｔ／ｓ音声符号化」日本音響学会、１９９９年春季研究発表会講演論文集Ｉ、２１１〜２１２頁（文献２）に開示されている。
【００２４】
図２０は文献２で用いられているものと同様なパルス音源の位置候補と極性を示す音源位置・極性テーブルである。これは効率的に符号化ビット数を削減するために、隣接する音源位置における極性が反対となるように、音源の位置候補毎に採り得る極性に制約を与えるものである。
【００２５】
また、別の代数的音源の品質を改善する構成が、ＴａｄａｓｈｉＡｍａｄａ，ＫｉｍｉｏＭｉｓｅｋｉａｎｄＭａｓａｍｉＡｋａｍｉｎｅ “ＣＥＬＰｓｐｅｅｃｈｃｏｄｉｎｇｂａｓｅｄｏｎａｎａｄａｐｔｉｖｅｐｕｌｓｅｐｏｓｉｔｉｏｎｃｏｄｅｂｏｏｋ” １９９９ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ．Ｉ，ｐｐ．１３−１６（Ｍａｒ１９９９）（文献３）及び土屋、天田、三関「適応パルス位置ＡＣＥＬＰ音声符号化の改善」日本音響学会、１９９９年春季研究発表会講演論文集Ｉ、２１３〜２１４頁（文献４）に開示されている。
【００２６】
文献３では、適応音源信号の振幅包絡の大きさが大きいところにパルス音源の位置候補が集まるようにフレーム毎に適応的にパルス音源の位置候補を設定するようにしている。これにより符号化特性が改善することが示されている。
【００２７】
文献４は文献３の改良に相当する。駆動音源信号（文献４中ではＡＣＥＬＰ音源）の生成部にピッチフィルタを内包させたときには、最初の１ピッチ周期の区間の音源位置が選択されやすい傾向があり、そのときにピッチ逆フィルタ処理を行った適応音源信号の振幅包絡の大きさに基づいて、フレーム毎に適応的にパルス音源の位置候補を設定するようにしている。
【００２８】
【発明が解決しようとする課題】
従来の音声符号化装置及び音声復号化装置（文献１）は以上のように構成されているので、各音源番号毎の位置候補をフレーム内に均等に分布させている。したがって、その位置候補を低ビットレート化するためにはパルス数を減らすか、または、各音源番号毎の位置候補数を均等間隔で間引くしかないため、局所的にパルスが集中するような駆動音源信号を生成することができず、符号化特性の劣化を起こす課題があった。
【００２９】
文献２では、この特性劣化を抑制する音源の位置候補と極性に関して効率的な制約のつけ方を開示しているが、この制約は隣接する音源位置における極性を反対にするといったヒューリスティックなルールに基づくものであり、また、これ以外の音源位置と極性の関係は一切採り得ないとしているので、自然発声した音声が常にこのルールに当てはまる訳がなく、このルールに当てはまらない場合には大きな品質劣化を招く課題があった。
【００３０】
文献３及び文献４では、この特性劣化を抑制する適応的な間引き方法を開示しているが、入力音声の周期性が乱れたり変化する時には、適応的な間引きを行うことでむしろ大きな特性劣化を起こす課題があった。また、この適応的な間引き処理は、通信路での符号伝送誤りによって適応音源信号が正しく生成されない場合、駆動音源信号にまで影響が及ぶという課題があった。
【００３１】
また、文献４では、駆動音源信号の生成部にピッチフィルタを内包させる場合、最初の１ピッチ周期の区間に音源位置候補を集中させることで平均的な特性改善を達成しているが、聴感的に最も重要な音声の立上り区間などでは、むしろフレーム後半が重要な場合があり、フレーム後半が良好に再現できずに特性劣化を引き起こして、聞いた印象ではむしろ品質劣化を起こす課題があった。
【００３２】
この発明は上記のような課題を解決するためになされたもので、特性の劣化を招くことなく、低ビットレート化を図ることができる音声符号化装置及び音声復号化装置を得ることを目的とする。
【００３３】
【課題を解決するための手段】
この発明に係る音声符号化装置は、音源符号化手段が複数の音源位置の全組合せに対して個々に定まる音源位置の組合せに関する評価値が基準値より高い音源位置の組合せのみを抽出して、その音源位置の組合せを示す情報を記述したインデックステーブルを備え、そのインデックステーブルから任意の音源位置の組合せを選択し、その音源位置の組合せを用いて入力音声の音源情報を符号化するようにしたものである。
【００３４】
この発明に係る音声符号化装置は、インデックステーブルに記述されている音源位置の組合せを示す情報が、個別に符号化された音源位置の組合せ情報であるようにしたものである。
【００３５】
この発明に係る音声符号化装置は、音源符号化手段が少なくとも評価値が基準値より高い音源位置の組合せを含む音源位置の組合せを認識することが可能な場合、インデックステーブルに記述されている音源位置の組合せを示す情報が、音源位置の各組合せがインデックステーブルの要素であるか否かを示すフラグ情報であるようにしたものである。
【００３６】
この発明に係る音声符号化装置は、音源符号化手段がフレーム長と音源数とインデックス数から少なくとも評価値が基準値より高い音源位置の組合せを含む音源位置の組合せを認識するようにしたものである。
【００３７】
この発明に係る音声符号化装置は、音源符号化手段が音源位置及び極性から構成された複数の対データの全組合せに対して個々に定まる対データの組合せに関する評価値が基準値より高い対データの組合せのみを抽出して、その対データの組合せを示す情報を記述したインデックステーブルを備え、そのインデックステーブルから任意の対データの組合せを選択し、その対データの組合せを用いて入力音声の音源情報を符号化するようにしたものである。
【００３８】
この発明に係る音声符号化装置は、インデックステーブルに記述されている対データの組合せを示す情報が、個別に符号化された対データの組合せ情報であるようにしたものである。
【００３９】
この発明に係る音声符号化装置は、音源符号化手段が少なくとも評価値が基準値より高い対データの組合せを含む対データの組合せを認識することが可能な場合、インデックステーブルに記述されている対データの組合せを示す情報が、対データの各組合せがインデックステーブルの要素であるか否かを示すフラグ情報であるようにしたものである。
【００４０】
この発明に係る音声符号化装置は、音源符号化手段がフレーム長と音源数とインデックス数から少なくとも評価値が基準値より高い対データの組合せを含む対データの組合せを認識するようにしたものである。
【００４１】
この発明に係る音声符号化装置は、音源符号化手段が記述内容が相互に異なるインデックステーブルを複数個有し、任意のインデックステーブルを選択して使用するようにしたものである。
【００４２】
この発明に係る音声符号化装置は、音源符号化手段が入力音声を分析して所定のパラメータを抽出し、そのパラメータに対応するインデックステーブルを選択するようにしたものである。
【００４３】
この発明に係る音声符号化装置は、音源符号化手段がスペクトル包絡情報および音源情報の少なくともどちらか一方から所定のパラメータを抽出し、そのパラメータに対応するインデックステーブルを選択するようにしたものである。
【００４４】
この発明に係る音声復号化装置は、音源復号化手段が複数の音源位置の全組合せに対して個々に定まる音源位置の組合せに関する評価値が基準値より高い音源位置の組合せのみを抽出して、その音源位置の組合せを示す情報を記述したインデックステーブルを備え、そのインデックステーブルから音源情報に含まれている組合せを示す符号に基づいて音源位置の組合せを選択し、その音源位置の組合せを用いて入力音声の音源情報を復号化するようにしたものである。
【００４５】
この発明に係る音声復号化装置は、インデックステーブルに記述されている音源位置の組合せを示す情報が、個別に符号化された音源位置の組合せ情報であるようにしたものである。
【００４６】
この発明に係る音声復号化装置は、音源復号化手段が少なくとも評価値が基準値より高い音源位置の組合せを含む音源位置の組合せを認識することが可能な場合、インデックステーブルに記述されている音源位置の組合せを示す情報が、音源位置の各組合せがインデックステーブルの要素であるか否かを示すフラグ情報であるようにしたものである。
【００４７】
この発明に係る音声復号化装置は、音源復号化手段がフレーム長と音源数とインデックス数から少なくとも評価値が基準値より高い音源位置の組合せを含む音源位置の組合せを認識するようにしたものである。
【００４８】
この発明に係る音声復号化装置は、音源復号化手段が音源位置及び極性から構成された複数の対データの全組合せに対して個々に定まる対データの組合せに関する評価値が基準値より高い対データの組合せのみを抽出して、その対データの組合せを示す情報を記述したインデックステーブルを備え、そのインデックステーブルから音源情報に含まれている組合せを示す符号に基づいて対データの組合せを選択し、その対データの組合せを用いて入力音声の音源情報を復号化するようにしたものである。
【００４９】
この発明に係る音声復号化装置は、インデックステーブルに記述されている対データの組合せを示す情報が、個別に符号化された対データの組合せ情報であるようにしたものである。
【００５０】
この発明に係る音声復号化装置は、音源復号化手段が少なくとも評価値が基準値より高い対データの組合せを含む対データの組合せを認識することが可能な場合、インデックステーブルに記述されている対データの組合せを示す情報が、対データの各組合せがインデックステーブルの要素であるか否かを示すフラグ情報であるようにしたものである。
【００５１】
この発明に係る音声復号化装置は、音源復号化手段がフレーム長と音源数とインデックス数から少なくとも評価値が基準値より高い対データの組合せを含む対データの組合せを認識するようにしたものである。
【００５２】
この発明に係る音声復号化装置は、音源復号化手段が記述内容が相互に異なるインデックステーブルを複数個有し、音源情報に含まれている選択情報を示す符号に対応するインデックステーブルを選択して使用するようにしたものである。
【００５３】
この発明に係る音声復号化装置は、音源復号化手段が記述内容が相互に異なるインデックステーブルを複数個有し、スペクトル包絡情報および音源情報の少なくともどちらか一方から所定のパラメータを抽出し、そのパラメータに対応するインデックステーブルを選択して使用するようにしたものである。
【００５４】
【発明の実施の形態】
以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１による音声符号化装置を示す構成図であり、図において、２１は入力音声を分析して、その入力音声のスペクトル包絡情報である線形予測係数を抽出する線形予測分析手段、２２は線形予測分析手段２１により抽出された線形予測係数を符号化する線形予測係数符号化手段である。
なお、線形予測分析手段２１及び線形予測係数符号化手段２２から包絡情報符号化手段が構成されている。
【００５５】
２３は線形予測係数符号化手段２２により量子化された線形予測係数を用いて仮の合成音を生成し、仮の合成音と入力音声の距離が最小になる適応音源符号（音源情報）を選択して多重化手段２６に出力するとともに、その適応音源符号に対応する適応音源信号（過去の所定長の音源信号が周期的に繰り返された時系列ベクトル）をゲイン符号化手段２５に出力する適応音源符号化手段、２４は線形予測係数符号化手段２２により量子化された線形予測係数を用いて仮の合成音を生成し、仮の合成音と符号化対象信号（入力音声から適応音源信号による合成音を差し引いた信号）の距離が最小になる駆動音源符号（音源情報）を選択して多重化手段２６に出力するとともに、その駆動音源符号に対応する時系列ベクトルである駆動音源信号をゲイン符号化手段２５に出力する駆動音源符号化手段である。
【００５６】
２５は適応音源符号化手段２３から出力された適応音源信号と駆動音源符号化手段２４から出力された駆動音源信号にゲインベクトルの各要素を乗算し、各乗算結果を相互に加算して音源信号を生成する一方、線形予測係数符号化手段２２により量子化された線形予測係数を用いて、その音源信号から仮の合成音を生成し、仮の合成音と入力音声の距離が最小になるゲイン符号（音源情報）を選択して多重化手段２６に出力するゲイン符号化手段である。
なお、適応音源符号化手段２３，駆動音源符号化手段２４及びゲイン符号化手段２５から音源符号化手段が構成されている。
【００５７】
２６は線形予測係数符号化手段２２により符号化された線形予測係数の符号と、適応音源符号化手段２３から出力された適応音源符号と、駆動音源符号化手段２４から出力された駆動音源符号と、ゲイン符号化手段２５から出力されたゲイン符号とを多重化して、音声符号を出力する多重化手段である。
【００５８】
図２はこの発明の実施の形態１による音声復号化装置を示す構成図であり、図において、３１は音声符号化装置から出力された音声符号を分離して、線形予測係数の符号を線形予測係数復号化手段３２に出力し、適応音源符号を適応音源復号化手段３３に出力し、駆動音源符号を駆動音源復号化手段３４に出力し、ゲイン符号をゲイン復号化手段３５に出力する分離手段、３２は分離手段３１から出力された線形予測係数を復号化し、その復号結果を合成フィルタ３９のフィルタ係数に変換して、そのフィルタ係数を合成フィルタ３９に出力する線形予測係数復号化手段（包絡情報復号化手段）である。
【００５９】
３３は分離手段３１から出力された適応音源符号に対応する適応音源信号（過去の音源信号が周期的に繰り返された時系列ベクトル）を出力する適応音源復号化手段、３４は分離手段３１から出力された駆動音源符号に対応する時系列ベクトルである駆動音源信号を出力する駆動音源復号化手段、３５は分離手段３１から出力されたゲイン符号に対応するゲインベクトルを出力するゲイン復号化手段である。
【００６０】
３６はゲイン復号化手段３５から出力されたゲインベクトルの要素を適応音源復号化手段３３から出力された適応音源信号に乗算する乗算器、３７はゲイン復号化手段３５から出力されたゲインベクトルの要素を駆動音源復号化手段３４から出力された駆動音源信号に乗算する乗算器、３８は乗算器３６の乗算結果と乗算器３７の乗算結果を加算して、音源信号を生成する加算器、３９は加算器３８により生成された音源信号に対する合成フィルタリング処理を実行して出力音声を生成する合成フィルタである。
なお、適応音源復号化手段３３，駆動音源復号化手段３４，ゲイン復号化手段３５，乗算器３６，３７，加算器３８及び合成フィルタ３９から音源復号化手段が構成されている。
【００６１】
図３は音声符号化装置における駆動音源符号化手段２４の内部を示す構成図であり、図において、４１は複数の音源位置の全組合せのうち、使用頻度（音源位置の組合せに関する評価値）が基準頻度（基準値）より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、４２は音源位置組合せテーブル４１から任意の音源位置の組合せを選択し、その音源位置の組合せと適正な極性を用いて入力音声の駆動音源情報を符号化する代数的音源符号化手段である。
【００６２】
図４は音声復号化装置における駆動音源復号化手段３４の内部を示す構成図であり、図において、５１は複数の音源位置の全組合せのうち、使用頻度（音源位置の組合せに関する評価値）が基準頻度（基準値）より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、５２は音源位置組合せテーブル５１から駆動音源符号に含まれている音源位置符号（組合せを示す符号）に基づいて音源位置の組合せを選択し、その音源位置の組合せと極性（極性を特定する情報は駆動音源符号に含まれている）を用いて入力音声の音源情報を復号化する代数的音源復号化手段である。
【００６３】
次に動作について説明する。
音声符号化装置及び音声復号化装置では、５〜５０ｍｓ程度を１フレームとして、フレーム単位で処理を行う。
【００６４】
まず、音声符号化装置の線形予測分析手段２１は、音声を入力すると、その入力音声を分析して、音声のスペクトル包絡情報である線形予測係数を抽出する。
線形予測係数符号化手段２２は、線形予測分析手段２１が線形予測係数を抽出すると、その線形予測係数を符号化し、その符号を多重化手段２６に出力する。また、その符号に対応する量子化された線形予測係数を適応音源符号化手段２３，駆動音源符号化手段２４及びゲイン符号化手段２５に出力する。
【００６５】
適応音源符号化手段２３は、過去の所定長の音源信号を記憶する適応音源符号帳を内蔵し、内部で発生させる各適応音源符号（適応音源符号は数ビットの２進数値で示される）に応じて、過去の音源信号が周期的に繰り返された時系列ベクトルを生成する。
次に、各時系列ベクトルに適切なゲインを乗じた後、線形予測係数符号化手段２２により量子化された線形予測係数を用いる合成フィルタに各時系列ベクトルを通すことにより、仮の合成音を生成する。
【００６６】
そして、適応音源符号化手段２３は、仮の合成音と入力音声との距離を調査し、この距離を最小とする適応音源符号を選択して多重化手段２６に出力するとともに、その選択した適応音源符号に対応する時系列ベクトルを適応音源信号として、ゲイン符号化手段２５に出力する。また、入力音声から適応音源信号による合成音を差し引いた信号を符号化対象信号として、駆動音源符号化手段２４に出力する。
【００６７】
駆動音源符号化手段２４は、適応音源符号化手段２３から符号化対象信号を入力すると、音源位置組合せテーブル４１から任意の音源位置の組合せ（パルス音源の位置候補の組合せ）を選択し、そのパルス音源の位置候補の組合せと適正な極性を用いて入力音声の駆動音源情報を符号化する。具体的には以下に示す通りである。
【００６８】
まず、駆動音源符号化手段２４の音源位置組合せテーブル４１には、例えば、図１９に示すパルス音源の位置候補の全組合せの中から使用頻度が高い組合せ（基準頻度より高い組合せ）のみを所望数抽出するなどして、複数のパルス音源の位置候補の組合せが記述されている（図５を参照）。
代数的音源において、符号化に用いられるパルス音源の位置候補の組合せは全組合せが均等に出現するわけではなく、その発生の頻度には偏りがあり、発生頻度が低い組合せを一切使用しないとしても合成音の品質に与える影響は小さい。この特性を利用してパルス音源の位置候補の組合せに制約を与えることにより、符号化特性の劣化を抑えつつ、代数的音源に要する符号化ビット数を削減する。
【００６９】
音源位置組合せテーブル４１の情報を記憶する方法として、例えば、図６に示すように、各インデックスに対する各音源番号の音源位置を２進数で表現するなど、直接的に、パルス音源の位置候補を表す情報を符号化して記憶する方法がある。
この様に構成すると、例えば、図１９に示すパルス音源の位置候補の全組合せから、位置候補の組合せを抽出して音源位置組合せテーブル４１を作成した場合、各パルス音源の位置候補の組合せは１３ｂｉｔで表現できるので、これをインデックスの個数（ここでは、インデックスの個数をＮ個とする）分、記憶するためには、（１３×Ｎ）ｂｉｔの記憶領域が必要となる。例えば、Ｎ＝５１２の場合は６６５６ｂｉｔ、Ｎ＝１０２４の場合は１３３１２ｂｉｔとなるなど、使用する音源位置組合せテーブル４１の大きさによって必要な記憶容量が異なる。
【００７０】
音源位置組合せテーブル４１の情報を記憶する別の方法として、例えば、図７に示すように、少なくとも音源位置組合せテーブル４１の全ての要素を含むように構成された第２の音源位置組合せテーブルにおける各位置候補の組合せに対して、音源位置組合せテーブル４１の要素であるか否かを示すフラグ情報を１ｂｉｔで表現するなど、直接的には、パルス音源の位置候補を表す情報を記憶するのではなく、パルス音源の位置候補の組合せの使用の可否を記憶する方法がある。この様に構成すると、例えば、図１９に示すパルス音源の位置候補の全組合せから、位置候補の組合せを抽出して音源位置組合せテーブル４１を作成した場合、全組合せ数（８１９２個）に対して使用する／使用しないを１ｂｉｔで表すので、８１９２ｂｉｔの記憶領域が必要となる。この場合、音源位置組合せテーブル４１のインデックス数Ｎに依らず必要な記憶領域は一定である。
【００７１】
なお、図１９に示すような代数的音源構造に基づくパルス音源の位置候補の全組合せは、フレーム長とパルス数とインデックス数から演算により求めることができるので、実際にはテーブルとして持つ必要はなく、記憶領域は不要である。
【００７２】
上述したように、音源位置組合せテーブル４１の情報を記憶する方法は複数あり、インデックス数Ｎと必要な記憶容量との関係が異なるので、インデックス数Ｎに応じて、より記憶容量が小さい方法を選択すれば、メモリやハードディスクなどの記憶装置の規模を小さくできるなど、効率的な装置化が可能となる。
【００７３】
駆動音源符号化手段２４の代数的音源符号化手段４２は、音源位置組合せテーブル４１に格納されているパルス音源の位置候補の組合せを順次読み出して、各位置候補に任意の極性でパルスを立てたときの仮の合成音を生成する。
そして、適応音源符号化手段２３から出力された符号化対象信号と仮の合成音との距離（信号の誤差）を計算して、その距離を最小にするパルス音源の位置候補の組合せと極性を探索する。
代数的音源符号化手段４２は、その探索が完了すると、その探索結果であるパルス音源の位置候補の組合せを表す音源位置符号と極性とを、駆動音源符号として多重化手段２６に出力するとともに、この駆動音源符号に対応する時系列ベクトルを、駆動音源信号としてゲイン符号化手段２５に出力する。
【００７４】
この代数的音源符号化手段４２における探索動作は、文献１に示されている駆動音源符号化手段と同様に行う。また、文献１に示されているように駆動音源の生成部の最終段にピッチフィルタを導入する。即ち、各パルス音源の位置候補にパルスを配置した信号にピッチフィルタを施して駆動音源信号とし、これに対する仮の合成音を生成する。
そして、各位置候補毎の仮の合成音同士の相関と、各位置候補毎の仮の合成音と符号化対象信号の相関を計算し、これらの相関を用いて各位置候補毎の極性の決定と位置探索を高速に行う。
【００７５】
その結果として、パルス音源の位置候補の組合せと各音源の極性が得られる。パルス音源の位置候補の組合せは、例えば、音源位置組合せテーブル４１における組合せに対するインデックスを２進数で表現するなど、対応した符号に変換され、最終的な音源位置符号として出力される。代数的音源は、その構造から探索動作に要する演算量が少ないことが知られているが、その代数的音源の構造を保ちつつ探索すべき音源位置の組合せ数を減らすことにより、更なる演算量削減効果を得ることができる。
【００７６】
上記のようにして、駆動音源符号化手段２４が駆動音源信号を出力すると、ゲイン符号化手段２５は、内部で発生させる各ゲイン符号（ゲイン符号は数ビットの２進数値で示される）に応じて、ゲイン符号帳からゲインベクトルの読み出しを順次実行する。
そして、各ゲインベクトルの要素を、適応音源符号化手段２３から出力された適応音源信号と、駆動音源符号化手段２４から出力された駆動音源信号にそれぞれ乗算し、各乗算結果を相互に加算して音源信号を生成する。
次に、その音源信号を線形予測係数符号化手段２２により量子化された線形予測係数を用いる合成フィルタに通すことにより、仮の合成音を生成する。
【００７７】
そして、ゲイン符号化手段２５は、仮の合成音と入力音声との距離を調査し、この距離を最小とするゲイン符号を選択して多重化手段２６に出力する。また、そのゲイン符号に対応する音源信号を適応音源符号化手段２３に出力する。これにより、適応音源符号化手段２３は、ゲイン符号化手段２５により選択されたゲイン符号に対応する音源信号を用いて、内蔵する適応音源符号帳の更新を行う。
【００７８】
多重化手段２６は、線形予測係数符号化手段２２により符号化された線形予測係数の符号と、適応音源符号化手段２３から出力された適応音源符号と、駆動音源符号化手段２４から出力された駆動音源符号（音源位置符号と極性を含む）と、ゲイン符号化手段２５から出力されたゲイン符号とを多重化し、その多重化結果である音声符号を音声復号化装置に出力する。
【００７９】
次に、音声復号化装置の分離手段３１は、音声符号化装置が音声符号を出力すると、その音声符号を分離して、線形予測係数の符号を線形予測係数復号化手段３２に出力し、適応音源符号を適応音源復号化手段３３に出力し、駆動音源符号を駆動音源復号化手段３４に出力し、ゲイン符号をゲイン復号化手段３５に出力する。
線形予測係数復号化手段３２は、分離手段３１から線形予測係数の符号を受けると、その符号を復号化し、その復号結果を合成フィルタ３９のフィルタ係数に変換して、そのフィルタ係数を合成フィルタ３９に出力する。
【００８０】
適応音源復号化手段３３は、過去の所定長の音源信号を記憶する適応音源符号帳を内蔵し、分離手段３１から出力された適応音源符号に対応する適応音源信号（過去の音源信号が周期的に繰り返された時系列ベクトル）を出力する。
【００８１】
駆動音源復号化手段３４は、分離手段３１から音源位置符号と極性を含む駆動音源符号を受けると、代数的音源復号化手段５２が音源位置組合せテーブル５１（音源位置組合せテーブル４１と同一内容がテーブル化されている）から、音源位置符号に対応するパルス音源の位置候補の組合せを読み出し、各位置候補に前記極性を付与したパルスを配置した信号にピッチフィルタを施して駆動音源信号を生成し、その駆動音源信号を出力する。
【００８２】
ゲイン復号化手段３５は、ゲインベクトルを格納するゲイン符号帳を内蔵し、分離手段３１から出力されたゲイン符号に対応するゲインベクトルを出力する。そして、適応音源復号化手段３３から出力された適応音源信号と駆動音源復号化手段３４から出力された駆動音源信号は、乗算器３６，３７により当該ゲインベクトルの要素が乗算され、加算器３８により乗算器３６，３７の乗算結果が相互に加算される。
【００８３】
合成フィルタ３９は、加算器３８の加算結果である音源信号に対する合成フィルタリング処理を実行して出力音声を生成する。なお、フィルタ係数としては、線形予測係数復号化手段３２により復号化された線形予測係数を用いる。
最後に、適応音源復号化手段３３は、上記音源信号を用いて、内蔵する適応音源符号帳の更新を行う。
【００８４】
以上で明らかなように、この実施の形態１によれば、複数の音源位置の全組合せのうち、使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル４１，５１から音源位置の組合せを選択し、その音源位置の組合せを用いて入力音声の音源情報を符号化又は復号化するように構成したので、特性の劣化を招くことなく、低ビットレート化を図ることができる音声符号化装置及び音声復号化装置が得られる効果を奏する。
【００８５】
また、音源位置組合せテーブル４１，５１に記述する音源位置の組合せは、その使用頻度により選択して抽出するなどの統計的な手法によるなど、実際に入力音声の音源情報を符号化又は復号化するのに則した方法で構成できるので、ヒューリスティックなルールを用いた場合のような不自然な制約がなく、低ビットレートであっても品質のよい音声符号化装置及び音声復号化装置が得られる効果を奏する。
【００８６】
また、固定的な音源位置の組合せを用いているので、通信路での符号伝送誤りに対する強い耐性を維持しながら、特性を改善することができる効果を奏する。
【００８７】
さらに、複数の音源位置の組合せを音源位置組合せテーブル４１，５１に記述する際、各音源位置を個別に符号化して記述するようにしたので、あるいは、音源位置組合せテーブル４１，５１の各要素を第２の音源位置組合せテーブルの要素とし、第２の音源位置組合せテーブルの各要素に対する使用の可否を示すフラグ情報を記述するようにしたので、インデックス数Ｎに応じて必要な記憶容量を小さくすることができ、装置化規模が小さい効率的な音声符号化装置及び音声復号化装置が得られる効果を奏する。
【００８８】
さらに、第２の音源位置組合せテーブルの音源位置の組合せはフレーム長と音源数とインデックス数とから生成するようにしたので、これに要する記憶容量を不要にすることができ、装置化規模が小さい効率的な音声符号化装置及び音声復号化装置が得られる効果を奏する。
【００８９】
なお、この実施の形態１では、駆動音源信号の生成部にピッチフィルタを導入しているが、これを駆動音源復号化手段３４においてのみ導入したり、駆動音源符号化手段２４と駆動音源復号化手段３４の両方で導入しない構成も可能である。
【００９０】
また、この実施の形態１では、音源位置の組合せに関する評価値として使用頻度を用いるものについて示したが、これに限るものではなく、例えば、符号化歪み等を小さくする期待値などを音源位置の組合せに関する評価値として用いるようにしてもよい。
この期待値は、例えば、音源位置の全組合せを用いて学習用音声データを符号化したときに、各音源位置の組合せ毎に、それが用いられたときの駆動音源信号成分により減少した符号化歪みの総和などとすることができる。
【００９１】
実施の形態２．
図８はこの発明の実施の形態２による音声符号化装置における駆動音源符号化手段２４の内部を示す構成図であり、図において、４３は音源位置及び極性から構成された複数の対データの全組合せのうち、使用頻度（対データの組合せに関する評価値）が基準頻度（基準値）より高い対データの組合せを示す情報が記述された音源位置・極性組合せテーブル（インデックステーブル）、４４は音源位置・極性組合せテーブル４３から任意の対データの組合せを選択し、その対データの組合せを用いて入力音声の駆動音源情報を符号化する代数的音源符号化手段である。
【００９２】
図９はこの発明の実施の形態２による音声復号化装置における駆動音源復号化手段３４の内部を示す構成図であり、図において、５３は複数の対データの全組合せのうち、使用頻度（対データの組合せに関する評価値）が基準頻度（基準値）より高い対データの組合せを示す情報が記述された音源位置・極性組合せテーブル（インデックステーブル）、５４は音源位置・極性組合せテーブル５３から駆動音源符号に含まれている音源位置・極性符号（組合せを示す符号）に基づいて対データの組合せを選択し、その対データの組合せを用いて入力音声の音源情報を復号化する代数的音源復号化手段である。
【００９３】
次に動作について説明する。ただし、駆動音源符号化手段２４及び駆動音源復号化手段３４以外は上記実施の形態１と同様であるため、駆動音源符号化手段２４及び駆動音源復号化手段３４の動作のみを説明する。
【００９４】
まず、駆動音源符号化手段２４の音源位置・極性組合せテーブル４３には、例えば、図１９に示すパルス音源の位置候補及び各音源の極性の全組合せの中から使用頻度が高い組合せ（基準頻度より高い組合せ）のみを所望数抽出するなどして、複数の対データの組合せが記述されている（図１０を参照）。
代数的音源において、符号化に用いられるパルス音源の対データの組合せは全組合せが均等に出現するわけではなく、その発生の頻度には偏りがあり、発生頻度が低い組合せを一切使用しないとしても合成音の品質に与える影響は小さい。この特性を利用してパルス音源の対データの組合せに制約を与えることにより、符号化特性の劣化を抑えつつ、代数的音源に要する符号化ビット数を削減する。
【００９５】
音源位置・極性組合せテーブル４３の情報を記憶する方法として、例えば、図１１に示すように、各インデックスに対する各音源番号の対データを２進数で表現するなど、直接的に、パルス音源の対データを表す情報を符号化して記憶する方法がある。
この様に構成すると、例えば、図１９に示すパルス音源の位置候補及び各音源の極性の全組合せから対データの組合せを抽出して音源位置・極性組合せテーブル４３を作成した場合、各パルス音源の対データの組合せは１７ｂｉｔで表現できるので、これをインデックスの個数（ここでは、インデックスの個数をＮ個とする）分、記憶するためには、（１７×Ｎ）ｂｉｔの記憶領域が必要となる。例えば、Ｎ＝４０９６の場合は６９６３２ｂｉｔ、Ｎ＝８１９２の場合は１３９２６４ｂｉｔとなるなど、使用する音源位置・極性組合せテーブル４３の大きさによって必要な記憶容量が異なる。
【００９６】
音源位置・極性組合せテーブル４３の情報を記憶する別の方法として、例えば、図１２に示すように、少なくとも音源位置・極性組合せテーブル４３の全ての要素を含むように構成された第２の音源位置・極性組合せテーブルにおける各対データの組合せに対して、音源位置・極性組合せテーブル４３の要素であるか否かを示すフラグ情報を１ｂｉｔで表現するなど、直接的には、パルス音源の対データを表す情報を記憶するのではなく、パルス音源の対データの組合せの使用の可否を記憶する方法がある。
この様に構成すると、例えば、図１９に示すパルス音源の位置候補及び各音源の極性の全組合せから、対データの組合せを抽出して音源位置・極性組合せテーブル４３を作成した場合、全組合せ数（１３１０７２個）に対して使用する／使用しないを１ｂｉｔで表すので、１３１０７２ｂｉｔの記憶領域が必要となる。この場合、音源位置・極性組合せテーブル４３のインデックス数Ｎに依らず必要な記憶領域は一定である。
【００９７】
なお、図１９に示すような代数的音源構造に基づくパルス音源の位置候補及び各音源の極性の全組合せは、フレーム長とパルス数とインデックス数から演算により求めることができるので、実際にはテーブルとして持つ必要はなく、記憶領域は不要である。
【００９８】
上述したように、音源位置・極性組合せテーブル４３の情報を記憶する方法は複数あり、インデックス数Ｎと必要な記憶容量との関係が異なるので、インデックス数Ｎに応じて、より記憶容量が小さい方法を選択すれば、メモリやハードディスクなどの記憶装置の規模を小さくできるなど、効率的な装置化が可能となる。
【００９９】
駆動音源符号化手段２４の代数的音源符号化手段４４は、音源位置・極性組合せテーブル４３に格納されているパルス音源の対データの組合せを順次読み出して、その対データの各位置候補に対データの各極性でパルスを立てたときの仮の合成音を生成する。
そして、適応音源符号化手段２３から出力された符号化対象信号と仮の合成音との距離を計算して、その距離を最小にするパルス音源の対データの組合せを探索する。
代数的音源符号化手段４４は、その探索が完了すると、その探索結果であるパルス音源の対データの組合せを表す音源位置・極性符号を、駆動音源符号として多重化手段２６に出力するとともに、この駆動音源符号に対応する時系列ベクトルを、駆動音源信号としてゲイン符号化手段２５に出力する。
【０１００】
この代数的音源符号化手段４４における探索動作は、文献１に示されている駆動音源符号化手段と同様に行う。また、文献１に示されているように駆動音源の生成部の最終段にピッチフィルタを導入する。即ち、各パルス音源の位置候補にパルスを配置した信号にピッチフィルタを施して駆動音源信号とし、これに対する仮の合成音を生成する。
そして、各位置候補毎の仮の合成音同士の相関と、各位置候補毎の仮の合成音と符号化対象信号の相関を計算し、これらの相関を用いて対データの探索を高速に行う。
【０１０１】
その結果として、パルス音源の対データの組合せが得られる。パルス音源の対データの組合せは、例えば、音源位置組合せ・極性テーブル４３における組合せに対するインデックスを２進数で表現するなど、対応した符号に変換され、最終的な音源位置・極性符号として出力される。代数的音源は、その構造から探索動作に要する演算量が少ないことが知られているが、その代数的音源の構造を保ちつつ探索すべき対データの組合せ数を減らすことにより、更なる演算量削減効果を得ることができる。
【０１０２】
次に、駆動音源復号化手段３４の代数的音源復号化手段５４は、分離手段３１から音源位置・極性符号を受けると、音源位置・極性組合せテーブル５３（音源位置・極性組合せテーブル４３と同一内容がテーブル化されている）から、音源位置・極性符号に対応するパルス音源の対データの組合せを読み出し、その対データの各位置候補に前記極性を付与したパルスを配置した信号にピッチフィルタを施して駆動音源信号を生成し、その駆動音源信号を出力する。
【０１０３】
以上で明らかなように、この実施の形態２によれば、複数の対データの全組合せのうち、使用頻度が基準頻度より高い対データの組合せを示す情報が記述された音源位置・極性組合せテーブル４３，５３から対データの組合せを選択し、その対データの組合せを用いて入力音声の音源情報を符号化又は復号化するように構成したので、特性の劣化を招くことなく、低ビットレート化を図ることができる音声符号化装置及び音声復号化装置が得られる効果を奏する。
【０１０４】
また、音源位置・極性組合せテーブル４３，５３に記述する対データの組合せは、その使用頻度により選択して抽出するなどの統計的な手法によるなど、実際に入力音声の音源情報を符号化又は復号化するのに則した方法で構成できるので、ヒューリスティックなルールを用いた場合のような不自然な制約がなく、低ビットレートであっても品質のよい音声符号化装置及び音声復号化装置が得られる効果を奏する。
【０１０５】
また、固定的な対データの組合せを用いているので、通信路での符号伝送誤りに対する強い耐性を維持しながら、特性を改善することができる効果を奏する。
【０１０６】
さらに、複数の対データの組合せを音源位置・極性組合せテーブル４３，５３に記述する際、各対データを個別に符号化して記述するようにしたので、あるいは、音源位置・極性組合せテーブル４３，５３の各要素を第２の音源位置・極性組合せテーブルの要素とし、第２の音源位置・極性組合せテーブルの各要素に対する使用の可否を示すフラグ情報を記述するようにしたので、インデックス数Ｎに応じて必要な記憶容量を小さくすることができ、装置化規模が小さい効率的な音声符号化装置及び音声復号化装置が得られる効果を奏する。
【０１０７】
さらに、第２の音源位置・極性組合せテーブルの対データの組合せはフレーム長と音源数とインデックス数とから生成するようにしたので、これに要する記憶容量を不要にすることができ、装置化規模が小さい効率的な音声符号化装置及び音声復号化装置が得られる効果を奏する。
【０１０８】
なお、この実施の形態２では、駆動音源信号の生成部にピッチフィルタを導入しているが、これを駆動音源復号化手段３４においてのみ導入したり、駆動音源符号化手段２４と駆動音源復号化手段３４の両方で導入しない構成も可能である。
【０１０９】
また、この実施の形態２では、対データの組合せに関する評価値として使用頻度を用いるものについて示したが、これに限るものではなく、例えば、符号化歪み等を小さくする期待値などを対データの組合せに関する評価値として用いるようにしてもよい。
【０１１０】
実施の形態３．
図１３はこの発明の実施の形態３による音声符号化装置における駆動音源符号化手段２４の内部を示す構成図であり、図において、６１は複数の音源位置の全組合せのうち、有音の立ち上がり区間で使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、６２は複数の音源位置の全組合せのうち、有音の立ち上がり以外の区間で使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、６３は入力音声を分析して所定のパラメータを抽出し、そのパラメータに対応する音源位置組合せテーブル６１（または６２）を選択する選択手段、６４は選択手段６３により選択された音源位置組合せテーブル６１（または６２）から任意の音源位置の組合せを選択し、その音源位置の組合せと適正な極性を用いて入力音声の音源情報を符号化する代数的音源符号化手段である。
【０１１１】
図１４はこの発明の実施の形態３による音声復号化装置における駆動音源復号化手段３４の内部を示す構成図であり、図において、７１は複数の音源位置の全組合せのうち、有音の立ち上がり区間で使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、７２は複数の音源位置の全組合せのうち、有音の立ち上がり以外の区間で使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、７３は駆動音源符号に含まれている選択情報を示す符号に対応する音源位置組合せテーブル７１（または７２）を選択する選択手段、７４は選択手段７３により選択された音源位置組合せテーブル７１（または７２）から駆動音源符号に含まれている音源位置符号に基づいて音源位置の組合せを選択し、その音源位置の組合せと極性（極性を特定する情報は駆動音源符号に含まれている）を用いて入力音声の音源情報を復号化する代数的音源復号化手段である。
【０１１２】
次に動作について説明する。ただし、駆動音源符号化手段２４及び駆動音源復号化手段３４以外は上記実施の形態１と同様であるため、駆動音源符号化手段２４及び駆動音源復号化手段３４の動作のみを説明する。
【０１１３】
まず、駆動音源符号化手段２４の音源位置組合せテーブル６１には、例えば、図１９に示すパルス音源の位置候補の全組合せの中から有声の立上り区間で使用頻度が高い組合せ（基準頻度より高い組合せ）のみを所望数抽出するなどして、複数のパルス音源の位置候補の組合せが記述されている。また、音源位置組合せテーブル６２には、例えば、図１９に示すパルス音源の位置候補の全組合せの中から有声の立上り以外の区間で使用頻度が高い組合せ（基準頻度より高い組合せ）のみを所望数抽出するなどして、複数のパルス音源の位置候補の組合せが記述されている。
【０１１４】
代数的音源において、符号化に用いられるパルス音源の位置候補の組合せは全組合せが均等に出現するわけではなく、その発生の頻度には偏りがあり、発生頻度が低い組合せを一切使用しないとしても合成音の品質に与える影響は小さい。
例えば、有声の立上り区間などフレーム後半のパワーがフレーム前半に比較して大きい場合には、パルス音源の位置候補もフレーム後半に集中する傾向がある。また、有声の立上り以外の区間では、ピッチフィルタを用いることもあり、パルス音源の位置候補はフレーム前半に集中する傾向や、フレームに均等に出現する傾向がある。
この特性を利用して、入力音声の特徴に応じてパルス音源の位置候補の組合せに制約を与えることにより、符号化特性の劣化を抑えつつ、代数的音源に要する符号化ビット数を削減する。
【０１１５】
選択手段６３は、入力音声を分析し、例えば、有声の立上り区間であれば、音源位置組合せテーブル６１を選択し、有声の立上り以外の区間であれば、音源位置組合せテーブル６２を選択するなど、入力音声の分析結果に基づいて使用する音源位置組合せテーブルを選択して切り換える。
【０１１６】
代数的音源符号化手段６４は、選択手段６３により選択された音源位置組合せテーブル６１（または６２）に格納されているパルス音源の位置候補の組合せを順次読み出して、各位置候補に任意の極性でパルスを立ててピッチフィルタを施したときの仮の合成音を生成する。
そして、適応音源符号化手段２３から出力された符号化対象信号と仮の合成音との距離を計算して、その距離を最小にするパルス音源の位置候補の組合せと極性を探索する。
【０１１７】
代数的音源符号化手段６４は、その探索が完了すると、その探索結果であるパルス音源の位置候補の組合せを表す音源位置符号と極性とを、駆動音源符号として多重化手段２６に出力するとともに、この駆動音源符号に対応する時系列ベクトルを、駆動音源信号としてゲイン符号化手段２５に出力する。なお、選択手段６３から出力された選択情報も、駆動音源符号に含められて多重化手段２６に出力される。
【０１１８】
次に、駆動音源復号化手段３４の選択手段７３は、駆動音源符号に含まれている選択情報を示す符号に対応する音源位置組合せテーブル、即ち、音源位置組合せテーブル７１又は音源位置組合せテーブル７２を選択する。ただし、音源位置組合せテーブル７１は音源位置組合せテーブル６１と同一内容がテーブル化され、音源位置組合せテーブル７２は音源位置組合せテーブル６２と同一内容がテーブル化されている。
【０１１９】
代数的音源復号化手段７４は、選択手段７３により選択された音源位置組合せテーブル７１（または７２）から、音源位置符号に対応するパルス音源の位置候補の組合せを読み出し、各位置候補に前記極性を付与したパルスを配置した信号にピッチフィルタを施して駆動音源信号を生成し、その駆動音源信号を出力する。
【０１２０】
以上で明らかなように、この実施の形態３によれば、記述内容が相互に異なる音源位置組合せテーブル６１，６２（または７１，７２）を有し、任意の音源位置組合せテーブルを選択して使用するように構成したので、上記実施の形態１と同様の効果を奏することができるとともに、特性の劣化を効果的に抑制することができる効果を奏する。
【０１２１】
なお、この実施の形態３では、駆動音源信号の生成部にピッチフィルタを導入しているが、これを駆動音源復号化手段３４においてのみ導入したり、駆動音源符号化手段２４と駆動音源復号化手段３４の両方で導入しない構成も可能である。
【０１２２】
また、この実施の形態３では、有声の立上りか否かにより音源位置組合せテーブルを切り換えているが、母音部か否か、雑音区間か音声区間か、あるいは、ピッチ長の大小に応じて切り換えるなど、他の基準を用いる構成も可能である。
さらに、これらの基準を複数組み合せて用いる構成も可能である。
【０１２３】
また、この実施の形態３では、有声の立上りか否かを判定しているので、主に入力音声のパワー情報をパラメータとして用いて音源位置組合せテーブルを切り換えることになるが、入力音声を分析して得られるピッチ情報やスペクトル情報など、他のパラメータを用いる構成も可能である。さらに、これらのパラメータを複数組み合せて用いる構成も可能である。
【０１２４】
この実施の形態３では、２つの音源位置組合せテーブルを切り換えているが、３つ以上の音源位置組合せテーブルを切り換える構成も可能である。
また、この実施の形態３では、複数の音源位置組合せテーブルを切り換えているが、複数の音源位置・極性組合せテーブルを切り換える構成も可能である。
【０１２５】
実施の形態４．
図１５はこの発明の実施の形態４による音声符号化装置における駆動音源符号化手段２４の内部を示す構成図であり、図において、８１は複数の音源位置の全組合せのうち、ピッチ周期がフレーム長より短い場合に使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、８２は複数の音源位置の全組合せのうち、ピッチ周期がフレーム長より長い場合に使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、８３は適応音源符号からピッチ周期を求め、そのピッチ周期に対応する音源位置組合せテーブル８１（または８２）を選択する選択手段、８４は選択手段８３により選択された音源位置組合せテーブル８１（または８２）から任意の音源位置の組合せを選択し、その音源位置の組合せと適正な極性を用いて入力音声の音源情報を符号化する代数的音源符号化手段である。
【０１２６】
図１６はこの発明の実施の形態４による音声復号化装置における駆動音源復号化手段３４の内部を示す構成図であり、図において、９１は複数の音源位置の全組合せのうち、ピッチ周期がフレーム長より短い場合に使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、９２は複数の音源位置の全組合せのうち、ピッチ周期がフレーム長より長い場合に使用頻度が基準頻度より高い音源位置の組合せを示す情報が記述された音源位置組合せテーブル（インデックステーブル）、９３は適応音源符号からピッチ周期を求め、そのピッチ周期に対応する音源位置組合せテーブル９１（または９２）を選択する選択手段、９４は選択手段９３により選択された音源位置組合せテーブル９１（または９２）から駆動音源符号に含まれている音源位置符号に基づいて音源位置の組合せを選択し、その音源位置の組合せと極性（極性を特定する情報は駆動音源符号に含まれている）を用いて入力音声の音源情報を復号化する代数的音源復号化手段である。
【０１２７】
次に動作について説明する。ただし、駆動音源符号化手段２４及び駆動音源復号化手段３４以外は上記実施の形態１と同様であるため、駆動音源符号化手段２４及び駆動音源復号化手段３４の動作のみを説明する。
【０１２８】
まず、駆動音源符号化手段２４の音源位置組合せテーブル８１には、例えば、図１９に示すパルス音源の位置候補の全組合せの中からピッチ周期がフレーム長より短い場合に使用頻度が高い組合せ（基準頻度より高い組合せ）のみを所望数抽出するなどして、複数のパルス音源の位置候補の組合せが記述されている。また、音源位置組合せテーブル８２には、例えば、図１９に示すパルス音源の位置候補の全組合せの中からピッチ周期がフレーム長より長い場合に使用頻度が高い組合せ（基準頻度より高い組合せ）のみを所望数抽出するなどして、複数のパルス音源の位置候補の組合せが記述されている。
【０１２９】
代数的音源において、符号化に用いられるパルス音源の位置候補の組合せは全組合せが均等に出現するわけではなく、その発生の頻度には偏りがあり、発生頻度が低い組合せを一切使用しないとしても合成音の品質に与える影響は小さい。
例えば、ピッチ周期がフレーム長より短い場合は、ピッチフィルタを用いることもあり、パルス音源の位置候補はフレーム前半に集中する傾向があり、ピッチ周期がフレーム長より長い場合にはパルス音源の位置候補はフレームに均等に出現する傾向がある。
この特性を利用して、入力音声の特徴に応じてパルス音源の位置候補の組合せに制約を与えることにより、符号化特性の劣化を抑えつつ、代数的音源に要する符号化ビット数を削減する。
【０１３０】
選択手段８３は、適応音源符号よりピッチ周期を求め、例えば、ピッチ周期がフレーム長より短い場合には、音源位置組合せテーブル８１を選択し、ピッチ周期がフレーム長より長い場合には、音源位置組合せテーブル８２を選択するなど、ピッチ周期に基づいて使用する音源位置組合せテーブルを選択して切り換える。
【０１３１】
代数的音源符号化手段８４は、選択手段８３により選択された音源位置組合せテーブル８１（または８２）に格納されているパルス音源の位置候補の組合せを順次読み出して、各位置候補に任意の極性でパルスを立ててピッチフィルタを施したときの仮の合成音を生成する。
そして、適応音源符号化手段２３から出力された符号化対象信号と仮の合成音との距離を計算して、その距離を最小にするパルス音源の位置候補の組合せと極性を探索する。
代数的音源符号化手段８４は、その探索が完了すると、その探索結果であるパルス音源の位置候補の組合せを表す音源位置符号と極性とを、駆動音源符号として多重化手段２６に出力するとともに、この駆動音源符号に対応する時系列ベクトルを、駆動音源信号としてゲイン符号化手段２５に出力する。
【０１３２】
次に、駆動音源復号化手段３４の選択手段９３は、駆動音源符号化手段２４の選択手段８３と同様にして、音源位置組合せテーブル９１又は音源位置組合せテーブル９２を選択する。ただし、音源位置組合せテーブル９１は音源位置組合せテーブル８１と同一内容がテーブル化され、音源位置組合せテーブル９２は音源位置組合せテーブル８２と同一内容がテーブル化されている。
【０１３３】
代数的音源復号化手段９４は、選択手段９３により選択された音源位置組合せテーブル９１（または９２）から、音源位置符号に対応するパルス音源の位置候補の組合せを読み出し、各位置候補に前記極性を付与したパルスを配置した信号にピッチフィルタを施して駆動音源信号を生成し、その駆動音源信号を出力する。
【０１３４】
以上で明らかなように、この実施の形態４によれば、記述内容が相互に異なる音源位置組合せテーブル８１，８２（または９１，９２）を有し、任意の音源位置組合せテーブルを選択して使用するように構成したので、上記実施の形態１と同様の効果を奏することができるとともに、特性の劣化を効果的に抑制することができる効果を奏する。
また、この実施の形態４では、適応音源符号より求めることができるピッチ周期に基づいて音源位置組合せテーブルを選択するようにしているので、使用対象の音源位置組合せテーブルを特定する選択情報の符号化が不要になる効果も奏する。
【０１３５】
なお、この実施の形態４では、駆動音源信号の生成部にピッチフィルタを導入しているが、これを駆動音源復号化手段３４においてのみ導入したり、駆動音源符号化手段２４と駆動音源復号化手段３４の両方で導入しない構成も可能である。
【０１３６】
また、この実施の形態４では、適応音源符号から求まるピッチ周期に応じて音源位置組合せテーブルを切り換えているが、線形予測係数の符号から求まるスペクトル様態に応じて切り換えるなど、他のパラメータを用いる構成も可能である。さらに、これらのパラメータを複数組み合せて用いる構成も可能である。
【０１３７】
また、この実施の形態４では、現フレームで求められた符号を用いて音源位置組合せテーブルを切り換えるためのパラメータを求めているが、過去のフレームにおける符号を用いて音源位置組合せテーブルを切り換えるためのパラメータを求める構成も可能である。
【０１３８】
この実施の形態４では、符号に基づいて音源位置組合せテーブルを切り換えるためのパラメータを求めているが、過去に生成された音源信号や出力音声など、音声符号化装置及び音声復号化装置に共通に生成可能な信号を分析してパラメータを求める構成も可能である。
【０１３９】
また、この実施の形態４では、２つの音源位置組合せテーブルを切り換えているが、３つ以上の音源位置組合せテーブルを切り換える構成も可能である。
さらに、この実施の形態４では、複数の音源位置組合せテーブルを切り換えているが、複数の音源位置・極性組合せテーブルを切り換える構成も可能である。
【０１４０】
【発明の効果】
以上のように、この発明によれば、音源符号化手段が複数の音源位置の全組合せに対して個々に定まる音源位置の組合せに関する評価値が基準値より高い音源位置の組合せのみを抽出して、その音源位置の組合せを示す情報を記述したインデックステーブルを備え、そのインデックステーブルから任意の音源位置の組合せを選択し、その音源位置の組合せを用いて入力音声の音源情報を符号化するように構成したので、特性の劣化を招くことなく、低ビットレート化を図ることができる効果がある。
また、予め作成して備えているインデックステーブルに存在する固定的な音源位置の組合せを用いているので、通信路での符号伝送誤りに対する強い耐性を維持しながら、特性を改善することができる効果がある。
【０１４１】
この発明によれば、インデックステーブルに記述されている音源位置の組合せを示す情報が、個別に符号化された音源位置の組合せ情報であるように構成したので、記憶容量が小さい効率的な音声符号化装置が得られる効果がある。
【０１４２】
この発明によれば、音源符号化手段が少なくとも評価値が基準値より高い音源位置の組合せを含む音源位置の組合せを認識することが可能な場合、インデックステーブルに記述されている音源位置の組合せを示す情報が、音源位置の各組合せがインデックステーブルの要素であるか否かを示すフラグ情報であるように構成したので、記憶容量が小さい効率的な音声符号化装置が得られる効果がある。
【０１４３】
この発明によれば、音源符号化手段がフレーム長と音源数とインデックス数から少なくとも評価値が基準値より高い音源位置の組合せを含む音源位置の組合せを認識するように構成したので、記憶容量が小さい効率的な音声符号化装置が得られる効果がある。
【０１４４】
この発明によれば、音源符号化手段が音源位置及び極性から構成された複数の対データの全組合せに対して個々に定まる対データの組合せに関する評価値が基準値より高い対データの組合せのみを抽出して、その対データの組合せを示す情報を記述したインデックステーブルを備え、そのインデックステーブルから任意の対データの組合せを選択し、その対データの組合せを用いて入力音声の音源情報を符号化するように構成したので、特性の劣化を招くことなく、低ビットレート化を図ることができる効果がある。
また、予め作成して備えているインデックステーブルに存在する固定的な対データの組合せを用いているので、通信路での符号伝送誤りに対する強い耐性を維持しながら、特性を改善することができる効果がある。
【０１４５】
この発明によれば、インデックステーブルに記述されている対データの組合せを示す情報が、個別に符号化された対データの組合せ情報であるように構成したので、記憶容量が小さい効率的な音声符号化装置が得られる効果がある。
【０１４６】
この発明によれば、音源符号化手段が少なくとも評価値が基準値より高い対データの組合せを含む対データの組合せを認識することが可能な場合、インデックステーブルに記述されている対データの組合せを示す情報が、対データの各組合せがインデックステーブルの要素であるか否かを示すフラグ情報であるように構成したので、記憶容量が小さい効率的な音声符号化装置が得られる効果がある。
【０１４７】
この発明によれば、音源符号化手段がフレーム長と音源数とインデックス数から少なくとも評価値が基準値より高い対データの組合せを含む対データの組合せを認識するように構成したので、記憶容量が小さい効率的な音声符号化装置が得られる効果がある。
【０１４８】
この発明によれば、音源符号化手段が記述内容が相互に異なるインデックステーブルを複数個有し、任意のインデックステーブルを選択して使用するように構成したので、低ビットレート化を図ることができる効果がある。また、特性の劣化を効果的に抑制することができる効果がある。
【０１４９】
この発明によれば、音源符号化手段が入力音声を分析して所定のパラメータを抽出し、そのパラメータに対応するインデックステーブルを選択するように構成したので、複雑な処理を実施することなく、インデックステーブルを選択することができる効果がある。
【０１５０】
この発明によれば、音源符号化手段がスペクトル包絡情報および音源情報の少なくともどちらか一方から所定のパラメータを抽出し、そのパラメータに対応するインデックステーブルを選択するように構成したので、使用対象のインデックステーブルを特定する選択情報の符号化が不要になる効果がある。
【０１５１】
この発明によれば、音源復号化手段が複数の音源位置の全組合せに対して個々に定まる音源位置の組合せに関する評価値が基準値より高い音源位置の組合せのみを抽出して、その音源位置の組合せを示す情報を記述したインデックステーブルを備え、そのインデックステーブルから音源情報に含まれている組合せを示す符号に基づいて音源位置の組合せを選択し、その音源位置の組合せを用いて入力音声の音源情報を復号化するように構成したので、特性の劣化を招くことなく、低ビットレート化を図ることができる効果がある。
また、予め作成して備えているインデックステーブルに存在する固定的な音源位置の組合せを用いているので、通信路での符号伝送誤りに対する強い耐性を維持しながら、特性を改善することができる効果がある。
【０１５２】
この発明によれば、インデックステーブルに記述されている音源位置の組合せを示す情報が、個別に符号化された音源位置の組合せ情報であるように構成したので、記憶容量が小さい効率的な音声復号化装置が得られる効果がある。
【０１５３】
この発明によれば、音源復号化手段が少なくとも評価値が基準値より高い音源位置の組合せを含む音源位置の組合せを認識することが可能な場合、インデックステーブルに記述されている音源位置の組合せを示す情報が、音源位置の各組合せがインデックステーブルの要素であるか否かを示すフラグ情報であるように構成したので、記憶容量が小さい効率的な音声復号化装置が得られる効果がある。
【０１５４】
この発明によれば、音源復号化手段がフレーム長と音源数とインデックス数から少なくとも評価値が基準値より高い音源位置の組合せを含む音源位置の組合せを認識するように構成したので、記憶容量が小さい効率的な音声復号化装置が得られる効果がある。
【０１５５】
この発明によれば、音源復号化手段が音源位置及び極性から構成された複数の対データの全組合せに対して個々に定まる対データの組合せに関する評価値が基準値より高い対データの組合せのみを抽出して、その対データの組合せを示す情報を記述したインデックステーブルを備え、そのインデックステーブルから音源情報に含まれている組合せを示す符号に基づいて対データの組合せを選択し、その対データの組合せを用いて入力音声の音源情報を復号化するように構成したので、特性の劣化を招くことなく、低ビットレート化を図ることができる効果がある。
また、予め作成して備えているインデックステーブルに存在する固定的な対データの組合せを用いているので、通信路での符号伝送誤りに対する強い耐性を維持しながら、特性を改善することができる効果がある。
【０１５６】
この発明によれば、インデックステーブルに記述されている対データの組合せを示す情報が、個別に符号化された対データの組合せ情報であるように構成したので、記憶容量が小さい効率的な音声復号化装置が得られる効果がある。
【０１５７】
この発明によれば、音源復号化手段が少なくとも評価値が基準値より高い対データの組合せを含む対データの組合せを認識することが可能な場合、インデックステーブルに記述されている対データの組合せを示す情報が、対データの各組合せがインデックステーブルの要素であるか否かを示すフラグ情報であるように構成したので、記憶容量が小さい効率的な音声復号化装置が得られる効果がある。
【０１５８】
この発明によれば、音源復号化手段がフレーム長と音源数とインデックス数から少なくとも評価値が基準値より高い対データの組合せを含む対データの組合せを認識するように構成したので、記憶容量が小さい効率的な音声復号化装置が得られる効果がある。
【０１５９】
この発明によれば、音源復号化手段が記述内容が相互に異なるインデックステーブルを複数個有し、音源情報に含まれている選択情報を示す符号に対応するインデックステーブルを選択して使用するように構成したので、低ビットレート化を図ることができる効果がある。また、特性の劣化を効果的に抑制することができる効果がある。
【０１６０】
この発明によれば、音源復号化手段が記述内容が相互に異なるインデックステーブルを複数個有し、スペクトル包絡情報および音源情報の少なくともどちらか一方から所定のパラメータを抽出し、そのパラメータに対応するインデックステーブルを選択して使用するように構成したので、低ビットレート化を図ることができる効果がある。また、特性の劣化を効果的に抑制することができる効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１による音声符号化装置を示す構成図である。
【図２】この発明の実施の形態１による音声復号化装置を示す構成図である。
【図３】この発明の実施の形態１による音声符号化装置における駆動音源符号化手段の内部を示す構成図である。
【図４】この発明の実施の形態１による音声復号化装置における駆動音源復号化手段の内部を示す構成図である。
【図５】音源位置組合せテーブルを示す説明図である。
【図６】音源位置組合せテーブルの情報記憶方法を示す説明図である。
【図７】音源位置組合せテーブルの情報記憶方法を示す説明図である。
【図８】この発明の実施の形態２による音声符号化装置における駆動音源符号化手段の内部を示す構成図である。
【図９】この発明の実施の形態２による音声復号化装置における駆動音源復号化手段の内部を示す構成図である。
【図１０】音源位置・極性組合せテーブルを示す説明図である。
【図１１】音源位置・極性組合せテーブルの情報記憶方法を示す説明図である。
【図１２】音源位置・極性組合せテーブルの情報記憶方法を示す説明図である。
【図１３】この発明の実施の形態３による音声符号化装置における駆動音源符号化手段の内部を示す構成図である。
【図１４】この発明の実施の形態３による音声復号化装置における駆動音源復号化手段の内部を示す構成図である。
【図１５】この発明の実施の形態４による音声符号化装置における駆動音源符号化手段の内部を示す構成図である。
【図１６】この発明の実施の形態４による音声復号化装置における駆動音源復号化手段の内部を示す構成図である。
【図１７】従来のＣＥＬＰ方式を用いる音声符号化装置を示す構成図である。
【図１８】従来のＣＥＬＰ方式を用いる音声復号化装置を示す構成図である。
【図１９】音源位置テーブルを示す説明図である。
【図２０】音源位置・極性テーブルを示す説明図である。
【符号の説明】
２１線形予測分析手段（包絡情報符号化手段）、２２線形予測係数符号化手段（包絡情報符号化手段）、２３適応音源符号化手段（音源符号化手段）、２４駆動音源符号化手段（音源符号化手段）、２５ゲイン符号化手段（音源符号化手段）、２６多重化手段、３１分離手段、３２線形予測係数復号化手段（包絡情報復号化手段）、３３適応音源復号化手段（音源復号化手段）、３４駆動音源復号化手段（音源復号化手段）、３５ゲイン復号化手段（音源復号化手段）、３６，３７乗算器（音源復号化手段）、３８加算器（音源復号化手段）、３９合成フィルタ、４１，５１，７１，７２，８１，８２，９１，９２音源位置組合せテーブル（インデックステーブル）、４２，４４，６４，８４代数的音源符号化手段、４３，５３音源位置・極性組合せテーブル（インデックステーブル）、５２，５４，７４，９４代数的音源復号化手段、６１，６２音源位置組合せテーブル（インデックステーブル）、６３，７３，８３，９３選択手段。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech coding apparatus that compresses a digital speech signal to a small amount of information, and a speech decoding apparatus that decodes a speech code generated by the speech coding apparatus and reproduces a digital speech signal.
[0002]
[Prior art]
In many conventional speech coding apparatuses and speech decoding apparatuses, a speech coding apparatus divides input speech into spectrum envelope information and sound source information, and encodes each frame unit of a predetermined long section to generate a speech code. The speech decoding apparatus decodes the speech code, and generates the decoded speech by combining the spectrum envelope information and the sound source information by the synthesis filter. As the most typical speech coding apparatus and speech decoding apparatus, there is one using a code-driven linear predictive coding (CELP) method.
[0003]
FIG. 17 is a block diagram showing a speech coding apparatus using a conventional CELP system. In FIG. 17, 1 is a linear prediction analysis that analyzes input speech and extracts linear prediction coefficients that are spectral envelope information of the input speech. Means 2 is a linear prediction coefficient encoding means for encoding the linear prediction coefficient extracted by the linear prediction analysis means 1, and 3 is a temporary synthesis using the linear prediction coefficient quantized by the linear prediction coefficient encoding means 2. A sound is generated, and an adaptive excitation code that minimizes the distance between the temporary synthesized sound and the input speech is selected and output to the multiplexing means 6, and an adaptive excitation signal corresponding to the adaptive excitation code (with a predetermined length in the past) is selected. Adaptive excitation encoding means for outputting a time-series vector in which the excitation signal is periodically repeated) to the gain encoding means 5, and 4 is a temporary prediction using the linear prediction coefficient quantized by the linear prediction coefficient encoding means 2. A synthesized sound is generated, and a driving excitation code that minimizes the distance between the temporary synthesized sound and the encoding target signal (a signal obtained by subtracting the synthesized sound of the adaptive excitation signal from the input sound) is selected and output to the multiplexing means 6. At the same time, it is a drive excitation encoding unit that outputs a drive excitation signal that is a time-series vector corresponding to the drive excitation code to the gain encoding unit 5.
[0004]
Reference numeral 5 denotes an excitation signal that is obtained by multiplying the adaptive excitation signal output from the adaptive excitation encoding means 3 and the driving excitation signal output from the driving excitation encoding means 4 by each element of the gain vector, and adding the multiplication results to each other. On the other hand, using the linear prediction coefficient quantized by the linear prediction coefficient encoding means 2, a temporary synthesized sound is generated from the sound source signal, and the gain that minimizes the distance between the temporary synthesized sound and the input speech Gain encoding means for selecting a code and outputting it to the multiplexing means 6, 6 is the code of the linear prediction coefficient encoded by the linear prediction coefficient encoding means 2, and the adaptive excitation output from the adaptive excitation encoding means 3 The multiplexing means outputs a speech code by multiplexing the code, the driving excitation code output from the driving excitation encoding means 4 and the gain code output from the gain encoding means 5.
[0005]
FIG. 18 is a block diagram showing a conventional speech decoding apparatus using the CELP method. In FIG. 18, reference numeral 11 denotes a speech prediction output from the speech coding apparatus, and linear prediction coefficient decoding is performed on the linear prediction coefficient code. Separating means for outputting to the converting means 12, outputting the adaptive excitation code to the adaptive excitation decoding means 13, outputting the driving excitation code to the driving excitation decoding means 14, and outputting the gain code to the gain decoding means 15; Is a linear prediction coefficient decoding unit that decodes the linear prediction coefficient output from the separation unit 11, converts the decoding result into a filter coefficient of the synthesis filter 19, and outputs the filter coefficient to the synthesis filter 19.
[0006]
Reference numeral 13 denotes an adaptive excitation decoding means for outputting an adaptive excitation signal corresponding to the adaptive excitation code output from the separation means 11 (a time series vector in which past excitation signals are periodically repeated), and reference numeral 14 is output from the separation means 11. Drive excitation decoding means for outputting a drive excitation signal that is a time series vector corresponding to the driven excitation code, and 15 is a gain decoding means for outputting a gain vector corresponding to the gain code output from the separation means 11. .
[0007]
16 is a multiplier that multiplies the adaptive excitation signal output from the adaptive excitation decoding means 13 by the gain vector element output from the gain decoding means 15, and 17 is the gain vector element output from the gain decoding means 15. Is multiplied by the drive excitation signal output from the drive excitation decoding means 14, 18 is an adder for adding the multiplication result of the multiplier 16 and the multiplication result of the multiplier 17 to generate a excitation signal, and 19 is an addition This is a synthesis filter that performs synthesis filtering processing on the sound source signal generated by the generator 18 to generate output sound.
[0008]
Next, the operation will be described.
In a conventional speech encoding device and speech decoding device, processing is performed in units of frames, with about 5 to 50 ms being one frame.
[0009]
First, when a speech is input, the linear prediction analysis unit 1 of the speech encoding apparatus analyzes the input speech and extracts linear prediction coefficients that are speech spectral envelope information.
When the linear prediction analysis means 1 extracts a linear prediction coefficient, the linear prediction coefficient encoding means 2 encodes the linear prediction coefficient and outputs the code to the multiplexing means 6. In addition, the quantized linear prediction coefficient corresponding to the code is output to the adaptive excitation encoding means 3, the drive excitation encoding means 4 and the gain encoding means 5.
[0010]
The adaptive excitation coding means 3 incorporates an adaptive excitation codebook for storing a past excitation signal of a predetermined length, and is adapted to each adaptive excitation code generated internally (the adaptive excitation code is indicated by a binary value of several bits). In response, a time series vector in which past sound source signals are periodically repeated is generated.
Next, after multiplying each time series vector by an appropriate gain, each time series vector is passed through a synthesis filter that uses the linear prediction coefficient quantized by the linear prediction coefficient encoding means 2 to obtain a temporary synthesized sound. Generate.
[0011]
Then, the adaptive excitation encoding means 3 investigates the distance between the temporary synthesized sound and the input speech, selects an adaptive excitation code that minimizes this distance, outputs it to the multiplexing means 6, and selects the selected adaptation. A time series vector corresponding to the excitation code is output to the gain encoding means 5 as an adaptive excitation signal. Further, a signal obtained by subtracting the synthesized sound of the adaptive excitation signal from the input sound is output to the driving excitation encoding means 4 as an encoding target signal.
[0012]
The driving excitation encoding means 4 has a built-in driving excitation codebook for storing a driving code vector which is a plurality of non-noisy or noisy time series vectors, and each driving excitation code (the number of driving excitation codes is several) generated internally. In accordance with the binary value of the bit), time-series vectors are sequentially read out from the driving excitation codebook.
Next, after multiplying each time series vector by an appropriate gain, each time series vector is passed through a synthesis filter that uses the linear prediction coefficient quantized by the linear prediction coefficient encoding means 2 to obtain a temporary synthesized sound. Generate.
[0013]
Then, the driving excitation encoding means 4 investigates the distance between the temporary synthesized sound and the encoding target signal output from the adaptive excitation encoding means 3, and selects the driving excitation code that minimizes this distance. While outputting to the multiplexing means 6, the time series vector corresponding to the selected driving excitation code is output to the gain encoding means 5 as a driving excitation signal.
[0014]
The gain encoding means 5 has a built-in gain code book for storing a gain vector, and the gain code book is used to generate a gain code from the gain code book according to each internally generated gain code (the gain code is indicated by a binary value of several bits). The vector is read sequentially.
Then, the elements of each gain vector are respectively multiplied by the adaptive excitation signal output from the adaptive excitation encoding means 3 and the driving excitation signal output from the driving excitation encoding means 4, and the multiplication results are added to each other. To generate a sound source signal.
Next, the sound source signal is passed through a synthesis filter that uses the linear prediction coefficient quantized by the linear prediction coefficient encoding means 2 to generate a temporary synthesized sound.
[0015]
Then, the gain encoding unit 5 investigates the distance between the temporary synthesized sound and the input speech, selects a gain code that minimizes this distance, and outputs it to the multiplexing unit 6. In addition, the excitation signal corresponding to the gain code is output to the adaptive excitation encoding means 3. As a result, the adaptive excitation encoding unit 3 updates the built-in adaptive excitation codebook using the excitation signal corresponding to the gain code selected by the gain encoding unit 5.
[0016]
The multiplexing unit 6 includes the linear prediction coefficient encoded by the linear prediction coefficient encoding unit 2, the adaptive excitation code output from the adaptive excitation encoding unit 3, and the drive excitation output from the drive excitation encoding unit 4. The code and the gain code output from the gain encoding means 5 are multiplexed, and the speech code that is the multiplexing result is output to the speech decoding apparatus.
[0017]
When the speech encoding apparatus outputs the speech code, the separating means 11 of the speech decoding apparatus separates the speech code, outputs the code of the linear prediction coefficient to the linear prediction coefficient decoding means 12, and outputs the adaptive excitation code. It outputs to the adaptive excitation decoding means 13, outputs the driving excitation code to the driving excitation decoding means 14, and outputs the gain code to the gain decoding means 15. When the linear prediction coefficient decoding unit 12 receives the code of the linear prediction coefficient from the separation unit 11, the linear prediction coefficient decoding unit 12 decodes the code, converts the decoding result into the filter coefficient of the synthesis filter 19, and converts the filter coefficient to the synthesis filter 19. Output to.
[0018]
The adaptive excitation decoding means 13 has a built-in adaptive excitation codebook that stores the excitation signal of a predetermined length in the past, and an adaptive excitation signal corresponding to the adaptive excitation code output from the separating means 11 (the past excitation signal is periodic). Output a time-series vector).
Further, the driving excitation decoding means 14 incorporates a driving excitation codebook for storing a driving code signal that is a plurality of non-noisy or noisy time series vectors, and corresponds to the driving excitation code output from the separation means 11. A driving sound source signal is output.
The gain decoding unit 15 has a built-in gain codebook for storing the gain vector, and outputs a gain vector corresponding to the gain code output from the separating unit 11.
[0019]
The adaptive excitation signal output from the adaptive excitation decoding means 13 and the driving excitation signal output from the driving excitation decoding means 14 are multiplied by the elements of the gain vector by multipliers 16 and 17, and the adder 18 The multiplication results of the multipliers 16 and 17 are added to each other.
[0020]
The synthesis filter 19 performs synthesis filtering processing on the sound source signal that is the addition result of the adder 18 to generate output sound. As the filter coefficient, the linear prediction coefficient decoded by the linear prediction coefficient decoding unit 12 is used.
Finally, adaptive excitation decoding means 13 updates the built-in adaptive excitation codebook using the excitation signal.
[0021]
Next, a conventional technique in which the above-described speech encoding device and speech decoding device are improved will be described.
Akitoshi Kataoka, Shinji Hayashi, Takehiro Moriya, Shoko Kurihara, Kazunori Mano “CS-ACELP Basic Algorithm” NTT R & D, Vol. 45, pp. 325-330, April 1996 (Reference 1), mainly for the purpose of reducing the amount of calculation and the amount of memory, a CELP speech coding apparatus and speech decoding in which a pulsed sound source is introduced into the coding of a driving sound source. An apparatus is disclosed. In this conventional configuration, the driving sound source is expressed only by position information and polarity information of several pulses. Such a sound source is called an algebraic sound source and has a good coding characteristic for its simple structure, and has been adopted in many recent standard systems.
[0022]
FIG. 19 is a sound source position table indicating pulse sound source position candidates used in Document 1. In the speech encoding apparatus in FIG. 17, the driving excitation encoding means 4 is used, and in the speech decoding apparatus in FIG. 18, driving excitation decoding is performed. Mounted on the means 14.
In Document 1, the excitation encoding frame length is 40 samples, and the driving excitation is composed of four pulses. As shown in FIG. 19, the position candidates of the sound source numbers 1 to 3 are restricted to eight positions, and each pulse position can be encoded by 3 bits. The pulse of sound source number 4 is restricted to 16 positions, and the pulse position can be encoded by 4 bits. By constraining the pulse sound source position candidates, it is possible to reduce the amount of calculation by reducing the number of encoded bits and the number of combinations while suppressing deterioration of the encoding characteristics.
[0023]
A configuration that reduces the bit rate while maintaining the quality of this algebraic sound source is Omuro, Mano “High-speed pulse search type 4 kbit / s speech coding” The Acoustical Society of Japan, 1999 Spring Research Conference Proceedings I, 211-212 (reference 2).
[0024]
FIG. 20 is a sound source position / polarity table showing the position candidates and polarity of pulse sound sources similar to those used in Document 2. In order to reduce the number of encoded bits efficiently, this restricts the polarity that can be taken for each sound source position candidate so that the polarities at adjacent sound source positions are opposite.
[0025]
Also, a configuration that improves the quality of another algebraic sound source is Tadashi Amada, Kimio Miseki, and Masami Akamin “CELP special coding based on AceivePeuseposition Code 19”. I, pp. 13-16 (Mar 1999) (ref. 3) and Tsuchiya, Amada, and Mitseki, “Improvement of adaptive pulse position ACELP speech coding”, Acoustical Society of Japan, 1999 Spring Research Conference Proceedings I, pp. 213-214 (ref. 4).
[0026]
In Document 3, pulse sound source position candidates are adaptively set for each frame so that pulse sound source position candidates are gathered where the amplitude envelope of the adaptive sound source signal is large. It has been shown that this improves the coding characteristics.
[0027]
Document 4 corresponds to an improvement of Document 3. When a pitch filter is included in the generation unit of the driving sound source signal (ACELP sound source in Reference 4), the sound source position in the first one-pitch period tends to be selected, and pitch inverse filter processing is performed at that time. Based on the magnitude of the amplitude envelope of the adaptive sound source signal, pulse sound source position candidates are adaptively set for each frame.
[0028]
[Problems to be solved by the invention]
Since the conventional speech coding apparatus and speech decoding apparatus (Reference 1) are configured as described above, the position candidates for each sound source number are evenly distributed in the frame. Therefore, in order to reduce the bit rate of the position candidates, the number of pulses must be reduced or the number of position candidates for each sound source number must be thinned out at equal intervals. There was a problem that the signal could not be generated and the encoding characteristics deteriorated.
[0029]
Reference 2 discloses a method for efficiently constraining the position candidate and polarity of a sound source that suppresses the characteristic deterioration, but this restriction is based on a heuristic rule that reverses the polarity at adjacent sound source positions. In addition, since it is said that the relationship between the sound source position and polarity other than this cannot be taken at all, naturally uttered speech does not always apply to this rule, and if it does not apply to this rule, there will be a large quality degradation. There was a challenge to invite.
[0030]
References 3 and 4 disclose an adaptive thinning method that suppresses this characteristic degradation. However, when the periodicity of the input speech is disturbed or changes, the adaptive thinning causes rather large characteristic degradation. There was a problem to be caused. In addition, this adaptive thinning-out process has a problem that it affects the driving sound source signal when the adaptive sound source signal is not correctly generated due to a code transmission error in the communication channel.
[0031]
Further, in Reference 4, when a pitch filter is included in the drive sound source signal generation unit, average characteristic improvement is achieved by concentrating sound source position candidates in the first one pitch period section. In the most important voice rising section, the latter half of the frame is rather important, and the latter half of the frame cannot be reproduced well, resulting in characteristic deterioration, and in the heard impression, there is a problem of causing quality deterioration.
[0032]
The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech coding apparatus and speech decoding apparatus that can achieve a low bit rate without deteriorating characteristics. To do.
[0033]
[Means for Solving the Problems]
  In the speech coding apparatus according to the present invention, the sound source coding means includes all combinations of a plurality of sound source positions.For eachCombination of sound source positions whose evaluation value for the combination of sound source positions is higher than the reference valueAn index table describing information indicating the combination of the sound source positions.Any combination of sound source positions is selected from the sound source information, and the sound source information of the input speech is encoded using the combination of the sound source positions.
[0034]
In the speech coding apparatus according to the present invention, the information indicating the combination of the sound source positions described in the index table is the combination information of the sound source positions encoded individually.
[0035]
In the speech coding apparatus according to the present invention, when the sound source coding means can recognize a combination of sound source positions including at least a combination of sound source positions whose evaluation value is higher than the reference value, the sound source described in the index table The information indicating the combination of positions is flag information indicating whether each combination of the sound source positions is an element of the index table.
[0036]
In the speech coding apparatus according to the present invention, the sound source coding means recognizes a combination of sound source positions including a combination of sound source positions whose evaluation value is higher than a reference value from the frame length, the number of sound sources, and the number of indexes. is there.
[0037]
  The speech coding apparatus according to the present invention provides a combination of a plurality of pairs of data in which the sound source coding means is composed of a sound source position and a polarity.For eachCombinations of paired data whose evaluation value for paired data combinations is higher than the reference valueIs provided with an index table that describes information indicating the combination of the paired data.Any pair data combination is selected from the above, and the sound source information of the input speech is encoded using the pair data combination.
[0038]
The speech encoding apparatus according to the present invention is such that the information indicating the combination of paired data described in the index table is paired information of paired data encoded individually.
[0039]
In the speech coding apparatus according to the present invention, when the excitation coding unit can recognize at least a paired data combination including a paired data combination whose evaluation value is higher than the reference value, the pair described in the index table is used. The information indicating the combination of data is flag information indicating whether each combination of the paired data is an element of the index table.
[0040]
In the speech coding apparatus according to the present invention, the sound source coding means recognizes a combination of paired data including a pair of paired data whose evaluation value is higher than the reference value from the frame length, the number of sound sources, and the number of indexes. is there.
[0041]
In the speech encoding apparatus according to the present invention, the excitation encoding means has a plurality of index tables having different description contents, and an arbitrary index table is selected and used.
[0042]
In the speech coding apparatus according to the present invention, the sound source coding means analyzes the input speech, extracts a predetermined parameter, and selects an index table corresponding to the parameter.
[0043]
In the speech coding apparatus according to the present invention, the sound source coding means extracts a predetermined parameter from at least one of spectrum envelope information and sound source information, and selects an index table corresponding to the parameter. .
[0044]
  In the speech decoding apparatus according to the present invention, the sound source decoding means includes all combinations of a plurality of sound source positions.For eachCombination of sound source positions whose evaluation value for the combination of sound source positions is higher than the reference valueAn index table describing information indicating the combination of the sound source positions.Are selected based on a code indicating a combination included in the sound source information, and the sound source information of the input speech is decoded using the combination of the sound source positions.
[0045]
In the speech decoding apparatus according to the present invention, the information indicating the combination of excitation positions described in the index table is the combination information of excitation positions encoded individually.
[0046]
In the speech decoding apparatus according to the present invention, when the sound source decoding means can recognize a combination of sound source positions including at least a combination of sound source positions whose evaluation value is higher than the reference value, the sound source described in the index table The information indicating the combination of positions is flag information indicating whether each combination of the sound source positions is an element of the index table.
[0047]
The speech decoding apparatus according to the present invention is such that the sound source decoding means recognizes a combination of sound source positions including a combination of sound source positions whose evaluation value is higher than a reference value from the frame length, the number of sound sources, and the number of indexes. is there.
[0048]
  In the speech decoding apparatus according to the present invention, all combinations of a plurality of paired data in which the sound source decoding means is composed of the sound source position and the polarityFor eachCombinations of paired data whose evaluation value for paired data combinations is higher than the reference valueIs provided with an index table that describes information indicating the combination of the paired data.Is selected based on the code indicating the combination included in the sound source information, and the sound source information of the input speech is decoded using the paired data combination.
[0049]
The speech decoding apparatus according to the present invention is such that the information indicating the combination of paired data described in the index table is paired information of paired data encoded individually.
[0050]
In the speech decoding apparatus according to the present invention, when the sound source decoding means is capable of recognizing a paired data combination including at least a paired data combination whose evaluation value is higher than the reference value, the pair described in the index table is used. The information indicating the combination of data is flag information indicating whether each combination of the paired data is an element of the index table.
[0051]
In the speech decoding apparatus according to the present invention, the sound source decoding means recognizes a combination of paired data including a pair of paired data whose evaluation value is higher than the reference value from the frame length, the number of sound sources, and the number of indexes. is there.
[0052]
In the speech decoding apparatus according to the present invention, the sound source decoding means has a plurality of index tables having different descriptions, and selects an index table corresponding to a code indicating selection information included in the sound source information. It is intended to be used.
[0053]
In the speech decoding apparatus according to the present invention, the sound source decoding means has a plurality of index tables having different description contents, extracts a predetermined parameter from at least one of spectrum envelope information and sound source information, and the parameter The index table corresponding to is selected and used.
[0054]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described below.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a speech encoding apparatus according to Embodiment 1 of the present invention. In FIG. 1, reference numeral 21 denotes a linear that analyzes input speech and extracts linear prediction coefficients that are spectral envelope information of the input speech. A prediction analysis unit 22 is a linear prediction coefficient encoding unit that encodes the linear prediction coefficient extracted by the linear prediction analysis unit 21.
The linear prediction analysis means 21 and the linear prediction coefficient encoding means 22 constitute an envelope information encoding means.
[0055]
23 generates a temporary synthesized sound using the linear prediction coefficient quantized by the linear prediction coefficient encoding means 22 and selects an adaptive excitation code (sound source information) that minimizes the distance between the temporary synthesized sound and the input speech. Then, the adaptive excitation signal corresponding to the adaptive excitation code (a time-series vector in which a sound source signal having a predetermined length in the past is periodically repeated) is output to the gain encoding means 25. The sound source encoding unit 24 generates a temporary synthesized sound using the linear prediction coefficient quantized by the linear prediction coefficient encoding unit 22, and generates the temporary synthesized sound and the encoding target signal (from the input speech by the adaptive excitation signal). A driving excitation code (sound source information) that minimizes the distance of the signal (a signal obtained by subtracting the synthesized sound) is selected and output to the multiplexing means 26, and a driving excitation signal that is a time-series vector corresponding to the driving excitation code is selected. A driving excitation coding means for outputting in-coding means 25.
[0056]
Reference numeral 25 denotes an excitation signal that is obtained by multiplying each element of the gain vector by the adaptive excitation signal output from the adaptive excitation encoding means 23 and the driving excitation signal output from the driving excitation encoding means 24, and adding each multiplication result to each other. On the other hand, using the linear prediction coefficient quantized by the linear prediction coefficient encoding means 22, a temporary synthetic sound is generated from the sound source signal, and the gain that minimizes the distance between the temporary synthetic sound and the input speech Gain encoding means for selecting a code (sound source information) and outputting it to the multiplexing means 26.
The adaptive excitation encoding means 23, the drive excitation encoding means 24, and the gain encoding means 25 constitute excitation excitation means.
[0057]
26 is a code of the linear prediction coefficient encoded by the linear prediction coefficient encoding unit 22, an adaptive excitation code output from the adaptive excitation encoding unit 23, and a driving excitation code output from the driving excitation encoding unit 24. The multiplexing means outputs a speech code by multiplexing the gain code output from the gain encoding means 25.
[0058]
FIG. 2 is a block diagram showing a speech decoding apparatus according to Embodiment 1 of the present invention. In FIG. 2, reference numeral 31 denotes a speech code output from the speech coding apparatus, and linear prediction coefficient codes are linearly predicted. Separating means for outputting to the coefficient decoding means 32, outputting the adaptive excitation code to the adaptive excitation decoding means 33, outputting the driving excitation code to the driving excitation decoding means 34, and outputting the gain code to the gain decoding means 35 , 32 decodes the linear prediction coefficient output from the separating means 31, converts the decoding result into filter coefficients of the synthesis filter 39, and outputs the filter coefficients to the synthesis filter 39 (envelope). Information decoding means).
[0059]
Reference numeral 33 denotes an adaptive excitation decoding means for outputting an adaptive excitation signal corresponding to the adaptive excitation code output from the separation means 31 (a time series vector in which past excitation signals are periodically repeated), and 34 is an output from the separation means 31. Drive excitation decoding means for outputting a drive excitation signal which is a time-series vector corresponding to the driven excitation code, and 35 is a gain decoding means for outputting a gain vector corresponding to the gain code output from the separation means 31. .
[0060]
Reference numeral 36 denotes a multiplier for multiplying the adaptive excitation signal output from the adaptive excitation decoding means 33 by the gain vector element output from the gain decoding means 35, and reference numeral 37 denotes a gain vector element output from the gain decoding means 35. Is a multiplier that multiplies the drive excitation signal output from the drive excitation decoding means 34, 38 is an adder that generates the excitation signal by adding the multiplication result of the multiplier 36 and the multiplication result of the multiplier 37, 39 This is a synthesis filter that performs synthesis filtering processing on the sound source signal generated by the adder 38 to generate output sound.
The adaptive excitation decoding means 33, the drive excitation decoding means 34, the gain decoding means 35, the multipliers 36 and 37, the adder 38, and the synthesis filter 39 constitute excitation excitation means.
[0061]
FIG. 3 is a block diagram showing the inside of the driving excitation encoding means 24 in the speech encoding apparatus. In FIG. 3, reference numeral 41 denotes the usage frequency (evaluation value regarding the combination of excitation positions) among all combinations of a plurality of excitation positions. A sound source position combination table (index table) in which information indicating a combination of sound source positions higher than the reference frequency (reference value) is described, 42 selects an arbitrary sound source position combination from the sound source position combination table 41, and the sound source position This is an algebraic excitation encoding unit that encodes driving excitation information of input speech using a combination and an appropriate polarity.
[0062]
FIG. 4 is a block diagram showing the inside of the driving excitation decoding means 34 in the speech decoding apparatus. In FIG. 4, reference numeral 51 denotes the usage frequency (evaluation value regarding the combination of excitation positions) among all combinations of a plurality of excitation positions. A sound source position combination table (index table) in which information indicating a combination of sound source positions higher than the reference frequency (reference value) is described, 52 is a sound source position code (combination of combinations) included in the drive sound source code from the sound source position combination table 51. An algebra that selects a combination of sound source positions based on the combination of the sound source positions and decodes the sound source information of the input speech using the combination of the sound source positions and the polarity (information specifying the polarity is included in the drive sound source code) This is a sound source decoding means.
[0063]
Next, the operation will be described.
In the speech encoding device and speech decoding device, processing is performed in units of frames, with about 5 to 50 ms as one frame.
[0064]
First, when a speech is input, the linear prediction analysis unit 21 of the speech encoding apparatus analyzes the input speech and extracts linear prediction coefficients that are speech spectral envelope information.
When the linear prediction analysis unit 21 extracts the linear prediction coefficient, the linear prediction coefficient encoding unit 22 encodes the linear prediction coefficient and outputs the code to the multiplexing unit 26. Further, the quantized linear prediction coefficient corresponding to the code is output to the adaptive excitation encoding unit 23, the driving excitation encoding unit 24, and the gain encoding unit 25.
[0065]
The adaptive excitation coding means 23 has a built-in adaptive excitation codebook that stores excitation signals of a predetermined length in the past, and each adaptive excitation code generated internally (the adaptive excitation code is indicated by a binary value of several bits). In response, a time series vector in which past sound source signals are periodically repeated is generated.
Next, after multiplying each time series vector by an appropriate gain, each time series vector is passed through a synthesis filter that uses the linear prediction coefficient quantized by the linear prediction coefficient encoding means 22 to obtain a temporary synthesized sound. Generate.
[0066]
Then, the adaptive excitation encoding unit 23 investigates the distance between the temporary synthesized sound and the input speech, selects an adaptive excitation code that minimizes this distance, outputs it to the multiplexing unit 26, and selects the selected adaptation. The time series vector corresponding to the excitation code is output to the gain encoding means 25 as an adaptive excitation signal. Further, a signal obtained by subtracting the synthesized sound of the adaptive excitation signal from the input sound is output to the drive excitation encoding means 24 as an encoding target signal.
[0067]
When the excitation signal is input from the adaptive excitation encoding unit 23, the driving excitation encoding unit 24 selects an arbitrary combination of excitation positions (combination of pulse excitation position candidates) from the excitation position combination table 41, and the pulse The driving sound source information of the input sound is encoded using a combination of sound source position candidates and an appropriate polarity. Specifically, it is as shown below.
[0068]
First, in the excitation position combination table 41 of the drive excitation encoding means 24, for example, only a desired number of combinations (higher than the reference frequency) that are used frequently among all combinations of pulse excitation position candidates shown in FIG. A combination of position candidates of a plurality of pulse sound sources is described by extraction (see FIG. 5).
In algebraic sound sources, all combinations of pulse sound source position candidates used for encoding do not appear evenly, there is a bias in the frequency of their occurrence, even if a combination with low frequency of occurrence is not used at all The effect on the quality of the synthesized sound is small. By limiting the combinations of pulse sound source position candidates using this characteristic, the number of coding bits required for the algebraic sound source is reduced while suppressing the deterioration of the coding characteristic.
[0069]
As a method for storing the information of the sound source position combination table 41, for example, as shown in FIG. 6, the position of the sound source of each sound source number for each index is directly expressed in binary numbers, for example, directly representing pulse sound source position candidates. There is a method for encoding and storing information.
With this configuration, for example, when the sound source position combination table 41 is created by extracting position candidate combinations from all combinations of pulse sound source position candidates shown in FIG. 19, the combinations of position candidates of each pulse sound source are 13 bits. Therefore, in order to store this as many as the number of indexes (here, the number of indexes is N), a storage area of (13 × N) bits is required. For example, the required storage capacity varies depending on the size of the sound source position combination table 41 to be used, such as 6656 bits when N = 512 and 13312 bits when N = 1024.
[0070]
As another method of storing the information of the sound source position combination table 41, for example, as shown in FIG. 7, each of the second sound source position combination tables configured to include at least all the elements of the sound source position combination table 41. For example, flag information indicating whether or not the position candidate is an element of the sound source position combination table 41 is represented by 1 bit, instead of directly storing information representing a pulse sound source position candidate. There is a method for storing whether or not a combination of pulse sound source position candidates can be used. With this configuration, for example, when the combination of position candidates is extracted from all the combinations of position candidates of the pulse sound source shown in FIG. 19 and the sound source position combination table 41 is created, the total number of combinations (8192) is generated. Since use / not use is represented by 1 bit, a storage area of 8192 bits is required. In this case, the necessary storage area is constant regardless of the index number N of the sound source position combination table 41.
[0071]
Since all combinations of pulse sound source position candidates based on the algebraic sound source structure as shown in FIG. 19 can be obtained from the frame length, the number of pulses, and the number of indexes, it is not actually necessary to have a table. The storage area is not necessary.
[0072]
As described above, there are a plurality of methods for storing the information in the sound source position combination table 41, and since the relationship between the number of indexes N and the required storage capacity is different, a method with a smaller storage capacity is selected according to the number of indexes N. In this case, it is possible to reduce the scale of the storage device such as a memory or a hard disk, and to realize an efficient device.
[0073]
The algebraic excitation encoding means 42 of the driving excitation encoding means 24 sequentially reads out the combinations of pulse excitation position candidates stored in the excitation position combination table 41, and puts a pulse with an arbitrary polarity on each position candidate. Temporary synthesized sound is generated.
Then, the distance (signal error) between the encoding target signal output from the adaptive excitation encoding unit 23 and the provisional synthesized sound is calculated, and the combination and polarity of the position candidate of the pulse excitation that minimizes the distance are calculated. Explore.
When the search is completed, the algebraic excitation encoding means 42 outputs the excitation position code and polarity representing the combination of pulse excitation position candidates as the search result to the multiplexing means 26 as a drive excitation code, The time series vector corresponding to this drive excitation code is output to the gain encoding means 25 as a drive excitation signal.
[0074]
The search operation in the algebraic excitation encoding means 42 is performed in the same manner as the driving excitation encoding means shown in Document 1. Also, as shown in Document 1, a pitch filter is introduced at the final stage of the drive sound source generation unit. That is, a pitch filter is applied to a signal in which a pulse is arranged at a position candidate of each pulse sound source to obtain a drive sound source signal, and a provisional synthesized sound is generated.
Then, the correlation between the temporary synthesized sounds for each position candidate and the correlation between the temporary synthesized sounds for each position candidate and the encoding target signal are calculated, and the polarity for each position candidate is determined using these correlations. And position search at high speed.
[0075]
As a result, a combination of pulse sound source position candidates and the polarity of each sound source are obtained. The combination of pulse sound source position candidates is converted into a corresponding code, for example, by expressing an index for the combination in the sound source position combination table 41 in binary, and output as a final sound source position code. Algebraic sound sources are known to require a small amount of computation for the search operation due to their structure, but by reducing the number of combinations of sound source positions to be searched while maintaining the structure of the algebraic sound source, further computations are required. A reduction effect can be obtained.
[0076]
As described above, when the drive excitation encoding unit 24 outputs the drive excitation signal, the gain encoding unit 25 responds to each gain code (a gain code is indicated by a binary value of several bits) generated internally. The gain vector is sequentially read from the gain codebook.
Then, the elements of each gain vector are respectively multiplied by the adaptive excitation signal output from the adaptive excitation encoding unit 23 and the driving excitation signal output from the driving excitation encoding unit 24, and the multiplication results are added to each other. To generate a sound source signal.
Next, the sound source signal is passed through a synthesis filter that uses the linear prediction coefficient quantized by the linear prediction coefficient encoding means 22 to generate a temporary synthesized sound.
[0077]
Then, the gain encoding unit 25 investigates the distance between the temporary synthesized sound and the input speech, selects a gain code that minimizes this distance, and outputs it to the multiplexing unit 26. Further, the excitation signal corresponding to the gain code is output to the adaptive excitation encoding means 23. As a result, the adaptive excitation encoding unit 23 updates the built-in adaptive excitation codebook using the excitation signal corresponding to the gain code selected by the gain encoding unit 25.
[0078]
The multiplexing unit 26 outputs the linear prediction coefficient code encoded by the linear prediction coefficient encoding unit 22, the adaptive excitation code output from the adaptive excitation encoding unit 23, and the drive excitation encoding unit 24. The driving excitation code (including the excitation position code and polarity) and the gain code output from the gain encoding means 25 are multiplexed, and the audio code as the multiplexing result is output to the audio decoding apparatus.
[0079]
Next, when the speech encoding apparatus outputs the speech code, the separating means 31 of the speech decoding apparatus separates the speech code, outputs the code of the linear prediction coefficient to the linear prediction coefficient decoding means 32, and is adapted. The excitation code is output to adaptive excitation decoding means 33, the driving excitation code is output to driving excitation decoding means 34, and the gain code is output to gain decoding means 35.
When the linear prediction coefficient decoding unit 32 receives the code of the linear prediction coefficient from the separation unit 31, the linear prediction coefficient decoding unit 32 decodes the code, converts the decoding result into the filter coefficient of the synthesis filter 39, and converts the filter coefficient to the synthesis filter 39. Output to.
[0080]
The adaptive excitation decoding means 33 has a built-in adaptive excitation codebook for storing the excitation signal of a predetermined length in the past, and an adaptive excitation signal corresponding to the adaptive excitation code output from the separating means 31 (the past excitation signal is periodically generated). Output a time-series vector).
[0081]
When the driving excitation decoding means 34 receives the driving excitation code including the excitation position code and the polarity from the separation means 31, the algebraic excitation decoding means 52 causes the excitation position combination table 51 (the same contents as the excitation position combination table 41 to be stored in the table). Read out a combination of pulse sound source position candidates corresponding to the sound source position code, and generate a driving sound source signal by applying a pitch filter to the signal in which the pulse having the polarity is assigned to each position candidate, The drive sound source signal is output.
[0082]
The gain decoding unit 35 has a built-in gain codebook for storing the gain vector, and outputs a gain vector corresponding to the gain code output from the separating unit 31. The adaptive excitation signal output from the adaptive excitation decoding unit 33 and the driving excitation signal output from the driving excitation decoding unit 34 are multiplied by elements of the gain vector by multipliers 36 and 37, and the adder 38 The multiplication results of the multipliers 36 and 37 are added to each other.
[0083]
The synthesis filter 39 performs synthesis filtering processing on the sound source signal that is the addition result of the adder 38 to generate output sound. Note that the linear prediction coefficient decoded by the linear prediction coefficient decoding unit 32 is used as the filter coefficient.
Finally, the adaptive excitation decoding means 33 updates the built-in adaptive excitation codebook using the excitation signal.
[0084]
As is apparent from the above, according to the first embodiment, among all combinations of a plurality of sound source positions, a sound source position combination table 41 in which information indicating a combination of sound source positions whose use frequency is higher than the reference frequency is described. Since a combination of sound source positions is selected from 51 and the sound source information of the input speech is encoded or decoded using the combination of the sound source positions, the bit rate can be reduced without causing deterioration of characteristics. Thus, there is an effect that a speech encoding device and a speech decoding device that can be obtained are obtained.
[0085]
Further, the combinations of sound source positions described in the sound source position combination tables 41 and 51 are actually encoded or decoded in the sound source information of the input speech, for example, by a statistical method such as selection and extraction according to the frequency of use. Therefore, there is no unnatural restriction as in the case of using a heuristic rule, and it is possible to obtain a high-quality speech encoding device and speech decoding device even at a low bit rate. Play.
[0086]
In addition, since a fixed combination of sound source positions is used, there is an effect that characteristics can be improved while maintaining a strong resistance against a code transmission error in a communication path.
[0087]
Furthermore, when describing a combination of a plurality of sound source positions in the sound source position combination tables 41 and 51, each sound source position is encoded and described separately, or each element of the sound source position combination tables 41 and 51 is described. Since flag information indicating whether or not each element of the second sound source position combination table can be used is described as an element of the second sound source position combination table, the necessary storage capacity is reduced according to the number of indexes N. Therefore, there is an effect that an efficient speech encoding device and speech decoding device with a small device scale can be obtained.
[0088]
Further, since the combination of the sound source positions in the second sound source position combination table is generated from the frame length, the number of sound sources, and the number of indexes, the storage capacity required for this can be eliminated, and the scale of the apparatus is small. There is an effect that an efficient speech encoding device and speech decoding device can be obtained.
[0089]
In the first embodiment, the pitch filter is introduced into the drive excitation signal generator, but this is introduced only in the drive excitation decoding means 34, or the drive excitation encoding means 24 and the drive excitation decoding. A configuration not introduced by both means 34 is also possible.
[0090]
In the first embodiment, the use frequency is used as the evaluation value for the combination of sound source positions. However, the present invention is not limited to this. For example, an expected value for reducing encoding distortion or the like is used as the sound source position. You may make it use as an evaluation value regarding a combination.
For example, when the learning speech data is encoded using all combinations of sound source positions, the expected value is encoded by the driving sound source signal component when the sound data is used for each combination of sound source positions. The total sum of distortion can be used.
[0091]
Embodiment 2. FIG.
FIG. 8 is a block diagram showing the inside of the driving excitation encoding means 24 in the speech encoding apparatus according to Embodiment 2 of the present invention. In FIG. 8, reference numeral 43 denotes all of a plurality of pairs of data composed of excitation positions and polarities. Among the combinations, a sound source position / polarity combination table (index table) in which information indicating a combination of paired data whose use frequency (evaluation value for the paired data pair) is higher than the reference frequency (reference value) is described, 44 is a sound source position An algebraic excitation encoding unit that selects an arbitrary pair data combination from the polarity combination table 43 and encodes the driving excitation information of the input voice using the pair data combination.
[0092]
FIG. 9 is a block diagram showing the inside of the driving excitation decoding means 34 in the speech decoding apparatus according to Embodiment 2 of the present invention. In FIG. A sound source position / polarity combination table (index table) in which information indicating a combination of paired data whose evaluation value) is higher than a reference frequency (reference value) is described. Algebraic sound source decoding that selects a pair of data pairs based on a sound source position / polarity code (a code indicating a combination) included in the code and decodes sound source information of the input speech using the pair data combination Means.
[0093]
Next, the operation will be described. However, since operations other than the driving excitation encoding unit 24 and the driving excitation decoding unit 34 are the same as those in the first embodiment, only the operations of the driving excitation encoding unit 24 and the driving excitation decoding unit 34 will be described.
[0094]
First, in the sound source position / polarity combination table 43 of the drive sound source encoding means 24, for example, a combination (higher than the reference frequency) among the pulse sound source position candidates and all combinations of the polarities of each sound source shown in FIG. A combination of a plurality of pairs of data is described by extracting only a desired number of high combinations) (see FIG. 10).
In algebraic sound sources, all combinations of pulse sound source pair data used for encoding do not appear evenly, there is a bias in the frequency of their occurrence, even if a combination with low frequency of occurrence is not used at all The effect on the quality of the synthesized sound is small. By limiting the combination of pulse sound source pair data using this characteristic, the number of encoded bits required for the algebraic sound source is reduced while suppressing the deterioration of the encoding characteristic.
[0095]
As a method for storing the information of the sound source position / polarity combination table 43, for example, as shown in FIG. 11, the pair data of each sound source number for each index is directly expressed in binary numbers, etc. There is a method of encoding and storing information representing the.
With this configuration, for example, when the combination of pair data is extracted from all combinations of pulse sound source position candidates and the polarity of each sound source shown in FIG. 19 and the sound source position / polarity combination table 43 is created, Since the combination of paired data can be expressed by 17 bits, a storage area of (17 × N) bits is required in order to store this for the number of indexes (here, the number of indexes is N). . For example, the required storage capacity differs depending on the size of the sound source position / polarity combination table 43 to be used, such as 69632 bits for N = 4096 and 139264 bits for N = 8192.
[0096]
As another method for storing the information of the sound source position / polarity combination table 43, for example, as shown in FIG. 12, the second sound source position configured to include at least all the elements of the sound source position / polarity combination table 43. -For each combination of paired data in the polarity combination table, flag information indicating whether or not it is an element of the sound source position / polarity combination table 43 is expressed in 1 bit. There is a method of storing whether or not a combination of pulsed sound source pair data can be used, instead of storing information to be expressed.
With this configuration, for example, when the combination of paired data is extracted from all combinations of pulse sound source position candidates and polarity of each sound source shown in FIG. 19 and the sound source position / polarity combination table 43 is created, the total number of combinations Since use / not use for (131072) is represented by 1 bit, a storage area of 131072 bits is required. In this case, the necessary storage area is constant regardless of the index number N of the sound source position / polarity combination table 43.
[0097]
Note that all combinations of pulse sound source position candidates and polarity of each sound source based on the algebraic sound source structure as shown in FIG. 19 can be obtained by calculation from the frame length, the number of pulses, and the number of indexes. There is no need to have a storage area.
[0098]
As described above, there are a plurality of methods for storing the information of the sound source position / polarity combination table 43, and the relationship between the number of indexes N and the required storage capacity is different, so that the storage capacity is smaller according to the number of indexes N. If this is selected, the scale of a storage device such as a memory or a hard disk can be reduced, and an efficient device can be realized.
[0099]
The algebraic excitation encoding means 44 of the driving excitation encoding means 24 sequentially reads out the combination of pulse excitation pair data stored in the excitation position / polarity combination table 43 and outputs pair data to each position candidate of the pair data. A temporary synthesized sound when a pulse is generated with each polarity is generated.
Then, the distance between the encoding target signal output from the adaptive excitation encoding unit 23 and the provisional synthesized sound is calculated, and a combination of pulse excitation pair data that minimizes the distance is searched.
When the search is completed, the algebraic excitation encoding unit 44 outputs the excitation position / polarity code representing the combination of pulse excitation pair data, which is the search result, to the multiplexing unit 26 as a driving excitation code. A time series vector corresponding to the driving excitation code is output to the gain encoding means 25 as a driving excitation signal.
[0100]
The search operation in the algebraic excitation encoding unit 44 is performed in the same manner as the driving excitation encoding unit described in Document 1. Also, as shown in Document 1, a pitch filter is introduced at the final stage of the drive sound source generation unit. That is, a pitch filter is applied to a signal in which a pulse is arranged at a position candidate of each pulse sound source to obtain a drive sound source signal, and a provisional synthesized sound is generated.
Then, the correlation between the temporary synthetic sounds for each position candidate and the correlation between the temporary synthetic sounds and the encoding target signal for each position candidate are calculated, and the pair data is searched at high speed using these correlations. .
[0101]
As a result, a combination of pulse sound source pair data is obtained. The combination of pulse sound source pair data is converted into a corresponding code, for example, by expressing an index for the combination in the sound source position combination / polarity table 43 in binary, and is output as the final sound source position / polarity code. Algebraic sound sources are known to require a small amount of computation for the search operation due to their structure, but by reducing the number of pairs of paired data to be searched while maintaining the structure of the algebraic sound source, the amount of computation is further increased. A reduction effect can be obtained.
[0102]
Next, when the algebraic excitation decoding means 54 of the driving excitation decoding means 34 receives the excitation position / polarity code from the separation means 31, the excitation position / polarity combination table 53 (the same contents as the excitation position / polarity combination table 43). Is read as a table), and a combination of pulse sound source pair data corresponding to the sound source position / polarity code is read out, and a pitch filter is applied to a signal in which a pulse having the polarity is arranged for each position candidate of the pair data. To generate a driving sound source signal and output the driving sound source signal.
[0103]
As is apparent from the above, according to the second embodiment, the sound source position / polarity combination table in which information indicating the combination of paired data whose use frequency is higher than the reference frequency among all combinations of the plurality of paired data is described. Since the paired data combination is selected from 43 and 53 and the sound source information of the input speech is encoded or decoded using the paired data combination, the bit rate can be reduced without causing deterioration of characteristics. There is an effect that a speech encoding device and a speech decoding device can be obtained.
[0104]
Further, the paired data combinations described in the sound source position / polarity combination tables 43 and 53 are actually encoded or decoded from the sound source information of the input speech, such as by a statistical method such as selection and extraction according to the frequency of use. Therefore, there is no unnatural restriction as in the case of using heuristic rules, and a high-quality speech encoding device and speech decoding device can be obtained even at a low bit rate. Has the effect.
[0105]
In addition, since a fixed combination of data is used, there is an effect that the characteristics can be improved while maintaining a strong resistance against a code transmission error in the communication path.
[0106]
Furthermore, when describing a combination of a plurality of pairs of data in the sound source position / polarity combination tables 43, 53, each pair of data is individually coded and described, or alternatively, the sound source position / polarity combination tables 43, 53 Is used as an element of the second sound source position / polarity combination table, and flag information indicating availability of use for each element of the second sound source position / polarity combination table is described. Thus, it is possible to reduce the necessary storage capacity and to obtain an effect of obtaining an efficient speech coding apparatus and speech decoding apparatus with a small scale of device implementation.
[0107]
Further, since the combination of the paired data in the second sound source position / polarity combination table is generated from the frame length, the number of sound sources, and the number of indexes, the storage capacity required for this can be made unnecessary, and the scale of the apparatus This is advantageous in that an efficient speech coding apparatus and speech decoding apparatus can be obtained.
[0108]
In the second embodiment, a pitch filter is introduced into the drive excitation signal generation unit. However, this is introduced only in the drive excitation decoding means 34, or the drive excitation encoding means 24 and the drive excitation decoding. A configuration not introduced by both means 34 is also possible.
[0109]
In the second embodiment, the use frequency is used as the evaluation value for the paired data combination. However, the present invention is not limited to this. For example, the expected value for reducing the coding distortion or the like is used for the paired data. You may make it use as an evaluation value regarding a combination.
[0110]
Embodiment 3 FIG.
FIG. 13 is a block diagram showing the inside of the driving excitation encoding means 24 in the speech encoding apparatus according to Embodiment 3 of the present invention. In the figure, reference numeral 61 denotes a rising edge of a sound among all combinations of a plurality of excitation positions. A sound source position combination table (index table) in which information indicating a combination of sound source positions whose use frequency is higher than the reference frequency is described in a section, 62 is used in a section other than the rising of sound, among all combinations of a plurality of sound source positions. A sound source position combination table (index table) 63 in which information indicating a combination of sound source positions whose frequency is higher than the reference frequency is described, 63 extracts a predetermined parameter by analyzing the input speech, and a sound source position combination table corresponding to the parameter Selection means 61 for selecting 61 (or 62), 64 is a sound source position combination table 61 (automatic selection) selected by the selection means 63. Selects a combination of any of the sound source position from the 62), which is the algebraic sound source coding means for coding the sound source information of the input speech using a combination of proper polarity of the sound source position.
[0111]
FIG. 14 is a block diagram showing the inside of the driving excitation decoding means 34 in the speech decoding apparatus according to Embodiment 3 of the present invention. In the figure, reference numeral 71 denotes a rising edge of a sound among all combinations of a plurality of excitation positions. A sound source position combination table (index table) in which information indicating combinations of sound source positions whose use frequency is higher than the reference frequency is described in a section, 72 is used in a section other than the rise of sound among all combinations of a plurality of sound source positions. A sound source position combination table (index table) in which information indicating a combination of sound source positions whose frequency is higher than the reference frequency is described, and 73 is a sound source position combination table 71 (corresponding to a code indicating selection information included in the drive sound source code). Or selecting means 72), and 74 is a sound source position combination table 71 (or 72) selected by the selecting means 73. Selects a combination of sound source positions based on the sound source position code included in the dynamic sound source code, and uses the sound source position combination and polarity (information specifying the polarity is included in the drive sound source code) to input speech This is algebraic excitation decoding means for decoding the excitation information.
[0112]
Next, the operation will be described. However, since operations other than the driving excitation encoding unit 24 and the driving excitation decoding unit 34 are the same as those in the first embodiment, only the operations of the driving excitation encoding unit 24 and the driving excitation decoding unit 34 will be described.
[0113]
First, in the excitation position combination table 61 of the drive excitation encoding means 24, for example, a combination having a high use frequency (a combination higher than the reference frequency) in the voiced rising section among all combinations of pulse excitation position candidates shown in FIG. ) Only a desired number is extracted, and a plurality of pulse sound source position candidate combinations are described. In the sound source position combination table 62, for example, only a desired number of combinations (higher than the reference frequency) that are frequently used in a section other than the rising edge of voiced out of all combinations of pulse sound source position candidates shown in FIG. A combination of position candidates of a plurality of pulsed sound sources is described by extraction.
[0114]
In algebraic sound sources, all combinations of pulse sound source position candidates used for encoding do not appear evenly, there is a bias in the frequency of their occurrence, even if a combination with low frequency of occurrence is not used at all The effect on the quality of the synthesized sound is small.
For example, when the power of the second half of the frame is higher than that of the first half of the frame, such as a voiced rising section, the pulse sound source position candidates tend to concentrate in the second half of the frame. In addition, a pitch filter may be used in a section other than the voiced rise, and the pulse sound source position candidates tend to concentrate in the first half of the frame or appear evenly in the frame.
By using this characteristic, the number of encoding bits required for the algebraic sound source is reduced while restraining the deterioration of the encoding characteristic by constraining the combination of pulse sound source position candidates according to the characteristics of the input speech.
[0115]
The selection means 63 analyzes the input speech, for example, selects the sound source position combination table 61 if it is a voiced rising section, and selects the sound source position combination table 62 if it is a section other than voiced rising, etc. A sound source position combination table to be used is selected and switched based on the analysis result of the input voice.
[0116]
The algebraic excitation encoding unit 64 sequentially reads out combinations of pulse excitation position candidates stored in the excitation position combination table 61 (or 62) selected by the selection unit 63, and each position candidate has an arbitrary polarity. A temporary synthesized sound is generated when a pulse is applied with a pulse.
Then, the distance between the encoding target signal output from the adaptive excitation encoding means 23 and the provisional synthesized sound is calculated, and the pulse source position candidate combination and polarity that minimize the distance are searched.
[0117]
When the search is completed, the algebraic excitation encoding unit 64 outputs the excitation position code and the polarity representing the combination of pulse excitation position candidates as the search result to the multiplexing unit 26 as a driving excitation code, The time series vector corresponding to this drive excitation code is output to the gain encoding means 25 as a drive excitation signal. Note that the selection information output from the selection unit 63 is also included in the drive excitation code and output to the multiplexing unit 26.
[0118]
Next, the selection means 73 of the driving excitation decoding means 34 stores the excitation position combination table corresponding to the code indicating the selection information included in the driving excitation code, that is, the excitation position combination table 71 or the excitation position combination table 72. select. However, the sound source position combination table 71 has the same contents as the sound source position combination table 61, and the sound source position combination table 72 has the same contents as the sound source position combination table 62.
[0119]
The algebraic excitation decoding means 74 reads a combination of pulse excitation position candidates corresponding to the excitation position code from the excitation position combination table 71 (or 72) selected by the selection means 73, and assigns the polarity to each position candidate. A pitch filter is applied to the signal in which the applied pulse is arranged to generate a driving sound source signal, and the driving sound source signal is output.
[0120]
As is apparent from the above, according to the third embodiment, the sound source position combination tables 61 and 62 (or 71 and 72) having different descriptions are selected, and an arbitrary sound source position combination table is selected and used. Thus, the same effects as those of the first embodiment can be obtained, and the effect of effectively suppressing the deterioration of characteristics can be obtained.
[0121]
In the third embodiment, a pitch filter is introduced into the drive excitation signal generator, but this is introduced only in the drive excitation decoding means 34, or the drive excitation encoding means 24 and the drive excitation decoding. A configuration not introduced by both means 34 is also possible.
[0122]
Further, in the third embodiment, the sound source position combination table is switched depending on whether or not the voice is rising, but it is switched according to whether it is a vowel part, a noise section or a voice section, or the pitch length. Configurations using other criteria are also possible.
Furthermore, a configuration in which a plurality of these standards are used in combination is also possible.
[0123]
In the third embodiment, since it is determined whether or not it is a voiced rise, the sound source position combination table is mainly switched using the power information of the input voice as a parameter. However, the input voice is analyzed. A configuration using other parameters such as pitch information and spectrum information obtained in this way is also possible. Furthermore, a configuration in which a plurality of these parameters are used in combination is also possible.
[0124]
In the third embodiment, two sound source position combination tables are switched, but a configuration in which three or more sound source position combination tables are switched is also possible.
In the third embodiment, a plurality of sound source position combination tables are switched, but a configuration in which a plurality of sound source position / polarity combination tables are switched is also possible.
[0125]
Embodiment 4 FIG.
FIG. 15 is a block diagram showing the inside of the driving excitation encoding means 24 in the speech encoding apparatus according to Embodiment 4 of the present invention. In the figure, reference numeral 81 denotes a pitch period of all combinations of a plurality of excitation positions. A sound source position combination table (index table) in which information indicating a combination of sound source positions whose use frequency is higher than the reference frequency when it is shorter than the length is described. A sound source position combination table (index table) in which information indicating a combination of sound source positions whose use frequency is higher than the reference frequency when it is long is described. 83 is a sound source position combination corresponding to the pitch period obtained from an adaptive excitation code. Selection means 84 for selecting the table 81 (or 82), 84 is a sound source position combination table selected by the selection means 83. Le 81 (or 82) to select any combination of the sound source position from an algebraic excitation coding means for coding the sound source information of the input speech using the proper polarity and combinations of sound source position.
[0126]
FIG. 16 is a block diagram showing the inside of the driving excitation decoding means 34 in the speech decoding apparatus according to Embodiment 4 of the present invention. In the figure, 91 is a pitch period of all combinations of a plurality of excitation positions. A sound source position combination table (index table) in which information indicating combinations of sound source positions whose use frequency is higher than the reference frequency when the length is shorter than the reference frequency is described. 92 is a pitch period longer than the frame length among all combinations of a plurality of sound source positions. A sound source position combination table (index table) in which information indicating a combination of sound source positions whose use frequency is higher than the reference frequency when it is long is described. 93 obtains a pitch period from the adaptive excitation code, and a sound source position combination corresponding to the pitch period Selection means 94 for selecting the table 91 (or 92), 94 is a sound source position combination table selected by the selection means 93. A combination of sound source positions is selected from the sound source position code 91 (or 92) based on the sound source position code included in the drive sound source code, and the combination of the sound source position and the polarity (information specifying the polarity is included in the drive sound source code). Algebraic excitation decoding means for decoding the excitation information of the input speech using
[0127]
Next, the operation will be described. However, since operations other than the driving excitation encoding unit 24 and the driving excitation decoding unit 34 are the same as those in the first embodiment, only the operations of the driving excitation encoding unit 24 and the driving excitation decoding unit 34 will be described.
[0128]
First, in the excitation position combination table 81 of the drive excitation encoding means 24, for example, combinations (reference criteria) that are frequently used when the pitch period is shorter than the frame length among all combinations of pulse excitation position candidates shown in FIG. A combination of position candidates of a plurality of pulsed sound sources is described by extracting only a desired number of combinations higher than the frequency). In the sound source position combination table 82, for example, only combinations (higher than the reference frequency) that are frequently used when the pitch period is longer than the frame length among all combinations of pulse sound source position candidates shown in FIG. A combination of position candidates of a plurality of pulse sound sources is described by extracting a desired number.
[0129]
In algebraic sound sources, all combinations of pulse sound source position candidates used for encoding do not appear evenly, there is a bias in the frequency of their occurrence, even if a combination with low frequency of occurrence is not used at all The effect on the quality of the synthesized sound is small.
For example, when the pitch period is shorter than the frame length, a pitch filter may be used, and the pulse sound source position candidates tend to concentrate in the first half of the frame, and when the pitch period is longer than the frame length, the pulse sound source position candidates Tend to appear evenly in the frame.
By using this characteristic, the number of encoding bits required for the algebraic sound source is reduced while restraining the deterioration of the encoding characteristic by constraining the combination of pulse sound source position candidates according to the characteristics of the input speech.
[0130]
The selection means 83 obtains the pitch period from the adaptive excitation code, for example, selects the excitation position combination table 81 when the pitch period is shorter than the frame length, and selects the excitation position combination when the pitch period is longer than the frame length. The sound source position combination table to be used is selected and switched based on the pitch period, such as selecting the table 82.
[0131]
The algebraic excitation encoding unit 84 sequentially reads out combinations of pulse excitation position candidates stored in the excitation position combination table 81 (or 82) selected by the selection unit 83, and each position candidate has an arbitrary polarity. A temporary synthesized sound is generated when a pulse is applied with a pulse.
Then, the distance between the encoding target signal output from the adaptive excitation encoding means 23 and the provisional synthesized sound is calculated, and the pulse source position candidate combination and polarity that minimize the distance are searched.
When the search is completed, the algebraic excitation encoding unit 84 outputs the excitation position code and polarity representing the combination of pulse excitation position candidates, which is the search result, to the multiplexing unit 26 as a driving excitation code, The time series vector corresponding to this drive excitation code is output to the gain encoding means 25 as a drive excitation signal.
[0132]
Next, the selection unit 93 of the driving excitation decoding unit 34 selects the excitation position combination table 91 or the excitation position combination table 92 in the same manner as the selection unit 83 of the driving excitation encoding unit 24. However, the sound source position combination table 91 is tabulated with the same contents as the sound source position combination table 81, and the sound source position combination table 92 is tabulated with the same contents as the sound source position combination table 82.
[0133]
The algebraic excitation decoding means 94 reads a combination of pulse excitation position candidates corresponding to the excitation position code from the excitation position combination table 91 (or 92) selected by the selection means 93, and assigns the polarity to each position candidate. A pitch filter is applied to the signal in which the applied pulse is arranged to generate a driving sound source signal, and the driving sound source signal is output.
[0134]
As is apparent from the above, according to the fourth embodiment, the sound source position combination tables 81 and 82 (or 91 and 92) having different descriptions are selected, and an arbitrary sound source position combination table is selected and used. Thus, the same effects as those of the first embodiment can be obtained, and the effect of effectively suppressing the deterioration of characteristics can be obtained.
In the fourth embodiment, since the sound source position combination table is selected based on the pitch period that can be obtained from the adaptive sound source code, the selection information encoding that specifies the sound source position combination table to be used is encoded. There is also an effect that eliminates the need for.
[0135]
In the fourth embodiment, a pitch filter is introduced into the drive excitation signal generator, but this is introduced only in the drive excitation decoding means 34, or the drive excitation encoding means 24 and the drive excitation decoding. A configuration not introduced by both means 34 is also possible.
[0136]
In the fourth embodiment, the sound source position combination table is switched according to the pitch period obtained from the adaptive sound source code. However, the configuration using other parameters such as switching according to the spectrum state obtained from the code of the linear prediction coefficient is used. Is also possible. Furthermore, a configuration in which a plurality of these parameters are used in combination is also possible.
[0137]
In the fourth embodiment, the parameters for switching the sound source position combination table are obtained using the code obtained in the current frame. However, the parameters for switching the sound source position combination table using the code in the past frame are used. A configuration for obtaining parameters is also possible.
[0138]
In the fourth embodiment, parameters for switching the sound source position combination table based on the code are obtained. However, in common with the speech encoding device and the speech decoding device, such as a sound source signal and output speech generated in the past. A configuration in which a parameter can be obtained by analyzing a signal that can be generated is also possible.
[0139]
In the fourth embodiment, two sound source position combination tables are switched, but a configuration in which three or more sound source position combination tables are switched is also possible.
Furthermore, in the fourth embodiment, a plurality of sound source position combination tables are switched, but a configuration in which a plurality of sound source position / polarity combination tables are switched is also possible.
[0140]
【The invention's effect】
  As described above, according to the present invention, the sound source encoding means is an all combination of a plurality of sound source positions.For eachCombination of sound source positions whose evaluation value for the combination of sound source positions is higher than the reference valueAn index table describing information indicating the combination of the sound source positions.Since any combination of sound source positions is selected from the above and the sound source information of the input speech is encoded using the combination of the sound source positions, the bit rate can be reduced without causing deterioration of characteristics. There is an effect that can be done.
  In addition, since the combination of the fixed sound source positions existing in the index table prepared and provided in advance is used, it is possible to improve the characteristics while maintaining a strong resistance against a code transmission error in the communication path. There is.
[0141]
According to the present invention, since the information indicating the combination of the sound source positions described in the index table is the combination information of the sound source positions encoded separately, an efficient speech code with a small storage capacity can be obtained. There is an effect that an apparatus is obtained.
[0142]
According to this invention, when the sound source encoding means can recognize a combination of sound source positions including at least a combination of sound source positions whose evaluation value is higher than the reference value, the combination of sound source positions described in the index table is determined. Since the information to be shown is flag information indicating whether each combination of sound source positions is an element of the index table, there is an effect that an efficient speech coding apparatus with a small storage capacity can be obtained.
[0143]
According to the present invention, the excitation encoding means is configured to recognize a combination of excitation positions including at least a combination of excitation positions whose evaluation value is higher than the reference value from the frame length, the number of excitations, and the number of indexes. There is an effect that a small and efficient speech coding apparatus can be obtained.
[0144]
  According to this invention, all combinations of a plurality of pairs of data in which the sound source encoding means is composed of the sound source position and the polarityFor eachCombinations of paired data whose evaluation value for paired data combinations is higher than the reference valueIs provided with an index table that describes information indicating the combination of the paired data.Since an arbitrary pair data combination is selected and the sound source information of the input speech is encoded using the pair data combination, a low bit rate can be achieved without causing deterioration of characteristics. There is an effect that can be done.
  In addition, since a fixed combination of data existing in an index table prepared and prepared in advance is used, it is possible to improve characteristics while maintaining strong resistance against code transmission errors in a communication path There is.
[0145]
According to the present invention, since the information indicating the combination of the paired data described in the index table is the paired information of the paired data encoded individually, the efficient speech code having a small storage capacity can be obtained. There is an effect that an apparatus is obtained.
[0146]
According to the present invention, when the excitation coding means can recognize a paired data combination including a paired data combination whose evaluation value is higher than the reference value at least, the paired data combination described in the index table is obtained. Since the information to be displayed is flag information indicating whether or not each combination of paired data is an element of the index table, there is an effect that an efficient speech coding apparatus with a small storage capacity can be obtained.
[0147]
According to the present invention, the excitation encoding means is configured to recognize a combination of paired data including a combination of paired data whose evaluation value is higher than the reference value based on the frame length, the number of excitations, and the number of indexes. There is an effect that a small and efficient speech coding apparatus can be obtained.
[0148]
According to the present invention, since the excitation encoding means has a plurality of index tables having different description contents, and an arbitrary index table is selected and used, the bit rate can be reduced. effective. Further, there is an effect that the deterioration of characteristics can be effectively suppressed.
[0149]
According to the present invention, the excitation encoding unit analyzes the input speech, extracts a predetermined parameter, and selects an index table corresponding to the parameter, so that the index can be obtained without performing complicated processing. There is an effect that a table can be selected.
[0150]
According to the present invention, the excitation encoding unit is configured to extract a predetermined parameter from at least one of the spectrum envelope information and the excitation information, and select an index table corresponding to the parameter. There is an effect that the selection information for specifying the table need not be encoded.
[0151]
  According to the present invention, the sound source decoding means has all combinations of a plurality of sound source positions.For eachCombination of sound source positions whose evaluation value for the combination of sound source positions is higher than the reference valueAn index table describing information indicating the combination of the sound source positions.Is selected based on a code indicating the combination included in the sound source information, and the sound source information of the input speech is decoded using the combination of the sound source positions. There is an effect that the bit rate can be reduced without incurring.
  In addition, since the combination of the fixed sound source positions existing in the index table prepared and provided in advance is used, it is possible to improve the characteristics while maintaining a strong resistance against a code transmission error in the communication path. There is.
[0152]
According to the present invention, since the information indicating the combination of sound source positions described in the index table is configured to be individually encoded combination information of sound source positions, efficient speech decoding with a small storage capacity is possible. There is an effect that an apparatus is obtained.
[0153]
According to this invention, when the sound source decoding means can recognize a combination of sound source positions including at least a combination of sound source positions whose evaluation value is higher than the reference value, the combination of sound source positions described in the index table is determined. Since the information to be shown is flag information indicating whether each combination of sound source positions is an element of the index table, there is an effect that an efficient speech decoding apparatus with a small storage capacity can be obtained.
[0154]
According to the present invention, the sound source decoding means is configured to recognize a combination of sound source positions including at least a combination of sound source positions whose evaluation value is higher than the reference value from the frame length, the number of sound sources, and the number of indexes. There is an effect that a small and efficient speech decoding apparatus can be obtained.
[0155]
  According to the present invention, all combinations of a plurality of pairs of data in which the sound source decoding means is composed of the sound source position and the polarityFor eachCombinations of paired data whose evaluation value for paired data combinations is higher than the reference valueIs provided with an index table that describes information indicating the combination of the paired data.The combination of the paired data is selected based on the code indicating the combination included in the sound source information, and the sound source information of the input speech is decoded using the paired data combination. There is an effect that the bit rate can be reduced without incurring.
  In addition, since a fixed combination of data existing in an index table prepared and prepared in advance is used, it is possible to improve characteristics while maintaining strong resistance against code transmission errors in a communication path There is.
[0156]
According to the present invention, since the information indicating the combination of the paired data described in the index table is the paired information of the paired data encoded individually, the efficient speech decoding with a small storage capacity There is an effect that an apparatus is obtained.
[0157]
According to the present invention, when the sound source decoding means can recognize at least a paired data combination including a paired data combination whose evaluation value is higher than the reference value, the paired data combination described in the index table is obtained. Since the information to be displayed is flag information indicating whether or not each combination of paired data is an element of the index table, there is an effect that an efficient speech decoding apparatus with a small storage capacity can be obtained.
[0158]
According to the present invention, the sound source decoding means is configured to recognize a combination of paired data including a pair of paired data whose evaluation value is higher than the reference value from the frame length, the number of sound sources, and the number of indexes. There is an effect that a small and efficient speech decoding apparatus can be obtained.
[0159]
According to the present invention, the sound source decoding means has a plurality of index tables having different description contents, and selects and uses the index table corresponding to the code indicating the selection information included in the sound source information. Since it is configured, there is an effect that the bit rate can be reduced. Further, there is an effect that the deterioration of characteristics can be effectively suppressed.
[0160]
According to this invention, the sound source decoding means has a plurality of index tables having different description contents, extracts a predetermined parameter from at least one of spectrum envelope information and sound source information, and an index corresponding to the parameter. Since the table is selected and used, the bit rate can be reduced. Further, there is an effect that the deterioration of characteristics can be effectively suppressed.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a speech encoding apparatus according to Embodiment 1 of the present invention.
FIG. 2 is a block diagram showing a speech decoding apparatus according to Embodiment 1 of the present invention.
FIG. 3 is a block diagram showing the inside of driving excitation encoding means in the speech encoding apparatus according to Embodiment 1 of the present invention;
FIG. 4 is a block diagram showing the inside of drive excitation decoding means in the speech decoding apparatus according to Embodiment 1 of the present invention.
FIG. 5 is an explanatory diagram showing a sound source position combination table.
FIG. 6 is an explanatory diagram showing an information storage method of a sound source position combination table.
FIG. 7 is an explanatory diagram showing an information storage method of a sound source position combination table.
FIG. 8 is a block diagram showing the inside of driving excitation encoding means in a speech encoding apparatus according to Embodiment 2 of the present invention.
FIG. 9 is a block diagram showing the inside of drive excitation decoding means in a speech decoding apparatus according to Embodiment 2 of the present invention.
FIG. 10 is an explanatory diagram showing a sound source position / polarity combination table.
FIG. 11 is an explanatory diagram showing an information storage method of a sound source position / polarity combination table.
FIG. 12 is an explanatory diagram showing an information storage method of a sound source position / polarity combination table.
FIG. 13 is a block diagram showing the inside of drive excitation coding means in a speech coding apparatus according to Embodiment 3 of the present invention.
FIG. 14 is a block diagram showing the inside of drive excitation decoding means in a speech decoding apparatus according to Embodiment 3 of the present invention.
FIG. 15 is a block diagram showing the inside of drive excitation encoding means in a speech encoding apparatus according to Embodiment 4 of the present invention;
FIG. 16 is a block diagram showing the inside of drive excitation decoding means in a speech decoding apparatus according to Embodiment 4 of the present invention.
FIG. 17 is a block diagram showing a speech encoding apparatus using a conventional CELP method.
FIG. 18 is a block diagram showing a speech decoding apparatus using a conventional CELP method.
FIG. 19 is an explanatory diagram showing a sound source position table.
FIG. 20 is an explanatory diagram showing a sound source position / polarity table.
[Explanation of symbols]
21 linear prediction analysis means (envelope information encoding means), 22 linear prediction coefficient encoding means (envelope information encoding means), 23 adaptive excitation encoding means (excitation coding means), 24 drive excitation encoding means (excitation code) Encoding means), 25 gain encoding means (excitation encoding means), 26 multiplexing means, 31 separation means, 32 linear prediction coefficient decoding means (envelope information decoding means), 33 adaptive excitation decoding means (excitation decoding) Means), 34 drive excitation decoding means (excitation decoding means), 35 gain decoding means (excitation decoding means), 36, 37 multiplier (excitation decoding means), 38 adder (excitation decoding means), 39 synthesis filter, 41, 51, 71, 72, 81, 82, 91, 92 excitation position combination table (index table), 42, 44, 64, 84 Algebraic excitation encoding means, 3, 53 sound source position and the polarity combination table (index table), 52,54,74,94 algebraic sound source decoding means, 61 and 62 sound source position combinations table (index table), 63,73,83,93 selection means.

Claims

Envelope information encoding means for extracting the spectrum envelope information of the input speech and encoding the spectrum envelope information, excitation source encoding means for encoding the excitation information of the input speech, and encoding by the envelope information encoding means A speech encoding apparatus comprising: a multiplexing unit that multiplexes the generated spectrum envelope information and the excitation information encoded by the excitation encoding unit and outputs a speech code; and the excitation encoding unit includes a plurality of excitation sources An index table describing information indicating the combination of sound source positions is extracted by extracting only combinations of sound source positions whose evaluation values relating to the combination of sound source positions determined individually for all combinations of positions are higher than the reference value , and the index and characterized by selecting any combination of the sound source position from the table, to encode the sound source information of the input speech using a combination of the sound source position That speech coding apparatus.

2. The speech encoding apparatus according to claim 1, wherein the information indicating the combination of sound source positions described in the index table is combination information of sound source positions encoded individually.

When the sound source encoding means can recognize a combination of sound source positions including at least a combination of sound source positions whose evaluation value is higher than the reference value, the information indicating the combination of sound source positions described in the index table is the sound source position. 2. The speech encoding apparatus according to claim 1, wherein each combination is flag information indicating whether or not each combination is an element of an index table.

4. The speech encoding apparatus according to claim 3, wherein the excitation encoding means recognizes a combination of excitation positions including a combination of excitation positions whose evaluation value is higher than a reference value based on the frame length, the number of excitations, and the number of indexes. .

Envelope information encoding means for extracting the spectrum envelope information of the input speech and encoding the spectrum envelope information, excitation source encoding means for encoding the excitation information of the input speech, and encoding by the envelope information encoding means In the speech coding apparatus comprising the multiplexed spectrum envelope information and the multiplexing means for multiplexing the excitation information encoded by the excitation encoding means , the excitation encoding means is composed of an excitation position and a polarity. Equipped with an index table that extracts only combinations of paired data that have an evaluation value higher than the reference value for each combination of paired data that is individually determined for all pairs of paired data, and describes information indicating the paired data combinations , selects a combination of any pair data from the index table, to encode the sound source information of the input speech using a combination of the paired data Speech coding apparatus characterized by and.

6. The speech coding apparatus according to claim 5, wherein the information indicating the combination of paired data described in the index table is paired information of paired data encoded individually.

When the excitation coding means can recognize at least a paired data combination including a paired data combination whose evaluation value is higher than the reference value, the information indicating the paired data combination described in the index table is the paired data. 6. The speech encoding apparatus according to claim 5, wherein each of the combinations is flag information indicating whether or not each combination is an element of an index table.

8. The speech encoding apparatus according to claim 7, wherein the excitation encoding means recognizes a combination of paired data including a combination of paired data whose evaluation value is higher than a reference value based on the frame length, the number of excitations, and the number of indexes. .

9. The sound source encoding unit includes a plurality of index tables having different description contents, and selects and uses an arbitrary index table. Speech encoding device.

10. The speech encoding apparatus according to claim 9, wherein the excitation encoding means analyzes the input speech, extracts a predetermined parameter, and selects an index table corresponding to the parameter.

The speech encoding apparatus according to claim 9, wherein the excitation encoding means extracts a predetermined parameter from at least one of spectrum envelope information and excitation information, and selects an index table corresponding to the parameter.

Separating means for separating the spectral envelope information and sound source information of the input speech from the speech code, envelope information decoding means for decoding the spectral envelope information separated by the separating means, and decoding by the envelope information decoding means In the speech decoding apparatus comprising the sound source decoding means for decoding the sound source information separated by the separation means with reference to the spectrum envelope information, the sound source decoding means comprises all combinations of a plurality of sound source positions. And an index table describing information indicating the combination of the sound source positions, wherein only the combination of the sound source positions whose evaluation value regarding the combination of the sound source positions determined individually is higher than the reference value is provided. A sound source position combination is selected based on a code indicating the combination included in the information, and the sound source position combination is used to enter the information. Speech decoding apparatus characterized by decoding the sound source information of the speech.

13. The speech decoding apparatus according to claim 12, wherein the information indicating the combination of sound source positions described in the index table is sound source position combination information encoded individually.

When the sound source decoding means can recognize a combination of sound source positions including at least a combination of sound source positions whose evaluation value is higher than the reference value, the information indicating the combination of sound source positions described in the index table is the sound source position. 13. The speech decoding apparatus according to claim 12, wherein each combination is flag information indicating whether or not each combination is an element of an index table.

15. The speech decoding apparatus according to claim 14, wherein the sound source decoding means recognizes a combination of sound source positions including a combination of sound source positions having at least an evaluation value higher than a reference value from the frame length, the number of sound sources, and the number of indexes. .

Separating means for separating the spectral envelope information and sound source information of the input speech from the speech code, envelope information decoding means for decoding the spectral envelope information separated by the separating means, and decoding by the envelope information decoding means In the speech decoding apparatus comprising the excitation decoding means for decoding the excitation information separated by the separation means with reference to the spectrum envelope information, the excitation decoding means is composed of an excitation position and a polarity. Equipped with an index table that extracts only combinations of paired data that have an evaluation value higher than the reference value for each combination of paired data that is individually determined for all pairs of paired data, and describes information indicating the paired data combinations selects a combination of paired data based on the code indicating the combination contained in the sound source information from the index table, its Speech decoding apparatus characterized by decoding the sound source information of the input speech using a combination of paired data.

17. The speech decoding apparatus according to claim 16, wherein the information indicating the combination of pair data described in the index table is pair information of pair data individually encoded.

When the sound source decoding means can recognize at least a paired data combination including a paired data combination whose evaluation value is higher than the reference value, the information indicating the paired data combination described in the index table is the paired data. The speech decoding apparatus according to claim 16, wherein each combination is flag information indicating whether or not each combination is an element of an index table.

19. The speech decoding apparatus according to claim 18, wherein the sound source decoding unit recognizes a combination of paired data including a pair of paired data whose evaluation value is higher than a reference value based on the frame length, the number of sound sources, and the number of indexes. .

The sound source decoding means has a plurality of index tables having different description contents, and selects and uses an index table corresponding to a code indicating selection information included in the sound source information. The speech decoding apparatus according to any one of claims 12 to 19.

The sound source decoding means has a plurality of index tables having different description contents, extracts a predetermined parameter from at least one of spectrum envelope information and sound source information, and selects an index table corresponding to the parameter. The speech decoding apparatus according to any one of claims 12 to 19, wherein the speech decoding apparatus is used.