JP4228630B2

JP4228630B2 - Speech coding apparatus and speech coding program

Info

Publication number: JP4228630B2
Application number: JP2002255476A
Authority: JP
Inventors: 圭見市
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-08-30
Filing date: 2002-08-30
Publication date: 2009-02-25
Anticipated expiration: 2022-08-30
Also published as: JP2004093946A

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a speech encoding system which can perform encoding of speech signals in a relatively short time even with a processor of relatively low operating frequencies and a speech encoding program. <P>SOLUTION: A region 11 is the region where the significant computation result of the autocorrelation matrix of the impulse response of an acoustic feeling weighted composition filter is set. A region 12 is the region which is the symmetric matrix of the region 11. A region 13 is the region where the result of an imaginary computation is are set. The operations are executed by allocating parallel calculation means to the autocorrelation matrix calculation of the impulse response of the acoustic feeling weighted composition filter at high frequency. For that reason, the impulse response elements of the acoustic feeling weighted composition filter are continuously stored in a memory region. Furthermore, the impulse response elements of the virtual acoustic feeling weighted composition filter are stored in continuation to the impulse response elements of the acoustic feeling weighted composition filter. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は音声符号化装置および音声符号化プログラムに係わり、たとえば電話端末や電話交換機で音声の符号化を行う音声符号化装置および音声符号化プログラムに関する。
【０００２】
【従来の技術】
近年、携帯電話機の爆発的な普及に伴って伝送路を流れる音声トラフィック量が急激に増大している。この増大しているトラフィックに対応するために通信事業者は伝送設備の増強を行っている。しかしながら伝送設備の増強には多大なコストが必要であるため無尽蔵に増強することはできない。
【０００３】
また近年の通信の分野ではアナログ信号を送受するアナログ通信に代わり、ディジタル信号を送受するディジタル通信が主流である。ディジタル通信は、雑音の影響を受けにくい点や複数の信号を多重しやすい点でアナログ通信よりも優れているため急速に普及してきた。電話機を用いた電話システムにおいてもこの傾向は同じであり、アナログ信号である音声信号をディジタル化して送受することが主流となっている。
【０００４】
アナログ信号をディジタル信号に変換するためには標本化と量子化および符号化が必要である。標本化とは時間的に連続な信号波形を時間的に離れた時点の値で表現することである。また量子化とは波形の値を有限個の値の中の１つで近似的に表現することである。そして符号化とは具体的にどのように表現するかということで、通常は２進数で表現する。
【０００５】
アナログ信号である音声信号をディジタル化する技術としては、従来より波形の振幅をサンプリング定理に基づいて量子化する線形パルス符号化（Pulse Code Modulation：PCM）技術がよく知られている。この線形パルス符号化はアナログ信号を一様なステップで量子化するもので、通常“Ａ／Ｄ変換”と呼ばれているものと同じである。
【０００６】
電話端末を用いて通話を行うシステムに対して利用者が求める大きなポイントは、“誰とでもいつでもつながること”と“明瞭に会話ができること”である。このうちの“誰とでもいつでもつながること”を必ず実現しようとすると、すべての電話端末同士の接続の組み合わせに対する伝送路を予め確保しておく必要がある。しかしこれは膨大な組み合わせ数となる上に実際に通話をしていない場合であっても伝送路を占有することになり、経済性を大きく損ねることになるため現実的ではない。そこで伝送路を予め確保しておくことは行わずに、接続が必要となってから空いている伝送路を探してこれを確保して使用するようにしている。つまり伝送路は複数の利用者の共有資源という位置付けになる。したがって他の利用者によって伝送路が完全に占有されている場合には新たな接続を確立させることができないが、伝送路に空きが残っていれば新たな接続を確立することができる。つまりそれぞれの接続を確立するために必要な伝送路の占有量を小さくすることができれば、数多くの接続を同時に確立することができる。そこでそれぞれの接続が占有する伝送路の容量を小さくする技術が検討されている。
【０００７】
また“明瞭に会話ができること”が実現されるためには互いの音声信号ができるだけ元の信号から変化することなく相手に届けられる必要がある。音声信号を伝送のためにディジタル化するにあたって、その音声信号の変化を小さくするためには標本化と量子化と符号化をできるだけ細やかに行う必要がある。標本化の細やかさは時間的な離れ度合いを小さくすることで実現できる。つまりこれは標本化の周期を短くすることになる。また量子化の細やかさは波形の値を表現する有限個の近似的な値の変化ステップを小さくするとともに値の個数を多くすることで実現できる。つまり量子化のステップを小さくするとともに、近似値の範囲と近似値を表現する符号の取りうる値の範囲を大きくすることが必要である。
【０００８】
標本化の周期を短くすることによって単位時間当たりの標本の数は増加する。また量子化のステップを小さくして近似値を表わす符号の範囲を大きくすることによって符号の桁数あるいはビット数も増大する。伝送する情報の量は単位時間当たりの標本の数と符号の桁数あるいはビット数の積に比例するので、“明瞭に会話できること”を実現しようとすると伝送する情報の量が増大することになる。ところがこれは“誰とでもいつでもつながること”を実現するために必要である“それぞれの接続を確立するために必要な伝送路の占有量を小さくすること”とは相反した要求である。
【０００９】
そこでこれらの相反した要求を同時に満足するために、増大した音声情報を伝送する際に情報量を小さくする音声符号化の技術が検討されている。
【００１０】
なお伝送する音声の情報量を小さくするためには、音声符号化を行う前に不必要な情報を伝送する情報に入れ込まないことも重要である。標本化周期については音声情報をディジタル化するにあたって必要十分な細かさに限定すれば伝送する情報量を小さくすることができる。一般に電話機で通話を行うのに必要な周波数帯域の上限は４キロヘルツと言われている。つまり４キロヘルツの信号まで再現可能な標本化周期で標本化を実施すればよい。この場合の標本化周期はシャノンの定理により２倍の８キロヘルツである。
【００１１】
量子化については音声振幅の統計的な性質を用いて振幅を対数変換して圧縮する技術がある。対数変換の公式としてはμ則またはＡ則と呼ばれる式が広く活用されている。この対数圧縮によって同じ量子化ステップ数で音声波形を細やかに量子化することができる。更に音声振幅の変化の特性に合わせて量子化のステップ幅を時間的に変化させる適応量子化と呼ばれる技術がある。
【００１２】
また音声信号には周期性があるため隣接した標本間で相関がある他に、離れた標本間でも相関があるという特性がある。隣接した標本間の差分、あるいはその相関を利用して予測した値と実際の標本値の予測差分を符号化することによって情報を圧縮することができる。単純に差分を伝送する技術として差分量子化（Differentioal Pulse Code Modulation:DPCM）技術がある。この差分量子化技術の中には、差分を“０”あるいは“１”のいずれかしか取り得ないものがある。これはデルタ変調（Delta Modulation:DM）と呼ばれるものであり、量子化した直前のデータと現データとの差分が一定の値を超えている場合には“１”を伝送し、それ以外の場合には“０”を伝送する。これにより伝送に必要な差分情報の情報量を非常に小さく抑えることができる。また予測差分を伝送する技術の例としては適応差分量子化（Adaptive Differntioal Pulse Code Modulation:ADPCM）技術がある。適応差分量子化技術を用いて符号化したデータは元データの半分程度にまで圧縮することができる。
【００１３】
また、音声波形や音声の周波数成分情報等をディジタル化する際にそれぞれの標本化値ごとに量子化せずに複数の標本化値の組をまとめて１個の符号で表現する量子化技術がある。これはベクトル量子化と呼ばれるもので、複数の情報をまとめて１個の符号で表現することによって情報を圧縮することができる。
【００１４】
これまで説明した差分量子化技術や適応差分量子化技術は、音声信号の波形そのものをできるだけ忠実に表現しようとする技術である。これとは別に音声というものがいかにして生成されるのかをモデル化して、この生成モデルに基づいて音声をパラメータに変換して表現する“分析合成技術”と呼ばれる技術がある。音声の分析合成技術を用いた通信では音声信号波形そのものを伝送する代わりに音声を合成して作り出す際に必要な音源の性質を表わす情報と、その音源の出力信号に周波数フィルタ処理を行う際の周波数フィルタの性質を表わす情報を符号化して伝送する。この符号化された信号を受信した側では符号を復号して元の情報に変換した後、音源の性質を表わす情報と周波数フィルタの性質を表わす情報を基に音声信号を合成して再現する。
【００１５】
音声は声帯が振動することによって生じる周期性をもった有声音と、空気の流れによって生じる周期性をもたない無声音で構成されている。つまりこれらの周期性をもった音源情報と周期性を持たない音源情報の両方を伝送する必要がある。また人の声帯で生成された空気の振動は、のどや口の中といった太さの異なる複数の管を通った後に放射される。これは声帯で生成した信号に複数の周波数フィルタをかけて出力信号を得ていることに相当する。そこでこの信号通過経路を複数の周波数フィルタを合成した“合成フィルタ”と考える。この合成フィルタの性質は複数の周波数についての、その周波数フィルタの入出力特性で表わすことができる。
【００１６】
人の声が言葉として聞き分けられているということは、音声信号が時間の経過とともに変化しておりその変化を感じ取っていることになる。つまりこの音声信号の音源の性質とフィルタの性質も時間の経過とともに変化しており一様ではない。したがって分析合成技術を用いた音声信号の伝送においても、この時間の経過とともに変化した、それぞれ異なった情報を逐一伝送する必要がある。
【００１７】
音声の分析合成技術の例としてマルチパルス符号化と呼ばれる技術がある。マルチパルス符号化技術は入力音声信号を分析して、この入力音声信号を構成する周波数帯ごとの強弱を示すスペクトル包絡情報と音源情報とを抽出して伝送し、受信側でこの情報を元に音声を合成する技術である。スペクトル包絡情報は一般的に線形予測分析を用いて求められる。また、ここでの音源情報は入力音声信号からスペクトル包絡情報を除いた残差信号と呼ばれるもので、振幅と位置に自由度のある複数パルスからなるパルス列によって表わされる。このマルチパルス符号化技術を用いて伝送効率を改善する提案には、たとえば特開平４−８４２００公報に示されたものがある。また、音声の分析合成技術の他の例として符号励振線形予測（Code Excited Linear Prediction Audio Codes：CELP）と呼ばれる技術がある。
【００１８】
図７は、符号励振線形予測技術を用いた装置の例として２台の携帯電話端末の音声入出力部分を表わしたものである。ここでは符号励振線形予測技術の概要を説明する。通常それぞれの携帯電話端末には音声を符号化する音声符号化装置と符号化された信号を音声に復号する音声復号化装置の両方が搭載されている。しかし、ここでは説明を容易にするために音声を送信する側に音声符号化装置１０１のみを示し、音声を受信する側に音声復号化装置１０２のみを示している。音声を送信する側では利用者が発した音声を携帯電話端末のマイクロフォン１０３で検出し、これを電気信号に変換して音声符号化装置１０１の音声入力手段１０４に入力する。音声入力手段１０４は音声を所定の周期で分割した音声信号１０５を生成する。分割された音声信号１０５は、線形予測分析器１０６に入力され線形予測分析が実施されて周波数フィルタ特性１０７が求められ、合成フィルタ１０８に入力される。この周波数フィルタ特性１０７に基づいて合成フィルタ１０８は、入力された音声信号１０５のもつフィルタ特性の反映された周波数フィルタとして動作する。また、合成フィルタ１０８には音源として用いることができる信号のパターンが各種格納されている符号帳１０９から、聴感重み付け誤差最小化器１１０からの制御信号に従って、音源に相当する音源信号１１１が入力される。符号帳１０９にはこの音源として用いることができる信号のパターンが各種格納されており、コードブックとも呼ばれている。また、符号帳１０９に格納された各種信号のパターンには、それぞれ符号が付与されている。入力された音源信号１１１は、周波数フィルタ特性１０７が反映された合成音声信号１１２となる。また音声信号１０５は、加算器１１３にも入力され合成音声信号１１２と加算される。この時、合成音声信号１１２を加算器１１３の負の入力とすることによって、加算器１１３からの出力信号は、音声信号１０５と合成音声信号１１２との差分信号１１４となる。この差分信号１１４が最小となる符号帳の出力を探索するために、聴感重み付け誤差最小化器１１０は、符号帳１０９に対して新たな音源信号１１１の出力を要求する。この要求を繰り返して、差分信号１１４が最小となった際に符号帳１０９から出力された音源信号１１１に付与されていた符号１１５が、伝送路１１６に対して出力される。また、これと併せて線形予測分析器１０６から出力された周波数フィルタ特性１０７が伝送路１１６に対して出力される。
【００１９】
音声を受信する側の音声復号化装置１０２は、符号帳１１７と合成フィルタ１１８と合成の完了した音声に周波数ごとの強弱を付加するポストフィルタ１１９で構成されている。伝送路１１６から受信した符号１２１は、音声を送信する側が送信した符号１１５と同一のものである。また、伝送路１１６から受信した周波数フィルタ特性１２２は、音声を送信する側が送信した周波数フィルタ特性１０７と同一のものである。音声復号化装置１０２は伝送路１１６から受信した符号１２１を符号帳１１７に入力して得た音源信号１２３を合成フィルタ１１８に入力する。符号帳１１７は、音声を送信する側がもつ符号帳１０９と同一のものであり、同じ符号が符号帳に入力された場合に出力される音源信号は同じものとなる。この音源信号１２３は、伝送路１１６から受信した周波数フィルタ特性１２２に従って周波数入出力特性が調整された合成フィルタ１１８に入力されて、合成音声１２４が出力される。この合成音声１２４はポストフィルタ１１９に入力されて周波数ごとの強弱を加えられて聴感上聞き取り易い合成音声１２５に変換された後スピーカ１２０から出力されるようになっている。
【００２０】
このように符号励振線形予測では、あらかじめ種々の音源のパターンを蓄えておきそれぞれに符号を割り当てておく。音声符号化装置に入力された音声信号は一定の時間間隔ごとに区切られて、線形予測分析によって周波数スペクトル包絡を求め、これに基づいた合成フィルタの係数が求められる。符号帳の各音源のパターンに対して、この合成フィルタの係数を入力パラメータとした合成フィルタ処理を行い合成音声を生成する。この合成音声と入力された音声信号を比較して、最も誤差の小さかった音源のパターンに割り当てられている符号を入力された音声信号に対応する符号として用いる。符号帳のもつ音源のパターンを細やかにかつ種類を豊富に作成する程、入力音声信号との類似度の高いパターンを選択できる可能性が高まり、音声の再現性も高まる。しかし膨大な数のパターンを作成して、これを音声符号化装置に記憶させることは装置の記憶媒体の増大と比較処理量の増大に繋がるため限度がある。そこでこのパターンを代数的に生成して、少ない記憶媒体量で多種のパターンを代替する技術が提案されている。この技術を用いた符号励振線形予測は代数符号励振線形予測（Algebraic Code Excited Linear Prediction：ACELP）と呼ばれ、携帯電話システムにおいて利用されている。この代数符号励振線形予測を更に少ない記憶媒体量で実現する提案は、たとえば特表平１０−５０２１９１に示されている。
【００２１】
音声信号を符号化する技術とこれを元に戻す復号化技術は、対になったものが詳細に規定されていなければ正しく元の音声信号を再現することができない。またこの符号化技術を多くの機器に普及させるためには、その詳細な仕様が広く認知される必要がある。そこで符号化技術は標準化組織のもとで詳細が規定されて、勧告として公開されている。このうち代数符号励振線形予測に関しては、国際的な標準化組織である“ＩＴＵ”（International Telecommunication Union）において“ＩＴＵ−勧告Ｇ７２３．１低レート方式（５．３ｋｂｉｔ／ｓｅｃ）”として勧告化されている。また、別の国際的な標準化組織である“３ＧＰＰ”（3rd Generation Partnership Project）において、“ＧＳＭ−ＡＭＲ”（Global System for Mobile Communications -Adaptive Multi-Rate Speech Transcoding）として勧告化されている。これらは代数符号励振線形予測を原理とする符号化技術であり大幅に符号量を削減しながら高品質な音声を再生できる技術として知られている。
【００２２】
これらの音声信号の符号化技術は電話システムにおいて利用者が持つ携帯電話端末で用いられる他に、電話端末間の接続を仲介する電話交換機にも用いられる。電話交換機においては、異なった音声符号化技術を採用している電話端末同士の通話を成立させるために符号化された音声信号の相互変換を行う。そこで、異なった音声符号化技術間で共通に取り扱える信号に変換するために音声の符号化と復号化を行う必要がある。電話交換機は同時に多数の電話端末間の接続を仲介する装置であるため、音声符号化の相互変換を行う回路は同時に多数の電話端末に対応できるものでなければならない。
【００２３】
図８は、符号励振線形予測技術を用いた音声符号化装置の要部を表わしたものである。この音声符号化装置を用いて、符号励振線形予測技術の概要を説明する。この音声符号化装置はフィルタ部位１３１と音源部位１３２と比較器部位１３３で構成されている。フレーム分割された音声信号１３４はまずフィルタ部位１３１に入力されて線形予測分析および量子化器１３５によって線形予測分析が行われて周波数フィルタ特性が求められた後、この周波数フィルタ特性が量子化される。量子化された周波数フィルタ特性は合成フィルタ１３６に入力されて入力音声のもつフィルタ特性をもったフィルタが形成される。
【００２４】
音源部位１３２は、音声のもつ声の高さに相当する周期成分の信号を生成する音源となる適応コードブック１３７と音声のもつ周期成分以外の信号を生成する音源となる雑音コードブック１３８をもつ。そしてこれらの音源から出力された信号の振幅を制御するためのゲインコードブック１３９と、適応コードブック１３７の生成した信号の振幅を調整する適応コードブック信号増幅器１４０と、雑音コードブック１３８の生成した信号の振幅を調整する雑音コードブック信号増幅器１４１をもつ。振幅の調整されたこれらの音源信号は、音源加算器１４２によって加算された後、フィルタ部位１３１の合成フィルタ１３６に入力されて周波数ごとの強弱を付加された合成音声１４３として生成される。
【００２５】
比較器部位１３３には音声信号１３４と合成音声１４３が入力されて、加算器１４４で加算される。ここで合成信号１４３を負の形式で入力することで差分を求める。この差分に対して聴感重み付けフィルタ１４５によって、聴感上の聞こえにくい周波数成分を強めるとともに、聴感上聞こえやすい周波数成分を弱めるフィルタ処理を行う。この処理によって人の聴感上で冗長な情報を削減し情報量を低減させる。この後誤差最小化器１４６は、合成音声１４３と音声信号１３４の最小二乗誤差を算出することで誤差を求めると共に、音源部位１３２のそれぞれのコードブックに対して別の信号出力を実行させる。この処理を繰り返し実行し、誤差が最小となった際のコードブックからの出力信号が目標の信号であるとみなし、この信号に付与されている符号を音声符号化装置の出力として得る。この、コードブックから目標の出力信号を見つけ出す処理は“パターン探索”、“パルス探索”あるいは“コードブック探索”等と呼ばれている。
【００２６】
図９は、人の聴感上で知覚されにくい情報を省き冗長な情報を削除する周波数ごとの強弱処理を行う聴感重み付けを実施した聴感重み付け音声信号と聴感重み付け合成音声信号とこれら２つの音声信号の誤差信号をベクトルで表わしたものである。ここで聴感重み付け音声信号１５１をｒ、聴感重み付け合成音声信号１５２をＧＨν_ξとすると、誤差信号１５３はｒ−ＧＨν_ξとして表わすことができる。ここで音声符号化に関する勧告の一例である“ＩＴＵ勧告Ｇ７２３．１”の記述を用いて代数符号励振線形予測で行うパターン探索原理を詳しく説明する。パターン探索は入力音声信号と合成音声信号の平均二乗誤差Ｅ_ξを求めて、この誤差が最小となるパターンを目的のパターンとするものである。この平均二乗誤差を求める演算を次の（１）式に示す。
【数１】

【００２７】
ここで、聴感重み付け音声信号ｒと聴感重み付け合成音声信号ＧＨｖ_ξの誤差を最小にすることは、（１）式の平均二乗誤差Ｅ_ξを最小にすることと同じである。（１）式の右辺を展開して、次の（２）式に表わす。
【００２８】
Ｅ_ξ＝（ｒ）×（ｒ−ＧＨν_ξ）−（ＧＨν_ξ）×（ｒ−ＧＨν_ξ）……（２）
【００２９】
この時に誤差が最小となる条件を図９で誤差信号１５３を表わしている“ｒ−ＧＨν_ξ”で考えると、聴感重み付け合成音声信号１５２と誤差信号１５３のなす角度が直角であるとき誤差が最小となることがわかる。これはベクトルが直交する時それらのベクトルの内積が“０”である原理に基づいている。したがって、誤差が最小であるとき（２）式の第２項は“０”となる。ここで、ＧＨν_ξのＧはコードブックの各パターンから生成される音源信号に与える利得に相当するコードブック利得を示す。また、ν_ξはインデクスξにおけるパターンを表わす代数的符号語を示す。また、Ｈは音声信号線形予測分析して求めたスペクトル包絡情報に聴感重み付けを行った合成フィルタに、広帯域の周波数成分を均等に含むインパルス信号を入力した際の応答を示す合成フィルタのインパルス応答を要素とする行列を示す。これはインパルス応答要素ｈ（ｎ）を、対角がｈ（０）、低位の対角がｈ（１）、・・・、ｈ（Ｌ−１）の下三角テプリッツ畳み込み行列の形式にしたものである。ここでｎは、入力音声信号を一定時間で区切って標本化した際の各標本のインデクス値と等しいものである。また、Ｌは入力音声を一定時間に区切った１区間内の標本の数である。テプリッツ行列とは、対称でかつ対角線に平行な線上の要素がすべて等しい性質をもっている行列をいう。また、下三角畳み込みとは行列の対角線より下の三角形の領域に要素を設定して、それ以外の領域には“０”が設定されていることをいう。
【００３０】
直行したベクトルの内積が“０”であるという原理に基づき、（２）式の第２項を“０”とする。更に平均二乗誤差“Ｅ_ξ＝０”として、式の変形を行う。ここでパラメータを振ることができる唯一の要素であるコードブック利得Ｇについて（２）式を微分する。さらにこれをコードブック利得Ｇに関してまとめると、次の（３）式のように表わすことができる。
【数２】

【００３１】
これを（１）式に代入すると平均二乗誤差Ｅ_ξは、次の（４）式のように求められる。
【数３】

【００３２】
この平均二乗誤差Ｅ_ξを最小化するには、（４）式の第２項を最大化すればよい。そこで（４）式の第２項をτ_ξとすると、次の（５）式が求められる。
【数４】

【００３３】
ここでは（４）式からこの（５）式を求めるにあたり、次の（６）式を用いた。ここでｄは聴感重み付け音声信号と聴感重み付け合成フィルタのインパルス応答の相関を表わすベクトルである。
【００３４】
ｄ＝Ｈ^Tｒ ……（６）
【００３５】
また、（４）式から（５）式を求めるにあたり、次の（７）式を用いる。Φは聴感重み付け合成フィルタのインパルス応答の共分散行列である。
【００３６】
Φ＝Ｈ^TＨ ……（７）
【００３７】
そして、このベクトルｄと行列Φをコードブック探索に先だって演算によって求める。このベクトルｄのそれぞれの要素は、次の（８）式で表わすことができる。
【数５】

【００３８】
ここで“ｊ”は“０”以上でかつ“Ｎ−１”以下である。また、“Ｎ”は聴感重み付け合成フィルタのインパルス応答の要素数であり、また音声をフレーム分割した際の１フレームあたりの標本数である。そして“ｎ”と“ｉ”と“ｊ”は聴感重み付け合成フィルタのインパルス応答の要素と標本化された音声の要素と聴感重み付け合成フィルタのインパルス応答の共分散行列の要素を番号付けするための値である。
【００３９】
また、行列Φ（ｉ，ｊ）は次の（９）式で表わすことができる。
【数６】

【００４０】
ここでｉは“０”以上でかつ“Ｎ−１”以下であり、ｉはｊ以下である。
【００４１】
この“ＩＴＵ勧告Ｇ７２３．１”の例では、パターン探索は４本のパルスの位置と極性を探索する。そのために、それぞれのパルス位置に対応した４重のループ処理が実行される。そして、それぞれのループ処理によって新しいパルスの寄与分が加算される。そこで、（５）式の相関Ｃは以下の（１０）式で表わすことができる。
【００４２】
Ｃ＝α₀ｄ［ｍ₀］＋α₁ｄ［ｍ₁］＋α₂ｄ［ｍ₂］＋α₃ｄ［ｍ₃］……（１０）
【００４３】
ここで、ｍ_kはｋ番目のパルスの位置を示しており、α_kはｋ番目のパルスの極性を示している。また、ベクトルｄは（８）式にて求めた値である。
【００４４】
ここで探索しているパルスは、振幅が零でないことを意味する“非零パルス”と呼ばれるものである。この非零パルスの１サブフレームあたりの本数は、それぞれの代数符号励振線形予測技術の種類ごとに定められている。そして、決められたパルス位置の複数の候補の中から演算によって最適な位置を決定する。一例として示した勧告Ｇ．７２３．１で規定されている２種類の伝送速度のうちの低ビットレートの場合、（１０）式に４個の項がありそれぞれがパルスの位置を示しているように非零パルスの本数は最大４本である。また、勧告Ｇ．７２３．１ではフレーム長が３０ミリ秒、フレームを４分割したサブフレーム長が７．５ミリ秒、サンプリング周波数が８キロヘルツなので、１サブフレーム中のサンプル数は“６０”となる。非零パルス位置候補は、演算量削減のために偶数番目のみと考えている。非零パルス部分以外は計算をしないことにより、演算量を大幅に低減している。サンプル数“６０”の中の偶数番目は３０本あり、この中から４本を選ぶため、それぞれは８カ所の中から選ぶものとする。つまり４本の非零パルスはおのおの８カ所の位置候補のうち最適な場所を演算で選ぶ。しかし“８カ所”の４倍は“３２カ所”となり、“３０”を“２”超過している。この時の３番目と４番目のパルスはフレーム内に収まらずに、結果としてパルスの本数が減ることもあり得る。この時は、非零パルスの本数は４本よりも少なくなる。別の例として、勧告のＧＳＭ−ＡＭＲでは８種類の伝送速度（ビットレート）が存在し、非零パルスの本数は、一部重複するものを含めて６種類のパターンがある。
【００４５】
（５）式における偶数パルス位置にパルスのあるパターンのベクトルのもつエネルギーを求める。このエネルギーεは、次の（１１）式で表わすことができる。
【００４６】
ε＝Φ（ｍ₀，ｍ₀）
＋Φ（ｍ₁，ｍ₁）＋２α₀α₁Φ（ｍ₀，ｍ₁）
＋Φ（ｍ₂，ｍ₂）＋２［α₀α₂Φ（ｍ₀，ｍ₂）＋α₁α₂Φ（ｍ₁，ｍ₂）］
＋Φ（ｍ₃，ｍ₃）＋２［α₀α₃Φ（ｍ₀，ｍ₃）＋α₁α₃Φ（ｍ₁，ｍ₃）＋α₂α₃Φ（ｍ₂，ｍ₃）］ ……（１１）
【００４７】
演算量を削減するために、パルスは偶数番目に存在するものと仮定して計算を行い、奇数番目のパルス位置にパルスのあるパターンのベクトルについては、近似を用いて算出する。奇数番目のパルス位置にパルスのあるパターンのベクトルを計算するために、ベクトルｄ［ｊ］と対称行列Φ（ｍ₁，ｍ₂）を変形して、新たに式ｓ［ｊ］を定義してベクトルｄ’［ｊ］を構成する。
【００４８】
ｓ［２ｊ］＝ｓ［２ｊ＋１］＝ｓｉｇｎ（ｄ［２ｊ］）
｜ｄ［２ｊ］｜＞｜ｄ［２ｊ＋１］｜の場合
ｓ［２ｊ］＝ｓ［２ｊ＋１］＝ｓｉｇｎ（ｄ［２ｊ＋１］）
それ以外の場合 ……（１２）
【００４９】
ｄ’［ｊ］＝ｄ［ｊ］×ｓ［ｊ］ ……（１３）
Φ’（ｉ，ｊ）＝ｓ［ｉ］×ｓ［ｊ］×Φ（ｉ，ｊ） ……（１４）
【００５０】
ここで、（１３）式におけるベクトルｄ’［ｊ］と（１４）式における対称行列Φ’（ｉ，ｊ）は、ベクトルｄ［ｊ］と対称行列Φ（ｉ，ｊ）に、式ｓ［ｊ］を用いてパルスの極性要素を取り込んだものである。そのため、（１０）式と（１１）式では、すべてのパルスの極性αは“１”とみなすことができる。したがって、（１０）式と（１１）式は、以下の（１５）式と（１６）式のように表わすことができる。
【００５１】
Ｃ＝ｄ’［ｍ₀］＋ｄ’［ｍ₁］＋ｄ’［ｍ₂］＋ｄ’［ｍ₃］……（１５）
ε＝Φ’（ｍ₀，ｍ₀）
＋Φ’（ｍ₁，ｍ₁）＋２Φ’（ｍ₀，ｍ₁）
＋Φ’（ｍ₂，ｍ₂）＋２［Φ’（ｍ₀，ｍ₂）＋Φ’（ｍ₁，ｍ₂）］
＋Φ’（ｍ₃，ｍ₃）＋２［Φ’（ｍ₀，ｍ₃）＋Φ’（ｍ₁，ｍ₃）＋Φ’（ｍ₂，ｍ₃）］……（１６）
【００５２】
ここで求められたエネルギーεと（１５）式で求めた相関Ｃを更に式（５）のエネルギーε_ξと相関Ｃ_ξに代入し、相関エネルギー比τ_ξが最小となるパルス位置を求めることがパターン探索の原理である。この原理に従ったパターン探索を行うために、代数符号励振線形予測技術では、聴感重み付け合成フィルタのインパルス応答ｈ（ｎ）を用いた下三角テプリッツ畳み込み行列Ｈと、下三角テプリッツ畳み込み行列Ｈの転置行列との相関演算で、Ｎ行Ｎ列の二次元行列Φ（ｉ，ｊ）を算出している。
【００５３】
代数符号励振線形予測のうちの勧告ＧＳＭ−ＡＭＲを例として、音声符号化処理を具体的な数値を用いて説明する。勧告ＧＳＭ−ＡＭＲで音声信号を符号化する単位は、５ミリ秒の時間幅をもったサブフレームである。フレームは、サブフレームを４個まとめた２０ミリ秒の時間幅をもっている。標本化周期は８キロヘルツなので、１つのサブフレーム中のサンプル数Ｎは“４０”となる。これらの４０個のサンプルについて、パターン探索処理を実施することになる。
【００５４】
パターン探索を行うにあたり、まず符号器のもつ聴感重み付け合成フィルタのインパルス応答ｈ（ｎ）を用いた下三角テプリッツ畳み込み行列Ｈと、行列Ｈの転置行列の相関演算を実施し、Ｎ行Ｎ列の二次元行列Φ（ｉ，ｊ）を求める。ここでサンプル数Ｎが４０であるので、Ｎ行Ｎ列の二次元行列Φ（ｉ，ｊ）の要素数は“４０×４０”の１６００個となる。また、４０個の要素を番号付けする値となる“ｎ”の値の範囲は０〜３９となる。
【００５５】
二次元行列Φ（ｉ，ｊ）は、（９）式に“ｎ”の値の範囲を反映させて、次の（１７）式で表わすことができる。
【数７】

【００５６】
ここでｊはｉ以上であり、ｉは“０”以上でかつ“３９”以下である。
【００５７】
これらの演算を、Ｎ行Ｎ列の二次元行列Φ（ｉ，ｊ）の要素数分行わなければならない。しかしこの行列は下三角テプリッツ畳み込み行列であり、対称行列である性質を用いて各要素の算出をＮの二乗個の演算で求めるのではなく、“Ｎ（Ｎ＋１）／２”個の演算で求めている。このように、パターン探索に関しては演算量削減の各種の提案が行われている。
【００５８】
たとえば、特開平１１−３２７５９９公報では符号励振線形予測技術を用いた音声符号化装置でコードブックからパターン探索する際に用いる聴感重み付け合成フィルタのインパルス応答行列の長さを音声の周期成分の２分の１以下の長さに短縮している。つまりインパルス応答行列の長さを短縮することで、インパルス応答行列の相関演算量を削減している。そして、この削減した部分は近似処理を行うことで置き換えている。しかしこの近似処理を用いた置き換えは、近似処理をせずに相関演算を行った場合とは異なった結果となる可能性があり音声品質を劣化させる要因となりかねず、また、各勧告とのビットイグザクトを保証できなくなる恐れがある。
【００５９】
また、特開平０３−１８９７００公報では符号励振線形予測技術を用いた音声符号化装置でコードブックからパターン探索に用いる聴感重み付け合成フィルタのインパルス応答行列の相関行列の演算を初期要素のみに対して行っている。そして、それ以外の要素については１つ前に求めた要素を基に再帰的に求めている。これにより演算の重複を削減して演算量を削減している。しかしこの再帰的な演算は、再帰的な演算をせずに相関演算を実施した場合とは異なった結果となる可能性があり音声品質を劣化させる要因となりかねず、また、各勧告とのビットイグザクトを保証できなくなる恐れがある。
【００６０】
【発明が解決しようとする課題】
以上説明したように、符号励振線形予測技術を用いた音声符号化装置では少ない伝送情報量で高品質な音声を再生することが可能である。しかしながら音声のもつ特徴を抽出しこれを表現するのに適切な情報を符号帳から求めるために大量の演算を短時間で実施しなければならない。特に、数多くの電話端末に対する音声符号化処理を行う電話交換機等の集線装置では、この大量の演算を更に数多くの電話端末数分実施しなければならない。そこで、これを賄うために多数の動作周波数の高いプロセッサを備えるか、あるいは更に動作周波数の高いプロセッサを備えなければならなかった。
【００６１】
そこで本発明の目的は動作周波数の比較的低いプロセッサであっても比較的短時間で音声信号の符号化ができる音声符号化装置および音声符号化プログラムを提供することにある。
【００６２】
【課題を解決するための手段】
本発明では、（イ）通話者が発する音声を入力する音声入力手段と、（ロ）この音声入力手段で入力された音声を代数符号励振線形予測による音声符号化アルゴリズムごとに、Ｌ個（Ｌは自然数）の標本を持つサブフレームｑ個（ｑは自然数）を持つ、所定の単位時間長としてのフレームに分割する分割手段と、（ハ）この分割手段で分割された音声のフレームごとの周波数強度分布を表わすスペクトラム包絡に基づく周波数入出力特性をもち、前記した音声符号化アルゴリズムで音源信号と共に音声合成モデルを構成する合成フィルタを生成する合成フィルタ生成手段と、（ニ）この合成フィルタ生成手段によって生成された合成フィルタに広帯域の周波数成分を均等に含むインパルス信号を入力した際の応答であり、音声符号化アルゴリズムごとに定まるＬ個の要素をもつ前記した合成フィルタのインパルス応答を求めるインパルス応答取得手段と、（ホ）このインパルス応答取得手段によって求めたＬ個の合成フィルタのインパルス応答を、下三角テプリッツ畳み込み行列の要素として、前記した下三角テプリッツ畳み込み行列とその転置行列の積を、その積が対称行列となる性質を利用して作成し、前記した下三角テプリッツ畳み込み行列とその転置行列の自己相関演算を合成フィルタのインパルス応答のＬ個の要素のうちのこれよりも少ないｎ個（ｎは自然数）の要素ずつ取り出して１処理周期あたりｎ個の要素を並列して実行する並列演算で、ｎ個の合成フィルタのインパルス応答の自己相関値を求める並列演算手段と、（へ）この並列演算手段によるｎ個の要素に関する前記した並列演算を繰り返してＬ×Ｌ個の合成フィルタのインパルス応答の自己相関値を求める繰り返し手段と、（ト）この繰り返し手段によるｎ個の要素に関する前記した並列演算の繰り返しにより、演算対象の合成フィルタのインパルス応答の要素をＬ個を越える要素数に仮想的に拡張して前記した並列演算による自己相関演算を実施する演算要素拡張手段と、（チ）前記した繰り返し手段と前記した演算要素拡張手段によって求めた自己相関値から、本来のインパルス応答の値を用いて演算した結果を有意なものとして選択する自己相関値選択手段と、（リ）前記したインパルス応答取得手段で求めた合成フィルタのインパルス応答と音声の相互相関値演算を行う相互相関演算手段と、（ヌ）前記した自己相関値選択手段によって選択した有意な自己相関値と前記した相互相関演算手段によって求めた相互相関値のうちの前記した音声符号化アルゴリズムごとに設定されるサブフレームについての前記したＬ個の要素に応じて定まる所定の個数の要素を選び出して比較演算を行い音声の構成要素の１つである空気の流れによって生じる周期性をもたない無声音成分を表わす情報を求める無声音解析手段と、（ル）この無声音解析手段によって求めた無声音成分を表わす情報を音声の非周期成分の要素として通話者の通話相手に向けて出力する出力手段とを音声符号化装置に具備させ、
前記した並列演算手段は、（ホ−１）前記したインパルス応答の配列ｈ（ｎ）の前記した下三角テプリッツ畳み込み行列としての自己相関行列Φ（ｉ，ｊ）における“ｉ”と“ｊ”を“０”で初期設定する初期設定手段と、（ホ−２）この演算を実装置環境上で実施するにあたって前記したインパルス応答の配列ｈ（ｉ）が１データ当たりａビット（ａは自然数）の要素の配列であり、演算処理に使用するレジスタのビット幅がｂビット（ｂは自然数）で、全部でｃ個（ｃは自然数）のレジスタファイルが前記した並列演算のための演算器に配置されているものとし、前記した演算器は１処理周期に最大ｎ個の演算を並列して実行できるものとしたとき、前記したレジスタにはｂをａで割った商であるｄ個（ｄは自然数）のａビットデータを一括して読み込むことが可能なことにより、前記した演算器にまずインパルス応答の配列ｈ（ｉ）〜ｈ（ｉ＋（ｄ−１））を読み込み、次にインパルス応答の配列ｈ（ｊ）〜ｈ（ｊ＋（ｄ−１））を読み込む読込処理手段と、（ホ−３）次のｎ個の演算式を１処理周期で並列して、それぞれ積算結果を求める演算手段と、
ｈ（ｉ）×ｈ（ｊ）
ｈ（ｉ＋１）×ｈ（ｊ＋１）
ｈ（ｉ＋２）×ｈ（ｊ＋２）
……
ｈ（ｉ＋（ｎ−１））×ｈ（ｊ＋（ｎ−１））
（ホ−４）前記した“ｉ”の値および前記した“ｊ”の値に“ｎ”の値をそれぞれ加算して、前記した演算器の前記したｃ個のレジスタファイルに格納する積算結果の個数がｂを２ａで割った商にｃを掛けたレジスタファイル満杯値に達するまで前記した演算手段による演算を繰り返す演算繰り返し制御手段と、（ホ−５）この演算繰り返し制御手段によって前記したレジスタファイル満杯値に到達したとき、これら全積算結果を前記したｃ個のレジスタファイルに連続で格納するレジスタファイル格納制御手段を具備し、
前記した演算要素拡張手段は、前記した下三角テプリッツ畳み込み行列がインパルス応答の要素数をＮとするとき、Ｎ行Ｎ列の対称行列である性質を利用して、自己相関行列Φ（ｉ，ｊ）を算出するにあたり全要素ｎの二乗個ではなく、「ｎ×（ｎ＋１）／２」個を計算し、要素Φ（ｉ，ｊ）と要素Φ（ｊ，ｉ）が等しいものである関係を利用して、計算を行っていない要素部分を補間することを特徴としている。
【００６４】
また本発明では、（イ）通話者が発する音声を入力する音声入力手段と、（ロ）この音声入力手段で入力された音声を代数符号励振線形予測による音声符号化アルゴリズムごとに、Ｌ個（Ｌは自然数）の標本を持つサブフレームｑ個（ｑは自然数）を持つ、所定の単位時間長としてのフレームに分割する分割手段と、（ハ）この分割手段で分割された音声のフレームごとの周波数強度分布を表わすスペクトラム包絡に基づく周波数入出力特性をもち、前記した音声符号化アルゴリズムで音源信号と共に音声合成モデルを構成する合成フィルタを生成する合成フィルタ生成手段と、（ニ）この合成フィルタ生成手段によって生成された合成フィルタに広帯域の周波数成分を均等に含むインパルス信号を入力した際の応答であり、音声符号化アルゴリズムごとに定まるＬ個の要素をもつ前記した合成フィルタのインパルス応答を求めるインパルス応答取得手段と、（ホ）このインパルス応答取得手段によって求めたＬ個の合成フィルタのインパルス応答を、下三角テプリッツ畳み込み行列の要素として、前記した下三角テプリッツ畳み込み行列とその転置行列の積を、その積が対称行列となる性質を利用して作成し、前記した下三角テプリッツ畳み込み行列とその転置行列の自己相関演算を合成フィルタのインパルス応答のＬ個の要素のうちのこれよりも少ないｎ個（ｎは自然数）の要素ずつ取り出して１処理周期あたりｎ個の要素を並列して実行する並列演算で、ｎ個の合成フィルタのインパルス応答の自己相関値を求める並列演算手段と、（へ）この並列演算手段によるｎ個の要素に関する前記した並列演算を繰り返してＬ×Ｌ個の合成フィルタのインパルス応答の自己相関値を求める繰り返し手段と、（ト）この繰り返し手段によるｎ個の要素に関する前記した並列演算の繰り返しにより、演算対象の合成フィルタのインパルス応答の要素にＬ−１個の仮想的な合成フィルタのインパルス応答要素を追加拡張して前記した並列演算を実施する演算要素拡張手段と、（チ）前記した繰り返し手段と前記した演算要素拡張手段によって求めた自己相関値から、本来のインパルス応答の値を用いて演算した結果を有意なものとして選択する自己相関値選択手段と、（リ）前記したインパルス応答取得手段で求めた合成フィルタのインパルス応答と音声の相互相関値演算を行う相互相関演算手段と、（ヌ）前記した自己相関値選択手段によって選択した有意な自己相関値と前記した相互相関演算手段によって求めた相互相関値のうちの前記した音声符号化アルゴリズムごとに設定されるサブフレームについての前記したＬ個の要素に応じて定まる所定の個数の要素を選び出して比較演算を行い音声の構成要素の１つである空気の流れによって生じる周期性をもたない無声音成分を表わす情報を求める無声音解析手段と、（ル）この無声音解析手段によって求めた無声音成分を表わす情報を音声の非周期成分の要素として通話者の通話相手に向けて出力する出力手段とを音声符号化装置に具備させ、
前記した並列演算手段は、（ホ−１）前記したインパルス応答の配列ｈ（ｎ）の前記した下三角テプリッツ畳み込み行列としての自己相関行列Φ（ｉ，ｊ）における“ｉ”と“ｊ”を“０”で初期設定する初期設定手段と、（ホ−２）この演算を実装置環境上で実施するにあたって前記したインパルス応答の配列ｈ（ｉ）が１データ当たりａビット（ａは自然数）の要素の配列であり、演算処理に使用するレジスタのビット幅がｂビット（ｂは自然数）で、全部でｃ個（ｃは自然数）のレジスタファイルが前記した並列演算のための演算器に配置されているものとし、前記した演算器は１処理周期に最大ｎ個の演算を並列して実行できるものとしたとき、前記したレジスタにはｂをａで割った商であるｄ個（ｄは自然数）のａビットデータを一括して読み込むことが可能なことにより、前記した演算器にまずインパルス応答の配列ｈ（ｉ）〜ｈ（ｉ＋（ｄ−１））を読み込み、次にインパルス応答の配列ｈ（ｊ）〜ｈ（ｊ＋（ｄ−１））を読み込む読込処理手段と、（ホ−３）次のｎ個の演算式を１処理周期で並列して、それぞれ積算結果を求める演算手段と、
ｈ（ｉ）×ｈ（ｊ）
ｈ（ｉ＋１）×ｈ（ｊ＋１）
ｈ（ｉ＋２）×ｈ（ｊ＋２）
……
ｈ（ｉ＋（ｎ−１））×ｈ（ｊ＋（ｎ−１））
（ホ−４）前記した“ｉ”の値および前記した“ｊ”の値に“ｎ”の値をそれぞれ加算して、前記した演算器の前記したｃ個のレジスタファイルに格納する積算結果の個数がｂを２ａで割った商にｃを掛けたレジスタファイル満杯値に達するまで前記した演算手段による演算を繰り返す演算繰り返し制御手段と、（ホ−５）この演算繰り返し制御手段によって前記したレジスタファイル満杯値に到達したとき、これら全積算結果を前記したｃ個のレジスタファイルに連続で格納するレジスタファイル格納制御手段を具備し、
前記した演算要素拡張手段は、前記した下三角テプリッツ畳み込み行列がインパルス応答の要素数をＮとするとき、Ｎ行Ｎ列の対称行列である性質を利用して、自己相関行列Φ（ｉ，ｊ）を算出するにあたり全要素ｎの二乗個ではなく、「ｎ×（ｎ＋１）／２」個を計算し、要素Φ（ｉ，ｊ）と要素Φ（ｊ，ｉ）が等しいものである関係を利用して、計算を行っていない要素部分を補間することを特徴としている。
【００６８】
更に本発明では、コンピュータに、（イ）音声入力手段で入力された音声を代数符号励振線形予測による音声符号化アルゴリズムごとに、Ｌ個（Ｌは自然数）の標本を持つサブフレームｑ個（ｑは自然数）を持つ、所定の単位時間長としてのフレームに分割する分割処理と、（ロ）この分割処理で分割された音声のフレームごとの周波数強度分布を表わすスペクトラム包絡に基づく周波数入出力特性をもち、前記した音声符号化アルゴリズムで音源信号と共に音声合成モデルを構成する合成フィルタを生成する合成フィルタ生成処理と、（ハ）この合成フィルタ生成処理によって生成された合成フィルタに広帯域の周波数成分を均等に含むインパルス信号を入力した際の応答であり、音声符号化アルゴリズムごとに定まるＬ個の要素をもつ前記した合成フィルタのインパルス応答を求めるインパルス応答取得処理と、（ニ）このインパルス応答取得処理によって求めたＬ個の合成フィルタのインパルス応答を、下三角テプリッツ畳み込み行列の要素として、前記した下三角テプリッツ畳み込み行列とその転置行列の積を、その積が対称行列となる性質を利用して作成し、前記した下三角テプリッツ畳み込み行列とその転置行列の自己相関演算を合成フィルタのインパルス応答のＬ個の要素のうちのこれよりも少ないｎ個（ｎは自然数）の要素ずつ取り出して１処理周期あたりｎ個の要素を並列して実行する並列演算で、ｎ個の合成フィルタのインパルス応答の自己相関値を求める並列演算処理と、（ホ）この並列演算処理によるｎ個の要素に関する前記した並列演算を繰り返してＬ×Ｌ個の合成フィルタのインパルス応答の自己相関値を求める繰り返し処理と、（へ）この繰り返し処理によるｎ個の要素に関する前記した並列演算の繰り返しにより、演算対象の合成フィルタのインパルス応答の要素をＬ個を越える要素数に仮想的に拡張して前記した並列演算による自己相関演算を実施する演算要素拡張処理と、（ト）前記した繰り返し処理と前記した演算要素拡張処理によって求めた自己相関値から、本来のインパルス応答の値を用いて演算した結果を有意なものとして選択する自己相関値選択処理と、（チ）前記したインパルス応答取得処理で求めた合成フィルタのインパルス応答と音声の相互相関値演算を行う相互相関演算処理と、（リ）前記した自己相関値選択処理によって選択した有意な自己相関値と前記した相互相関演算処理によって求めた相互相関値のうちの前記した音声符号化アルゴリズムごとに設定されるサブフレームについての前記したＬ個の要素に応じて定まる所定の個数の要素を選び出して比較演算を行い音声の構成要素の１つである空気の流れによって生じる周期性をもたない無声音成分を表わす情報を求める無声音解析処理とを音声符号化プログラムとして実行させ、
前記した並列演算処理は、（ニ−１）前記したインパルス応答の配列ｈ（ｎ）の前記した下三角テプリッツ畳み込み行列としての自己相関行列Φ（ｉ，ｊ）における“ｉ”と“ｊ”を“０”で初期設定する初期設定処理と、（ニ−２）この演算を実装置環境上で実施するにあたって前記したインパルス応答の配列ｈ（ｉ）が１データ当たりａビット（ａは自然数）の要素の配列であり、演算処理に使用するレジスタのビット幅がｂビット（ｂは自然数）で、全部でｃ個（ｃは自然数）のレジスタファイルが前記した並列演算のための演算器に配置されているものとし、前記した演算器は１処理周期に最大ｎ個の演算を並列して実行できるものとしたとき、前記したレジスタにはｂをａで割った商であるｄ個（ｄは自然数）のａビットデータを一括して読み込むことが可能なことにより、前記した演算器にまずインパルス応答の配列ｈ（ｉ）〜ｈ（ｉ＋（ｄ−１））を読み込み、次にインパルス応答の配列ｈ（ｊ）〜ｈ（ｊ＋（ｄ−１））を読み込む読込処理と、（ニ−３）次のｎ個の演算式を１処理周期で並列して、それぞれ積算結果を求める演算処理と、
ｈ（ｉ）×ｈ（ｊ）
ｈ（ｉ＋１）×ｈ（ｊ＋１）
ｈ（ｉ＋２）×ｈ（ｊ＋２）
……
ｈ（ｉ＋（ｎ−１））×ｈ（ｊ＋（ｎ−１））
（ニ−４）前記した“ｉ”の値および前記した“ｊ”の値に“ｎ”の値をそれぞれ加算して、前記した演算器の前記したｃ個のレジスタファイルに格納する積算結果の個数がｂを２ａで割った商にｃを掛けたレジスタファイル満杯値に達するまで前記した演算手段による演算を繰り返す演算繰り返し制御処理と、（ニ−５）この演算繰り返し制御手段によって前記したレジスタファイル満杯値に到達したとき、これら全積算結果を前記したｃ個のレジスタファイルに連続で格納するレジスタファイル格納制御処理とを実行させ、
前記した演算要素拡張処理は、前記した下三角テプリッツ畳み込み行列がインパルス応答の要素数をＮとするとき、Ｎ行Ｎ列の対称行列である性質を利用して、自己相関行列Φ（ｉ，ｊ）を算出するにあたり全要素ｎの二乗個ではなく、「ｎ×（ｎ＋１）／２」個を計算し、要素Φ（ｉ，ｊ）と要素Φ（ｊ，ｉ）が等しいものである関係を利用して、計算を行っていない要素部分を補間する処理であることを特徴としている。
【００７０】
更にまた本発明では、コンピュータに、（イ）音声入力手段で入力された音声を代数符号励振線形予測による音声符号化アルゴリズムごとに、Ｌ個（Ｌは自然数）の標本を持つサブフレームｑ個（ｑは自然数）を持つ、所定の単位時間長としてのフレームに分割する分割処理と、（ロ）この分割処理で分割された音声のフレームごとの周波数強度分布を表わすスペクトラム包絡に基づく周波数入出力特性をもち、前記した音声符号化アルゴリズムで音源信号と共に音声合成モデルを構成する合成フィルタを生成する合成フィルタ生成処理と、（ハ）この合成フィルタ生成処理によって生成された合成フィルタに広帯域の周波数成分を均等に含むインパルス信号を入力した際の応答であり、音声符号化アルゴリズムごとに定まるＬ個の要素をもつ前記した合成フィルタのインパルス応答を求めるインパルス応答取得処理と、（ニ）このインパルス応答取得処理によって求めたＬ個の合成フィルタのインパルス応答を、下三角テプリッツ畳み込み行列の要素として、前記した下三角テプリッツ畳み込み行列とその転置行列の積を、その積が対称行列となる性質を利用して作成し、前記した下三角テプリッツ畳み込み行列とその転置行列の自己相関演算を合成フィルタのインパルス応答のＬ個の要素のうちのこれよりも少ないｎ個（ｎは自然数）の要素ずつ取り出して１処理周期あたりｎ個の要素を並列して実行する並列演算で、ｎ個の合成フィルタのインパルス応答の自己相関値を求める並列演算処理と、（ホ）この並列演算処理によるｎ個の要素に関する前記した並列演算を繰り返してＬ×Ｌ個の合成フィルタのインパルス応答の自己相関値を求める繰り返し処理と、（へ）この繰り返し処理によるｎ個の要素に関する前記した並列演算の繰り返しにより、演算対象の合成フィルタのインパルス応答の要素にＬ−１個の仮想的な合成フィルタのインパルス応答要素を追加拡張して前記した並列演算を実施する演算要素拡張処理と、（ト）前記した繰り返し処理と前記した演算要素拡張処理によって求めた自己相関値から、本来のインパルス応答の値を用いて演算した結果を有意なものとして選択する自己相関値選択処理と、（チ）前記したインパルス応答取得処理で求めた合成フィルタのインパルス応答と音声の相互相関値演算を行う相互相関演算処理と、（リ）前記した自己相関値選択処理によって選択した有意な自己相関値と前記した相互相関演算処理によって求めた相互相関値のうちの前記した音声符号化アルゴリズムごとに設定されるサブフレームについての前記したＬ個の要素に応じて定まる所定の個数の要素を選び出して比較演算を行い音声の構成要素の１つである空気の流れによって生じる周期性をもたない無声音成分を表わす情報を求める無声音解析処理とを音声符号化プログラムとして実行させ、
前記した並列演算処理は、（ニ−１）前記したインパルス応答の配列ｈ（ｎ）の前記した下三角テプリッツ畳み込み行列としての自己相関行列Φ（ｉ，ｊ）における“ｉ”と“ｊ”を“０”で初期設定する初期設定処理と、（ニ−２）この演算を実装置環境上で実施するにあたって前記したインパルス応答の配列ｈ（ｉ）が１データ当たりａビット（ａは自然数）の要素の配列であり、演算処理に使用するレジスタのビット幅がｂビット（ｂは自然数）で、全部でｃ個（ｃは自然数）のレジスタファイルが前記した並列演算のための演算器に配置されているものとし、前記した演算器は１処理周期に最大ｎ個の演算を並列して実行できるものとしたとき、前記したレジスタにはｂをａで割った商であるｄ個（ｄは自然数）のａビットデータを一括して読み込むことが可能なことにより、前記した演算器にまずインパルス応答の配列ｈ（ｉ）〜ｈ（ｉ＋（ｄ−１））を読み込み、次にインパルス応答の配列ｈ（ｊ）〜ｈ（ｊ＋（ｄ−１））を読み込む読込処理と、（ニ−３）次のｎ個の演算式を１処理周期で並列して、それぞれ積算結果を求める演算処理と、
ｈ（ｉ）×ｈ（ｊ）
ｈ（ｉ＋１）×ｈ（ｊ＋１）
ｈ（ｉ＋２）×ｈ（ｊ＋２）
……
ｈ（ｉ＋（ｎ−１））×ｈ（ｊ＋（ｎ−１））
（ニ−４）前記した“ｉ”の値および前記した“ｊ”の値に“ｎ”の値をそれぞれ加算して、前記した演算器の前記したｃ個のレジスタファイルに格納する積算結果の個数がｂを２ａで割った商にｃを掛けたレジスタファイル満杯値に達するまで前記した演算手段による演算を繰り返す演算繰り返し制御処理と、（ニ−５）この演算繰り返し制御手段によって前記したレジスタファイル満杯値に到達したとき、これら全積算結果を前記したｃ個のレジスタファイルに連続で格納するレジスタファイル格納制御処理とを実行させ、
前記した演算要素拡張処理は、前記した下三角テプリッツ畳み込み行列がインパルス応答の要素数をＮとするとき、Ｎ行Ｎ列の対称行列である性質を利用して、自己相関行列Φ（ｉ，ｊ）を算出するにあたり全要素ｎの二乗個ではなく、「ｎ×（ｎ＋１）／２」個を計算し、要素Φ（ｉ，ｊ）と要素Φ（ｊ，ｉ）が等しいものである関係を利用して、計算を行っていない要素部分を補間する処理であることを特徴としている。
【００７２】
【実施例】
以下実施例につき本発明を詳細に説明する。
【００７３】
図１は、本発明の一実施例における音声符号化装置を表わしたものである。この音声符号化装置は、フィルタ部位５１と音源部位５２と比較器部位５３で構成されている。フレーム分割された音声信号５４はまずフィルタ部位５１に入力されて線形予測分析および量子化器５５によって線形予測分析が行われて周波数フィルタ特性が求められた後、この周波数フィルタ特性が量子化される。量子化された周波数フィルタ特性は合成フィルタ５６に入力されて入力音声のもつフィルタ特性をもったフィルタが形成される。
【００７４】
音源部位５２は、音声のもつ声の高さに相当する周期成分の信号を生成する音源となる適応コードブック５７と音声のもつ周期成分以外の信号を生成する音源となる雑音コードブック５８をもつ。そしてこれらの音源から出力された信号の振幅を制御するためのゲインコードブック５９と、適応コードブック５７の生成した信号の振幅を調整する適応コードブック信号増幅器６０と、雑音コードブック５８の生成した信号の振幅を調整する雑音コードブック信号増幅器６１をもつ。振幅の調整されたこれらの音源信号は、音源加算器６２によって加算された後、フィルタ部位５１の合成フィルタ５６に入力されて周波数ごとの強弱を付加された合成音声６３として生成される。
【００７５】
比較器部位５３には音声信号５４と合成音声６３が入力されて、加算器６４で加算される。ここで合成信号６３を負の形式で入力することで差分を求める。この差分に対して聴感重み付けフィルタ６５で、聴感上の聞こえにくい周波数成分を強めるとともに、聴感上聞こえやすい周波数成分を弱めるフィルタ処理を行う。この処理によって人の聴感上で冗長な情報を削減し情報量を低減させる。聴感重み付けフィルタ６５で周波数成分の調整された信号は誤差最小化器６６に入力される。誤差最小化器６６は、合成音声６３と音声信号５４の最小二乗誤差を算出することで誤差を求めると共に、音源部位５２のそれぞれのコードブックに対して別の信号出力を実行させる。この処理を繰り返し実行し、誤差が最小となった際のコードブックからの出力信号が目標の信号であるとみなし、この信号に付与されている符号を音声符号化装置の出力として得る。誤差最小化器６６では、図示しないマイクロプロセッサを用いてこの最小二乗誤差を求める演算を行う。そのために合成フィルタ５６がもつ周波数フィルタ特性に基づくインパルス応答の自己相関演算を行う。このインパルス応答の自己相関演算は演算量が膨大であり処理に要する時間が比較的長くなる傾向がある。そこで、この処理を効率良く短時間で実施するために以下の手法を用いる。
【００７６】
図２は、聴感重み付け合成フィルタのインパルス応答ｈ（ｎ）の自己相関行列Φ（ｉ，ｊ）を示している。領域１１は有意な演算結果が設定される領域である。また、領域１２は領域１１の対称行列である領域である。そして領域１３はその後の演算には用いない仮想的な演算結果が設定される領域である。つまり、領域１１と領域１２をまとめて示した範囲１４がその後の演算に必要な中間値であり、範囲１５がその後の演算には用いない仮想的な値である。
【００７７】
図２に示しているＶＡ〜ＶＦの値を以下に具体的に表わす。
ＶＡ＝ｈ（０）² ……（１８）
【００７８】
ＶＢ＝ｈ（１）²＋ｈ（０）² ……（１９）
【００７９】
【数８】

【００８０】
ＶＤ＝ｈ（０）×ｈ（１） ……（２１）
【００８１】
【数９】

【００８２】
ＶＦ＝ｈ（０）×ｈ（ｎ−１） ……（２３）
【００８３】
この聴感重み付け合成フィルタのインパルス応答ｈ（ｎ）の自己相関行列Φ（ｉ，ｊ）の求め方を説明する。この行列はＮ行Ｎ列の下三角テプリッツ畳み込み行列であり、自己相関行列Φ（ｉ，ｊ）はＮ行Ｎ列の対称行列となる。対称行列である性質を利用して、Φ（ｉ，ｊ）を算出するにあたり全要素ｎの二乗個ではなく、“ｎ×（ｎ＋１）／２”個を計算する。そして、要素Φ（ｉ，ｊ）と要素Φ（ｊ，ｉ）が等しいものである関係を利用して、実際の計算を行っていない要素部分を補間する。つまりこの行列Φ（ｉ，ｊ）は対称行列であるため、領域１１と領域１２の要素は、領域１１のみを算出することですべてを得られる。各要素の算出は、図２の右下から矢印の方向に向けて累積して、積和を行う。矢印１本ごとにループ処理を繰り返して、掛け合わせるｈ（ｎ）の値をずらしていく。１周目のループ処理によって、Ｎ項の対角線上の要素を算出する。第１項は、次の（２４）式で表わすことができる。
ｈ（０）² ……（２４）
【００８４】
第２項は、次の（２５）式で表わすことができる。
ｈ（０）²＋ｈ（１）² ……（２５）
【００８５】
以降第ｎ項は、次の（２６）式のように表わすことができる。
【数１０】

【００８６】
２周目のループ処理では、積算するｈ（ｉ）を１つずらした“３９”項を右下から左上の矢印方向に算出する。２周目のループ処理の第１項は次の（２７）式で表わすことができる。
ｈ（０）×ｈ（１） ……（２７）
【００８７】
２周目のループ処理の第２項は、次の（２８）式で表わすことができる。
ｈ（０）×ｈ（１）＋ｈ（１）×ｈ（２） ……（２８）
【００８８】
そして２周目のループ処理の第ｎ−１項は、次の（２９）式で表わすことができる。
【数１１】

【００８９】
更にｎ周目のループ処理での第１項は、次の（３０）式で表わすことができる。
ｈ（０）×ｈ（ｎ−１） ……（３０）
【００９０】
以上の計算により図２の領域１１の各要素を求めた後、領域１１に含まれる行列の対角線部分に線対称となるように領域１１の各要素を領域１２に複写する。これによって行列Φ（ｉ，ｊ）の全要素を得ることができる。
【００９１】
以上が行列Φ（ｉ，ｊ）を求める際の基本的な処理になる。これを演算時間を短縮するために以下の処理を行う。
【００９２】
まず、Ｎ行Ｎ列の二次元行列で表わされていた自己相関行列Φ（ｉ，ｊ）を１次元配列ｏｄａ［ｉｎｄｋ］として表わす。この１次元配列ｏｄａ［ｉｎｄｋ］は、次の（３１）式で表わすことができる。
【数１２】

【００９３】
ここで勧告ＧＳＭ−ＡＭＲを具体例として、それぞれの変数の具体的な値とその範囲を示す。勧告ＧＳＭ−ＡＭＲは、音声信号を一定長に区切ったフレーム長が２０ミリ秒で、それを更に４分割して５ミリ秒のサブフレームとし、このサブフレームを１単位として符号化を行うものである。音声信号を標本化する周期は、音声信号処理で一般的な８キロヘルツである。１サブフレーム中の標本数は、５ミリ秒と８キロヘルツの積で求められ“４０”標本となる。この標本数“４０”が、自己相関を求めるＮ行Ｎ列の行列の、行と列のそれぞれの要素となる。つまりＮ行Ｎ列の行列とは、４０行４０列の行列であり、その行列の要素数は“４０”と“４０”の積の“１６００”となる。この１６００個の要素をもつ１次元行列をｏｄａ［ｉｎｄｋ］とするので、行列の１要素を番号付けする値として使用されるｉｎｄｋの値の範囲は０以上１６００未満となる。式（３１）のｐの値は、１次元行列ｏｄａ［ｉｎｄｋ］を番号付けするｉｎｄｋの値を、行の要素数“４０”で除した際の商である。また式（３１）のｍの値は１次元行列ｏｄａ［ｉｎｄｋ］を番号付けするｉｎｄｋの値を、行の要素数“４０”で除した際の余りである。
【００９４】
この１次元配列ｏｄａ［ｉｎｄｋ］を算出する際に、聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）の範囲を拡張し、後の演算では使用しない値も含めて演算を行う。つまり、配列ｈ（ｎ）の拡張を実施しない場合であれば、ｈ（０）×ｈ（ｊ）〜ｈ（Ｎ−１−ｊ）×ｈ（Ｎ−１）の“Ｎ−ｊ”回の演算を行うところを、ｈ（０）×ｈ（ｊ）〜ｈ（Ｎ−１）×ｈ（Ｎ−１＋ｊ）のＮ回の演算を行う。具体的には、聴感聴感重み付け合成フィルタのインパルス応答の配列ｈ（Ｎ−１）をｈ（２×Ｎ−１）に拡張する。つまり要素数をＮ個から“２×Ｎ−１”個に拡張する。但しこの拡張部分は演算で使用されはするが、演算結果が利用されるものではないため領域を宣言するのみで十分であり、要素の値を指定する必要はない。
【００９５】
図３および図４は聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）の自己相関行列を求める処理を示したものである。
【００９６】
まず、聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）の各要素の符号を求め、これを格納する。（図３ステップＳ５１）
【００９７】
図５は、聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）の符号ビットのみを取り出して格納する領域を表わしたものである。図５に示すように聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）の符号ビットのみを取り出して領域ｓｉｇ＿ｎ［０］と領域ｓｉｇ＿ｎ［１］に順番に格納する。ｓｉｇ＿ｎ［０］は下位８ビットを有意な領域として使用する。ｓｉｇ＿ｎ［１］は全３２ビットを有意な領域として使用する。
【００９８】
聴感重み付け合成フィルタのインパルス応答の配列の各要素を算出する際にｈ（ｉ）の拡張した部分にはみ出す形で、常にＮ個の要素を算出する。図２の領域１３が、このはみ出した部分である。この領域１３は、後続の計算では不要となる。聴感重み付け合成フィルタのインパルス応答の自己相関を求める処理の前半は、各要素間の積算処理である。この積算は、“ｈ（ｉ）×ｈ（ｊ）”を“ｉ”と“ｊ”を漸増させながら実施する。聴感重み付け合成フィルタのインパルス応答の配列の要素数はＮ個なので、“ｉ”の値の範囲は“０”以上で“Ｎ−１”以下である。また、“ｊ”は“ｉ”以上で“Ｎ−１”以下であるが、“ｉ”の値にかかわらず“ｈ（ｉ）×ｈ（ｊ）”の演算をＮ回繰り返すことができるように、聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｉ）に関して“ｊ”の値の範囲を“ｉ”以上で、“ｉ＋Ｎ−１”以下と拡張する。
【００９９】
図３に戻って説明を続ける。聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）を番号付けする値である“ｉ”と“ｊ”を“０”で初期設定する（ステップＳ５２）。
【０１００】
聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｉ）は、１データあたり１６ビットの要素の配列であるものとする。これを記憶領域に連続して格納しておく。また、前記した図示しないマイクロプロセッサの演算器のもつ演算処理に使用する記憶領域であるレジスタのビット幅は１２８ビットであるものとする。そして演算器は１処理周期に最大４個の演算を並列して実行できるものとする。つまり演算器のレジスタに、連続した８個の１６ビットデータを一括して読み込むことが可能である。したがって演算器への読み込み処理の１度目に聴感重み付け合成フィルタのインパルス応答ｈ（ｉ）〜ｈ（ｉ＋７）を読み込み（ステップＳ５３）、演算器への読み込み処理の２度目で聴感重み付け合成フィルタのインパルス応答ｈ（ｊ）〜ｈ（ｊ＋７）を読み込む（ステップＳ５４）。そして以下の４個の演算式（３２）〜（３５）を１処理周期で並列して実行する（ステップＳ５５）。
【０１０１】
ｈ（ｉ）×ｈ（ｊ） ……（３２）
【０１０２】
ｈ（ｉ＋１）×ｈ（ｊ＋１） ……（３３）
【０１０３】
ｈ（ｉ＋２）×ｈ（ｊ＋２） ……（３４）
【０１０４】
ｈ（ｉ＋３）×ｈ（ｊ＋３） ……（３５）
【０１０５】
それぞれの演算は“１６ビットデータ×１６ビットデータ”であるから、それぞれの演算結果を格納するには“１６ビット＋１６ビット”の最大３２ビットの記憶領域が必要である。そこでこれら４個の演算結果を格納する記憶領域のビット幅は一律に“３２ビット×４”とする。この“３２ビット×４”の記憶領域を１０個分連続して演算器のレジスタへの高速なデータ受け渡しが可能な１２８ビット幅の１０個のレジスタファイルに確保する。４個の演算を実施した後、“ｉ”の値に“４”を加算し、“ｊ”の値に“４”を加算して次の演算処理に備える（ステップＳ５６）。そして再度、４個の演算式（３２）〜（３５）を１処理周期で並列して実行する（ステップＳ５７）。この処理を１０回になるまで繰り返して（ステップＳ５９：Ｎ）、それぞれの繰り返し処理のたびに“ｉ”の値に“４”を加算して、“ｊ”の値に“４”を加算する（ステップＳ５８）。５回の繰り返しが完了すると（ステップＳ５９：Ｙ）、積算結果は合計で４０個求められる。これを１２８ビットのレジスタファイル１０個にそれぞれ４データずつ計４０データを連続で格納する。そしてこの積算結果を用いて加算処理を行う。加算処理を行うにあたり、１０個のレジスタファイルを番号付けするインデクス値を“０”で初期設定する（図４のステップＳ６０）。
【０１０６】
図６は、これまでに実施した４個の並列積算をして求めた結果を用いて行う加算処理の対象となるレジスタと加算処理の流れを示すものである。この加算処理も、４個を並列して実行する。
【０１０７】
図６のレジスタ２１₁〜２１₃は、物理的には同一のレジスタであるが、処理サイクルの進展に伴う内容の変化を表わすためにそれぞれ別の符号を付している。レジスタ２２₁〜２２₃と、レジスタ２３₁〜２３₃についても同様である。また、各レジスタは１２８ビット幅で、それぞれ３２ビットの４個のレジスタ要素で構成されている。たとえばレジスタ２１₁はレジスタ要素３１₁〜３１₄で構成されており、レジスタ２２₁はレジスタ要素３２₁〜３２₄で構成されている。また、レジスタ２３₁は、レジスタ要素３３₁〜３３₄で構成されている。これらの各レジスタ要素についても、処理サイクルの進展に伴う内容の変化を表わすために、物理的には同一のレジスタ要素であるが処理サイクルごとに別の符号を付している。たとえば、レジスタ要素３１₁とレジスタ要素３１₅とレジスタ要素３１₉は物理的には同一であり、同様にレジスタ要素３２₁とレジスタ要素３２₅とレジスタ要素３２₉は物理的に同一である。他のレジスタ要素についても同様である。
【０１０８】
図４とともに加算処理の流れを説明する。加算を行う第１の処理サイクルでは、まずレジスタ２１₁〜２１₃の全レジスタ要素に対して、“０”を設定して初期化する（図４ステップＳ６０）。そしてレジスタ２１₁に積算結果の格納されているレジスタファイルをインデクス値で番号付けして求めた４個の積算結果を書き込む（図４ステップＳ６１）。具体例を示すと、レジスタ要素３１₄に“ｈ（ｉ）×ｈ（ｊ）”を書き込み、レジスタ要素３１₃に“ｈ（ｉ＋１）×ｈ（ｊ＋１）”を書き込む。更に、レジスタ要素３１₂に“ｈ（ｉ＋２）×ｈ（ｊ＋２）”を書き込み、レジスタ要素３１₁に“ｈ（ｉ＋３）×ｈ（ｊ＋３）”を書き込む。この書き込みが完了した後に、積算結果の格納されているレジスタファイルのインデクス値に“１”を加算する（図４ステップＳ６２）。この、加算結果の書き込みは、積算結果を格納した１０個のレジスタファイルに順に上書きするものとする。このようにすることで、他のメモリアクセスをせずに３２ビットデータを４０個、後続の演算に引き継ぐ。
【０１０９】
加算を行う第２のサイクルでは、加算を行う第１のサイクルで設定したレジスタの内容を使用して加算を行う。レジスタ２１₁と２２₁のそれぞれの内容を加算してレジスタ２２₂に書き込む。また、レジスタ２１₂には加算を行う第１の処理サイクルでレジスタ２１₁に設定した次の積算結果である３２ビットデータを４個、１２８ビット分まとめて書き込む。レジスタ２１₁のレジスタ要素３１₁とレジスタ２１₁のレジスタ要素３１₂とレジスタ２１₁のレジスタ要素３１₃を加算した結果をレジスタ２２₂のレジスタ要素３２₅に設定する。またこれと並列して、レジスタ２１₁のレジスタ要素３１₂とレジスタ２１₁のレジスタ要素３１₃を加算した結果をレジスタ２２₂のレジスタ要素３２₆に設定する。またこれと並列して、レジスタ２１₁のレジスタ要素３１₃をレジスタ２２₂のレジスタ要素３２₇に設定する。また、これと並列して、レジスタ２１₁のレジスタ要素３１₄とレジスタ２２₁のレジスタ要素３２₁とレジスタ２２₁のレジスタ要素３２₄を加算して、レジスタ２２₂のレジスタ要素３２₈に設定する。更にこれらと平行して、レジスタ２１₂のレジスタ要素３１₅、３１₆、３１₇、３１₈に対して、前のサイクルでレジスタ２１₁に設定した積算結果の次の積算結果データ４個をレジスタ２１ ₂ に設定する（図４ステップＳ６３）。この４個の積算結果は、レジスタファイルを現在のインデクス値で番号付けして求める。この書き込みが完了した後に、積算結果の格納されているレジスタファイルのインデクス値に“１”を加算する（図４ステップＳ６４）。
【０１１０】
加算を行う第３の処理サイクルでは、加算を行う第２のサイクルで設定したレジスタの内容を使用して加算を行う。レジスタ２１₂と２２₂のそれぞれの内容を加算してレジスタ２２₃に書き込む。また、レジスタ２１₃には加算を行う第２の処理サイクルでレジスタ２１₂に設定した次の積算結果である３２ビットデータを４個、１２８ビット分まとめて書き込む。レジスタ２２₂のレジスタ要素３２₅とレジスタ２２₂のレジスタ要素３２₈を加算して、レジスタ２３₃のレジスタ要素３３₉に設定する。またこれと並列して、レジスタ２２₂のレジスタ要素３２₆とレジスタ２２₂のレジスタ要素３２₈を加算して、レジスタ２３₃のレジスタ要素３３₁₀に設定する。またこれと並列して、レジスタ２２₂のレジスタ要素３２₇とレジスタ２２₂のレジスタ要素３２₈を加算して、レジスタ２３₃のレジスタ要素３３₁₁に設定する。またこれと並列して、レジスタ２２₂のレジスタ要素３２₈を、レジスタ２３₃のレジスタ要素３３₁₂に設定する。またこれと並列して、レジスタ２１₂のレジスタ要素３１₅とレジスタ２１₂のレジスタ要素３１₆とレジスタ２１₂のレジスタ要素３１₇を加算した結果をレジスタ２２₃のレジスタ要素３２₉に設定する。またこれと並列して、レジスタ２１₂のレジスタ要素３１₆とレジスタ２１₂のレジスタ要素３１₇を加算した結果をレジスタ２２₃のレジスタ要素３２₁₀に設定する。またこれと並列して、レジスタ２１₂のレジスタ要素３１₇をレジスタ２２₃のレジスタ要素３２₁₁に設定する。また、これと並列して、レジスタ２１₂のレジスタ要素３１₈とレジスタ２２₂のレジスタ要素３２₅とレジスタ２２₂のレジスタ要素３２₈を加算して、レジスタ２２₃のレジスタ要素３２₁₂に設定する。更にこれらと並列して、レジスタ２１₃の３１₉、３１₁₀、３１₁₁、３１₁₂に、前のサイクルでレジスタ２１₂に設定した積算結果の次の格納領域の４個の積算結果を設定する（図４ステップＳ６５）。この４個の積算結果は、レジスタファイルを現在のインデクス値で番号付けして求める。この書き込みが完了した後に、積算結果の格納されているレジスタファイルのインデクス値に“１”を加算する（図４ステップＳ６６）。そしてレジスタ２３₃の値が書き込まれる１２８ビット幅のレジスタファイルへの書き出し先を１レジスタファイル分進める。
【０１１１】
この加算を行う第３の処理サイクルを更に９回繰り返して、合計１０回行う（図４のステップＳ６７：Ｎ）。これによって積算処理で求めた“１０個×４データ”のすべてについて加算処理を行う。このように図３に示した流れ図の積算結果を、図４に示した流れ図で加算することによって、合計１２サイクルの加算処理が実施されて、１２８ビットのレジスタファイルの１０個に必要な４０個の積和結果が格納されることになる（ステップＳ６７：Ｙ）。
【０１１２】
そして、この４０個の積和結果のそれぞれに対して、あらかじめ格納しておいた領域ｓｉｇ＿ｎ［０］と領域ｓｉｇ＿ｎ［１］の符号を付加する（ステップＳ６８）。
【０１１３】
以上の“４並列の積算を１０回繰り返し、更に４並列の加算を１２サイクル実施する、その結果に対して符号データを付加する”処理を、聴感重み付け合成フィルタのインパルス応答ｈ（ｎ）の要素数である“４０”回繰り返す（ステップＳ６９：Ｎ）。この繰り返しを実施する際に、聴感重み付け合成フィルタのインパルス応答ｈ（ｎ）を番号付けするインデクス値“ｉ”に“０”を設定する。これとともに、繰り返しごとに聴感重み付け合成フィルタのインパルス応答ｈ（ｎ）を番号付けするインデクス値“ｊ”の開始点を“１”ずつ進めるために、“ｊ”に“１”を加算する（ステップＳ７０）。この繰り返しが４０回実施されることで、聴感重み付け合成フィルタのインパルス応答の自己相関行列Φ（ｉ，ｊ）の全１６００個の要素を１次元配列の形式で求めることができる（ステップＳ６９：Ｙ）。
【０１１４】
このように、自己相関値演算を演算器のもつ並列演算手段で実施する。更にこの自己相関演算の繰り返しを極力切れ目無く実行する。自己相関演算の切れ目は、演算器に合成フィルタのインパルス応答の要素を取り込む際に、合成フィルタのインパルス応答の要素が格納されている領域を演算で求めなければならなくなった場合に発生する。そこで、合成フィルタのインパルス応答の各要素を格納領域を演算で求めなくとも済むように連続させて格納しておく。但し、有限個数の要素を繰り返し使用して演算するためには、格納領域の途中の領域を演算で求める必要が生じる場合もある。そこで、この演算が発生する頻度を低下させる。そのために、合成フィルタのインパルス応答の要素をＬ−１個仮想的に増加させて、あたかも演算対象の要素があるかのように扱って自己相関演算を行う。これによって増加分の演算を実施することになるが、並列演算による自己相関演算回数の低減と格納領域演算回数の低減効果でこれを補い、総合的に処理時間を短縮するようになっている。
【０１１５】
なお、本明細書でフレームとして表現しているものの中にはフレームのみでなくサブフレームも含まれていることは当然である。
【０１１６】
【発明の効果】
以上説明したように請求項１または請求項２記載記載の発明によれば、ｎ個の並列演算が可能な並列演算手段に、入力音声から求めた聴感重み付け合成フィルタのインパルス応答の要素の自己相関演算を極力連続して実行させている。これによって並列演算手段の持つ処理能力を有効に活用している。したがって動作周波数の比較的低いプロセッサであっても比較的短時間で音声信号の符号化ができる。
【０１１７】
また、請求項３記載の発明によれば、並列演算手段は、１処理周期で並列演算を行うことが可能な最大数で並列演算を実施するようにしている。これによって演算器の持つ処理能力を有効に活用している。したがって動作周波数の比較的低いプロセッサであっても比較的短時間で音声信号の符号化ができる。
【０１１８】
更に、請求項４または請求項５記載記載の発明によれば、ｎ個の並列演算が可能な並列演算処理に、入力音声から求めた聴感重み付け合成フィルタのインパルス応答の要素の自己相関演算を極力連続して実行させている。これによって並列演算処理の持つ処理能力を有効に活用している。したがって動作周波数の比較的低いプロセッサであっても比較的短時間で音声信号の符号化ができる。
【図面の簡単な説明】
【図１】本発明の一実施例における符号励振線形予測技術を用いた音声符号化装置の要部を示したブロック図である。
【図２】本実施例の聴感重み付け合成フィルタのインパルス応答の自己相関行列を示した説明図である。
【図３】本実施例の聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）の自己相関行列を求める積算処理の流れを示した流れ図である。
【図４】本実施例の聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）の自己相関行列を求める加算処理の流れを示した流れ図である。
【図５】本実施例の聴感重み付け合成フィルタのインパルス応答の配列ｈ（ｎ）の符号ビットを格納する領域を示した説明図である。
【図６】本実施例の並列加算処理の対象となるレジスタと加算処理を示すブロック図である。
【図７】符号励振線形予測技術を用いた音声符号化装置および音声復号化装置の要部を示したブロック図である。
【図８】符号励振線形予測技術を用いた音声符号化装置の要部を示したブロック図である。
【図９】音声信号とコードブックから探索する推定信号とこの２つの信号の差分をベクトル信号として表わした説明図である。
【符号の説明】
１１領域（有意な演算結果が設定される領域）
１２領域（領域１１の対称行列である領域）
１３領域（仮想的な演算結果が設定される領域）
１４範囲（その後の演算に必要な中間値）
１５範囲（仮想的な値）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech coding apparatus and a speech coding program, and relates to a speech coding apparatus and a speech coding program that perform speech coding in a telephone terminal or a telephone exchange, for example.
[0002]
[Prior art]
In recent years, along with the explosive spread of mobile phones, the amount of voice traffic flowing through a transmission path has increased rapidly. In order to cope with this increasing traffic, communication carriers are increasing transmission facilities. However, it is impossible to infinitely enhance transmission equipment because it requires a great deal of cost.
[0003]
In recent years, in the field of communication, digital communication that transmits and receives digital signals is the mainstream instead of analog communication that transmits and receives analog signals. Digital communication has rapidly become popular because it is superior to analog communication in that it is less susceptible to noise and is easy to multiplex a plurality of signals. This tendency is the same in a telephone system using a telephone, and it is the mainstream to digitize and transmit an analog audio signal.
[0004]
Sampling, quantization, and encoding are required to convert an analog signal to a digital signal. Sampling is to express a signal waveform that is continuous in time as a value at a point in time. Quantization is the approximate expression of a waveform value by one of a finite number of values. The encoding means how to express specifically, and is usually expressed in binary.
[0005]
As a technique for digitizing an audio signal that is an analog signal, a linear pulse coding (PCM) technique for quantizing the amplitude of a waveform based on a sampling theorem has been well known. This linear pulse encoding quantizes an analog signal in a uniform step, and is the same as what is usually called “A / D conversion”.
[0006]
The major points that users demand for a system that uses a telephone terminal to make a call are “connect with anyone at any time” and “be able to talk clearly”. In order to realize “always connect with anyone at any time”, it is necessary to secure a transmission path for a combination of connections between all telephone terminals in advance. However, this is not realistic because the number of combinations becomes enormous and the transmission path is occupied even when a call is not actually made, which greatly impairs the economy. Therefore, a transmission line is not reserved in advance, but an empty transmission line is searched for after connection is required, and this is secured and used. In other words, the transmission path is positioned as a shared resource for a plurality of users. Therefore, a new connection cannot be established when the transmission path is completely occupied by another user, but a new connection can be established if there is a vacancy in the transmission path. That is, if the transmission path occupation amount necessary for establishing each connection can be reduced, a large number of connections can be established simultaneously. Therefore, a technique for reducing the capacity of the transmission line occupied by each connection has been studied.
[0007]
In addition, in order to realize “a clear conversation”, it is necessary that each other's audio signal be delivered to the other party without changing from the original signal as much as possible. When digitizing an audio signal for transmission, it is necessary to perform sampling, quantization, and encoding as finely as possible in order to reduce the change in the audio signal. Sampling fineness can be realized by reducing the degree of time separation. In other words, this shortens the sampling period. Further, the fineness of quantization can be realized by reducing the number of finite approximate value change steps expressing the waveform value and increasing the number of values. In other words, it is necessary to reduce the quantization step and increase the range of approximate values and the range of values that can be taken by the codes representing the approximate values.
[0008]
By shortening the sampling period, the number of samples per unit time is increased. Further, the number of digits or the number of bits of the code is increased by reducing the quantization step and increasing the range of the code representing the approximate value. Since the amount of information to be transmitted is proportional to the product of the number of samples per unit time and the number of digits of the code or the number of bits, the amount of information to be transmitted will increase if it is intended to achieve “being able to talk clearly” . However, this is a contradictory requirement for “reducing the occupation amount of the transmission line necessary for establishing each connection” necessary for realizing “connecting with anyone at any time”.
[0009]
Therefore, in order to satisfy these conflicting requirements at the same time, a speech coding technique for reducing the amount of information when transmitting increased speech information has been studied.
[0010]
In order to reduce the amount of audio information to be transmitted, it is important not to insert unnecessary information into the information to be transmitted before performing audio encoding. If the sampling period is limited to a fine enough to digitize voice information, the amount of information to be transmitted can be reduced. In general, it is said that the upper limit of the frequency band necessary for making a telephone call is 4 kilohertz. That is, it is only necessary to perform sampling at a sampling period that can be reproduced up to a signal of 4 kHz. In this case, the sampling period is doubled by 8 kHz according to Shannon's theorem.
[0011]
As for quantization, there is a technique in which the amplitude is logarithmically converted and compressed using the statistical property of speech amplitude. As a logarithmic transformation formula, an expression called μ-law or A-law is widely used. With this logarithmic compression, the speech waveform can be finely quantized with the same number of quantization steps. Furthermore, there is a technique called adaptive quantization in which the quantization step width is changed with time in accordance with the characteristics of the change in speech amplitude.
[0012]
Further, since the audio signal has periodicity, there is a characteristic that there is a correlation between adjacent samples as well as a correlation between distant samples. Information can be compressed by encoding a difference between adjacent samples or a prediction difference between a value predicted using the correlation and an actual sample value. As a technique for simply transmitting a difference, there is a differential quantization (Differentioal Pulse Code Modulation: DPCM) technique. Some of the differential quantization techniques can only take a difference of “0” or “1”. This is called Delta Modulation (DM). If the difference between the data immediately before quantization and the current data exceeds a certain value, “1” is transmitted, otherwise. Transmits “0”. Thereby, the information amount of difference information required for transmission can be kept very small. An example of a technique for transmitting a prediction difference is an adaptive differential quantization (ADPCM) technique. Data encoded using the adaptive differential quantization technique can be compressed to about half of the original data.
[0013]
Also, there is a quantization technique for expressing a set of a plurality of sampled values by a single code without quantizing each sampled value when digitizing a speech waveform or frequency component information of speech. is there. This is called vector quantization, and information can be compressed by combining a plurality of pieces of information and expressing them with one code.
[0014]
The differential quantization technique and the adaptive differential quantization technique described so far are techniques for expressing the waveform of the audio signal itself as faithfully as possible. In addition to this, there is a technique called “analytic synthesis technique” in which how speech is generated is modeled, and speech is converted into parameters based on this generation model. In communication using speech analysis and synthesis technology, instead of transmitting the speech signal waveform itself, information that represents the properties of the sound source required to synthesize and create speech and frequency filter processing for the output signal of that sound source Information representing the nature of the frequency filter is encoded and transmitted. On the side of receiving the encoded signal, the code is decoded and converted into the original information, and then the audio signal is synthesized and reproduced based on the information indicating the property of the sound source and the information indicating the property of the frequency filter.
[0015]
The voice is composed of a voiced sound having a periodicity caused by vibration of the vocal cords and a voiceless sound having no periodicity caused by the flow of air. That is, it is necessary to transmit both sound source information having periodicity and sound source information having no periodicity. The vibration of air generated in the human vocal cords is radiated after passing through a plurality of tubes having different thicknesses such as the throat and mouth. This corresponds to obtaining an output signal by applying a plurality of frequency filters to a signal generated in the vocal cords. Therefore, the signal passing path is considered as a “synthesis filter” obtained by synthesizing a plurality of frequency filters. The property of this synthesis filter can be expressed by the input / output characteristics of the frequency filter for a plurality of frequencies.
[0016]
The fact that a human voice is recognized as a word means that the voice signal is changing over time and the change is felt. In other words, the sound source property and the filter property of the audio signal change with time and are not uniform. Therefore, even in the transmission of audio signals using the analysis / synthesis technique, it is necessary to transmit different information one by one, which has changed over time.
[0017]
An example of speech analysis and synthesis technology is a technology called multipulse coding. Multipulse coding technology analyzes the input speech signal, extracts the spectral envelope information indicating the strength of each frequency band that composes this input speech signal and the sound source information and transmits them, and based on this information on the receiving side This is a technology for synthesizing speech. Spectral envelope information is generally obtained using linear prediction analysis. The sound source information here is called a residual signal obtained by removing spectral envelope information from the input voice signal, and is represented by a pulse train composed of a plurality of pulses having degrees of freedom in amplitude and position. For example, Japanese Patent Laid-Open No. 4-84200 discloses a proposal for improving the transmission efficiency using this multipulse encoding technique. Another example of speech analysis and synthesis technology is a technology called Code Excited Linear Prediction Audio Codes (CELP).
[0018]
FIG. 7 shows a voice input / output portion of two mobile phone terminals as an example of an apparatus using the code excitation linear prediction technique. Here, the outline of the code excitation linear prediction technique will be described. Normally, each mobile phone terminal is equipped with both a speech encoding device that encodes speech and a speech decoding device that decodes the encoded signal into speech. However, for ease of explanation, only the speech encoding apparatus 101 is shown on the side that transmits speech, and only the speech decoding apparatus 102 is shown on the side that receives speech. On the voice transmitting side, the voice uttered by the user is detected by the microphone 103 of the mobile phone terminal, converted into an electric signal, and input to the voice input means 104 of the voice encoding device 101. The voice input unit 104 generates a voice signal 105 obtained by dividing the voice at a predetermined cycle. The divided speech signal 105 is input to the linear prediction analyzer 106 and subjected to linear prediction analysis to obtain a frequency filter characteristic 107 and input to the synthesis filter 108. Based on the frequency filter characteristic 107, the synthesis filter 108 operates as a frequency filter in which the filter characteristic of the input audio signal 105 is reflected. In addition, the synthesis filter 108 receives a sound source signal 111 corresponding to a sound source from a codebook 109 in which various signal patterns that can be used as a sound source are stored, according to a control signal from an auditory weighting error minimizer 110. The The code book 109 stores various signal patterns that can be used as the sound source, and is also called a code book. Further, each signal pattern stored in the code book 109 is given a code. The input sound source signal 111 becomes a synthesized voice signal 112 in which the frequency filter characteristic 107 is reflected. The audio signal 105 is also input to the adder 113 and added to the synthesized audio signal 112. At this time, by making the synthesized voice signal 112 a negative input of the adder 113, the output signal from the adder 113 becomes a difference signal 114 between the voice signal 105 and the synthesized voice signal 112. In order to search for the output of the codebook that minimizes the difference signal 114, the perceptual weighting error minimizer 110 requests the codebook 109 to output a new sound source signal 111. By repeating this request, the code 115 assigned to the sound source signal 111 output from the codebook 109 when the difference signal 114 becomes minimum is output to the transmission line 116. At the same time, the frequency filter characteristic 107 output from the linear prediction analyzer 106 is output to the transmission line 116.
[0019]
The voice decoding apparatus 102 on the voice receiving side includes a codebook 117, a synthesis filter 118, and a post filter 119 that adds strength for each frequency to the synthesized voice. The code 121 received from the transmission line 116 is the same as the code 115 transmitted by the voice transmitting side. Further, the frequency filter characteristic 122 received from the transmission line 116 is the same as the frequency filter characteristic 107 transmitted by the voice transmitting side. The speech decoding apparatus 102 inputs the excitation signal 123 obtained by inputting the code 121 received from the transmission path 116 to the codebook 117 to the synthesis filter 118. The code book 117 is the same as the code book 109 of the voice transmitting side, and the same sound source signal is output when the same code is input to the code book. The sound source signal 123 is input to the synthesis filter 118 whose frequency input / output characteristics are adjusted according to the frequency filter characteristics 122 received from the transmission path 116, and the synthesized speech 124 is output. The synthesized speech 124 is input to the post filter 119, and is converted into a synthesized speech 125 that is easy to hear from the sense of being perceived and is output from the speaker 120.
[0020]
Thus, in code excitation linear prediction, various sound source patterns are stored in advance and a code is assigned to each pattern. The speech signal input to the speech coding apparatus is divided at regular time intervals, the frequency spectrum envelope is obtained by linear prediction analysis, and the coefficients of the synthesis filter based on this are obtained. A synthesis filter process using the synthesis filter coefficient as an input parameter is performed on each excitation pattern of the codebook to generate synthesized speech. The synthesized speech and the input speech signal are compared, and the code assigned to the sound source pattern having the smallest error is used as the code corresponding to the input speech signal. The finer and more abundant types of sound source patterns that the codebook has, the more likely it is to select a pattern with a high degree of similarity to the input audio signal, and the higher the reproducibility of the audio. However, creating a huge number of patterns and storing them in the speech encoding apparatus is limited because it leads to an increase in the storage medium of the apparatus and an increase in the amount of comparison processing. Therefore, a technique has been proposed in which this pattern is generated algebraically and various patterns are substituted with a small amount of storage medium. Code-excited linear prediction using this technique is called Algebraic Code Excited Linear Prediction (ACELP) and is used in mobile phone systems. A proposal for realizing this algebraic code-excited linear prediction with a smaller amount of storage medium is shown, for example, in Japanese Translation of PCT International Publication No. 10-502191.
[0021]
A technique for encoding an audio signal and a decoding technique for restoring the original cannot correctly reproduce the original audio signal unless the paired ones are defined in detail. Further, in order to spread this encoding technique to many devices, it is necessary to recognize the detailed specifications widely. Therefore, the encoding technology is defined in detail under a standardization organization and published as a recommendation. Among these, the algebraic code-excited linear prediction is recommended as “ITU-Recommendation G723.1 Low Rate Method (5.3 kbit / sec)” in the international standardization organization “ITU” (International Telecommunication Union). . Further, it is recommended as “GSM-AMR” (Global System for Mobile Communications—Adaptive Multi-Rate Speech Transcoding) in “3GPP” (3rd Generation Partnership Project), which is another international standardization organization. These are coding techniques based on the principle of algebraic code-excited linear prediction, and are known as techniques capable of reproducing high-quality speech while greatly reducing the amount of codes.
[0022]
These voice signal encoding techniques are used not only in mobile phone terminals owned by users in a telephone system but also in telephone exchanges that mediate connections between telephone terminals. In a telephone exchange, encoded speech signals are interconverted to establish a call between telephone terminals that employ different speech encoding techniques. Therefore, it is necessary to perform encoding and decoding of speech in order to convert it into a signal that can be handled in common between different speech encoding techniques. Since a telephone exchange is a device that mediates connections between a large number of telephone terminals at the same time, a circuit that performs interconversion of speech coding must be compatible with a large number of telephone terminals at the same time.
[0023]
FIG. 8 shows a main part of a speech encoding apparatus using a code excitation linear prediction technique. An outline of the code-excited linear prediction technique will be described using this speech encoding apparatus. This speech coding apparatus includes a filter part 131, a sound source part 132, and a comparator part 133. The audio signal 134 that has been divided into frames is first input to the filter part 131 and subjected to linear prediction analysis and linear prediction analysis by the quantizer 135 to obtain a frequency filter characteristic, and then the frequency filter characteristic is quantized. . The quantized frequency filter characteristic is input to the synthesis filter 136 to form a filter having the filter characteristic of the input voice.
[0024]
The sound source portion 132 has an adaptive code book 137 that becomes a sound source that generates a signal of a periodic component corresponding to the voice pitch of the voice and a noise code book 138 that becomes a sound source that generates a signal other than the periodic component of the voice. . Then, a gain codebook 139 for controlling the amplitude of the signals output from these sound sources, an adaptive codebook signal amplifier 140 for adjusting the amplitude of the signal generated by the adaptive codebook 137, and a noise codebook 138 are generated. A noise codebook signal amplifier 141 for adjusting the amplitude of the signal is provided. These sound source signals whose amplitudes have been adjusted are added by the sound source adder 142, and then input to the synthesis filter 136 of the filter portion 131 to be generated as synthesized speech 143 to which the strength for each frequency is added.
[0025]
A voice signal 134 and a synthesized voice 143 are input to the comparator part 133 and added by the adder 144. Here, the difference is obtained by inputting the composite signal 143 in a negative format. With respect to this difference, the auditory weighting filter 145 performs a filter process that intensifies frequency components that are difficult to hear in terms of perception and weakens frequency components that are easy to hear in terms of perception. This process reduces redundant information in terms of human hearing and reduces the amount of information. Thereafter, the error minimizer 146 calculates the least square error between the synthesized speech 143 and the speech signal 134 to obtain an error, and causes each code book of the sound source portion 132 to execute another signal output. This process is repeatedly executed, and the output signal from the code book when the error is minimized is regarded as the target signal, and the code assigned to this signal is obtained as the output of the speech coding apparatus. This process of finding the target output signal from the code book is called “pattern search”, “pulse search”, “code book search” or the like.
[0026]
FIG. 9 shows an auditory weighted audio signal, an auditory weighted synthesized audio signal, an auditory weighted synthesized audio signal, and an auditory weighted synthesized audio signal that perform intensity processing for each frequency that eliminates redundant information and eliminates information that is difficult to perceive on human perception. The error signal is represented by a vector. Here, the perceptual weighting voice signal 151 is r, and the perceptual weighting synthesized voice signal 152 is GHν._ξThen, the error signal 153 is r-GHν._ξCan be expressed as Here, the principle of pattern search performed by algebraic code-excited linear prediction using the description of “ITU recommendation G723.1”, which is an example of a recommendation regarding speech coding, will be described in detail. Pattern search is performed by calculating the mean square error E between the input speech signal and the synthesized speech signal_ξThe pattern having the smallest error is determined as the target pattern. The calculation for obtaining the mean square error is shown in the following equation (1).
[Expression 1]

[0027]
Here, the perceptual weighting voice signal r and the perceptual weighting synthesized voice signal GHv._ξTo minimize the error of the mean square error E of equation (1)_ξIs the same as minimizing. The right side of equation (1) is expanded and expressed in the following equation (2).
[0028]
E_ξ= (R) × (r-GHν_ξ)-(GHν_ξ) × (r-GHν)_ξ) …… (2)
[0029]
The condition for minimizing the error at this time is “r-GHν” representing the error signal 153 in FIG._ξThe error is minimized when the angle formed by the perceptually weighted synthesized speech signal 152 and the error signal 153 is a right angle. This means that when the vectors are orthogonal, the inner product of those vectors is “0”. Therefore, when the error is minimum, the second term of the equation (2) is “0”, where GHν_ξG indicates a code book gain corresponding to a gain given to a sound source signal generated from each pattern of the code book. Also, ν_ξIndicates an algebraic codeword representing a pattern in the index ξ. H represents the impulse response of the synthesis filter indicating the response when an impulse signal that uniformly includes a wideband frequency component is input to the synthesis filter obtained by performing auditory weighting on the spectral envelope information obtained by the speech signal linear prediction analysis. Indicates a matrix as an element. This is an impulse response element h (n) in the form of a lower triangular Toeplitz convolution matrix whose diagonal is h (0) and whose lower diagonal is h (1), ..., h (L-1). It is. Here, n is equal to the index value of each sample when the input audio signal is sampled by dividing it at a fixed time. Further, L is the number of samples in one section obtained by dividing the input voice at a certain time. The Toeplitz matrix refers to a matrix that has the property that all elements on a line that is symmetrical and parallel to a diagonal line are equal. The lower triangular convolution means that elements are set in a triangular area below the diagonal line of the matrix, and “0” is set in other areas.
[0030]
Based on the principle that the inner product of the orthogonal vectors is “0”, the second term of equation (2) is set to “0”. Furthermore, the mean square error "E_ξ= 0 ", the expression is modified. Here, the expression (2) is differentiated with respect to the codebook gain G, which is the only element that can be used to change the parameters. (3) It can be expressed as:
[Expression 2]

[0031]
Substituting this into equation (1) gives the mean square error E_ξIs obtained by the following equation (4).
[Equation 3]

[0032]
This mean square error E_ξCan be minimized by maximizing the second term of equation (4). Therefore, the second term of equation (4) is changed to τ_ξThen, the following equation (5) is obtained.
[Expression 4]

[0033]
Here, the following equation (6) was used to obtain the equation (5) from the equation (4). Here, d is a vector representing the correlation between the perceptually weighted audio signal and the impulse response of the perceptually weighted synthesis filter.
[0034]
d = H^Tr ...... (6)
[0035]
Further, the following equation (7) is used to obtain the equation (5) from the equation (4). Φ is a covariance matrix of the impulse response of the perceptual weighting synthesis filter.
[0036]
Φ = H^TH ...... (7)
[0037]
Then, the vector d and the matrix Φ are obtained by calculation prior to the code book search. Each element of the vector d can be expressed by the following equation (8).
[Equation 5]

[0038]
Here, “j” is “0” or more and “N−1” or less. “N” is the number of impulse response elements of the perceptual weighting synthesis filter, and the number of samples per frame when speech is divided into frames. “N”, “i”, and “j” are the elements of the impulse response of the perceptual weighting synthesis filter, the elements of the sampled speech, and the elements of the covariance matrix of the impulse response of the perceptual weighting synthesis filter.NumberingIt is a value to do.
[0039]
The matrix Φ (i, j) can be expressed by the following equation (9).
[Formula 6]

[0040]
Here, i is “0” or more and “N−1” or less, and i is j or less.
[0041]
In the example of “ITU recommendation G723.1”, the pattern search searches for the position and polarity of four pulses. For this purpose, quadruple loop processing corresponding to each pulse position is executed. Then, a new pulse contribution is added by each loop process. Therefore, the correlation C in the equation (5) can be expressed by the following equation (10).
[0042]
C = α₀d [m₀] + Α₁d [m₁] + Α₂d [m₂] + Α_Threed [m_Three] …… (10)
[0043]
Where m_kIndicates the position of the kth pulse, α_kIndicates the polarity of the kth pulse. Further, the vector d is a value obtained by the equation (8).
[0044]
The pulse searched here is a so-called “non-zero pulse” which means that the amplitude is not zero. The number of non-zero pulses per subframe is determined for each type of algebraic code-excited linear prediction technique. Then, an optimum position is determined by calculation from a plurality of candidates for the determined pulse position. Recommendation G. given as an example. In the case of a low bit rate out of the two types of transmission rates defined in 723.1, the number of non-zero pulses is as follows, as there are four terms in equation (10) and each indicating the position of a pulse. The maximum is 4. Also, Recommendation G. In 723.1, since the frame length is 30 milliseconds, the sub-frame length obtained by dividing the frame into four is 7.5 milliseconds, and the sampling frequency is 8 kHz, the number of samples in one sub-frame is “60”. Non-zero pulse position candidates are considered to be only even numbers in order to reduce the amount of calculation. The calculation amount is greatly reduced by not calculating anything other than the non-zero pulse portion. There are 30 even numbers in the sample number “60”, and four of these are selected, so each of them is selected from eight locations. That is, for the four non-zero pulses, the optimum place among the eight position candidates is selected by calculation. However, four times "8 places" is "32 places", exceeding "30" by "2". The third and fourth pulses at this time do not fit within the frame, and as a result, the number of pulses may decrease. At this time, the number of non-zero pulses is less than four. As another example, in the recommended GSM-AMR, there are 8 types of transmission rates (bit rates), and the number of non-zero pulses includes 6 types of patterns including some overlapping.
[0045]
The energy of the vector of the pattern having a pulse at the even pulse position in the equation (5) is obtained. This energy ε can be expressed by the following equation (11).
[0046]
ε = Φ (m₀, M₀)
+ Φ (m₁, M₁) + 2α₀α₁Φ (m₀, M₁)
+ Φ (m₂, M₂) +2 [α₀α₂Φ (m₀, M₂) + Α₁α₂Φ (m₁, M₂]]
+ Φ (m_Three, M_Three) +2 [α₀α_ThreeΦ (m₀, M_Three) + Α₁α_ThreeΦ (m₁, M_Three) + Α₂α_ThreeΦ (m₂, M_Three]] (11)
[0047]
In order to reduce the amount of calculation, calculation is performed on the assumption that pulses exist evenly, and a vector of a pattern having pulses at odd-numbered pulse positions is calculated using approximation. In order to calculate a vector of a pattern having pulses at odd-numbered pulse positions, a vector d [j] and a symmetric matrix Φ (m₁, M₂) To define a new expression s [j] to form a vector d '[j].
[0048]
s [2j] = s [2j + 1] = sign (d [2j])
| D [2j] |> | d [2j + 1] |
s [2j] = s [2j + 1] = sign (d [2j + 1])
Otherwise ...... (12)
[0049]
d ′ [j] = d [j] × s [j] (13)
Φ ′ (i, j) = s [i] × s [j] × Φ (i, j) (14)
[0050]
Here, the vector d ′ [j] in the equation (13) and the symmetric matrix Φ ′ (i, j) in the equation (14) are transformed into the vector s [i] and the symmetric matrix Φ (i, j) by the equation s [ j] is used to capture the polar element of the pulse. Therefore, in the equations (10) and (11), the polarity α of all the pulses can be regarded as “1”. Therefore, the expressions (10) and (11) can be expressed as the following expressions (15) and (16).
[0051]
C = d '[m₀] + D '[m₁] + D '[m₂] + D '[m_Three] …… (15)
ε = Φ ′ (m₀, M₀)
+ Φ ’(m₁, M₁) + 2Φ '(m₀, M₁)
+ Φ ’(m₂, M₂) +2 [Φ ′ (m₀, M₂) + Φ ’(m₁, M₂]]
+ Φ ’(m_Three, M_Three) +2 [Φ ′ (m₀, M_Three) + Φ ’(m₁, M_Three) + Φ ’(m₂, M_Three]] …… (16)
[0052]
The energy ε obtained here and the correlation C obtained by the equation (15) are further converted into the energy ε of the equation (5)._ξAnd correlation C_ξAnd the correlation energy ratio τ_ξThe principle of the pattern search is to obtain a pulse position where the minimum is. In order to perform a pattern search according to this principle, the algebraic code-excited linear prediction technique transposes the lower triangular Toeplitz convolution matrix H using the impulse response h (n) of the perceptual weighting synthesis filter and the lower triangular Toeplitz convolution matrix H. A two-dimensional matrix Φ (i, j) of N rows and N columns is calculated by a correlation operation with the matrix.
[0053]
The speech encoding process will be described using specific numerical values with the recommendation GSM-AMR of algebraic code excitation linear prediction as an example. A unit for encoding a voice signal in the recommended GSM-AMR is a subframe having a time width of 5 milliseconds. The frame has a time width of 20 milliseconds in which four subframes are collected. Since the sampling period is 8 kilohertz, the number of samples N in one subframe is “40”. A pattern search process is performed on these 40 samples.
[0054]
In performing the pattern search, first, a correlation operation between the lower triangular Toeplitz convolution matrix H using the impulse response h (n) of the audibility weighting synthesis filter of the encoder and the transposed matrix of the matrix H is performed, and N rows and N columns are performed. A two-dimensional matrix Φ (i, j) is obtained. Here, since the sample number N is 40, the number of elements of the N-row N-column two-dimensional matrix Φ (i, j) is 1600 of “40 × 40”. And 40 elementsNumberingThe range of the value of “n” that is a value to be 0 to 39.
[0055]
The two-dimensional matrix Φ (i, j) can be expressed by the following equation (17) by reflecting the range of the value “n” in the equation (9).
[Expression 7]

[0056]
Here, j is i or more, and i is “0” or more and “39” or less.
[0057]
These operations must be performed for the number of elements of the N-row N-column two-dimensional matrix Φ (i, j). However, this matrix is a lower triangular Toeplitz convolution matrix, and is calculated by “N (N + 1) / 2” operations rather than by calculating N squares using the property of being a symmetric matrix. ing. As described above, various proposals for reducing the amount of calculation have been made for the pattern search.
[0058]
For example, in Japanese Patent Laid-Open No. 11-327599, the length of an impulse response matrix of an audible weighting synthesis filter used when searching for a pattern from a codebook by a speech coding apparatus using a code-excited linear prediction technique is divided into two periods of speech periodic components. The length is shortened to 1 or less. That is, by reducing the length of the impulse response matrix, the amount of correlation calculation of the impulse response matrix is reduced. The reduced portion is replaced by performing an approximation process. However, replacement using this approximation process may result in a different result from the case where the correlation operation is performed without the approximation process, and may cause a deterioration in voice quality. There is a risk that the exact cannot be guaranteed.
[0059]
Further, in Japanese Patent Laid-Open No. 03-189700, the correlation matrix of the impulse response matrix of the perceptual weighting synthesis filter used for the pattern search from the code book is calculated for the initial element only by the speech coding apparatus using the code excitation linear prediction technique. ing. The other elements are recursively obtained based on the element obtained immediately before. This reduces the amount of calculation by reducing the duplication of calculation. However, this recursive operation may have a different result from the case where the correlation operation is performed without performing the recursive operation, which may cause a deterioration in voice quality. There is a risk that the exact cannot be guaranteed.
[0060]
[Problems to be solved by the invention]
As described above, a speech coding apparatus using code-excited linear prediction technology can reproduce high-quality speech with a small amount of transmission information. However, a large amount of computations must be performed in a short time in order to obtain information suitable for extracting and expressing the features of speech from the codebook. In particular, in a concentrator such as a telephone exchange that performs voice encoding processing for a large number of telephone terminals, this large amount of calculations must be performed for a larger number of telephone terminals. Therefore, in order to cover this, it has been necessary to provide a large number of processors having a high operating frequency or a processor having a higher operating frequency.
[0061]
SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a speech encoding apparatus and speech encoding program that can encode speech signals in a relatively short time even with a processor having a relatively low operating frequency.
[0062]
[Means for Solving the Problems]
  In the present invention, (a) a voice input means for inputting a voice uttered by a caller, and (b) a voice input by the voice input means for each voice coding algorithm based on algebraic code excitation linear prediction (L Is a division unit that divides q into subframes having a predetermined unit time length (q is a natural number) and (c) a frequency for each frame of the audio divided by the division unit. A synthesis filter generating means for generating a synthesis filter having a frequency input / output characteristic based on a spectrum envelope representing an intensity distribution and forming a voice synthesis model together with a sound source signal by the voice encoding algorithm, and (d) the synthesis filter generating means This is the response when an impulse signal that uniformly contains a wideband frequency component is input to the synthesis filter generated by And (e) impulse response of the L synthesis filters obtained by the impulse response acquisition means is converted into a lower triangular Toeplitz convolution. As a matrix element, a product of the lower triangular Toeplitz convolution matrix and its transpose matrix is created using the property that the product becomes a symmetric matrix, and the autocorrelation operation of the lower triangular Toeplitz convolution matrix and its transpose matrix is made. Is a parallel operation of taking out n elements (n is a natural number) from the L elements of the impulse response of the synthesis filter and executing n elements in parallel per processing cycle. Parallel calculation means for obtaining the autocorrelation value of the impulse response of the synthesis filter, and (f) the n elements by the parallel calculation means. Repetitive means for obtaining the autocorrelation value of the impulse response of the L × L synthesis filters by repeating the parallel operation described above, and An element expansion means for virtually expanding the impulse response elements of the synthesis filter to more than L elements and performing the autocorrelation calculation by the parallel calculation as described above, and (h) the repetition means and the calculation described above. Autocorrelation value selection means for selecting the result calculated using the value of the original impulse response as significant from the autocorrelation value obtained by the element expansion means, and (i) the synthesis obtained by the impulse response acquisition means described above. A cross-correlation calculating means for calculating a cross-correlation value between the impulse response of the filter and the voice, and (n) the autocorrelation value selecting means described above. Of the significant autocorrelation value and the cross-correlation value obtained by the cross-correlation calculating means, a predetermined number determined according to the above-mentioned L elements for the subframe set for each speech encoding algorithm. And (v) unvoiced sound analysis means for obtaining information representing an unvoiced sound component having no periodicity generated by the air flow, which is one of the constituent elements of voice, by performing a comparison operation. Output means for outputting information representing the unvoiced sound component as an element of the non-periodic component of the voice to the other party of the caller.Equipped in a speech encoding device,
  The parallel calculation means described above (e-1) calculates “i” and “j” in the autocorrelation matrix Φ (i, j) as the lower triangular Toeplitz convolution matrix of the impulse response array h (n) described above. Initial setting means for initial setting at “0”, and (e-2) when the above-described impulse response array h (i) is executed in an actual device environment, the above-described impulse response array h (i) is a bit (a is a natural number) per data. This is an array of elements, and the bit width of a register used for arithmetic processing is b bits (b is a natural number), and a total of c (c is a natural number) register files are arranged in the arithmetic unit for the parallel operation described above. Suppose that the above-mentioned arithmetic unit can execute a maximum of n operations in parallel in one processing cycle. In the above-mentioned register, d is a quotient obtained by dividing b by a (d is a natural number) ) A bit data Since it is possible to read in a lump, the impulse response array h (i) to h (i + (d-1)) is first read into the arithmetic unit, and then the impulse response array h (j) to h. Read processing means for reading (j + (d-1)), (e-3) arithmetic means for parallelizing the next n arithmetic expressions in one processing cycle, and obtaining an integration result,
h (i) × h (j)
h (i + 1) × h (j + 1)
h (i + 2) × h (j + 2)
......
h (i + (n−1)) × h (j + (n−1))
(E-4) The value of “n” is added to the value of “i” and the value of “j”, respectively, and the accumulated result stored in the c register files of the computing unit is stored. A calculation repeat control means for repeating the calculation by the calculation means until the register file full value obtained by multiplying the quotient obtained by dividing b by 2a by c; (e-5) the register file described above by the calculation repeat control means; Comprising a register file storage control means for continuously storing these total integration results in the c register files when the full value is reached;
  The arithmetic element expansion means described above uses the property that the lower triangular Toeplitz convolution matrix is a symmetric matrix of N rows and N columns when the number of impulse response elements is N, and uses the autocorrelation matrix Φ (i, j ) Is calculated by calculating “n × (n + 1) / 2” instead of the square of all the elements n, and the relationship that the elements Φ (i, j) and Φ (j, i) are equal is calculated. It is characterized by interpolating element parts that have not been calculated..
[0064]
  In the present invention, (a) a voice input means for inputting a voice uttered by a caller, and (b) a voice input by the voice input means for each voice coding algorithm based on algebraic code excitation linear prediction ( L is a sub-frame having q sub-frames (q is a natural number) having a sample of (L is a natural number) and is divided into frames having a predetermined unit time length; A synthesis filter generating means for generating a synthesis filter having a frequency input / output characteristic based on a spectrum envelope representing a frequency intensity distribution and forming a voice synthesis model together with a sound source signal by the voice encoding algorithm; and (d) generation of the synthesis filter. This is the response when an impulse signal that uniformly contains a wideband frequency component is input to the synthesis filter generated by the means. Impulse response acquisition means for obtaining the impulse response of the synthesis filter having L elements determined for each rhythm, and (e) the lower triangle Toeplitz convolution of the impulse responses of the L synthesis filters obtained by the impulse response acquisition means. As a matrix element, a product of the lower triangular Toeplitz convolution matrix and its transpose matrix is created using the property that the product becomes a symmetric matrix, and the autocorrelation operation of the lower triangular Toeplitz convolution matrix and its transpose matrix is made. Is a parallel operation of taking out n elements (n is a natural number) from the L elements of the impulse response of the synthesis filter and executing n elements in parallel per processing cycle. Parallel calculation means for obtaining an autocorrelation value of the impulse response of the synthesis filter of (n), and (f) n elements by the parallel calculation means Repetitive means for obtaining the autocorrelation value of the impulse response of L × L synthesis filters by repeating the above-mentioned parallel arithmetic, and (g) a computation object by repeating the parallel arithmetic for n elements by the repetitive means. An element expansion means for performing the above-mentioned parallel operation by additionally expanding the impulse response elements of L-1 virtual synthesis filters to the elements of the impulse response of the synthesis filter; The autocorrelation value selection means for selecting the result calculated using the original impulse response value as significant from the autocorrelation value obtained by the computed element expansion means, and (i) obtained by the impulse response acquisition means described above. A cross-correlation calculating means for calculating a cross-correlation value between the impulse response of the synthesized filter and the voice; The significant autocorrelation value selected by the stage and the cross-correlation value obtained by the cross-correlation calculating means are determined according to the L elements described above for the subframe set for each speech encoding algorithm. Unvoiced sound analysis means for selecting information representing an unvoiced sound component having no periodicity generated by the flow of air, which is one of the sound components, by selecting a predetermined number of elements and performing a comparison operation; Output means for outputting information representing the unvoiced sound component obtained by the means to the other party as a non-periodic component of speechIn a speech encoding device,
  The parallel calculation means described above (e-1) calculates “i” and “j” in the autocorrelation matrix Φ (i, j) as the lower triangular Toeplitz convolution matrix of the impulse response array h (n) described above. Initial setting means for initial setting at “0”, and (e-2) when the above-described impulse response array h (i) is executed in an actual device environment, the above-described impulse response array h (i) is a bit (a is a natural number) per data. This is an array of elements, and the bit width of a register used for arithmetic processing is b bits (b is a natural number), and a total of c (c is a natural number) register files are arranged in the arithmetic unit for the parallel operation described above. Suppose that the above-mentioned arithmetic unit can execute a maximum of n operations in parallel in one processing cycle. In the above-mentioned register, d is a quotient obtained by dividing b by a (d is a natural number) ) A bit data Since it is possible to read in a lump, the impulse response array h (i) to h (i + (d-1)) is first read into the arithmetic unit, and then the impulse response array h (j) to h. Read processing means for reading (j + (d-1)), (e-3) arithmetic means for parallelizing the next n arithmetic expressions in one processing cycle, and obtaining an integration result,
h (i) × h (j)
h (i + 1) × h (j + 1)
h (i + 2) × h (j + 2)
......
h (i + (n−1)) × h (j + (n−1))
(E-4) The value of “n” is added to the value of “i” and the value of “j”, respectively, and the accumulated result stored in the c register files of the computing unit is stored. A calculation repeat control means for repeating the calculation by the calculation means until the register file full value obtained by multiplying the quotient obtained by dividing b by 2a by c; (e-5) the register file described above by the calculation repeat control means; Comprising a register file storage control means for continuously storing these total integration results in the c register files when the full value is reached;
  The arithmetic element expansion means described above uses the property that the lower triangular Toeplitz convolution matrix is a symmetric matrix of N rows and N columns when the number of impulse response elements is N, and uses the autocorrelation matrix Φ (i, j ) Is calculated by calculating “n × (n + 1) / 2” instead of the square of all the elements n, and the relationship that the elements Φ (i, j) and Φ (j, i) are equal is calculated. It is characterized by interpolating element parts that have not been calculated..
[0068]
  Further, according to the present invention, (b) q subframes (q) having L samples (L is a natural number) for each speech coding algorithm based on algebraic code excitation linear prediction. (B) a frequency input / output characteristic based on a spectrum envelope representing a frequency intensity distribution for each frame of speech divided by this division process. In addition, a synthesis filter generation process for generating a synthesis filter that constitutes a voice synthesis model together with a sound source signal by the voice encoding algorithm described above, and (c) a wideband frequency component is equally distributed to the synthesis filter generated by the synthesis filter generation process. Is a response when an impulse signal included in is input, and has L elements determined for each speech encoding algorithm. Impulse response acquisition processing for obtaining the impulse response of the synthesized filter, and (d) the lower triangle Toeplitz convolution described above using the impulse responses of the L synthesis filters obtained by the impulse response acquisition processing as elements of the lower triangle Toeplitz convolution matrix A product of a matrix and its transpose matrix is created using the property that the product is a symmetric matrix, and the autocorrelation operation of the lower triangular Toeplitz convolution matrix and the transpose matrix is performed with the L elements of the impulse response of the synthesis filter. The parallel operation of taking out n elements (n is a natural number) less than this and executing n elements in parallel per processing cycle, and calculating the autocorrelation value of the impulse response of n synthesis filters And (e) repeating the above-described parallel operation for n elements by the parallel operation processing to obtain L × Iterative processing for obtaining autocorrelation values of impulse responses of L synthesis filters, and (f) by repeating the above-described parallel computation on n elements by this iterative processing, the elements of the impulse response of the synthesis filter to be computed are L A calculation element expansion process for performing the autocorrelation calculation by the parallel calculation by virtually extending to the number of elements exceeding the number of elements, and (g) from the autocorrelation value obtained by the above-described repetition process and the calculation element expansion process. Autocorrelation value selection processing for selecting the result calculated using the value of the original impulse response as significant, and (h) the cross-correlation value between the impulse response and the voice of the synthesis filter obtained by the impulse response acquisition processing described above A cross-correlation calculation process for performing the calculation, and (i) a significant autocorrelation value selected by the above-described autocorrelation value selection process and the above-described phase Of the cross-correlation values obtained by the correlation calculation process, a predetermined number of elements determined according to the above-mentioned L elements for the subframes set for each of the above-described speech encoding algorithms are selected and subjected to a comparison operation to perform speech An unvoiced sound analysis process for obtaining information representing an unvoiced sound component having no periodicity caused by air flow, which is one of the components ofRun as a speech encoding program,
  In the parallel processing described above, (d-1) “i” and “j” in the autocorrelation matrix Φ (i, j) as the lower triangular Toeplitz convolution matrix of the array of impulse responses h (n) described above are used. Initial setting process for initial setting at “0”, and (d-2) when the above-described impulse response array h (i) is executed in an actual device environment, the above-described impulse response array h (i) is a bit (a is a natural number) per data. This is an array of elements, and the bit width of a register used for arithmetic processing is b bits (b is a natural number), and a total of c (c is a natural number) register files are arranged in the arithmetic unit for the parallel operation described above. Suppose that the above-mentioned arithmetic unit can execute a maximum of n operations in parallel in one processing cycle. In the above-mentioned register, d is a quotient obtained by dividing b by a (d is a natural number) ) A bit data Since it is possible to read in a lump, the impulse response array h (i) to h (i + (d-1)) is first read into the arithmetic unit, and then the impulse response array h (j) to h. A reading process for reading (j + (d-1)), (d-3) an arithmetic process for calculating the integration result by parallelizing the next n arithmetic expressions in one processing cycle,
h (i) × h (j)
h (i + 1) × h (j + 1)
h (i + 2) × h (j + 2)
......
h (i + (n−1)) × h (j + (n−1))
(D-4) The value of “n” is added to the value of “i” and the value of “j”, respectively, and the accumulated result stored in the c register files of the computing unit is stored. The number is b (2) An arithmetic repeat control process for repeating the operation by the above-mentioned arithmetic means until the register file full value is obtained by multiplying the quotient divided by 2a until reaching the register file full value, and (d) 5 the register file full value is reached by this arithmetic repeat control means. Then, the register file storage control process for continuously storing all the integration results in the c register files is executed.
  The arithmetic element expansion process described above uses the property that the lower triangular Toeplitz convolution matrix is a symmetric matrix of N rows and N columns when the number of elements of the impulse response is N, and uses the autocorrelation matrix Φ (i, j ) Is calculated by calculating “n × (n + 1) / 2” instead of the square of all the elements n, and the relationship that the elements Φ (i, j) and Φ (j, i) are equal is calculated. This is characterized in that it is a process of interpolating element parts that are not calculated.
[0070]
  Furthermore, in the present invention, (b) q subframes (L is a natural number) having L samples (L is a natural number) for each speech coding algorithm based on algebraic code excitation linear prediction. (b) a frequency input / output characteristic based on a spectrum envelope representing a frequency intensity distribution for each frame of speech divided by this division process; And (c) a synthesis filter generation process for generating a synthesis filter that constitutes a voice synthesis model together with the sound source signal by the voice encoding algorithm, and (c) a wideband frequency component is added to the synthesis filter generated by the synthesis filter generation process. This is the response when an impulse signal that contains equal parts is input, and includes L elements determined for each speech encoding algorithm. The impulse response acquisition process for obtaining the impulse response of the synthesis filter described above, and (d) the impulse response of the L synthesis filters obtained by the impulse response acquisition process as elements of the lower triangle Toeplitz convolution matrix, The product of the convolution matrix and its transpose matrix is created using the property that the product is a symmetric matrix, and the autocorrelation operation of the lower triangular Toeplitz convolution matrix and the transpose matrix is performed with the L number of impulse responses of the synthesis filter. An autocorrelation value of impulse responses of n synthesis filters in a parallel operation in which n elements (n is a natural number) smaller than these elements are extracted and executed in parallel per n processing elements. And (e) repeating the above-described parallel operation for n elements by this parallel operation processing. An element of the impulse response of the synthesis filter to be computed is obtained by iterative processing for obtaining the autocorrelation values of the impulse responses of the L × L synthesis filters and (f) repetition of the above-described parallel computation for n elements by this iteration processing. Obtained by performing the above-described parallel operation by further expanding the impulse response elements of L-1 virtual synthesis filters, and (g) the above-described repetitive processing and the above-described calculation element expansion processing. Autocorrelation value selection processing for selecting the result calculated from the autocorrelation value using the value of the original impulse response as significant, and (h) the impulse response and sound of the synthesis filter obtained by the impulse response acquisition processing described above. Cross-correlation calculation processing for performing cross-correlation value calculation of (1) and significant autocorrelation selected by the above-described autocorrelation value selection processing And a predetermined number of elements determined according to the L elements described above for the subframes set for each of the speech encoding algorithms among the cross-correlation values obtained by the cross-correlation calculation process described above and compared Unvoiced sound analysis processing for calculating information representing unvoiced sound components having no periodicity generated by air flow, which is one of the constituent elements of speech.Run as a speech encoding program,
  In the parallel processing described above, (d-1) “i” and “j” in the autocorrelation matrix Φ (i, j) as the lower triangular Toeplitz convolution matrix of the array of impulse responses h (n) described above are used. Initial setting process for initial setting at “0”, and (d-2) when the above-described impulse response array h (i) is executed in an actual device environment, the above-described impulse response array h (i) is a bit (a is a natural number) per data. This is an array of elements, and the bit width of a register used for arithmetic processing is b bits (b is a natural number), and a total of c (c is a natural number) register files are arranged in the arithmetic unit for the parallel operation described above. Suppose that the above-mentioned arithmetic unit can execute a maximum of n operations in parallel in one processing cycle. In the above-mentioned register, d is a quotient obtained by dividing b by a (d is a natural number) ) A bit data Since it is possible to read in a lump, the impulse response array h (i) to h (i + (d-1)) is first read into the arithmetic unit, and then the impulse response array h (j) to h. A reading process for reading (j + (d-1)), (d-3) an arithmetic process for calculating the integration result by parallelizing the next n arithmetic expressions in one processing cycle,
h (i) × h (j)
h (i + 1) × h (j + 1)
h (i + 2) × h (j + 2)
......
h (i + (n−1)) × h (j + (n−1))
(D-4) The value of “n” is added to the value of “i” and the value of “j”, respectively, and the accumulated result stored in the c register files of the computing unit is stored. The number is b (2) An arithmetic repeat control process for repeating the operation by the above-mentioned arithmetic means until the register file full value is obtained by multiplying the quotient divided by 2a until reaching the register file full value, and (d) 5 the register file full value is reached by this arithmetic repeat control means. Then, the register file storage control process for continuously storing all the integration results in the c register files is executed.
  The arithmetic element expansion process described above uses the property that the lower triangular Toeplitz convolution matrix is a symmetric matrix of N rows and N columns when the number of elements of the impulse response is N, and uses the autocorrelation matrix Φ (i, j ) Is calculated by calculating “n × (n + 1) / 2” instead of the square of all the elements n, and the relationship that the elements Φ (i, j) and Φ (j, i) are equal is calculated. This is characterized in that it is a process of interpolating element parts that are not calculated.
[0072]
【Example】
Hereinafter, the present invention will be described in detail with reference to examples.
[0073]
FIG. 1 shows a speech encoding apparatus according to an embodiment of the present invention. This speech coding apparatus is composed of a filter part 51, a sound source part 52 and a comparator part 53. The frame-divided audio signal 54 is first input to the filter part 51 and subjected to linear prediction analysis and quantization by a quantizer 55 to obtain a frequency filter characteristic, and then the frequency filter characteristic is quantized. . The quantized frequency filter characteristic is input to the synthesis filter 56 to form a filter having the filter characteristic of the input voice.
[0074]
The sound source portion 52 has an adaptive code book 57 that is a sound source that generates a signal of a periodic component corresponding to the voice pitch of the voice and a noise code book 58 that is a sound source that generates a signal other than the periodic component of the voice. . Then, a gain code book 59 for controlling the amplitude of the signal output from these sound sources, an adaptive code book signal amplifier 60 for adjusting the amplitude of the signal generated by the adaptive code book 57, and a noise code book 58 are generated. A noise codebook signal amplifier 61 for adjusting the amplitude of the signal is provided. These sound source signals whose amplitudes are adjusted are added by a sound source adder 62 and then input to the synthesis filter 56 of the filter part 51 to be generated as a synthesized speech 63 to which the strength for each frequency is added.
[0075]
A voice signal 54 and a synthesized voice 63 are input to the comparator section 53 and added by an adder 64. Here, the difference is obtained by inputting the composite signal 63 in a negative format. With respect to this difference, the auditory weighting filter 65 performs a filter process for strengthening frequency components that are difficult to hear in terms of hearing and weakening frequency components that are easy to hear in terms of hearing. This process reduces redundant information in terms of human hearing and reduces the amount of information. The signal whose frequency component is adjusted by the perceptual weighting filter 65 is input to the error minimizer 66. The error minimizer 66 obtains an error by calculating a least square error between the synthesized speech 63 and the speech signal 54 and causes another signal output to be executed for each codebook of the sound source region 52. This process is repeatedly executed, and the output signal from the code book when the error is minimized is regarded as the target signal, and the code assigned to this signal is obtained as the output of the speech coding apparatus. The error minimizer 66 performs an operation for obtaining the least square error using a microprocessor (not shown). For this purpose, an autocorrelation calculation of an impulse response based on the frequency filter characteristic of the synthesis filter 56 is performed. The autocorrelation calculation of the impulse response has a large calculation amount and tends to take a relatively long time for processing. Therefore, the following method is used in order to perform this process efficiently and in a short time.
[0076]
FIG. 2 shows the autocorrelation matrix Φ (i, j) of the impulse response h (n) of the perceptual weighting synthesis filter. A region 11 is a region where a significant calculation result is set. The region 12 is a region that is a symmetric matrix of the region 11. The area 13 is an area in which virtual calculation results that are not used for subsequent calculations are set. That is, the range 14 that collectively shows the region 11 and the region 12 is an intermediate value necessary for the subsequent calculation, and the range 15 is a virtual value that is not used for the subsequent calculation.
[0077]
The values of VA to VF shown in FIG. 2 are specifically shown below.
VA = h (0)²        ...... (18)
[0078]
VB = h (1)²+ H (0)²        ...... (19)
[0079]
[Equation 8]

[0080]
VD = h (0) × h (1) (21)
[0081]
[Equation 9]

[0082]
VF = h (0) × h (n−1) (23)
[0083]
A method for obtaining the autocorrelation matrix Φ (i, j) of the impulse response h (n) of the perceptual weighting synthesis filter will be described. This matrix is a lower triangular Toeplitz convolution matrix with N rows and N columns, and the autocorrelation matrix Φ (i, j) is a symmetric matrix with N rows and N columns. By using the property of being a symmetric matrix, “n × (n + 1) / 2” is calculated instead of the square of all elements n in calculating Φ (i, j). Then, by utilizing the relationship in which the element Φ (i, j) and the element Φ (j, i) are equal, an element part that is not actually calculated is interpolated. That is, since this matrix Φ (i, j) is a symmetric matrix, all the elements of the region 11 and the region 12 can be obtained by calculating only the region 11. The calculation of each element is accumulated in the direction of the arrow from the lower right in FIG. The loop process is repeated for each arrow, and the value of h (n) to be multiplied is shifted. The element on the diagonal line of the N term is calculated by the loop processing of the first round. The first term can be expressed by the following equation (24).
h (0)² (24)
[0084]
The second term can be expressed by the following equation (25).
h (0)²+ H (1)² ...... (25)
[0085]
Hereinafter, the n-th term can be expressed as the following equation (26).
[Expression 10]

[0086]
In the loop processing for the second round, the “39” term obtained by shifting the accumulated h (i) by one is calculated from the lower right to the upper left arrow. The first term of the loop processing in the second round can be expressed by the following equation (27).
h (0) × h (1) (27)
[0087]
The second term of the loop processing in the second round can be expressed by the following equation (28).
h (0) × h (1) + h (1) × h (2) (28)
[0088]
The n-1th term of the loop processing in the second round can be expressed by the following equation (29).
## EQU11 ##

[0089]
Further, the first term in the n-th loop processing can be expressed by the following equation (30).
h (0) × h (n−1) (30)
[0090]
After obtaining the elements of the region 11 of FIG. 2 by the above calculation, the elements of the region 11 are copied to the region 12 so as to be symmetrical with respect to the diagonal portion of the matrix included in the region 11. As a result, all the elements of the matrix Φ (i, j) can be obtained.
[0091]
The above is the basic processing for obtaining the matrix Φ (i, j). In order to shorten the calculation time, the following processing is performed.
[0092]
First, an autocorrelation matrix Φ (i, j) represented by a two-dimensional matrix of N rows and N columns is represented as a one-dimensional array oda [indk]. This one-dimensional array oda [indk] can be expressed by the following equation (31).
[Expression 12]

[0093]
  Here, specific values and ranges of the respective variables are shown with the recommendation GSM-AMR as a specific example. The recommendation GSM-AMR is a frame length obtained by dividing a voice signal into a fixed length of 20 milliseconds, which is further divided into four sub-frames of 5 milliseconds, and encoding is performed with this sub-frame as one unit. is there. The cycle for sampling the audio signal is 8 kilohertz, which is common in audio signal processing. The number of samples in one subframe is obtained as a product of 5 milliseconds and 8 kilohertz, and is “40” samples. The number of samples “40” is an element of each row and column of an N × N matrix for obtaining autocorrelation. That is, the matrix of N rows and N columns is a matrix of 40 rows and 40 columns, and the number of elements of the matrix is “1600” which is a product of “40” and “40”. Since this one-dimensional matrix having 1600 elements is set to oda [indk], one element of the matrix isNumberingThe range of the value of indk used as the value to be is 0 or more and less than 1600. The value of p in equation (31) is the one-dimensional matrix oda [indk].NumberingThis is the quotient when the value of indk is divided by the number of elements in the row “40”. The value of m in equation (31) is the one-dimensional matrix oda [indk].NumberingThis is the remainder when the value of indk is divided by the number of elements in the row “40”.
[0094]
When calculating the one-dimensional array oda [indk], the range of the impulse response array h (n) of the auditory weighting synthesis filter is expanded, and the calculation is performed including values that are not used in the subsequent calculation. That is, if the array h (n) is not expanded, “N−j” times of h (0) × h (j) to h (N−1−j) × h (N−1) The calculation is performed N times from h (0) × h (j) to h (N−1) × h (N−1 + j). Specifically, the array h (N−1) of impulse responses of the audibility and audibility weighting synthesis filter is expanded to h (2 × N−1). That is, the number of elements is expanded from N to “2 × N−1”. However, although this extended part is used in the calculation, since the calculation result is not used, it is sufficient to declare the area, and it is not necessary to specify the value of the element.
[0095]
FIG. 3 and FIG. 4 show processing for obtaining the autocorrelation matrix of the impulse response array h (n) of the perceptual weighting synthesis filter.
[0096]
First, the sign of each element of the impulse response array h (n) of the perceptual weighting synthesis filter is obtained and stored. (FIG. 3, step S51)
[0097]
FIG. 5 shows an area in which only the sign bit of the array h (n) of the impulse response of the perceptual weighting synthesis filter is extracted and stored. As shown in FIG. 5, only the sign bit of the impulse response array h (n) of the perceptual weighting synthesis filter is extracted and stored in order in the region sig_n [0] and the region sig_n [1]. sig_n [0] uses the lower 8 bits as a significant area. sig_n [1] uses all 32 bits as a significant area.
[0098]
When calculating each element of the impulse response array of the perceptual weighting synthesis filter, N elements are always calculated so as to protrude beyond the expanded portion of h (i). A region 13 in FIG. 2 is the protruding portion. This area 13 is not necessary for subsequent calculations. The first half of the process for obtaining the autocorrelation of the impulse response of the perceptual weighting synthesis filter is an integration process between the elements. This integration is performed by gradually increasing “i” and “j” from “h (i) × h (j)”. Since the number of elements of the impulse response array of the perceptual weighting synthesis filter is N, the range of the value of “i” is “0” or more and “N−1” or less. Further, “j” is not less than “i” and not more than “N−1”, but the calculation of “h (i) × h (j)” can be repeated N times regardless of the value of “i”. Furthermore, the range of the value of “j” with respect to the array of impulse responses h (i) of the perceptual weighting synthesis filter is expanded to “i” or more and “i + N−1” or less.
[0099]
  Returning to FIG. 3, the description will be continued. The array of impulse responses h (n) of the perceptual weighting synthesis filterNumberingInitialize “i” and “j” with “0”.(Step S52).
[0100]
It is assumed that the impulse response array h (i) of the perceptual weighting synthesis filter is an array of 16-bit elements per data. This is continuously stored in the storage area. Further, it is assumed that the bit width of the register which is a storage area used for the arithmetic processing of the arithmetic unit of the microprocessor (not shown) is 128 bits. The computing unit can execute a maximum of four computations in parallel in one processing cycle. That is, it is possible to collectively read 8 pieces of 16-bit data into the register of the arithmetic unit. Therefore, the impulse responses h (i) to h (i + 7) of the perceptual weighting synthesis filter are read at the first reading process to the arithmetic unit (step S53), and the perceptual weighting synthesis filter impulse is read at the second reading process to the arithmetic unit. Responses h (j) to h (j + 7) are read (step S54). Then, the following four arithmetic expressions (32) to (35) are executed in parallel in one processing cycle (step S55).
[0101]
h (i) × h (j) (32)
[0102]
h (i + 1) × h (j + 1) (33)
[0103]
h (i + 2) × h (j + 2) (34)
[0104]
h (i + 3) × h (j + 3) (35)
[0105]
  Since each calculation is “16-bit data × 16-bit data”, a storage area of a maximum of 32 bits of “16 bits + 16 bits” is required to store each calculation result. Therefore, the bit width of the storage area for storing these four calculation results is uniformly “32 bits × 4”. This “32 bits × 4” storage area is secured in 10 register files of 128-bit width capable of high-speed data transfer to the registers of the arithmetic unit continuously. After performing four operations, “4” is added to the value of “i” and “4” is added to the value of “j” to prepare for the next operation process (step S56). Then, the four arithmetic expressions (32) to (35) are again executed in parallel in one processing cycle (step S57). This process is repeated until it reaches 10 times (step S59: N), and "4" is added to the value of "i" and "4" is added to the value of "j" for each repetition process. (Step S58). When five repetitions are completed (step S59: Y), a total of 40 integration results are obtained. A total of 40 data is stored in 10 128-bit register files, 4 data each. Then, addition processing is performed using the integration result. When performing addition processing, 10 register filesNumberingThe index value to be initialized is initially set to “0” (step S60 in FIG. 4).
[0106]
FIG. 6 shows a register to be subjected to addition processing performed using the results obtained by performing the four parallel integrations performed so far and the flow of the addition processing. This addition process is also executed in parallel.
[0107]
Register 21 of FIG.₁~ 21_ThreeAre physically the same register, but are given different symbols in order to represent changes in the contents as the processing cycle progresses. Register 22₁~ 22_ThreeAnd register 23₁~ 23_ThreeThe same applies to. Each register is 128 bits wide and is composed of four 32-bit register elements. For example, register 21₁Is the register element 31₁~ 31_FourAnd the register 22₁Is the register element 32₁~ 32_FourIt consists of The register 23₁Register element 33₁~ 33_FourIt consists of Each of these register elements is physically the same register element in order to represent a change in contents accompanying the progress of the processing cycle, but is given a different reference for each processing cycle. For example, register element 31₁And register element 31_FiveAnd register element 31₉Are physically identical and similarly register element 32₁And register element 32_FiveAnd register element 32₉Are physically identical. The same applies to the other register elements.
[0108]
  The flow of the addition process will be described with reference to FIG. In the first processing cycle in which addition is performed, first, the register 21₁~ 21_ThreeAll the register elements are initialized by setting "0" (step S60 in FIG. 4). And register 21₁The register file that stores the integration results inNumberingThe four integration results obtained in this way are written (step S61 in FIG. 4). As a specific example, the register element 31_Four“H (i) × h (j)” is written to the register element 31_ThreeIs written as “h (i + 1) × h (j + 1)”. Furthermore, the register element 31₂“H (i + 2) × h (j + 2)” is written to the register element 31₁Is written as “h (i + 3) × h (j + 3)”. After this writing is completed, “1” is added to the index value of the register file storing the integration result (step S62 in FIG. 4). This writing of the addition result shall be overwritten in order on the ten register files storing the integration results. In this way, 40 32-bit data can be taken over to the subsequent calculation without performing another memory access.
[0109]
  In the second cycle in which addition is performed, addition is performed using the contents of the register set in the first cycle in which addition is performed. Register 21₁And 22₁Register 22 is added to register 22₂Write to. The register 21₂Register 21 in the first processing cycle in which addition is performed.₁4 pieces of 128-bit data, which is the next integration result set to 1, are written together. Register 21₁Register element 31 of₁And register 21₁Register element 31 of₂And register 21₁Register element 31 of_ThreeIs added to the register 22₂Register element 32 of_FiveSet to. In parallel with this, the register 21₁Register element 31 of₂And register 21₁Register element 31 of_ThreeIs added to the register 22₂Register element 32 of₆Set to. In parallel with this, the register 21₁Register element 31 of_ThreeRegister 22₂Register element 32 of₇Set to. In parallel with this, the register 21₁Register element 31 of_FourAnd register 22₁Register element 32 of₁And register 22₁Register element 32 of_FourAnd register 22₂Register element 32 of₈Set to. Further in parallel with these, the register 21₂Register element 31 of_Five, 31₆, 31₇, 31₈In contrast, register 21 in the previous cycle₁Next to the integration result set inIntegration result data4 piecesRegister 21 ₂ InSetting is made (step S63 in FIG. 4). These four integration results show the register file with the current index value.NumberingAnd ask. After this writing is completed, “1” is added to the index value of the register file in which the integration result is stored (step S64 in FIG. 4).
[0110]
  In the third processing cycle in which addition is performed, addition is performed using the contents of the register set in the second cycle in which addition is performed. Register 21₂And 22₂Register 22 is added to register 22_ThreeWrite to. The register 21_ThreeRegister 21 in the second processing cycle in which addition is performed.₂4 pieces of 128-bit data, which is the next integration result set to 1, are written together. Register 22₂Register element 32 of_FiveAnd register 22₂Register element 32 of₈And register 23_ThreeRegister element 33 of₉Set to. In parallel with this, the register 22₂Register element 32 of₆And register 22₂Register element 32 of₈And register 23_ThreeRegister element 33 of_TenSet to. In parallel with this, the register 22₂Register element 32 of₇And register 22₂Register element 32 of₈And register 23_ThreeRegister element 33 of₁₁Set to. In parallel with this, the register 22₂Register element 32 of₈, Register 23_ThreeRegister element 33 of₁₂Set to. In parallel with this, the register 21₂Register element 31 of_FiveAnd register 21₂Register element 31 of₆And register 21₂Register element 31 of₇Is added to the register 22_ThreeRegister element 32 of₉Set to. In parallel with this, the register 21₂Register element 31 of₆And register 21₂Register element 31 of₇Is added to the register 22_ThreeRegister element 32 of_TenSet to. In parallel with this, the register 21₂Register element 31 of₇Register 22_ThreeRegister element 32 of₁₁Set to. In parallel with this, the register 21₂Register element 31 of₈And register 22₂Register element 32 of_FiveAnd register 22₂Register element 32 of₈And register 22_ThreeRegister element 32 of₁₂Set to. Furthermore, in parallel with these, the register 21_ThreeOf 31₉, 31_Ten, 31₁₁, 31₁₂In the previous cycle, register 21₂The four integration results in the storage area next to the integration result set in (4) are set (step S65 in FIG. 4). These four integration results show the register file with the current index value.NumberingAnd ask. After this writing is completed, “1” is added to the index value of the register file storing the integration result (step S66 in FIG. 4). And register 23_ThreeThe write destination to the 128-bit register file in which the value of 1 is written is advanced by one register file.
[0111]
  The third processing cycle in which this addition is performed is further repeated nine times for a total of 10 times (step S67: N in FIG. 4). Thus, addition processing is performed for all of “10 pieces × 4 data” obtained by the integration processing.In this way, the integration result of the flowchart shown in FIG. 3 is added in the flowchart shown in FIG.Thus, a total of 12 cycles of addition processing is performed, and 40 product-sum results required for 10 128-bit register files are stored (step S67: Y).
[0112]
Then, the codes of the area sig_n [0] and the area sig_n [1] stored in advance are added to each of the 40 product-sum results (step S68).
[0113]
  The above-mentioned process of “repeating 4 parallel integrations 10 times and performing 4 parallel additions for 12 cycles and adding code data to the result” is an element of the impulse response h (n) of the perceptual weighting synthesis filter. The number is repeated “40” times (step S69: N). When performing this iteration, the impulse response h (n) of the perceptual weighting synthesis filter isNumberingThe index value “i” to be set is set to “0”. At the same time, the impulse response h (n) of the perceptual weighting synthesis filter for each repetition.NumberingIn order to advance the starting point of the index value “j” to be incremented by “1”, “1” is added to “j” (step S70). By performing this iteration 40 times, all 1600 elements of the autocorrelation matrix Φ (i, j) of the impulse response of the perceptual weighting synthesis filter can be obtained in the form of a one-dimensional array (step S69: Y ).
[0114]
In this way, the autocorrelation value calculation is performed by the parallel calculation means of the calculator. Furthermore, this autocorrelation calculation is repeated as seamlessly as possible. An autocorrelation calculation break occurs when it is necessary to obtain an area in which the impulse response element of the synthesis filter is stored by calculation when the element of the impulse response of the synthesis filter is taken into the calculator. Therefore, each element of the impulse response of the synthesis filter is stored continuously so that the storage area does not have to be obtained by calculation. However, in order to repeatedly perform a calculation using a finite number of elements, it may be necessary to obtain an area in the middle of the storage area by calculation. Therefore, the frequency at which this calculation occurs is reduced. For this purpose, L-1 elements of the impulse response of the synthesis filter are virtually increased, and the autocorrelation calculation is performed as if there is an element to be calculated. As a result, the calculation for the increase is performed, but this is compensated by the effect of reducing the number of autocorrelation calculations and the number of storage area calculations by parallel calculation, and the processing time is shortened comprehensively.
[0115]
Of course, what is expressed as a frame in this specification includes not only frames but also subframes.
[0116]
【The invention's effect】
As described above, according to the first or second aspect of the invention, the autocorrelation of the impulse response element of the perceptual weighting synthesis filter obtained from the input speech is added to the n parallel operation means capable of performing parallel operation. Calculations are executed continuously as much as possible. As a result, the processing capability of the parallel computing means is effectively utilized. Therefore, even a processor having a relatively low operating frequency can encode a speech signal in a relatively short time.
[0117]
According to the third aspect of the present invention, the parallel operation means performs the parallel operation with the maximum number that can perform the parallel operation in one processing cycle. This makes effective use of the processing capability of the computing unit. Therefore, even a processor having a relatively low operating frequency can encode a speech signal in a relatively short time.
[0118]
Furthermore, according to the invention described in claim 4 or claim 5, the autocorrelation calculation of the impulse response element of the auditory weighting synthesis filter obtained from the input speech is performed as much as possible in the parallel calculation processing capable of n parallel calculations. It is running continuously. As a result, the processing capability of the parallel processing is effectively utilized. Therefore, even a processor having a relatively low operating frequency can encode a speech signal in a relatively short time.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a main part of a speech encoding apparatus using a code-excited linear prediction technique according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram showing an autocorrelation matrix of an impulse response of the perceptual weighting synthesis filter of the present embodiment.
FIG. 3 is a flowchart showing a flow of integration processing for obtaining an autocorrelation matrix of an impulse response array h (n) of the perceptual weighting synthesis filter of the present embodiment.
FIG. 4 is a flowchart showing a flow of an addition process for obtaining an autocorrelation matrix of an impulse response array h (n) of the perceptual weighting synthesis filter of the present embodiment.
FIG. 5 is an explanatory diagram showing an area for storing a sign bit of an impulse response array h (n) of the perceptual weighting synthesis filter according to the embodiment;
FIG. 6 is a block diagram illustrating a register to be subjected to parallel addition processing according to the present embodiment and addition processing;
FIG. 7 is a block diagram showing a main part of a speech encoding apparatus and speech decoding apparatus using code-excited linear prediction technology.
FIG. 8 is a block diagram showing a main part of a speech encoding apparatus using a code-excited linear prediction technique.
FIG. 9 is an explanatory diagram showing an estimated signal searched from a speech signal and a code book and a difference between the two signals as a vector signal.
[Explanation of symbols]
11 areas (areas where significant calculation results are set)
12 regions (regions that are symmetric matrices of region 11)
13 areas (areas where virtual calculation results are set)
14 range (intermediate value required for subsequent calculations)
15 range (virtual value)

Claims

A voice input means for inputting voice uttered by the caller;
A predetermined unit time having q subframes (q is a natural number) having L (L is a natural number) samples for each speech encoding algorithm based on algebraic code excitation linear prediction. A dividing means for dividing the frame into long frames;
A synthesis filter having frequency input / output characteristics based on a spectrum envelope representing a frequency intensity distribution for each frame of speech divided by the dividing means, and generating a synthesis filter constituting a speech synthesis model together with a sound source signal by the speech encoding algorithm Generating means;
The impulse response of the synthesis filter having L elements determined for each speech encoding algorithm, which is a response when an impulse signal that uniformly includes a wideband frequency component is input to the synthesis filter generated by the synthesis filter generation means Impulse response acquisition means for obtaining
The impulse response of the L synthesis filters obtained by the impulse response acquisition means is used as an element of a lower triangular Toeplitz convolution matrix, and the product of the lower triangular Toeplitz convolution matrix and its transpose matrix, and the product becomes a symmetric matrix. By using the lower triangular Toeplitz convolution matrix and its transpose matrix, the n elements (n is a natural number) smaller than the L elements of the impulse response of the synthesis filter are extracted. Parallel computing means for obtaining autocorrelation values of impulse responses of n synthesis filters in parallel computation for executing n elements in parallel per processing cycle;
Repetitive means for obtaining autocorrelation values of impulse responses of L × L synthesis filters by repeating the parallel calculation of n elements by the parallel calculation means;
By repeating the parallel calculation for n elements by the repetition means, the impulse response elements of the synthesis filter to be calculated are virtually expanded to the number of elements exceeding L, and the autocorrelation calculation by the parallel calculation is performed. Arithmetic element expansion means;
Autocorrelation value selection means for selecting a result calculated using an original impulse response value as significant from the autocorrelation values obtained by the repetition means and the calculation element expansion means;
A cross-correlation calculating means for calculating a cross-correlation value between the impulse response of the synthesis filter obtained by the impulse response acquiring means and the sound;
According to the L elements of the subframes set for each of the speech encoding algorithms, among the significant autocorrelation value selected by the autocorrelation value selection unit and the crosscorrelation value obtained by the crosscorrelation calculation unit. An unvoiced sound analyzing means for selecting information representing an unvoiced sound component having no periodicity generated by the flow of air, which is one of the components of the sound, by selecting a predetermined number of elements determined by
Output means for outputting information representing the unvoiced sound component obtained by the unvoiced sound analysis means to the other party as a non-periodic component of the voice,
The parallel computing means includes
Initial setting means for initially setting “i” and “j” at “0” in the autocorrelation matrix Φ (i, j) as the lower triangular Toeplitz convolution matrix of the array of impulse responses h (n);
In performing this operation in an actual device environment, the impulse response array h (i) is an array of elements of a bits (a is a natural number) per data, and the bit width of the register used for the arithmetic processing is b bits. (B is a natural number) and a total of c register files (c is a natural number) are arranged in the arithmetic unit for the parallel operation, and the arithmetic unit performs a maximum of n operations in one processing cycle. When it can be executed in parallel, the register can read d pieces of a-bit data (d is a natural number), which is a quotient obtained by dividing b by a. Read processing means for reading the impulse response array h (i) to h (i + (d-1)) first, and then reading the impulse response array h (j) to h (j + (d-1));
The following n arithmetic expressions are parallelized in one processing cycle, and each of the arithmetic means obtains an integration result;
h (i) × h (j)
h (i + 1) × h (j + 1)
h (i + 2) × h (j + 2)
......
h (i + (n−1)) × h (j + (n−1))
The value of “n” is added to the value of “i” and the value of “j”, respectively, and the number of integration results stored in the c register files of the computing unit is the quotient obtained by dividing b by 2a. A calculation repeat control means for repeating the calculation by the calculation means until a register file full value multiplied by c is reached;
When the register file full value is reached by the arithmetic repeat control means, the register file storage control means for continuously storing all the integration results in the c register files,
The arithmetic element expanding means uses the property that the lower triangular Toeplitz convolution matrix is an N-row N-column symmetric matrix when the number of impulse response elements is N, and calculates an autocorrelation matrix Φ (i, j). In calculating, instead of the square of all the elements n, “n × (n + 1) / 2” is calculated, and the relationship that the elements Φ (i, j) and Φ (j, i) are equal is used. A speech encoding apparatus characterized by interpolating element parts that have not been calculated.

A voice input means for inputting voice uttered by the caller;
A predetermined unit time having q subframes (q is a natural number) having L (L is a natural number) samples for each speech encoding algorithm based on algebraic code excitation linear prediction. A dividing means for dividing the frame into long frames;
A synthesis filter having frequency input / output characteristics based on a spectrum envelope representing a frequency intensity distribution for each frame of speech divided by the dividing means, and generating a synthesis filter constituting a speech synthesis model together with a sound source signal by the speech encoding algorithm Generating means;
The impulse response of the synthesis filter having L elements determined for each speech encoding algorithm, which is a response when an impulse signal that uniformly includes a wideband frequency component is input to the synthesis filter generated by the synthesis filter generation means Impulse response acquisition means for obtaining
The impulse response of the L synthesis filters obtained by the impulse response acquisition means is used as an element of a lower triangular Toeplitz convolution matrix, and the product of the lower triangular Toeplitz convolution matrix and its transpose matrix, and the product becomes a symmetric matrix. By using the lower triangular Toeplitz convolution matrix and its transpose matrix, the n elements (n is a natural number) smaller than the L elements of the impulse response of the synthesis filter are extracted. Parallel computing means for obtaining autocorrelation values of impulse responses of n synthesis filters in parallel computation for executing n elements in parallel per processing cycle;
Repetitive means for obtaining autocorrelation values of impulse responses of L × L synthesis filters by repeating the parallel calculation of n elements by the parallel calculation means;
By repeating the parallel calculation for n elements by the repetition means, the impulse response elements of L-1 virtual synthesis filters are additionally expanded to the impulse response elements of the synthesis filter to be calculated, and the parallel calculation is performed. Computing element expansion means to be implemented;
Autocorrelation value selection means for selecting a result calculated using an original impulse response value as significant from the autocorrelation values obtained by the repetition means and the calculation element expansion means;
A cross-correlation calculating means for calculating a cross-correlation value between the impulse response of the synthesis filter obtained by the impulse response acquiring means and the sound;
According to the L elements of the subframes set for each of the speech encoding algorithms, among the significant autocorrelation value selected by the autocorrelation value selection unit and the crosscorrelation value obtained by the crosscorrelation calculation unit. An unvoiced sound analyzing means for selecting information representing an unvoiced sound component having no periodicity generated by the flow of air, which is one of the components of the sound, by selecting a predetermined number of elements determined by
Output means for outputting information representing the unvoiced sound component obtained by the unvoiced sound analysis means to the other party as a non-periodic component of the voice,
The parallel computing means includes
Initial setting means for initially setting “i” and “j” at “0” in the autocorrelation matrix Φ (i, j) as the lower triangular Toeplitz convolution matrix of the array of impulse responses h (n);
In performing this operation in an actual device environment, the impulse response array h (i) is an array of elements of a bits (a is a natural number) per data, and the bit width of the register used for the arithmetic processing is b bits. (B is a natural number) and a total of c register files (c is a natural number) are arranged in the arithmetic unit for the parallel operation, and the arithmetic unit performs a maximum of n operations in one processing cycle. When it can be executed in parallel, the register can read d pieces of a-bit data (d is a natural number), which is a quotient obtained by dividing b by a. Read processing means for reading the impulse response array h (i) to h (i + (d-1)) first, and then reading the impulse response array h (j) to h (j + (d-1));
The following n arithmetic expressions are parallelized in one processing cycle, and each of the arithmetic means obtains an integration result;
h (i) × h (j)
h (i + 1) × h (j + 1)
h (i + 2) × h (j + 2)
......
h (i + (n−1)) × h (j + (n−1))
The value of “n” is added to the value of “i” and the value of “j”, respectively, and the number of integration results stored in the c register files of the computing unit is the quotient obtained by dividing b by 2a. A calculation repeat control means for repeating the calculation by the calculation means until a register file full value multiplied by c is reached;
When the register file full value is reached by the arithmetic repeat control means, the register file storage control means for continuously storing all the integration results in the c register files,
The arithmetic element expanding means uses the property that the lower triangular Toeplitz convolution matrix is an N-row N-column symmetric matrix when the number of impulse response elements is N, and calculates an autocorrelation matrix Φ (i, j). In calculating, instead of the square of all the elements n, “n × (n + 1) / 2” is calculated, and the relationship that the elements Φ (i, j) and Φ (j, i) are equal is used. A speech encoding apparatus characterized by interpolating element parts that have not been calculated.

The parallel calculating means, the speech encoding apparatus according to claim 1 or claim 2, wherein it is a means for performing the maximum number of parallel operations that can perform parallel arithmetic at 1 processing cycle.

On the computer,
Predetermined unit time length having q subframes (q is a natural number) having L (L is a natural number) samples for each speech encoding algorithm based on algebraic code excitation linear prediction. Splitting process to divide into frames as
A synthesis filter having a frequency input / output characteristic based on a spectrum envelope representing a frequency intensity distribution for each frame of the voice divided by the division processing, and generating a synthesis filter constituting a voice synthesis model together with a sound source signal by the voice coding algorithm Generation process,
Impulse response of the synthesis filter having L elements determined for each speech encoding algorithm, which is a response when an impulse signal uniformly including a wideband frequency component is input to the synthesis filter generated by the synthesis filter generation process Impulse response acquisition processing for
The impulse response of the L synthesis filters obtained by this impulse response acquisition processing is used as an element of the lower triangular Toeplitz convolution matrix, and the product of the lower triangular Toeplitz convolution matrix and its transpose matrix is a property that the product becomes a symmetric matrix. The auto-correlation operation of the lower triangular Toeplitz convolution matrix and its transpose matrix is extracted for each of n elements (n is a natural number) smaller than the L elements of the impulse response of the synthesis filter. A parallel operation for executing n elements in parallel per processing cycle, and a parallel operation for obtaining autocorrelation values of impulse responses of n synthesis filters;
Iterative processing for obtaining autocorrelation values of impulse responses of L × L synthesis filters by repeating the parallel computation on n elements by the parallel computation processing;
By repeating the parallel calculation for n elements by this iterative process, the impulse response elements of the synthesis filter to be calculated are virtually expanded to the number of elements exceeding L, and the autocorrelation calculation is performed by the parallel calculation. Arithmetic element expansion processing,
An autocorrelation value selection process for selecting a result calculated using an original impulse response value as a significant one from the autocorrelation value obtained by the iterative process and the calculation element expansion process;
A cross-correlation calculation process for calculating a cross-correlation value between the impulse response and the voice of the synthesis filter obtained in the impulse response acquisition process;
According to the L elements of the subframes set for each of the speech encoding algorithms, among the significant autocorrelation value selected by the autocorrelation value selection process and the crosscorrelation value obtained by the crosscorrelation calculation process A predetermined number of elements determined in this way, and performing a comparison operation to execute unvoiced sound analysis processing for obtaining information representing an unvoiced sound component having no periodicity caused by the flow of air, which is one of the components of the sound,
The parallel processing is as follows:
An initial setting process for initializing “i” and “j” with “0” in the autocorrelation matrix Φ (i, j) as the lower triangular Toeplitz convolution matrix of the array of impulse responses h (n);
In performing this operation in an actual device environment, the impulse response array h (i) is an array of elements of a bits (a is a natural number) per data, and the bit width of the register used for the arithmetic processing is b bits. (B is a natural number) and a total of c register files (c is a natural number) are arranged in the arithmetic unit for the parallel operation, and the arithmetic unit performs a maximum of n operations in one processing cycle. When it can be executed in parallel, the register can read d pieces of a-bit data (d is a natural number), which is a quotient obtained by dividing b by a. A reading process of reading an impulse response array h (i) to h (i + (d-1)) first, and then reading an impulse response array h (j) to h (j + (d-1));
The following n arithmetic expressions are processed in parallel in one processing cycle, and each of them obtains an integration result;
h (i) × h (j)
h (i + 1) × h (j + 1)
h (i + 2) × h (j + 2)
......
h (i + (n−1)) × h (j + (n−1))
The value of “n” is added to the value of “i” and the value of “j”, respectively, and the number of integration results stored in the c register files of the computing unit is the quotient obtained by dividing b by 2a. A calculation repetition control process for repeating the calculation by the calculation means until the register file full value multiplied by c is reached;
When the register file full value is reached by the arithmetic repeat control means, the register file storage control process for continuously storing all the integration results in the c register files is executed.
The arithmetic element expansion processing uses the property that the lower triangular Toeplitz convolution matrix is a symmetric matrix of N rows and N columns, where N is the number of elements of the impulse response, to calculate the autocorrelation matrix Φ (i, j). In calculating, instead of the square of all elements n, “n × (n + 1) / 2” is calculated, and the relationship that the elements Φ (i, j) and Φ (j, i) are equal is used. A speech coding program characterized by interpolating element parts that have not been calculated.

On the computer,
Predetermined unit time length having q subframes (q is a natural number) having L (L is a natural number) samples for each speech encoding algorithm based on algebraic code excitation linear prediction. Splitting process to divide into frames as
A synthesis filter having a frequency input / output characteristic based on a spectrum envelope representing a frequency intensity distribution for each frame of the voice divided by the division processing, and generating a synthesis filter constituting a voice synthesis model together with a sound source signal by the voice coding algorithm Generation process,
Impulse response of the synthesis filter having L elements determined for each speech encoding algorithm, which is a response when an impulse signal uniformly including a wideband frequency component is input to the synthesis filter generated by the synthesis filter generation process Impulse response acquisition processing for
The impulse response of the L synthesis filters obtained by this impulse response acquisition processing is used as an element of the lower triangular Toeplitz convolution matrix, and the product of the lower triangular Toeplitz convolution matrix and its transpose matrix is a property that the product becomes a symmetric matrix. The auto-correlation operation of the lower triangular Toeplitz convolution matrix and its transpose matrix is extracted for each of n elements (n is a natural number) smaller than the L elements of the impulse response of the synthesis filter. A parallel operation for executing n elements in parallel per processing cycle, and a parallel operation for obtaining autocorrelation values of impulse responses of n synthesis filters;
Iterative processing for obtaining autocorrelation values of impulse responses of L × L synthesis filters by repeating the parallel computation on n elements by the parallel computation processing;
By repeating the parallel calculation for n elements by this iterative process, the impulse response elements of L-1 virtual synthesis filters are additionally expanded to the impulse response elements of the synthesis filter to be calculated, and the parallel calculation is performed. A calculation element expansion process to be performed;
An autocorrelation value selection process for selecting a result calculated using an original impulse response value as a significant one from the autocorrelation value obtained by the iterative process and the calculation element expansion process;
A cross-correlation calculation process for calculating a cross-correlation value between the impulse response and the voice of the synthesis filter obtained in the impulse response acquisition process;
According to the L elements of the subframes set for each of the speech encoding algorithms, among the significant autocorrelation value selected by the autocorrelation value selection process and the crosscorrelation value obtained by the crosscorrelation calculation process A predetermined number of elements determined in this way, and performing a comparison operation to execute unvoiced sound analysis processing for obtaining information representing an unvoiced sound component having no periodicity caused by the flow of air, which is one of the components of the sound,
The parallel processing is as follows:
An initial setting process for initializing “i” and “j” with “0” in the autocorrelation matrix Φ (i, j) as the lower triangular Toeplitz convolution matrix of the array of impulse responses h (n);
In performing this operation in an actual device environment, the impulse response array h (i) is an array of elements of a bits (a is a natural number) per data, and the bit width of the register used for the arithmetic processing is b bits. (B is a natural number) and a total of c register files (c is a natural number) are arranged in the arithmetic unit for the parallel operation, and the arithmetic unit performs a maximum of n operations in one processing cycle. When it can be executed in parallel, the register can read d pieces of a-bit data (d is a natural number), which is a quotient obtained by dividing b by a. A reading process of reading an impulse response array h (i) to h (i + (d-1)) first, and then reading an impulse response array h (j) to h (j + (d-1));
The following n arithmetic expressions are processed in parallel in one processing cycle, and each of them obtains an integration result;
h (i) × h (j)
h (i + 1) × h (j + 1)
h (i + 2) × h (j + 2)
......
h (i + (n−1)) × h (j + (n−1))
The value of “n” is added to the value of “i” and the value of “j”, respectively, and the number of integration results stored in the c register files of the computing unit is the quotient obtained by dividing b by 2a. A calculation repetition control process for repeating the calculation by the calculation means until the register file full value multiplied by c is reached;
When the register file full value is reached by the arithmetic repeat control means, the register file storage control process for continuously storing all the integration results in the c register files is executed.
The arithmetic element expansion processing uses the property that the lower triangular Toeplitz convolution matrix is a symmetric matrix of N rows and N columns, where N is the number of elements of the impulse response, to calculate the autocorrelation matrix Φ (i, j). In calculating, instead of the square of all elements n, “n × (n + 1) / 2” is calculated, and the relationship that the elements Φ (i, j) and Φ (j, i) are equal is used. A speech coding program characterized by interpolating element parts that have not been calculated.