JP3661363B2

JP3661363B2 - Audio compression / decompression method and apparatus, and storage medium storing audio compression / decompression processing program

Info

Publication number: JP3661363B2
Application number: JP22351297A
Authority: JP
Inventors: 満広稲積
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1997-08-20
Filing date: 1997-08-20
Publication date: 2005-06-15
Anticipated expiration: 2017-08-20
Also published as: JPH1165599A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声信号を単純な処理で効率的に圧縮伸張処理する音声圧縮伸張方法および装置並びに音声圧縮伸張処理プログラムを記憶した記憶媒体に関する。
【０００２】
【従来の技術】
音声信号を圧縮伸張する際の符号化方法として、従来より様々な方法が提案されている。その１つとして、特開昭５９−１１６９７３（以下、第１の従来技術という）がある。
【０００３】
この第１の従来技術は、入力音声データを短時間毎に分割して短時間音声信号系列を求める手段、この短時間音声信号系列からスペクトル包絡パラメータを抽出するスペクトル包絡パラメータ抽出手段、このスペクトル包絡パラメータをもとにインパルス応答系列を計算するするインパルス応答系列計算手段、このインパルス応答系列を用いて自己相関関数列を計算する手段、前記インパルス応答系列と短時間音声信号系列を用いて相互相関関数列を計算する手段、前記自己相関関数列と相互相関関数列を用いて駆動音源信号系列計算して符号化する手段、スペクトル包絡符号と駆動音源信号とを組み合わせて出力する手段とを有し、さらに、前記短時間音声信号に対して予め定められた補正を加える目標信号計算手段を有している。
【０００４】
この第１の従来技術によれば、音声の符号化を行うに際して、効率的に駆動音源パルスの位置とゲインを決定することができ、また、計算量、使用メモリ量の削減にもある程度の効果は得られる。
【０００５】
しかし、この第１の従来技術は、女性の声のような音声信号を符号化したのち、音声合成を行う場合、高品質な音声合成を得るには、駆動音源パルスをたくさん抽出する必要があるため、圧縮率が悪くなるという問題点があった。
【０００６】
すなわち、女性の声は、男性の声に比べると複雑で、高精度な合成音を得るには、駆動音源パルスをたくさん抽出する必要があり、結局は、圧縮率が悪いものとなってしまう。
【０００７】
一方、高い圧縮率を得るための技術として、特開昭６３−３７３９９（以下、第２の従来技術という）、特開平３−４３００（以下、第３の従来技術という）がある。
【０００８】
第２の従来技術は、音声信号からピッチ推定を行い、過去のパルス列からの推定値と実際の信号との残差を求め、この残差により駆動音源パルスを計算しようとするものである。
【０００９】
また、第３の従来技術は、ピッチ推定を行い、その１ピッチ区間分の駆動音源（マルチパルス）を推定する。そして、そのマルチパルスのゲインと位相を補正することによって、他のピッチ区間を補正することにより他のピッチ区間を近似する。さらに、推定された値と実際の値との残差より、第２のマルチパルスを推定する。なお、マルチパルス信号の他に雑音コードブックを用いる場合もある。
【００１０】
【発明が解決しようとする課題】
前記した第２、第３の従来技術は、同じ波形を繰り返す周期を求め、１つ前の周期から次の周期を推定し、その推定した部分と現実の音声波形との差分を計算して、その差分により駆動音源を計算するため、高い圧縮率が実現できる。
【００１１】
しかし、ピッチを求めたり差分を求めたりする必要があるため計算量が多く、また、それらのデータを蓄えるために大きな容量のメモリが必要になるという問題点がある。
【００１２】
また、残差を求め、この残差により駆動音源パルスを計算するため、データの一部が失われた場合、失われたデータ部分がそれ以降の計算に大きな影響を与えることになり、高精度な音声合成が行えなくなるという大きな問題点がある。
【００１３】
このように、従来の技術は、それぞれにおいて種々の問題点がある。たとえば、第１の従来技術は、駆動音源パルスを求めるための基本的な技術ではあるが、合成音の品質を上げようとすると、多くの駆動音源パルスを立てる必要があり、女性の声のような音声データに対しては特に圧縮率が悪くなるという問題がある。また、第２の従来技術と、第３の従来技術は高圧縮率が得られるが、計算量が多く、使用メモリ量も多いという問題があり、さらに、差分情報を用いるためデータ欠落に弱いという問題がある。
【００１４】
最近では音声データを扱う携帯用の情報機器が広い分野で用いられるようになってきている。この種の携帯用情報機器は、ＣＰＵの計算速度やメモリ容量には大きな制約があるため、計算量や使用メモリ量が多いということは重大な問題である。また、差分情報を用いる方法は、データの欠落を考慮する必要のある情報機器においては製品の性能向上の面で問題が多く、携帯機器に限らず、コンピュータネットワーク上のリアルタイム伝送などにおいても、データの欠落が、伝送されるデータに大きな影響を与えることにもなる。
【００１５】
以上述べたように、従来のそれぞれの音声符号化方法は、処理が複雑であることが共通しており、ハードウエア化、並列処理による高速化が相対的に困難であるという問題点がある。特に、ピッチ周期を求める処理を含むものは、計算量が多く、また、誤りが発生した場合の影響が大きい。さらに、従来のスペクトル包絡パラメータによるインパルス応答と、駆動パルスを用いる方法は、パルスの前後に不連続を生じ、これが雑音となって現れるという問題点がある。
【００１６】
そこで、本発明は、処理内容が単純で、ハードウエア化、並列処理化を容易に可能とし、かつ、効率のよい符号化が可能で、比較的高い圧縮率での音声データ圧縮を可能とする音声圧縮伸張方法および装置並びに音声圧縮伸張処理プログラムを記憶した記憶媒体を提供することを目的とする。
【００１７】
本発明の音声圧縮伸張方法は、入力音声から所定区間の音声片を切り出し、所定の頻度で、前記切り出した所定区間の音声片からスペクトル包絡パラメータを抽出し、当該抽出したスペクトル包絡パラメータにより推定される時間的前方予測音声波形とそれと連続する時間的後方予測音声波形を用いて作成された第一の音声片を含む複数種類の音声片を参照し、前記複数種類の音声片と前記切り出した所定区間の音声片との類似性を比較して、前記複数種類の音声片から最も類似度の高い音声片を選択し、前記選択された音声片についてのデータを基に、前記切り出した所定区間の音声片を符号化して符号化データを作成する処理を含むことを特徴としている。
【００１８】
また、前記符号化データを作成したのち、当該符号化データを伸張し、前記伸張されたデータを前記切り出した所定区間の音声片から差し引いて残差を求め、前記残差の波形に対して、前記複数種類の音声片を参照し、前記複数種類の音声片と前記残差の波形との類似性を比較する処理を行って、符号化データを得るようにしている。
【００１９】
前記複数種類の音声片は、前記切り出した所定区間の音声片よりも時間的に後方の前記伸張の処理をされた音声波形を用いて作成された第二の音声片、雑音成分により作成された第三の音声片を有し、前記第一の音声片は、前記スペクトル包絡パラメータの抽出後にその内容が更新され、第二の音声片は、当該第二の音声片を用いて作成された符号化データが前記伸張の処理をされたのち、当該伸張の処理をされたデータに基づいてその内容が更新されるようにしている。
【００２０】
また、前記複数種類の音声片は、前記切り出した所定区間の音声片よりも時間的に長い区間を有し、前記切り出した所定区間の音声片との類似性を比較をする際は、各前記音声片の長さの範囲において前記切り出した所定区間の音声片との類似性が比較され、最も類似度の高い部分を有する音声片が選択されるようにしている。
【００２１】
また、前記符号化データは、前記最も類似度の高い部分を有する音声片の番号、当該音声片内のどの部分であるかを表す位置データ、振幅調整用のパラメータであり、さらに、場合に応じて、スペクトル包絡パラメータを加えたデータである。
【００２２】
また、本発明の音声圧縮伸張装置は、入力音声から所定区間の音声片を切り出す音声片切り出し部と、所定の頻度で、前記音声片切り出し部により切り出された所定区間の音声片からスペクトル包絡パラメータを抽出するスペクトル包絡パラメータ抽出部と、当該抽出したスペクトル包絡パラメータにより推定される時間的前方予測音声波形とそれと連続する時間的後方予測音声波形を用いて作成された第一の音声片を含む複数種類の音声片と、前記複数種類の音声片を参照し、前記複数種類の音声片と前記切り出された所定区間の音声片との類似性を比較して類似度を求める類似度判定部と、前記類似度判定部による類似度に基づいて、最も類似度の高い音声片を選択する音声片選択部と、前記音声片選択部により選択された音声片についてのデータを基に前記切り出された所定区間の音声片を符号化する符号化部と、を有することを特徴としている。
【００２３】
また、前記符号化部により符号化されたデータを伸張する伸張部と、前記伸張部により伸張されたデータあるいは前記スペクトル包絡パラメータを用いて対応する前記音声片の内容の更新を行う音声片更新部と、を有するようにしている。
【００２４】
また、前記伸張部により伸張されたデータを前記切り出された所定区間の音声片から差し引いて残差を求める残差生成部を有し、前記類似度判定部、前記音声片選択部、前記符号化部、前記伸張部、前記残差生成部は、処理順にループを形成し、前記残差生成部により生成された残差の波形に対して、前記複数種類の音声片を参照し、前記複数種類の音声片との類似性を比較する処理を行ったのち、符号化データを作成して出力するようにしている。
【００２５】
また、前記複数種類の音声片は、前記切り出された所定区間の音声片よりも時間的に後方の前記伸張の処理をされた音声波形を用いて作成された第二の音声片、雑音成分により作成された第三の音声片を有し、前記音声片更新部は、前記スペクトル包絡パラメータの抽出後に前記第一の音声片を更新し、前記第二の音声片を用いて作成された符号化データが前記伸張の処理をされたのち、当該伸張の処理をされたデータに基づいて第二の音声片を更新するようにしている。
【００２６】
また、前記複数種類の音声片は、前記切り出された所定区間の音声片よりも時間的に長い区間を有し、前記類似度判定部は、前記複数種類の音声片の長さの範囲において、類似性を比較して類似度を求め、前記音声片選択部は、最も類似度の高い部分を有する音声片を選択するようにしている。
【００２７】
また、前記符号化データは、前記最も高い部分を有する音声片の番号、前記最も高い部分を表す位置データ、振幅調整用のパラメータで表されるデータであり、さらに、場合に応じて、スペクトル包絡パラメータを有するようにしている。
【００２８】
また、本発明の記録媒体は入力音声から所定区間の音声片を切り出し、所定の頻度で、前記切り出した所定区間の音声片からスペクトル包絡パラメータを抽出し、当該抽出したスペクトル包絡パラメータにより推定される時間的前方予測音声波形とそれと連続する時間的後方予測音声波形を用いて作成された第一の音声片を含む複数種類の音声片を参照し、前記複数種類の音声片と前記所定区間の音声片との類似性を比較して、最も類似度の高い音声片を選択し、前記選択された音声片についてのデータを基に、前記所定区間の音声片を符号化して符号化データを作成する処理をコンピュータに実行させるための音声圧縮伸張処理プログラムを記憶した記憶媒体である。
【００２９】
このように、本発明では、複数種類のそれぞれの音声片と入力音声から切り出した所定区間の音声片（たとえば、４msec程度の長さの音声片）との類似性を比較し、最も類似度の高い音声片を選択し、その選択された音声片についてのデータを基に前記切り出した所定区間の音声片を符号化するという処理を基本処理として行うようにしている。これにより、符号化がきわめて単純な処理で可能となるため、ハードウエア化、並列処理化を行う際に有利なものとすることができる。
特に、スペクトル包絡パラメータにより推定される予測音声波形を用いる場合、従来では、時間的前方予測音声波形（インパルス応答）のみを用いることが一般的であるが、本発明は、スペクトル包絡パラメータにより推定される時間的前方予測音声波形とそれと連続する時間的後方予測音声波形を用いて音声片を作成するようにしている。
このように、前方予測音声波形に加えて、時間的に後方の後方予測音声波形を用いると、雑音の低減を図れる効果がある。すなわち、インパルス応答（前方予測音声波形）のみを用いた音声片とした場合、音声レベルが殆ど０の状態から急激に波形が立ち上がった音声片となってしまうため、その音声片を用いて圧縮伸張処理したとき、不連続点が生じることによってその部分が雑音となって現れるという問題点がある。これに対して、時間的に後方の後方予測音声波形を用いると不連続点を限りなく小さくすることができ、圧縮伸張音声の品質を大幅に改善できる。
【００３０】
また、符号化データを作成したのち、その符号化データの伸張処理、伸張されたデータを前記切り出した所定区間の音声片から差し引く残差生成処理、その残差波形に対して、再び、複数種類の音声片を参照し、類似性を求めるという処理を1回以上行って符号化データを得ることにより、より一層、高精度な符号化データを得ることができる。
【００３１】
また、前記複数種類の音声片は、前記切り出した所定区間の音声片よりも時間的に後方の前記伸張の処理をされた音声波形を用いて作成された第二の音声片、雑音成分により作成された第三の音声片を有することで、入力音声を符号化する際、効率よく、しかも高精度な符号化が可能となる。
【００３３】
また、前記第一の音声片は、前記スペクトル包絡パラメータの抽出後にその内容が更新され、第二の音声片は、当該第二の音声片を用いて作成された符号化データが前記伸張の処理をされたのち、当該伸張の処理をされたデータに基づいてその内容が更新されるようにしているので、従来のように、固定的な内容のコードブックとは異なり、切り出した所定区間の音声片に対して、常に、最適な音声片が格納されることになり、高品質な符号化が可能となる。
【００３４】
また、前記符号化されたデータは、類似部分音声片を有する音声片番号、その音声片内のどの部分であるかを表す位置データ、振幅調整用のパラメータで表されるデータに、場合によっては、スペクトル包絡パラメータをも加えたデータで表すことができる。したがって、符号化後のデータは数バイト程度のデータとなり、大幅なデータ圧縮が可能となる。なお、一般には、音声は急激に変化することは少ないので、処理対象音声片それぞれが４msec程度として考えた場合、スペクトル包絡パラメータの変化は緩やかであり、処理対象の音声片の１０個に１回程度の頻度でスペクトル包絡パラメータを抽出することで十分な精度が得られる、したがって、スペクトル包絡パラメータを加えたとしても大幅に圧縮されたデータとすることができる。
【００３５】
【発明の実施の形態】
以下、本発明の実施の形態について説明する。具体的な実施の形態を説明する前に、まず、本発明の実施の形態の基本的な処理内容について説明する。
【００３６】
図１は入力音声波形を示すもので、このような入力音声波形から、たとえば、４msec程度の音声片の切り出しを行う。この切り出された音声片（以下、処理対象音声片という）ｈ１を音声片表に格納されている音声片と比較し、最も類似度の高い音声片を音声片表の中から選択し、選択された音声片を用いて符号化データを作成する。なお、処理対象音声片を４msecとしたのは、この実施の形態において使用したシステムでは、４msec程度の長さで切り出すのが最もよい結果が得られるからである。つまり、処理理対象音声片の長さが４msecよりも短くなると、音質的には向上するが、圧縮率の低下につながり、また、４msecよりも長くなると、圧縮率的には有利となるが、音質的な劣化につながるおそれがあるからである。
【００３７】
ところで、ここで言う音声片表というのは、図２に示すような複数の要素から作成された音声片（この例では、Ａ１〜Ａ４の４つの音声片）を有するもので、これらの音声片の作成方法については後に説明する。なお、音声片表には常に最新の音声片が格納されるものであり、図２に示す音声片表は、或る時刻における音声片表の内容を示すものである。
【００３８】
今、この図２に示す音声片表が最新の内容であるとすれば、図１において、切り出された４msec程度の処理対象音声片ｈ１が、音声片表の中のどの音声片のどの部分に最も類似しているかを判断する。この場合、処理対象音声片ｈ１は、音声片表の音声片Ａ２の位置ｐ１からの部分が最も類似していると判定される。なお、この最も類似している部分を、類似部分と呼ぶことにする。
【００３９】
これにより、処理対象音声片ｈ１の符号化データは、音声片表の音声片番号Ａ２、位置ｐ１、音声レベルを合わせるための倍率によって表すことができる。
【００４０】
すなわち、音声片表の音声片番号は、この場合、Ａ１〜Ａ４の４つが存在するため、２ビットであらわすことができ、位置ｐ１は、それぞれの音声片の長さを１６msecとすれば１２８サンプリング点（サンプリング周波数が８ｋＨｚであるとする）であるため、７ビットで表すことができる。また、音声レベルの高さを合わせるために、たとえば、１２８段階で調整するとすれば、やはり７ビットで表すことができる。したがって、これらを合計すると、１６ビット、つまり、２バイトのデータとして表現できる。
【００４１】
これに対して、処理対象音声片ｈ１は、各サンプリング点それぞれに２バイト程度のデータ量があるとすれば、サンプリング点の数が３２個であると、６４バイトのデータ量が存在することになる。したがって、符号化後のデータ量は、元のデータに対して、１／３２となる。
【００４２】
また、スペクトル包絡パラメータを使用する場合は、そのデータとして、4.5 バイト程度必要である。ただし、一般には、音声は急激に変化することは少ないので、処理対象音声片それぞれが４msec程度として考えた場合、スペクトル包絡パラメータの変化は緩やかであり、処理対象音声片の１０個に１回程度の頻度でスペクトル包絡パラメータを抽出することで十分な精度が得られる、したがって、スペクトル包絡パラメータを加えたとしても、その符号化データは元のデータに対して大幅に圧縮されたデータとすることができる。
【００４３】
このように、本発明では、処理そのものは単純であり、しかも効率のよい音声データの圧縮が可能となる。
【００４４】
次に本発明の具体的な実施の形態について説明する。
【００４５】
図３は本発明の実施の形態の処理手順を説明するフロ−チャ−トである。図３において、まず、入力音声から４msec程度の処理対象音声片ｈ１を切り出す（ステップｓ１）。この処理は、前述の図１により説明した処理である。そして、スペクトル包絡パラメータを抽出するか否かを判断し（ステップｓ２）、スペクトル包絡パラメータを必要とする場合は、スペクトル包絡パラメータの抽出を行う（ステップｓ３）。なお、前述したように、音声は急激に変化することは少ないので、切り出される処理対象音声片それぞれが４msec程度として考えた場合、スペクトル包絡パラメータの変化は緩やかである。したがって、処理対象音声片の１０個に１回程度の頻度でスペクトル包絡パラメータを抽出することで十分な精度が得られる。
【００４６】
そして、次のステップｓ４において、その時点における音声片表を参照して、最も類似度の高い類似部分を有する音声片を選択する。たとえば、或る時点における処理対象音声片ｈ１に対して、その時点の音声片表の内容が図２に示す内容であったとすると、処理対象音声片ｈ１は、音声片表の音声片Ａ２の位置ｐ１からの部分が最も類似していると判定され、その音声片Ａ２が類似部分を有する音声片として選択される。
【００４７】
次に、選択された音声片Ａ２についてのデータ（音声片番号、位置、音声レベルを合わせるための倍率）などに基づいて符号化処理を行う（ステップｓ５）。
【００４８】
そして、圧縮処理が終了であるか否かを判断して（ステップｓ６）、圧縮処理が終了であれば、ステップｓ５にて符号化処理した符号化データを出力し（ステップｓ７）、入力音声についてすべての圧縮処理が終了か否かを判断して（ステップｓ８）、終了であれば処理を終了とし、まだ、終了していなければ、ステップｓ１に戻る。
【００４９】
一方、ステップｓ６において、圧縮処理終了でなければ、伸張処理（ステップｓ９）、残差生成処理（ステップｓ１０）を行ったのち、ステップｓ４に処理が戻り、ステップｓ４からステップｓ１０で形成されるループ処理を行う。以下、このループ処理について説明する。
【００５０】
前述したように、たとえば、処理対象音声片ｈ１に対して音声片表の音声片Ａ２の位置ｐ１からの部分が最も類似していると判定され、その類似部分を有する音声片Ａ２が選択されたとする。そして、選択された音声片Ａ２についてのデータ（音声片番号、位置、音声レベルを合わせるための倍率）などに基づいて符号化処理を行う。この段階で圧縮処理を終了としないで、同じ処理を何回か繰り返す。つまり、ステップｓ５において符号化されたあと、符号化されたデータを、一旦、伸張処理し（ステップｓ７）、その後、残差生成処理を行う（ステップ８）。
【００５１】
この残差生成処理というのは、符号化されて伸張された音声データを、元の入力音声（この場合、処理対象音声片ｈ１）から差し引いて、その差分を取る処理である。つまり、図４に示すように、処理対象音声片ｈ１から伸張処理された音声データＨ１を引いて、その残差ｄ１を求める。そして、求められた残差ｄ１について、その時点における音声片表を参照して、最も類似度の高い部分（類似部分）を有する音声片を選択するという処理を行う。このような処理を1回以上行うことにより、より一層、高精度な圧縮データが得られるが、２回程度でも十分な精度が得られる。
【００５２】
ところで、ステップｓ９にて行われる伸張処理は、図５のフロ−チャ−トに示されるような処理手順にて行われる。
【００５３】
すなわち、符号化されたデータを入力し（ステップｓ１１）、スペクトル包絡パラメータの更新か否かを判断する（ステップｓ１２）。つまり、スペクトル包絡パラメータが抽出されている場合は、これまでのスペクトル包絡パラメータの値を新たなスペクトル包絡パラメータの値に更新する（ステップｓ１３）。
【００５４】
次に、その時点における音声片表を参照して、符号化データに基づいて最も類似度の高い部分（類似部分）を有する音声片を選択する（ステップｓ１４）。そして、選択された音声片データに基づいて伸張データを作成する（ステップｓ１５）。そして、処理が終了したか否かを判断する（ステップｓ１６）。処理終了でなければ、ステップｓ１５にて伸張処理されたデータを用いて、それまでの音声片表の内容を、この新たな音声片によって更新する（ステップｓ１７）。
【００５５】
そして、さらに符号化データ存在すれば、その符号化データに対して、同様の処理が行われる。
【００５６】
なお、この伸張処理は、図３の処理手順の一つとしてだけ用いられるのではなく、伸張処理単独でも用いられる。たとえば、符号化されたデータが所定のメモリに蓄えられている場合、その符号化されたデータを伸張処理する場合にも用いられる。
【００５７】
このようにして伸張処理が終了すると、図３のフローチャートにおいては、残差生成を行う（ステップｓ１０）。つまり、前述したように、図４に示すように、音声片ｈ１から伸張処理された音声データＨ１を引いて、その残差ｄ１を求める。そして、求められた残差ｄ１について、その時点における音声片表（伸張処理後に新たに更新された音声片表）を参照して、最も類似度の高い部分（類似部分）を有する音声片を選択するという処理を行う。このような処理を1回以上行うことにより、より一層、高精度な圧縮データが得られるが、前述の如く、２回程度でも十分な精度が得られる。
【００５８】
ところで、以上の処理で用いられる音声片表は、少なくとも以下に示す要素により作成された音声片を含むものである。
【００５９】
（１）現在、切り出された処理対象音声片に対し、すでに圧縮伸張処理された音声データ（処理対象音声片に対し、時間的に後方の圧縮伸張処理された音声データ）を用いる。なお、ここでは、すでに過ぎ去った時間を時間的に後方といい、これから先の時間を時間的に前方という表現を用いる。
【００６０】
たとえば、入力音声が図６（ａ）であるとし、ある時刻ｔ１までの入力音声がすでに圧縮伸張処理され、その圧縮伸張処理された音声波形が図６（ｂ）のようであったとする。そして、現在、処理対象音声片がｈ１であったとすると、その処理対象音声片ｈ１に対しては、図６（ｂ）に示す圧縮伸張された音声波形の所定部分（処理対象音声片ｈ１に対する直前の圧縮伸張された音声波形）を音声片として用いる。これは、図２に示す音声片表においては、たとえば、Ａ２の音声片に相当する。なお、その音声片の時間的な長さは、１６msec程度とする。
【００６１】
（２）処理対象音声片の近傍のスペクトル包絡パラメータより推定される時間的前方予測音声波形およびそれと連続する時間的後方予測音声波形を用いる。
【００６２】
前にも述べたように、スペクトル包絡パラメータは、切り出された音声片ごとに送る必要はない。これは、音声は急激には変化することは殆どないと考えられるためであり、たとえば、数個から十数個の処理対象音声片に対して１回というような割合でスペクトル包絡パラメータを送ればよい。そういう意味で、ここでは、処理対象音声片の“近傍”のスペクトル包絡パラメータという表現を用いている。
【００６３】
なお、この現在処理対象音声片の近傍のスペクトル包絡パラメータより推定される時間的前方予測音声波形およびそれと連続する時間的後方予測音声波形というのは、図７に示すように、インパルス応答（前方予測音声波形）ｘ１に加えて、時間的に後方の後方予測音声波形ｘ２を指している。
【００６４】
このように、インパルス応答（前方予測音声波形）に加えて、時間的に後方の後方予測音声波形を用いると、雑音の低減を図れる効果がある。すなわち、インパルス応答（前方予測音声波形）のみを用いた音声片とした場合、音声レベルが殆ど０の状態から急激に波形が立ち上がった音声片となってしまうため、その音声片を用いて圧縮伸張処理したとき、不連続点が生じることによってその部分が雑音となって現れるという問題点がある。これに対して、時間的に後方の後方予測音声波形を用いると不連続点を限りなく小さくすることができ、圧縮伸張音声の品質を大幅に改善できる。
【００６５】
（３）雑音波形を用いる。
【００６６】
この雑音波形は乱数で与えられたものでもよく、また、実際の入力音声中からサンプル化されたものを用いてもよい。
【００６７】
以上のように、本発明で使用する音声片表の内容としては、（１）〜（３）で説明した音声片を少なくとも含むものとする。そして、これら各音声片は１６msec程度の長さの音声片として、たとえば、図２に示すような状態で保持され、常に、最新のデータが蓄えられる。
【００６８】
図８は本発明の音声圧縮伸張装置の構成を示すブロック図である。図８において、音声入力部１から入力された音声は、音声切り出し部２によって、前述したように、たとえば、４msec程度の処理対象音声片として切り出される。この切り出された処理対象音声片は、類似度判定部３によって、音声片表４内の幾つかの音声片Ａ１，Ａ２，・・・，Ａｎと比較され類似度を得る。そして、音声片選択部５によって最も類似度の高い部分（類似部分）を有する音声片が選択される。
【００６９】
符号化部６は、選択された音声片についてのデータ（音声片番号、位置、音声レベルを合わせるための倍率）などに基づいて符号化処理を行う。なお、この段階で符号化処理を終了とすれば、その符号化データを符号化データ出力部７から出力する。また、このとき、スペクトル包絡パラメータを用いる場合は、スペクトル包絡パラメータ抽出部８によって抽出されたスペクトル包絡パラメータを加えた符号化処理を行う。
【００７０】
一方、符号化処理終了でなければ、符号化部７で符号化された符号化データを伸張部９によって伸張処理し、残差生成部１０にて残差生成処理を行う。この伸張処理と残差生成処理は図４におけるフローチャートのステップｓ９とステップｓ１０の処理である。
【００７１】
この残差生成処理というのは、前述したように、符号化されて伸張された音声データを元の入力音声（この場合、処理対象音声片ｈ１）から差し引いて、その差分を取る処理である。つまり、図４に示すように、音声片ｈ１から伸張処理された音声データＨ１を引いて、その残差ｄ１を求めるものである。そして、求められた残差について、その時点における音声片表を参照して、最も類似度の高い類似部分音声片を選択するという処理を行う。
【００７２】
なお、前記伸張部９にて行われる伸張処理は、図５のフロ−チャ−トに示されるような処理手順にて行われる。そして、伸張処理された音声データを用いて、音声片更新部１１が音声片表４の内容の更新を行う。また、この音声片更新部１１は、スペクトル包絡パラメータ抽出部８からスペクトル包絡パラメータが抽出された場合は、そのスペクトル包絡パラメータにより推定される時間的前方予測音声波形およびそれと連続する時間的な後方予測音声波形をも更新する。このようにして、音声片表４の内容は常に最新の音声片が格納されることになる。
【００７３】
このような構成の音声圧縮伸張装置の全体的な動作については、図４のフローチャートで説明したので、ここではその動作についての説明は省略する。
【００７４】
なお、本発明は前述の実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲で種々変形実施可能となるものである。たとえば、切り出される処理対象音声片は、前述の実施の形態では、４msecとしたが、これは、前述の実施の形態において使用したシステムでは、４msecとすることで最もよい結果が得られたからである。しかし、使用するシステムなどによっては、この数値は異なる場合もあるので、これに限定されるものではなく、本発明が適用されるシステムに応じて最適な時間を設定することができる。また、図２で示した音声片表の内容は一例であって、これに限られるものではない。
【００７５】
また、以上説明した本発明の音声圧縮伸張処理を行う処理プログラムは、フロッピィディスク、光ディスク、ハードディスクなどの記憶媒体に記憶させて置くことが出来、本発明は、これらの記憶媒体をも含むものであり、また、ネットワークからデータを得る方式でもよい。
【００７６】
以上説明したように、本発明によれば、音声片表内のそれぞれの音声片と入力音声から切り出した所定区間の音声片との類似性を比較し、最も類似度の高い音声片を選択し、その選択された音声片についてのデータを基に前記切り出した所定区間のを符号化する処理を基本処理として行うようにしている。これにより、符号化がきわめて単純な処理で可能となる。
【００７７】
また、符号化データを作成したのち、その符号化データの伸張処理を行い、伸張されたデータを前記処理対象音声片から差し引いて得られた残差波形に対して、再び、音声片表を参照し、類似性を求めるという処理を複数回行って符号化データを得ることにより、より一層、高品質な符号化データを得ることができる。
【００７８】
また、音声片表に格納される音声片は、処理対象音声片よりも時間的に後方のすでに圧縮伸張処理された音声波形を用いて作成された音声片、スペクトル包絡パラメータにより推定される時間的前方予測音声波形と時間的後方予測音声波形を用いて作成された音声片、雑音成分により作成された音声片を少なくとも有することで、入力音声を符号化する際、効率よく、しかも高品質な符号化が可能となる。特に、スペクトル包絡パラメータにより推定される予測音声波形により音声片を作成する場合、本発明では、スペクトル包絡パラメータにより推定される時間的前方予測音声波形に加えて、時間的に後方の後方予測音声波形を用いているので、雑音の低減が図れ、音声の品質を大幅に改善できる。
【００７９】
また、それぞれの音声片は、符号化されたデータの伸張処理後あるいはスペクトル包絡パラメータの抽出後にその内容が更新されるようにしているので、従来のように、固定的な内容のコードブックとは異なり、処理対象音声片に対して、常に、最適な音声片が格納されることになり、高品質な符号化が可能となる。
【００８０】
また、前記符号化されたデータは、類似部分音声片を有する音声片番号、その音声片内のどの部分であるかを表す位置データ、振幅調整用のパラメータで表されるデータに、場合によっては、スペクトル包絡パラメータをも加えたデータで表すことができ、大幅なデータ圧縮が可能となる。
【００８１】
このように、本発明は、処理内容が単純でしかも効率よく高品質な音声圧縮伸張が可能となり、ハードウエア化や並列処理化を行う際にきわめて有利なものとすることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態を説明するために入力音声を所定の区間切り出した例を示す図。
【図２】本発明の実施の形態における音声片表の一例を示す図。
【図３】本発明の実施の形態の処理手順を説明するフローチャート。
【図４】本発明の実施の形態における残差成分を求める処理を説明する図。
【図５】本発明の実施の形態における伸張処理手順を説明するフローチャート。
【図６】本発明の実施の形態における音声片表内の音声片を伸張処理後の音声波形より作成する例を説明する図。
【図７】本発明の実施の形態における音声片表内の音声片をスペクトル包絡パラメータより推定される時間的前方予測音声波形と時間的後方予測音声波形より作成する例を説明する図。
【図８】本発明の実施の形態における音声圧縮伸張装置の構成を示すブロック図。
【符号の説明】
１音声入力部
２音声片切り出し部
３類似度判定部
４音声片表
５音声片選択部
６符号化部
７符号化データ出力部
８スペクトル包絡パラメータ抽出部
９伸張部
１０残差生成部
１１音声片更新部
ｈ１処理対象音声片
Ａ１，Ａ２，Ａ３，Ａ４音声片表内に格納された音声片
ｐ１音声片における類似部分音声の位置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio compression / decompression method and apparatus for efficiently compressing / decompressing an audio signal by simple processing, and a storage medium storing an audio compression / decompression processing program.
[0002]
[Prior art]
Conventionally, various methods have been proposed as an encoding method for compressing and expanding an audio signal. One of them is JP-A-59-116973 (hereinafter referred to as the first prior art).
[0003]
This first prior art includes means for dividing input speech data every short time to obtain a short time speech signal sequence, spectrum envelope parameter extracting means for extracting a spectrum envelope parameter from the short time speech signal sequence, and the spectrum envelope. Means for calculating an impulse response sequence based on parameters, means for calculating an autocorrelation function sequence using the impulse response sequence, and a cross-correlation function using the impulse response sequence and the short time speech signal sequence Means for calculating a sequence, means for calculating and encoding a driving excitation signal sequence using the autocorrelation function sequence and the cross-correlation function sequence, means for outputting a combination of a spectral envelope code and a driving excitation signal, Furthermore, it has target signal calculation means for applying a predetermined correction to the short time audio signal.
[0004]
According to the first prior art, the position and gain of the driving sound source pulse can be determined efficiently when speech is encoded, and a certain amount of effect can be achieved in reducing the amount of calculation and the amount of memory used. Is obtained.
[0005]
However, in the first prior art, when speech synthesis is performed after encoding a speech signal such as a female voice, it is necessary to extract a large number of driving sound source pulses in order to obtain high-quality speech synthesis. For this reason, there is a problem that the compression rate is deteriorated.
[0006]
That is, a female voice is more complex than a male voice, and in order to obtain a highly accurate synthesized sound, it is necessary to extract a large number of driving sound source pulses, which ultimately results in a poor compression rate.
[0007]
On the other hand, as techniques for obtaining a high compression rate, there are JP-A-63-37399 (hereinafter referred to as the second prior art) and JP-A-3-4300 (hereinafter referred to as the third prior art).
[0008]
In the second conventional technique, pitch estimation is performed from a speech signal, a residual between an estimated value from a past pulse train and an actual signal is obtained, and a drive sound source pulse is calculated from the residual.
[0009]
In the third prior art, pitch estimation is performed, and a driving sound source (multi-pulse) for one pitch section is estimated. Then, by correcting the gain and phase of the multipulse, the other pitch sections are approximated by correcting the other pitch sections. Further, the second multi-pulse is estimated from the residual between the estimated value and the actual value. In addition to the multi-pulse signal, a noise code book may be used.
[0010]
[Problems to be solved by the invention]
The second and third prior arts described above calculate the period of repeating the same waveform, estimate the next period from the previous period, calculate the difference between the estimated part and the actual speech waveform, Since the driving sound source is calculated based on the difference, a high compression rate can be realized.
[0011]
However, since it is necessary to obtain the pitch or the difference, there is a problem that the calculation amount is large, and a memory having a large capacity is required to store the data.
[0012]
In addition, since the residual is calculated and the driving sound source pulse is calculated based on this residual, if a part of the data is lost, the lost data will have a large effect on the subsequent calculations, resulting in high accuracy. There is a big problem that it is impossible to perform proper speech synthesis.
[0013]
As described above, each of the conventional techniques has various problems. For example, the first conventional technique is a basic technique for obtaining a driving sound source pulse. However, in order to improve the quality of a synthesized sound, it is necessary to set up many driving sound source pulses, like a female voice. There is a problem that the compression rate is particularly bad for voice data. Moreover, although the 2nd prior art and the 3rd prior art can obtain a high compression rate, there is a problem that the calculation amount is large and the amount of memory used is also large, and further, because the difference information is used, it is vulnerable to data loss. There's a problem.
[0014]
Recently, portable information devices that handle audio data have been used in a wide range of fields. In this type of portable information device, since the calculation speed and memory capacity of the CPU are greatly restricted, it is a serious problem that the amount of calculation and the amount of memory used are large. In addition, the method using difference information has many problems in terms of improving the performance of products for information devices that need to consider data loss, and is not limited to portable devices, but also for real-time transmission on a computer network. This lack of data will also have a significant impact on the data being transmitted.
[0015]
As described above, each of the conventional speech coding methods has a common problem that the processing is complicated, and there is a problem that it is relatively difficult to increase the speed by hardware and parallel processing. In particular, the processing including the process for obtaining the pitch period has a large amount of calculation and has a great influence when an error occurs. Further, the conventional impulse response using the spectral envelope parameter and the method using the drive pulse have a problem that discontinuity occurs before and after the pulse, and this appears as noise.
[0016]
Therefore, the present invention has a simple processing content, enables easy hardware and parallel processing, enables efficient encoding, and enables audio data compression at a relatively high compression rate. An object of the present invention is to provide an audio compression / decompression method and apparatus and a storage medium storing an audio compression / decompression processing program.
[0017]
The speech compression / decompression method of the present invention cuts out a speech segment in a predetermined section from input speech, extracts a spectrum envelope parameter from the speech segment in the extracted predetermined section at a predetermined frequency, and estimates it by the extracted spectrum envelope parameter. A plurality of types of speech segments including a first speech segment created using a temporally predicted speech waveform and a temporally predicted speech waveform continuous with the temporally predicted speech waveform. Compare the similarity of the segment with the speech segment, select the speech segment with the highest similarity from the plurality of types of speech segment, and based on the data about the selected speech segment, It is characterized by including a process of encoding a speech piece to create encoded data.
[0018]
Further, after creating the encoded data, expand the encoded data, subtract the expanded data from the audio segment of the predetermined section that was cut out to obtain a residual, for the waveform of the residual, The plurality of types of speech segments are referred to, and a process of comparing the similarity between the plurality of types of speech segments and the residual waveform is performed to obtain encoded data.
[0019]
The plurality of types of speech pieces are created by using a second speech piece and a noise component that are created by using the speech waveform that has been subjected to the decompression process that is temporally behind the speech piece of the cut out predetermined section. A third voice piece, the first voice piece is updated in content after extraction of the spectral envelope parameter, and the second voice piece is a code created using the second voice piece. After the decompressed data is subjected to the decompression process, the contents are updated based on the decompressed data.
[0020]
In addition, the plurality of types of speech pieces have sections that are longer in time than the speech pieces of the cut out predetermined section, and when comparing the similarity with the cut out voice pieces of the predetermined section, In the range of the length of the voice segment, the similarity with the extracted voice segment of the predetermined section is compared, and the voice segment having the highest similarity is selected.
[0021]
The encoded data includes the number of the speech piece having the highest similarity part, position data indicating which part in the speech piece is included, and an amplitude adjustment parameter. Thus, the spectrum envelope parameter is added.
[0022]
The speech compression / decompression apparatus according to the present invention includes a speech segment extraction unit that extracts a speech segment of a predetermined section from input speech, and a spectral envelope parameter from the speech segment of the predetermined section that is extracted by the speech segment extraction unit at a predetermined frequency. A plurality of first speech segments created using a temporally predictive speech waveform estimated by the extracted spectral envelope parameter and a temporally predictive speech waveform continuous therewith, A similarity determination unit that refers to a plurality of types of speech pieces and compares the similarity between the plurality of types of speech pieces and the extracted speech piece of the predetermined section, and obtains a similarity, Based on the similarity by the similarity determination unit, an audio segment selection unit that selects an audio segment with the highest similarity, and an audio segment selected by the audio segment selection unit. Is characterized by having an encoding unit for encoding the speech segments of a predetermined interval data cut out the based on the, the.
[0023]
Also, a decompression unit that decompresses the data encoded by the encoding unit, and a speech segment update unit that updates the content of the corresponding speech segment using the data decompressed by the decompression unit or the spectrum envelope parameter And so on.
[0024]
And a residual generation unit that obtains a residual by subtracting the data expanded by the expansion unit from the extracted speech segment of the predetermined section, the similarity determination unit, the speech segment selection unit, and the encoding The expansion unit, the extension unit, and the residual generation unit form a loop in the order of processing, refer to the plurality of types of speech pieces for the residual waveform generated by the residual generation unit, and After performing the process of comparing the similarity with the speech piece, encoded data is created and output.
[0025]
Further, the plurality of types of speech pieces are a second speech piece created by using the speech waveform that has been subjected to the decompression processing that is temporally rearward of the extracted speech piece of the predetermined section, and a noise component. A third speech piece created, and the speech piece update unit updates the first speech piece after extraction of the spectral envelope parameter, and the coding is created using the second speech piece. After the data is subjected to the decompression process, the second audio piece is updated based on the data subjected to the decompression process.
[0026]
Further, the plurality of types of speech pieces have a section that is longer in time than the extracted speech section of the predetermined section, and the similarity determination unit includes a range of lengths of the plurality of types of speech pieces, The similarity is obtained by comparing the similarities, and the speech piece selection unit selects a speech piece having a portion with the highest similarity.
[0027]
Further, the encoded data is data represented by the number of the speech segment having the highest portion, the position data representing the highest portion, and the parameter for amplitude adjustment. It has a parameter.
[0028]
Further, the recording medium of the present invention cuts out a voice segment of a predetermined section from the input voice, extracts a spectrum envelope parameter from the voice segment of the cut out predetermined section at a predetermined frequency, and estimates it by the extracted spectrum envelope parameter. Referring to a plurality of types of speech segments including a first speech segment created using a temporally predicted speech waveform and a temporally predicted speech waveform that is continuous therewith, the plurality of types of speech segments and the speech of the predetermined section Compare the similarity with the piece, select the voice piece with the highest similarity, and encode the voice piece in the predetermined section based on the data about the selected voice piece to create the coded data It is a storage medium storing an audio compression / decompression processing program for causing a computer to execute processing.
[0029]
As described above, the present invention compares the similarity between each of a plurality of types of speech pieces and a speech piece of a predetermined section extracted from the input speech (for example, a speech piece having a length of about 4 msec), and obtains the highest similarity. A process of selecting a high speech segment and encoding the speech segment in the predetermined section cut out based on data about the selected speech segment is performed as a basic process. As a result, encoding can be performed with extremely simple processing, which can be advantageous when performing hardware and parallel processing.
In particular, when a predicted speech waveform estimated by a spectral envelope parameter is used, conventionally, only a temporally forward predicted speech waveform (impulse response) is generally used, but the present invention is estimated by a spectral envelope parameter. A speech segment is created using a temporally predicted speech waveform and a temporally backward predicted speech waveform that is continuous therewith.
In this way, in addition to the forward predicted speech waveform, the use of the backward predicted speech waveform in terms of time has the effect of reducing noise. That is, when a voice piece using only the impulse response (forward predicted voice waveform) is used, a voice piece whose waveform rises suddenly from a state where the voice level is almost 0 is generated. When processed, there is a problem in that a discontinuous point appears and that part appears as noise. On the other hand, when the backward predicted speech waveform behind the time is used, the discontinuity can be reduced as much as possible, and the quality of the compressed and expanded speech can be greatly improved.
[0030]
Also, after creating encoded data, the encoded data is decompressed, the residual data is subtracted from the extracted speech segment of the predetermined section, and a plurality of types are again applied to the residual waveform. By obtaining the encoded data by performing the process of obtaining the similarity once or more with reference to the voice piece, it is possible to obtain the encoded data with higher accuracy.
[0031]
Further, the plurality of types of speech pieces are created by using a second speech piece and a noise component created by using the speech waveform that has been subjected to the decompression process that is temporally behind the cut out speech piece of the predetermined section. By having the third speech piece, the input speech can be encoded efficiently and with high accuracy.
[0033]
In addition, the content of the first speech segment is updated after the extraction of the spectral envelope parameter, and the second speech segment is encoded data created using the second speech segment is subjected to the decompression process. Since the contents are updated based on the decompressed data, the extracted audio data in the predetermined section is different from the codebook having the fixed contents as in the past. The optimum audio piece is always stored for the piece, and high-quality encoding is possible.
[0034]
In addition, the encoded data includes a speech segment number having a similar partial speech segment, position data indicating which portion in the speech segment, data represented by parameters for amplitude adjustment, depending on circumstances. In addition, it can be expressed by data including spectral envelope parameters. Therefore, the encoded data is about several bytes of data, and significant data compression is possible. In general, since the voice hardly changes suddenly, when each processing target speech piece is considered to be about 4 msec, the change of the spectrum envelope parameter is gradual, and once in 10 processing target speech pieces. Extracting the spectrum envelope parameter at a certain frequency provides sufficient accuracy. Therefore, even if the spectrum envelope parameter is added, the data can be greatly compressed.
[0035]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below. Before describing a specific embodiment, first, the basic processing contents of the embodiment of the present invention will be described.
[0036]
FIG. 1 shows an input voice waveform. For example, a voice segment of about 4 msec is cut out from such an input voice waveform. The extracted speech segment (hereinafter referred to as the processing target speech segment) h1 is compared with the speech segment stored in the speech segment table, and the speech segment with the highest similarity is selected from the speech segment table and selected. Encoded data is created using the voice piece. The reason why the processing target speech piece is set to 4 msec is that the best result can be obtained by cutting out with a length of about 4 msec in the system used in this embodiment. That is, if the length of the processing target speech piece is shorter than 4 msec, the sound quality is improved, but it leads to a decrease in the compression rate, and if it is longer than 4 msec, the compression rate is advantageous. This is because the sound quality may be deteriorated.
[0037]
By the way, the voice segment table mentioned here has voice segments created from a plurality of elements as shown in FIG. 2 (in this example, four voice segments A1 to A4). The method of creating will be described later. Note that the latest speech segment is always stored in the speech segment table, and the speech segment table shown in FIG. 2 shows the contents of the speech segment table at a certain time.
[0038]
Now, assuming that the speech fragment table shown in FIG. 2 is the latest content, the processed speech fragment h1 of about 4 msec extracted in FIG. 1 is located at which part of which speech fragment in the speech fragment table. Determine whether they are most similar. In this case, it is determined that the processing target speech piece h1 is most similar in the portion from the position p1 of the speech piece A2 of the speech piece table. This most similar part is called a similar part.
[0039]
As a result, the encoded data of the processing target speech piece h1 can be represented by the scale for matching the speech piece number A2, the position p1, and the speech level of the speech piece table.
[0040]
That is, in this case, there are four voice segment numbers A1 to A4 in this case, so it can be represented by 2 bits, and the position p1 is 128 sampling if the length of each voice segment is 16 msec. Since it is a point (assuming the sampling frequency is 8 kHz), it can be represented by 7 bits. In order to adjust the sound level, for example, if it is adjusted in 128 steps, it can be represented by 7 bits. Therefore, when these are added up, it can be expressed as 16-bit, that is, 2-byte data.
[0041]
On the other hand, if the processing target speech piece h1 has a data amount of about 2 bytes at each sampling point, a data amount of 64 bytes exists if the number of sampling points is 32. Become. Therefore, the amount of data after encoding is 1/32 of the original data.
[0042]
If spectrum envelope parameters are used, about 4.5 bytes are required as the data. However, in general, since the voice hardly changes rapidly, when each processing target speech piece is considered to be about 4 msec, the change of the spectrum envelope parameter is moderate, and about once in 10 processing target speech pieces. Extracting the spectral envelope parameter at a frequency of sufficient accuracy can be obtained. Therefore, even if the spectral envelope parameter is added, the encoded data may be data that is significantly compressed relative to the original data. it can.
[0043]
As described above, in the present invention, the processing itself is simple, and the audio data can be efficiently compressed.
[0044]
Next, specific embodiments of the present invention will be described.
[0045]
FIG. 3 is a flowchart for explaining the processing procedure of the embodiment of the present invention. In FIG. 3, first, a processing target speech piece h1 of about 4 msec is cut out from the input speech (step s1). This process is the process described with reference to FIG. Then, it is determined whether or not a spectrum envelope parameter is to be extracted (step s2). If a spectrum envelope parameter is required, the spectrum envelope parameter is extracted (step s3). Note that, as described above, since the voice is unlikely to change abruptly, if each processing target speech piece to be cut out is considered to be about 4 msec, the change in the spectrum envelope parameter is moderate. Therefore, sufficient accuracy can be obtained by extracting the spectrum envelope parameters at a frequency of about once in 10 speech pieces to be processed.
[0046]
Then, in the next step s4, referring to the speech segment table at that time, the speech segment having the similar part with the highest similarity is selected. For example, if the content of the speech segment table at that time is the content shown in FIG. 2 for the processing target speech segment h1 at a certain time, the processing target speech segment h1 is the position of the speech segment A2 in the speech segment table. It is determined that the part from p1 is most similar, and the speech piece A2 is selected as a speech piece having a similar part.
[0047]
Next, an encoding process is performed based on the data (speech piece number, position, magnification for matching the speech level) of the selected speech piece A2 (step s5).
[0048]
Then, it is determined whether or not the compression process has been completed (step s6). If the compression process has been completed, the encoded data encoded in step s5 is output (step s7), and the input speech is input. It is determined whether or not all compression processes have been completed (step s8). If completed, the process ends. If not completed yet, the process returns to step s1.
[0049]
On the other hand, if the compression process is not finished in step s6, the decompression process (step s9) and the residual generation process (step s10) are performed, then the process returns to step s4, and the loop formed from step s4 to step s10 Process. Hereinafter, this loop process will be described.
[0050]
As described above, for example, it is determined that the portion from the position p1 of the speech segment A2 in the speech segment table is most similar to the processing target speech segment h1, and the speech segment A2 having the similar portion is selected. To do. Then, encoding processing is performed based on the data (speech number, position, magnification for matching the sound level) of the selected speech piece A2. The same process is repeated several times without terminating the compression process at this stage. That is, after being encoded in step s5, the encoded data is once decompressed (step s7), and then a residual generation process is performed (step 8).
[0051]
This residual generation process is a process of subtracting the encoded and expanded audio data from the original input audio (in this case, the processing target audio piece h1) and taking the difference. That is, as shown in FIG. 4, the decompressed audio data H1 is subtracted from the processing target audio fragment h1, and the residual d1 is obtained. Then, with respect to the obtained residual d1, a process of selecting a speech segment having a portion having a highest similarity (similar portion) with reference to the speech segment table at that time is performed. By performing such a process once or more, highly accurate compressed data can be obtained, but sufficient accuracy can be obtained even about twice.
[0052]
By the way, the decompression process performed in step s9 is performed according to a processing procedure as shown in the flowchart of FIG.
[0053]
That is, the encoded data is input (step s11), and it is determined whether or not the spectrum envelope parameter is updated (step s12). That is, when the spectrum envelope parameter is extracted, the value of the spectrum envelope parameter so far is updated to the value of the new spectrum envelope parameter (step s13).
[0054]
Next, referring to the speech segment table at that time, the speech segment having the portion with the highest similarity (similar portion) is selected based on the encoded data (step s14). Then, decompressed data is created based on the selected audio piece data (step s15). Then, it is determined whether or not the process is completed (step s16). If the processing is not finished, the contents of the speech segment table up to that point are updated with the new speech segment using the data expanded in step s15 (step s17).
[0055]
If there is more encoded data, the same processing is performed on the encoded data.
[0056]
Note that this decompression process is not only used as one of the processing procedures of FIG. 3, but is also used in the decompression process alone. For example, when encoded data is stored in a predetermined memory, it is also used when the encoded data is decompressed.
[0057]
When the decompression process ends in this way, residual generation is performed in the flowchart of FIG. 3 (step s10). In other words, as described above, as shown in FIG. 4, the decompressed audio data H1 is subtracted from the audio fragment h1, and the residual d1 is obtained. Then, with respect to the obtained residual d1, the speech segment having the highest similarity (similar portion) is selected with reference to the speech segment table at that time (the speech segment table newly updated after the decompression process). The process of doing. By performing such processing once or more, highly accurate compressed data can be obtained, but sufficient accuracy can be obtained even about twice as described above.
[0058]
By the way, the speech segment table used in the above processing includes at least speech segments created by the following elements.
[0059]
(1) Currently, audio data that has already been subjected to compression / expansion processing (voice data that has been subjected to compression / expansion processing later in time with respect to the processing target audio piece) is used for the cut out processing target audio piece. Here, the expression that the time that has already passed is referred to as “backward in time” and the future time is referred to as “forward in time” is used.
[0060]
For example, it is assumed that the input voice is shown in FIG. 6A, and the input voice up to a certain time t1 has already been compressed and expanded, and the compressed and expanded voice waveform is as shown in FIG. 6B. If the processing target speech piece h1 is currently at h1, the processing target speech piece h1 has a predetermined portion of the compressed and expanded speech waveform shown in FIG. 6B (immediately before the processing target speech piece h1). (Compressed and decompressed speech waveform) is used as a speech piece. This corresponds to, for example, A2 speech segment in the speech segment table shown in FIG. The time length of the audio piece is about 16 msec.
[0061]
(2) A temporal forward prediction speech waveform estimated from a spectral envelope parameter in the vicinity of the processing target speech segment and a temporal backward prediction speech waveform continuous therewith are used.
[0062]
As previously mentioned, the spectral envelope parameters do not need to be sent for each segmented speech fragment. This is because it is considered that the voice hardly changes abruptly. For example, if the spectrum envelope parameter is sent at a rate of once to several to a dozen speech pieces to be processed, Good. In this sense, the expression “nearby” spectral envelope parameters of the processing target speech segment is used here.
[0063]
Note that the temporal forward prediction speech waveform estimated from the spectral envelope parameters in the vicinity of the current speech piece to be processed and the temporal backward prediction speech waveform continuous therewith are shown in FIG. In addition to (speech waveform) x1, it indicates a backward predicted speech waveform x2 that is temporally backward.
[0064]
Thus, in addition to the impulse response (forward predicted speech waveform), using the backward predicted speech waveform that is temporally rearward has the effect of reducing noise. That is, when a voice piece using only the impulse response (forward predicted voice waveform) is used, a voice piece whose waveform rises suddenly from a state where the voice level is almost 0 is generated. When processed, there is a problem in that a discontinuous point appears and that part appears as noise. On the other hand, when the backward predicted speech waveform behind the time is used, the discontinuity can be reduced as much as possible, and the quality of the compressed and expanded speech can be greatly improved.
[0065]
(3) Use a noise waveform.
[0066]
This noise waveform may be given by random numbers, or may be sampled from actual input speech.
[0067]
As described above, the content of the speech segment table used in the present invention includes at least the speech segment described in (1) to (3). Each of these voice pieces is held as a voice piece having a length of about 16 msec in a state as shown in FIG. 2, for example, and the latest data is always stored.
[0068]
FIG. 8 is a block diagram showing the configuration of the audio compression / decompression apparatus of the present invention. In FIG. 8, the voice input from the voice input unit 1 is cut out by the voice cutout unit 2 as a processing target voice piece of about 4 msec, for example, as described above. The extracted processing target speech piece is compared with several speech pieces A1, A2,..., An in the speech piece table 4 by the similarity determination unit 3 to obtain a similarity. Then, the speech segment having the highest similarity (similar portion) is selected by the speech segment selection unit 5.
[0069]
The encoding unit 6 performs an encoding process based on data about the selected speech segment (speech segment number, position, magnification for matching the speech level) and the like. If the encoding process is terminated at this stage, the encoded data is output from the encoded data output unit 7. At this time, when the spectrum envelope parameter is used, an encoding process is performed by adding the spectrum envelope parameter extracted by the spectrum envelope parameter extraction unit 8.
[0070]
On the other hand, if the encoding process is not completed, the encoded data encoded by the encoding unit 7 is expanded by the expansion unit 9, and the residual generation process is performed by the residual generation unit 10. This expansion processing and residual generation processing are the processing of step s9 and step s10 in the flowchart in FIG.
[0071]
As described above, the residual generation process is a process of subtracting the encoded and expanded audio data from the original input audio (in this case, the processing target audio piece h1) and taking the difference. That is, as shown in FIG. 4, the decompressed voice data H1 is subtracted from the voice piece h1, and the residual d1 is obtained. Then, with respect to the obtained residual, a process of selecting a similar partial speech segment having the highest degree of similarity is performed with reference to the speech segment table at that time.
[0072]
The decompression process performed in the decompression unit 9 is performed according to the processing procedure shown in the flowchart of FIG. Then, the speech segment update unit 11 updates the contents of the speech segment table 4 using the decompressed speech data. In addition, when the spectral envelope parameter is extracted from the spectral envelope parameter extracting unit 8, the speech segment updating unit 11 performs temporal forward prediction speech waveform estimated by the spectral envelope parameter and temporal backward prediction continuous thereto. The voice waveform is also updated. In this way, the latest speech segment is always stored as the content of the speech segment table 4.
[0073]
Since the overall operation of the audio compression / decompression apparatus having such a configuration has been described with reference to the flowchart of FIG. 4, description of the operation is omitted here.
[0074]
The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the gist of the present invention. For example, the processing target speech piece to be cut out is 4 msec in the above-described embodiment. This is because the best result was obtained by using 4 msec in the system used in the above-described embodiment. . However, this numerical value may differ depending on the system to be used, and is not limited to this, and an optimal time can be set according to the system to which the present invention is applied. Further, the contents of the speech fragment table shown in FIG. 2 are merely examples, and the present invention is not limited to this.
[0075]
The processing program for performing the audio compression / decompression processing of the present invention described above can be stored in a storage medium such as a floppy disk, an optical disk, or a hard disk, and the present invention includes these storage mediums. Yes, and a method of obtaining data from the network may be used.
[0076]
As described above, according to the present invention, the similarity between each voice piece in the voice piece table and the voice piece in a predetermined section cut out from the input voice is compared, and the voice piece with the highest similarity is selected. The process of encoding the cut out predetermined section based on the data about the selected speech piece is performed as a basic process. As a result, encoding can be performed with a very simple process.
[0077]
Also, after creating encoded data, the encoded data is decompressed, and the residual waveform obtained by subtracting the decompressed data from the processing target speech fragment is referred to the speech fragment table again. Then, by obtaining the encoded data by performing the process of obtaining the similarity a plurality of times, it is possible to obtain even higher quality encoded data.
[0078]
The speech segment stored in the speech segment table is a speech segment created using a speech waveform that has already been compressed and decompressed in time behind the target speech segment, and is temporally estimated from the spectral envelope parameter. By having at least a speech segment created using the forward predicted speech waveform and the temporal backward predicted speech waveform, and a speech segment created from noise components, an efficient and high-quality code can be used when encoding the input speech. Can be realized. In particular, when a speech segment is created from a predicted speech waveform estimated from a spectrum envelope parameter, in the present invention, in addition to a temporal forward predicted speech waveform estimated from a spectrum envelope parameter, a temporally backward predicted speech waveform Therefore, noise can be reduced and the voice quality can be greatly improved.
[0079]
In addition, since the content of each speech piece is updated after the encoded data is decompressed or the spectral envelope parameters are extracted, a codebook having a fixed content as in the past is used. In contrast, the optimum speech segment is always stored for the speech segment to be processed, and high-quality encoding is possible.
[0080]
In addition, the encoded data includes a speech segment number having a similar partial speech segment, position data indicating which portion in the speech segment, data represented by parameters for amplitude adjustment, depending on circumstances. In addition, it can be expressed by data including a spectral envelope parameter, and a large amount of data compression is possible.
[0081]
As described above, according to the present invention, it is possible to perform high-quality audio compression / decompression with simple processing contents and high efficiency, and can be extremely advantageous when performing hardware or parallel processing.
[Brief description of the drawings]
FIG. 1 is a diagram showing an example in which input speech is cut out in a predetermined section in order to explain an embodiment of the present invention.
FIG. 2 is a diagram showing an example of an audio fragment table according to the embodiment of the present invention.
FIG. 3 is a flowchart for explaining a processing procedure according to the embodiment of the present invention.
FIG. 4 is a diagram illustrating processing for obtaining a residual component in the embodiment of the present invention.
FIG. 5 is a flowchart illustrating a decompression processing procedure according to the embodiment of the present invention.
FIG. 6 is a diagram for explaining an example of creating a speech segment in a speech segment table from an expanded speech waveform in the embodiment of the present invention.
FIG. 7 is a diagram illustrating an example in which a speech segment in a speech segment table according to the embodiment of the present invention is created from a temporally forward predicted speech waveform estimated from a spectral envelope parameter and a temporally backward predicted speech waveform.
FIG. 8 is a block diagram showing the configuration of the audio compression / decompression apparatus according to the embodiment of the present invention.
[Explanation of symbols]
1 Voice input part
2 Voice segmentation part
3 Similarity determination unit
4 Voice one side table
5 Voice segment selector
6 Coding section
7 Encoded data output section
8 Spectrum envelope parameter extraction unit
9 Extension part
10 Residual generator
11 Voice segment update part
h1 target speech piece
A1, A2, A3, A4 Voice fragment stored in the voice fragment table
p1 Position of similar partial speech in speech segment

Claims

Cut out a segment of speech from the input speech
Extracting spectral envelope parameters from the segmented speech segment at a predetermined frequency,
With reference to a plurality of types of speech segments including a first speech segment created using a temporal forward prediction speech waveform estimated by the extracted spectral envelope parameter and a temporal backward prediction speech waveform continuous therewith, the plurality of speech segments Compare the similarity between the type of audio piece and the extracted audio piece of the predetermined section, select the audio piece with the highest similarity from the plurality of types of audio piece, and select the data about the selected audio piece An audio compression / decompression method characterized by including a process of generating encoded data by encoding an audio piece of the cut-out predetermined section.

After creating the encoded data, the encoded data is expanded, and the expanded data is subtracted from the extracted audio segment of the predetermined section to obtain a residual. The encoded data is obtained by referring to types of speech segments and performing a process of comparing the similarity between the plurality of types of speech segments and the residual waveform. Audio compression encoding method.

The plurality of types of speech pieces are created by using a second speech piece and a noise component that are created by using the speech waveform that has been subjected to the decompression process that is temporally behind the speech piece of the cut out predetermined section. A third voice piece, the first voice piece is updated in content after extraction of the spectral envelope parameter, and the second voice piece is a code created using the second voice piece. 3. The audio compression / decompression method according to claim 2, wherein after the decompressed data is subjected to the decompression process, the content is updated based on the decompressed data.

The plurality of types of speech pieces have a section that is longer in time than the speech piece of the cut out predetermined section, and when comparing the similarity with the cut out voice piece of the predetermined section, 4. The speech piece having a portion with the highest similarity is selected by comparing the similarity with the speech piece of the cut-out predetermined section in the range of the length. The audio compression / decompression method described.

The encoded data is the number of a speech piece having the portion with the highest degree of similarity, position data indicating which part in the speech piece is, a parameter for amplitude adjustment, and, depending on the case, 5. The audio compression / decompression method according to claim 4, wherein the data is obtained by adding a spectrum envelope parameter.

An audio segment extractor for extracting an audio segment of a predetermined section from the input speech;
A spectral envelope parameter extracting unit that extracts a spectral envelope parameter from a voice segment of a predetermined section cut out by the voice piece cutting unit at a predetermined frequency;
A plurality of types of speech pieces including a first speech segment created using a temporal forward prediction speech waveform estimated by the extracted spectral envelope parameter and a temporal backward prediction speech waveform continuous therewith,
A similarity determination unit that refers to the plurality of types of speech pieces and compares the similarities between the plurality of types of speech pieces and the extracted speech piece of the predetermined section;
An audio piece selection unit that selects an audio piece with the highest similarity based on the similarity by the similarity determination unit;
An encoding unit that encodes the speech segment of the predetermined section extracted based on the data about the speech segment selected by the speech segment selection unit;
An audio compression / decompression apparatus comprising:

A decompression unit for decompressing the data encoded by the encoding unit;
A speech segment update unit that updates the content of the corresponding speech segment using the data expanded by the expansion unit or the spectrum envelope parameter;
7. The audio compression / decompression apparatus according to claim 6, further comprising:

A residual generation unit that subtracts the data expanded by the expansion unit from the clipped audio segment of the predetermined section to obtain a residual;
The similarity determination unit, the speech segment selection unit, the encoding unit, the decompression unit, and the residual generation unit form a loop in the order of processing,
For the residual waveform generated by the residual generation unit, refer to the plurality of types of speech pieces, and perform a process of comparing the similarity with the plurality of types of speech pieces, and then encoding data. 8. The audio compression / decompression apparatus according to claim 7, wherein the audio compression / decompression apparatus is created and output.

The plurality of types of speech pieces are created by using a second speech piece and a noise component created by using the speech waveform that has been subjected to the decompression process that is temporally rearward of the extracted speech piece of the predetermined section. A third voice piece
The speech segment update unit updates the first speech segment after the extraction of the spectral envelope parameter, and the encoded data created using the second speech segment is subjected to the decompression process. 9. The voice compression / decompression apparatus according to claim 6, wherein the second voice piece is updated based on the decompressed data.

The plurality of types of speech pieces have a section that is longer in time than the cut-out speech section of the predetermined section,
The similarity determination unit calculates a similarity by comparing similarities in a range of lengths of the plurality of types of speech pieces;
10. The voice compression / decompression method according to claim 6, wherein the voice piece selection unit selects a voice piece having a portion having the highest degree of similarity.

The encoded data is data represented by the number of the voice segment having the highest part, the position data representing the highest part, and the parameter for amplitude adjustment. 11. The audio compression / decompression apparatus according to claim 10, further comprising:

Cut out a segment of speech from the input speech
Extracting spectral envelope parameters from the segmented speech segment at a predetermined frequency,
With reference to a plurality of types of speech segments including a first speech segment created using a temporal forward prediction speech waveform estimated by the extracted spectral envelope parameter and a temporal backward prediction speech waveform continuous therewith, the plurality of speech segments The similarity between the type of speech piece and the speech piece in the predetermined section is selected, the speech piece having the highest similarity is selected, and the speech piece in the predetermined section is selected based on the data about the selected speech piece. A storage medium storing an audio compression / decompression processing program for causing a computer to execute a process of generating encoded data by encoding a computer.