JPH04213500A

JPH04213500A - Method and device for encoding voice

Info

Publication number: JPH04213500A
Application number: JP1124191A
Authority: JP
Inventors: Berel Wolovitz Lionel; ライオネル　ベレル　ヴォロヴィッツ
Original assignee: Psion PLC
Current assignee: Psion Holdings Ltd
Priority date: 1990-02-01
Filing date: 1991-01-31
Publication date: 1992-08-04
Also published as: GB9002282D0; EP0440335A2; EP0440335A3

Abstract

PURPOSE: To encode a voice to a form suitable for storing it in a storage device of a computer. CONSTITUTION: This device for encoding a voice consists of a sampling circuit 9 for sampling the voice and generating digital data expressing the sampled signal. An encoder 11 connected to the circuit 9 encodes digital data by linear predictive encoding and generates a variable expressing a sound signal. The variable is converted by a converter 12, a route of a polynomial for a sum and a difference is determined by a circuit 13 so as to generate linear spectral data and these data are stored. As an example, the linear spectral data are irregularly quantized in a frequency area and the route of the polynomial can be found out only by evaluating the polynomial by quantized frequency.

Description

【発明の詳細な説明】【０００１】【産業上の利用分野】本発明は、データ圧縮法に関し、
詳細にはコンピュータの記憶装置に格納するに適切な形
の音声の符号化に関する。【０００２】【従来の技術】パーソナルコンピュータ、特に携帯用ハ
ンドヘルド、あるいはラップトップ形の機械の音声記録
を容易にすることは、望ましい。例えば、記録された音
声は、機械で動作しているワードプロセッサの文書に注
釈をつけるために使用されるか、あるいは、機械により
制御されたダイアリと関連して指示メッセージを与える
ために使用される。【０００３】パーソナルコンピュータに、コンピュータ
による記憶のために人間の声より成る音声の処理に使用
されるＤＳＰ（デジタル信号プロセッサ）を装着するこ
とは、知られている。しかし、許容出来る品質で音声を
記憶するためには、既知の装置では高速なデータ転送速
度が必要とされるので、実際には、数秒以下の音声が、
一般的に使用可能な数メガバイトのＲＡＭに記憶され、
さらに、大容量記憶装置、例えばハードディスクで使用
出来る数十または数百メガバイトさえも、急速に使い果
される。この問題を克服するために、データ圧縮法が使
用されるが、しかし、この説明で望ましいとされる非常
に高い圧縮率を可能とする既知の圧縮アルゴリズムは、
計算が非常に集中的に行れる。従って、パーソナルコン
ピュータでは記憶量と処理能力が限定されるという２つ
の制約条件があるので、この種の装置で長時間の音声を
記録することは、これまで不可能であった。【０００４】【本発明の概要】本発明の第１の面により、コンピュー
タに記憶するための音声符号化の方法は、音声信号のサ
ンプリングと線形予測符号によるサンプリングされた音
声データの符号化、及びこれによって行われる音声デー
タを表す変数の生成、和と差の多項式を得るための変数
の変換、多項式の根の決定及びこれによる線スペクトル
データの生成、及びコンピュータへの線スペクトルデー
タの記憶などより構成されている。【０００５】本発明者は、非常に優れた性能を既知の装
置に付与する特色をいくつか組合せた符号化を開発し、
これによって圧縮率は１００：１程度に高くなった。こ
のアルゴリズムは、十分に計算効率がよく、一般的な持
ち出し可能なＤＳＰ使用の比較的低電力のハンドヘルド
形のパーソナルコンピュータで実行出来る。【０００６】線形予測符号化は、予測としては音声の分
析と合成とに使用される周知の方法であり、例えば、ア
メリカ音響学会誌１９７１年ｖｏｌ．５０　　Ｎｏ．２
（パート２）の６３７〜６５５頁に記載されたビー・エ
ス・アタール、他の“音声波の線形予測による音声の分
析と合成”の論文に述べられている。線形スペクトル対
の使用は、アイ・イー・イー、イー誌の音響学、音声と
信号処理に関する特集、１９８６年１２月ｖｏｌ．エイ
・エス・エス・ピー─３４、Ｎｏ．６のピー・カベイル
他著“チエビシエフ多項式を使用した線スペクトル周波
数の計算”と、アイ・イー・イー・イー誌の音響学、音
声と信号処理に関する特集、１９８７年４月ｖｏｌ．エ
イ・エス・エス・ピーＮｏ．４、５６８〜５７１頁に記
載されたジョージ・エス・キング他の論文に述べられて
いる。しかし、線スペクトル対の誘導に使用された根算
出のアルゴリズムは、集中的に計算が行われるので、こ
の形式の方法を比較的低電力のコンピュータで行うこと
は、実際には今まで可能であるとは考えられず、線スペ
クト対を使用せずに得られた圧縮率は十分に高くはなか
った。【０００７】好ましくは記憶された線スペクトル対は、
一様でない量子化による周波数領域で量子化され、多項
式の根は、量子化周波数で多項式を評価することによっ
てのみ見出される。圧縮率とアルゴリズムの計算効率は
、線スペクトル対が量子化周波数で関連の関数を評価す
ることにより見出されて、周波数範囲の適切な量子化に
よって周波数領域内に線スペクトル対を量子化すること
により、改善されることが明らかにされている。【０００８】サンプリングされた音声データは、適切に
所定の期間のフレームへ分割されて、連続したフレーム
が比較され、新しいフレームと先行フレームとの差が所
定の程度より小さい場合、新しいフレームは、先行フレ
ームを繰り返す命令により符号化される。【０００９】新しいフレームが先行フレームと所定の程
度以上に差があると、圧縮比は、新しいフレームを記憶
することによってのみ、さらに改善される。フレームが
同じである場合、新しいフレームは、データの復元によ
り、先行フレームを繰り返す命令として認識される一定
の領域に１ビットづつ符号化される。【００１０】以降に一層詳しく述べられているように、
フレーム間を比較する適切な基準によって、記憶される
フレームの数は、品質の著しい低下もなく、５０％程度
だけ減少することが明らかにされている。この方法は、
音声信号の表示の図形ディスプレイと、使用者により行
われた信号の図形表示の動作に応答して、記憶された変
数化音声信号の編集とより適切に構成されている。【００１１】本発明の第１の面により設定されたこれら
の圧縮法を使用して、テキストがワードプロセッサを使
用して記憶編集されると同様に、コンピュータを使用し
て音声を記憶し編集することが可能になった。しかし、
問題が、適切なユーザーインタフェースの設定から発生
する。時間に対する振幅のプロットの形でサンプリング
したデータのディスプレイを形成することは、簡単なデ
ジタル・サンプラを組み入れたパーソナルコンピュータ
において知られている。このようなディスプレイは、音
楽などの比較的一様な信号の処理にはある程度役に立つ
が、音声のように複雑な信号を処理するには一段と困難
であることが知られている。本発明者は、信号が圧縮さ
れまた変数化された形で記憶される場合、音声などの複
雑な信号の編集を非常に容易にする記憶された変数から
表示が生成されることを明らかにした。従って、使用者
は、切断、張り合せ、複写、削除より成る、テキストプ
ロセッサで一般に行われる動作の多くを実行することが
出来る。【００１２】この方法は、そのほかに、信号の図形表示
に関してカーソルを移動して編集するために音声信号内
に点を選択することと、同時にカーソルによりマークさ
れた信号の部分を再成することより適切に構成されてい
る。【００１３】本発明の第２の面より、次の手段より成る
音声を符号化するための装置が装備されている。即ち、
音声信号をサンプリングして、これにより音声信号を表
示するデジタルデータを生成する手段と、サンプリング
する前記手段に接続しまた線形予測符号化により前記デ
ジタルデータを符号化するために配置されており、これ
により音声信号を表示する変数を生成するエンコーダと
、前記変数を変換し、対応する和と差の多項式を計算す
る手段と、前記の和と差の多項式の根を決定し、これに
より一連の線スペクトルデータを生成する手段と、前記
線スペクトルデータを記憶する記憶手段とである。【００１４】本発明の第３の面より、音声を符号化と復
号化とをする手段と、次の手段より成る符号化と復号化
の前記手段より成るコンピュータが装備されている。即
ち、音声信号をサンプリングして、これにより前記音声
信号を表示するデジタルデータを生成する手段と、サン
プリングする前記手段に接続し、線形予測符号化により
前記デジタルデータを符号化するために配置され、これ
により前記音声信号を表示する変数を生成するエンコー
ダと、前記変数を変換し、対応する和と差の多項式を計
算する手段と、前記の和と差の多項式の根を決定し、こ
れにより一連の線スペクトルデータを生成する手段と、
前記線スペクトルデータを記憶する手段と、前記線スペ
クトルデータを前記の記憶する手段から検索する手段と
、前記検索した線スペクトルデータに対応する和と差の
多項式を決定する手段と、前記の和と差の多項式を変換
して、前記音声を表示する変数を生成する手段と、前記
変換を復号化するために配置され、前記音声信号を表示
するデジタルデータを生成するデコーダと前記デジタル
データに対応するアナログ信号を生成し、出力する手段
とである。【００１５】【実施例】ラップトップ・パーソナルコンピュータ１は
、アナログ−デジタル変換器ＡＤＣを介して、デジタル
信号処理装置ＤＳＰへ接続したマイクロフォン２より成
っている。適切である場合、ＤＳＰによるデータ出力は
、本実施例では、電池装備されたＲＡＭカートリッジで
ある大容量記憶装置４か、あるいは、コンピュータ１の
主ＲＡＭ１５へ、主ＣＰＵ３の制御の下で書き込まれる
。【００１６】アナログ−デジタル変換器は、マイクロフ
ォンで受けた音声をサンプリングして、データストリー
ムを６４キロビット毎秒の速度で生成する。生成された
データストリームは、本実施例では、２５ミリ秒の持続
時間のフレームに分割される。各フレームに関して、変
数の数が計算される。これらの変数は、フレーム内の音
声を、励起周波数として、また音声トラックの作用を音
声符号により生成された励起にならって形成する変換関
数を有する離散的時間変化線形フィルタを定義する予測
係数の個数として表す。フレームがＰ個のサンプルより
成る場合、ａｋ　は一般的予測子係数であり、Ｓｎ　が
一般的サンプル値とすると、この予測子誤差Ｅｎ　は、
音声サンプルＳｎ　と数式１で与えられるその予測され
た値との差である。従って、Ｅｎ　は、数式２により与
えられる。【００１７】予測子係数の値は、予測誤差の不偏分散＜
Ｅｎ　２＞ＡＶを最小にするように選択される。ビー・
エス・アタール他の上記引用した論文にかなり詳細に述
べられているように、サンプリングしたデータの集合に
この制約条件を適用すると、解法することにより予測子
係数を与える１組の連立１次法定式が求められる。【００１８】データの圧縮をさらに行うために、予測子
は直接に記憶されないが、和と差の多項式に分解される
多項式を形成するために使用される。この和と差の多項
式の根は、線スペクトルの集合を構成し、この集合は、
コンピュータにより記憶するためにＤＳＰにより出力さ
れ、その後相補合成処理に使用される線スペクトルの集
合である。根を見出すアルゴリズムと予測子係数を記憶
された線スペクトルの集合から誘導する相補アルゴリズ
ムが、明細書末尾に詳細に記載されている。それらの根
は、多くの周波数で多項式を評価することにより求めら
れる。連続した周波数を使用するよりはむしろ、この関
数は、一様でないスペースで所定の間隔において評価さ
れる。本実施例において、対数的に間隔を設けられる周
波数が量子化するために使用される。表１に列記された
これらの周波数は、耳が最も感度の高い多くの量子化レ
ベルをこの領域に設定するために選択される。【００１９】次に、図２に示すように、符号が変化する
周波数ｆ１　とｆ２　は、その間に、根の位置を設定す
る。２つの周波数のうち、多項式Ｐ（ｆ）により近い周波数
は、選択されて根を表す。明確にするため、図２は、多
項式を表すものとしてさらに単純された関数を示す。【００２０】上記の処理の結果、各フレームは、一連の
第１、第２、・・・第ｎ番の根により表され、各根は、
個々の周波数の量子化レベルに相当する１から４２の指
標より構成されている。連続したフレームを比較するこ
とにより、また、新しいフレームが、先行フレームを繰
り返す命令によりそのフレームを符号化する所定の程度
よりも小さく異なる場合に、圧縮率のさらに著しい増加
が得られることが明らかにされた。比較は、多くのフレ
ームの対応する根の間で行われる。すなわち、１つのフ
レームの第１の根とほかのフレームの第１の根と比較さ
れ、１つのフレーム第２の根は、ほかのフレームの根と
比較され、以下同様である。対応する根の量子化単位の
差が決定される。フレームは、再生の品質に著しい低下
がなく、このように比較された根が３よりも大きく異な
らない場合、“繰返し”命令により置き換えられること
が明らかになった。根が５だけ異なることが許容される
ならば、可聴域の劣化が発生する。基準を３とすると、
フレームの約５０％が繰り返される。再生の品質をさら
に改善するために、より厳しい制約条件が使用される。例えば、根が、先行フレームの対応する根から１より大
きく異ならない場合にのみ、フレームが繰り返されるな
らば、繰り返しとして符号化されるフレームのパーセン
トは低下されるが、再生の品質には対応する向上がある
。より高い周波数の根が、低い周波数の根より大きく変
ることが許容されるならば、好適実施例では、繰返しフ
レームの基準は、第１の４つの根は、１量子化単位以下
だけ異なるべきであり、残りの根は、２以下だけ異なる
べきであるということである。【００２１】全体で、ＤＳＰによる各フレーム出力は、
フレームが有声化されるか、有声化されないかを示す第
１の変数と、予測子係数ａ１　・・・ａｐ　から計算さ
れた線スペクトルの集合と一緒に、刺激トーンのピッチ
周期を設定する第２の変数とより成っている。ピッチ周
期は、多くの一般的なピッチ決定法のいずれか１つの方
法を使用して決定される。この方法には、アメリカ音響
学会誌、１９６９年８月ｖｏｌ．４６，４４２〜４４８
頁に発表されたビー・ゴールド他の論文“時間領域内の
音声のピッチ周期算出の並行処理法”に記載された方法
などがある。【００２２】線スペクトルの集合が記憶されると、記憶
された集合を記憶装置から叫び出し、また、予測子係数
を線スペクトルの集合から計算するために相補アルゴリ
ズムを適用することにより、音声はいつでも再生される
。適切なアルゴリズムが付属書２に列記されている。生成されたデータは、デジタル−アナログ変換器、音声
増幅器、スピーカ（図示せず）。【００２３】記憶装置に記憶された音声データは、使用
者により編集される。ウィンドウ７は、コンピュータの
ディスプレイ６に設けられており、変数化された音声デ
ータの選択された部分が表示される。マウスすなわちデ
ジタルパッドなどの入力装置の制御の下でカーソル８を
使用して、使用者は、表示されたデータを操作して、切
断、張り合せ、複写、または消去などの機能を実行する
ことが出来る。入力装置からの入力により、記憶された
変数は、対応して変化する。【００２４】編集点を示すカーソルが、ディスプレイに
関連して移動すると、カーソルの下の編集点に対するフ
レームは、連続して、またカーソルの移動の速度と方向
により決定された速度で、いつでも再生される。従って
、使用者は、試行錯誤により進める必要もなく、編集点
を選択したり、選択された部分を別個の次の動作で再生
するなどをすることもなく、耳によって編集点を選択す
ることが出来る。データは、ほかの変数から独立して符
号化されたピッチで変数化されるので、速度の変化によ
って、再生された音声のピッチは変化しない。従って、
カーソルは、知能機能が低下することなく、種々の速度
で移動することが出来る。【００２５】ほかのモードの編集では、カーソルは、投
錨点を示すために使用される。前進している投錨点から
の音声データは、再生される。投錨点を示すカーソルが
移動すると、直ちに、データの再生は、新しい投錨点が
再び開始する。ディスプレイが記憶されたデータから誘
導される様式は、すべての与えられた使用領域の必要条
件と、個々のディスプレイ装置の制約とにより変わる。本実施例では、ディスプレイ装置にプロットされた各点
は、４つの連続したフレームの平均増幅度を示す。【００２６】　　　　　　　　　Ａｐｐｅｎｄｉｘ　１　　／＊　　Ｅｘｔｒａｃｔｓ　ｏｆ　ｃｏｄｅ　ｔｏ　ｄｏ　Ｌ
ＢＰ　ｒｏｏｔ　ｆｉｎｄｉｎｇ　ｆｒｏｍ　ｐｒｅｄ
ｉｃｔｏｒ　ｃｏｅｆｆｉｃｉｅｎｔｓ，　　ａｎｄ　
ｔｏ　ｃｏｎｖｅｒｔ　ｂａｃｋ　ｆｒｏｍ　ｒｏｏｔ
ｓ　ｔｏ　ｐｒｅｄｉｃｔｏｒ　ｃｏｅｆｆｉｃｉｅｎ
ｔｓ．　ｗｒｉｔｔｅｎ　ｂｙ　Ｌｉｏｎｅｌ　Ｗｏｌ
ｏｖｉｔｚ　Ａｕｇｕｓｔ　１９８８．　＊／　【００２７】ｄｏｕｂｌｅ　Ｅｖａｌ（ｘ，ｐｏｌｙ）
／＊　Ｅｖｓｌｕａｔｅ　ｏｏｌｙｎｏｍｉｅｌ　（５　ａｄ
ｄｓ，　４　ｓｕｂｓ，　４　ｍｕｌｓ）　　＊／　ｄｏｕｂｌｅ　×　＊ｐｏｌｙ；　｛ｄｏｕｂｌｅ　ｔｅｍｐＯ，　ｔｅｍｐｌ，　ｔｅｍｐ
２；　　ｔｅｍｐｌ＝２．０＊　×　＋ｐｏｌｙ［１］
；　　ｔｅｍｐ２＝２．０＊　×　＊ｔｅｍｐ１−１．
０　＋ｐｏｌｙ［２］；ｔｅｍｐ０＝２．０＊　×　＊
ｔｅｍｐ２−ｔｅｍｐ１＋ｐｏｌｙ［３］；ｔｅｍｐ２
＝２．０＊　×　＊ｔｅｍｐ０−ｔｅｍｐ２＋ｐｏｌｙ
［４］；ｒｅｔｕｒｎ（×　＊ｔｅｍｐ２−ｔｅｍｐ０
＋ｐｏｌｙ［５］）；　｝【００２８】　ＦｉｎｄＲｏｏｔｓ（ｑ，ｐ，ｘ，ｔａｂｌｅ）　　
／＊　　Ｆｉｎｄ　ｒｏｏｔｓ　ｏｆ　ｓｕｍ　ａｎｄ　ｄｉ
ｆｆｅｒｅｎｃｅ　ｐｏｌｙｎｏｍｉａｌｓ．　　　＊
／　　　　　　ｄｏｕｂｌｅ　＊ｑ；　　／＊　ｓｕｍ　ｐ
ｏｌｙ　＊／　　　　　　ｄｏｕｂｌｅ　＊ｐ；　　／
＊　ｄｉｆｆ　ｐｏｌｙ　　＊／　　　　　　ｄｏｕｂ
ｌｅ　＊ｘ；　　／＊　ｒｏｏｔｓ　（ｉｎ　ｏｒｄｅ
ｒ　ｉ．　ｅ．　ｓｕｎ　ａｎｄ　ｄｉｆｆ　ｉｎｔｅ
ｒｌｅａｖｅｄ　＊／　　　　　　ｄｏｕｂｌｅ　＊ｔ
ａｂｌｅ；　　／＊　　ｑｕａｎｔｉｓａｔｉｏｎ　　
ｔａｂｌｅ　＊／　　　　　｛　　　　　ｄｏｕｂｌｅ　＊ｒ，　＊ｔ，　ｔｅｍｐ，
　ｐｒｅｖ　；【００２９】【００３１】　　　　　　　　　　　　　　　Ａｐｐｅｎｄｉｘ　２
　　　　ＦｒｅｄＦｒｏｍＲｏｏｔｓ（ｘ，ａ）　　／
＊　　Ｃｏｍｐｕｔｅ　ｐｒｅｄｉｃｔｏｒ　ｃｏｅｆｆｉ
ｃｉｅｎｔｓ　ｆｒｏｍ　ｒｏｏｔｓ　ｏｆ　ｓｕｍ　
ｓｎｄ　ｄｉｆｆ　ｐｏｌｙｓ，　　＊／　　　　　　ｄｏｕｂｌｅ　＊ｘ，　＊ａ；　　　　　｛　　　　　ｄｏｕｂｌｅ　ｑ［１０］，　ｐ［１０］　
；　　　　　ｄｏｕｂｌｅ　ｔｅｍｐ１，ｔｅｍｐ２，
ｔｅｍｐ３；【００３２】【００３３】【００３４】　　　　　／＊　ｃａｌｃｕｌａｔｅ　ｐｒｅｄｉｃｔ
ｏｒ　ｃｏｅｆｆｉｃｉｅｎｔｓ　ｆｒｏｍ　ｓｕｍ　
ａｎｄ　ｄｉｆｆｅｒｅｎｃｅ　　　　　　　ｐｏｌｙ
ｎｏｍｉａｌｓ　＊／　　　　　　ｑ［５］＝２．０＊
ｑ［５］＋ｑ［４］　；　　　　　　ｑ［４］＋＝ｑ［
３］　；　　　　　　ｑ［３］＋＝ｑ［２］　；　　　
　　　ｑ［２］＋＝ｑ［１］　；　　　　　　ｑ［１］
＋＝１．０　　；　　　　　　ｐ［４］−＝ｐ［３］　
；　　　　　　ｐ［３］−＝ｐ［２］　；　　　　　　
ｐ［２］−＝ｐ［１］　；　【００３５】【００３６】【００３７】【数１】【００３８】【数２】【００３９】【表１】DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a data compression method,
More particularly, it relates to encoding audio in a form suitable for storage in computer storage. BACKGROUND OF THE INVENTION It is desirable to facilitate audio recording of personal computers, particularly portable handheld or laptop type machines. For example, recorded audio may be used to annotate documents in a word processor running on a machine, or used to provide instructional messages in connection with a machine-controlled diary. It is known to equip personal computers with DSPs (digital signal processors) which are used to process audio consisting of human voices for storage by the computer. However, in order to store audio in acceptable quality, known devices require high data transfer rates, so in practice, audio of a few seconds or less is
Stored in typically available megabytes of RAM,
Moreover, the tens or even hundreds of megabytes available on mass storage devices, such as hard disks, are rapidly used up. To overcome this problem, data compression methods are used; however, known compression algorithms that allow very high compression ratios, which are desirable in this discussion,
Calculations are very intensive. Therefore, it has hitherto been impossible to record long-duration audio with this type of device due to the two limitations of personal computers: limited storage capacity and limited processing power. SUMMARY OF THE INVENTION According to a first aspect of the present invention, a method of audio encoding for storage in a computer comprises sampling an audio signal, encoding the sampled audio data with a linear predictive code, and Generating variables representing audio data, converting variables to obtain sum and difference polynomials, determining roots of polynomials and thereby generating line spectrum data, and storing line spectrum data in a computer, etc. It is configured. [0005] The inventors have developed an encoding that combines several features that give very superior performance to known devices,
This increased the compression ratio to about 100:1. This algorithm is computationally efficient enough to be implemented on a relatively low power handheld personal computer using a common portable DSP. [0006] Linear predictive coding is a well-known method used for speech analysis and synthesis as a prediction, and is described in, for example, the Journal of the Acoustical Society of America, 1971, vol. 50 No. 2
(Part 2), pp. 637-655, by B.S. Attar and others on "Speech Analysis and Synthesis by Linear Prediction of Speech Waves." The use of linear spectral pairs is described in IE, Special Issue on Acoustics, Speech and Signal Processing, December 1986, vol. A.S.P.-34, No. 6, P. Kabeil et al., “Calculation of line spectral frequencies using Tievisiev polynomials,” and IEE special issue on acoustics, speech and signal processing, April 1987, vol. A.S.P. No. 4, pp. 568-571, by George S. King et al. However, the root-calculating algorithms used to derive the line spectral pairs are computationally intensive, so it has never been practical to implement this form of method on relatively low-power computers. could not be considered, and the compression ratio obtained without using line spectrum pairs was not high enough. Preferably the stored line spectrum pairs are:
Quantized in the frequency domain with non-uniform quantization, the roots of the polynomial are found only by evaluating the polynomial at the quantized frequency. The compression ratio and the computational efficiency of the algorithm are found by evaluating the relevant function at the frequency at which the line spectral pair is quantized, and by quantizing the line spectral pair in the frequency domain by appropriate quantization of the frequency range. It has been shown that this can be improved. The sampled audio data is suitably divided into frames of a predetermined period, the consecutive frames are compared, and if the difference between the new frame and the previous frame is less than a predetermined degree, the new frame is divided into frames of a predetermined period. Encoded by an instruction to repeat the frame. [0009] If the new frame differs from the previous frame by more than a predetermined degree, the compression ratio can only be further improved by storing the new frame. If the frames are the same, the new frame is encoded one bit at a time in a certain area, which is recognized by data restoration as an instruction to repeat the previous frame. As discussed in more detail below,
It has been shown that with appropriate frame-to-frame comparison criteria, the number of stored frames can be reduced by as much as 50% without any significant loss of quality. This method is
The graphical display of the representation of the audio signal and the editing of the stored variable audio signal in response to actions of the graphical representation of the signal performed by the user are more suitably arranged. Using these compression methods set out in accordance with the first aspect of the invention, it is possible to store and edit audio using a computer in the same way that text is stored and edited using a word processor. is now possible. but,
The problem arises from setting the proper user interface. It is known in personal computers incorporating simple digital samplers to form a display of sampled data in the form of a plot of amplitude versus time. While such displays are somewhat useful for processing relatively uniform signals such as music, they are known to have more difficulty processing complex signals such as voice. The inventor has shown that if the signal is stored in a compressed and variable form, a representation can be generated from the stored variables which greatly facilitates the editing of complex signals such as speech. . Thus, the user can perform many of the operations commonly performed on text processors, including cutting, pasting, copying, and deleting. The method also consists of selecting points in the audio signal for editing by moving a cursor with respect to a graphical representation of the signal, and at the same time regenerating the part of the signal marked by the cursor. Properly configured. According to a second aspect of the invention, there is provided an apparatus for encoding speech comprising the following means. That is,
means for sampling the audio signal to thereby generate digital data representing the audio signal; and a means for sampling the audio signal and arranged to encode the digital data by linear predictive coding; means for transforming said variables and computing corresponding sum and difference polynomials; and determining the roots of said sum and difference polynomials thereby generating a series of lines. They are means for generating spectrum data, and storage means for storing the line spectrum data. According to a third aspect of the invention, there is provided a computer comprising means for encoding and decoding audio, and said means for encoding and decoding comprising the following means. namely, means for sampling an audio signal to thereby generate digital data representative of said audio signal; connected to said means for sampling and arranged for encoding said digital data by linear predictive coding; an encoder for generating variables thereby representing said audio signal; means for transforming said variables and computing corresponding sum-difference polynomials; and determining roots of said sum-difference polynomials thereby means for generating line spectral data of
means for storing the line spectrum data; means for retrieving the line spectrum data from the storing means; means for determining a sum and difference polynomial corresponding to the retrieved line spectrum data; means for transforming a difference polynomial to produce a variable representing said audio signal; a decoder arranged to decode said transformation and generating digital data representing said audio signal; and a decoder corresponding to said digital data. means for generating and outputting analog signals. DESCRIPTION OF THE PREFERRED EMBODIMENTS A laptop personal computer 1 consists of a microphone 2 connected via an analog-to-digital converter ADC to a digital signal processing device DSP. If appropriate, the data output by the DSP is written to the mass storage device 4, in this example a battery-equipped RAM cartridge, or to the main RAM 15 of the computer 1 under control of the main CPU 3. The analog-to-digital converter samples the audio received by the microphone and produces a data stream at a rate of 64 kilobits per second. The generated data stream is divided into frames of 25 millisecond duration in this example. For each frame, a number of variables is calculated. These variables are the number of prediction coefficients that define a discrete time-varying linear filter with the audio in the frame as the excitation frequency and a transform function that shapes the behavior of the audio track after the excitation produced by the audio code. Expressed as If a frame consists of P samples, then ak is the general predictor coefficient and Sn is the general sample value, then this predictor error En is:
is the difference between the audio sample Sn and its predicted value given by Equation 1. Therefore, En is given by Equation 2. The value of the predictor coefficient is determined by the unbiased variance of the prediction error <
En 2 > selected to minimize AV. Bee
As described in considerable detail in the above-cited paper by S. Attar et al., applying this constraint to a sampled data set yields a set of simultaneous first-order legal equations that, when solved, yield the predictor coefficients. is required. To further compress the data, the predictors are not stored directly, but are used to form polynomials that are decomposed into sum and difference polynomials. The roots of this sum and difference polynomial constitute a set of line spectra, and this set is
A collection of line spectra output by a DSP for storage by a computer and then used in a complementary synthesis process. An algorithm for finding roots and a complementary algorithm for deriving predictor coefficients from a stored collection of line spectra are described in detail at the end of the specification. Their roots are found by evaluating the polynomial at many frequencies. Rather than using continuous frequencies, this function is evaluated at predetermined intervals in a non-uniform space. In this example, logarithmically spaced frequencies are used for quantization. These frequencies listed in Table 1 are chosen to set a number of quantization levels in this region to which the ear is most sensitive. Next, as shown in FIG. 2, the root position is set between the frequencies f1 and f2 whose signs change. Of the two frequencies, the one that is closer to polynomial P(f) is selected to represent the root. For clarity, FIG. 2 shows a more simplified function as representing a polynomial. As a result of the above processing, each frame is represented by a series of first, second, . . . nth roots, and each root is
It is composed of 1 to 42 indicators corresponding to the quantization level of each frequency. It becomes clear that even more significant increases in compression ratios can be obtained by comparing successive frames and if the new frame differs by less than a predetermined degree from encoding that frame with instructions to repeat the previous frame. It was done. Comparisons are made between corresponding roots of many frames. That is, the first root of one frame is compared with the first root of another frame, the second root of one frame is compared with the root of another frame, and so on. The difference in quantization units of the corresponding roots is determined. It has been found that a frame can be replaced by a "repeat" instruction if there is no significant deterioration in the quality of playback and the roots thus compared do not differ by more than three. If the roots are allowed to differ by 5, audible degradation will occur. If the standard is 3,
Approximately 50% of the frames are repeated. Tighter constraints are used to further improve the quality of playback. For example, if a frame is repeated only if the root does not differ by more than 1 from the corresponding root of the previous frame, the percentage of frames encoded as repeats will be reduced, but the quality of playback will be correspondingly There is improvement. If the higher frequency roots are allowed to vary more than the lower frequency roots, in the preferred embodiment the criterion for the repeating frame is that the first four roots should differ by no more than one quantization unit. , and the remaining roots should differ by no more than 2. In total, each frame output by the DSP is:
A first variable indicating whether the frame is voiced or unvoiced, and a second variable that sets the pitch period of the stimulus tone, together with a set of line spectra computed from the predictor coefficients a1...ap. It consists of variables. The pitch period is determined using any one of a number of common pitch determination methods. This method is described in the Journal of the Acoustical Society of America, August 1969, vol. 46,442-448
Examples include the method described in the paper "Parallel Processing Method for Calculating the Pitch Period of Speech in the Time Domain" by B. Gold et al. Once a set of line spectra is stored, the audio can be retrieved at any time by calling the stored set out of storage and applying a complementary algorithm to calculate the predictor coefficients from the set of line spectra. will be played. Suitable algorithms are listed in Annex 2. The generated data is transferred to a digital-to-analog converter, an audio amplifier, and a speaker (not shown). [0023] The audio data stored in the storage device is edited by the user. A window 7 is provided on the computer display 6 and displays a selected portion of the variable audio data. Using the cursor 8 under the control of an input device such as a mouse or digital pad, the user can manipulate the displayed data to perform functions such as cutting, pasting, copying, or erasing. I can do it. Input from the input device causes the stored variables to change correspondingly. [0024] Whenever the cursor indicating an edit point moves relative to the display, the frames for the edit point under the cursor are played back in succession and at a speed determined by the speed and direction of the cursor movement. Ru. Therefore, the user can select edit points by ear without having to proceed by trial and error, selecting edit points, playing the selected part with a separate next action, etc. I can do it. Since the data is variable with pitch encoded independently of other variables, changes in speed do not change the pitch of the reproduced audio. Therefore,
The cursor can be moved at various speeds without loss of intelligence. In other modes of editing, the cursor is used to indicate anchor points. Audio data from the advancing anchor point is played back. As soon as the cursor indicating the anchor point is moved, data playback begins again at the new anchor point. The manner in which the display is derived from the stored data will vary depending on the requirements of any given area of use and the constraints of the particular display device. In this example, each point plotted on the display device represents the average amplification of four consecutive frames. Appendix 1 /* Extracts of code to do L
BP root finding from pred
ctor coefficients, and
to convert back from root
s to predictor coefficien
ts. written by Lionel Wol
ovitz August 1988. */ [0027] double Eval(x, poly)
/* Evsluate oolynomiel (5 ad
ds, 4 subs, 4 muls) */ double × *poly; { double tempO, templ, temp
2; templ=2.0* × +poly[1]
; temp2=2.0* × *temp1-1.
0 + poly[2]; temp0=2.0* × *
temp2-temp1+poly[3]; temp2
=2.0* × *temp0-temp2+poly
[4];return(× *temp2-temp0
+poly[5]); } FindRoots(q, p, x, table)
/* Find roots of sum and di
ference polynomials. *
/ double *q; /* sum p
oly */ double *p; /
* diff poly */ doub
le *x; /* roots (in orde
r i. e. sun and diff inte
rleaved */ double *t
able; /* quantification
table */ { double *r, *t, temp,
prev;0029;0031;Appendix 2
FredFromRoots(x,a) /
* Compute predictor coeffi
cients from roots of sum
snd diff polys, */ double *x, *a; { double q[10], p[10]
; double temp1, temp2,
temp3;0032;0033;0034; /*calculate predict
or coefficients from sum
and difference poly
nominals */ q[5]=2.0*
q[5]+q[4]; q[4]+=q[
3]; q[3]+=q[2];
q[2]+=q[1]; q[1]
+=1.0; p[4]-=p[3]
; p[3]-=p[2];
p[2]−=p[1];

[Brief explanation of the drawing]

【図１】本発明で使用されるコンピュータの構成図[Figure 1] Configuration diagram of a computer used in the present invention

【図
２】多項式の根を示すグラフ[Figure 2] Graph showing the roots of a polynomial

【図３】エンコーダのブロック図[Figure 3] Encoder block diagram

【図４】デコーダのブロック図[Figure 4] Block diagram of decoder

【図５】音声信号の編集に使用されるディスプレイ装置
を示す図FIG. 5 is a diagram showing a display device used for editing audio signals.

Claims

[Claims]

1. Sampling of an audio signal, encoding of sampled audio data by linear predictive coding, generation of a variable representing the audio data thereby, conversion of the variable to obtain a sum and difference polynomial, and the polynomial A method of encoding speech for storage in a computer, comprising: determining the root of , thereby generating a series of line spectral data, and storing said line spectral data in a computer.

2. The stored line spectral data is quantized in the frequency domain with non-uniform quantization, and the roots of the polynomial are determined only by evaluating the polynomial at the quantization frequency. A method according to claim 1, characterized in that:

3. The sampled audio data is divided into frames having a predetermined time, consecutive frames are compared, and if the new frame differs from the previous frame by less than a predetermined degree, the new frame differs from the previous frame. 3. A method according to claim 1 or 2, characterized in that it is encoded by an instruction to repeat the frame.

4. The frames are compared by determining a difference in quantization units between corresponding roots in each successive line spectral data, and the frames are repeated if the difference does not exceed a predetermined limit. 4. The method according to claim 3, characterized in that:

5. A lower predetermined limit is set for a lower frequency term, and a different predetermined limit is set for a different portion of each successive line spectral data. the method of.

6. Encoding and storing an audio signal by a method according to any one of the preceding claims, displaying a representation of the audio signal as a graphic, and responding to an operation of graphically displaying the signal performed by a user. A method of editing a memory, comprising responsively editing said stored variable audio signal.

7. The graphical representation of the signal further comprises moving a cursor to select a point within the audio signal for editing and simultaneously reproducing the portion of the signal indicated by the cursor. 8. The method according to claim 7, characterized in that:

8. The step of graphically displaying the audio signal comprises plotting a variable representing a signal averaged over a plurality of frames as a function of time. the method of.

9. Means for sampling an audio signal and thereby generating digital data representing the audio signal.
2,9) and an encoder (10,11) connected to said means of sampling and arranged to encode said digital data by linear predictive coding, thereby producing variables representative of said audio signal; means (12) for transforming said variables and calculating corresponding sum and difference polynomials; and means (13) for determining roots of said sum and difference polynomials thereby generating a series of line spectral data; , storage means (14) for storing the line spectrum data.

10. Means (19) for comparing successive frames of audio data having a predetermined duration, said means for comparing determining that the current frame differs from the previous frame by less than a predetermined degree, 10. The apparatus of claim 9 further comprising means responsive to said comparing means for generating and storing instructions to repeat frames.

11. The means for comparing (19) comprises means for reading a series of line spectral data corresponding to each frame and determining a difference between corresponding terms in the series of line spectral data; and means for comparing the difference between the frames with a predetermined limit (δ), characterized in that if none of said differences exceeds a predetermined difference, said frames are determined to differ by less than said predetermined difference. Claim 10
The device described in.

12. Said means (19) for determining and comparing differences.
) is set to the first and second different predetermined limits (δ
4. Device according to claim 3, characterized in that the magnitude of the predetermined limit of the low frequency term is smaller than the magnitude of the predetermined limit of the high frequency term.

13. Said means (13) for determining roots of said sum and difference polynomial evaluate said polynomial only at non-uniform quantization frequencies, thereby determining the roots of said sum and difference polynomial. 13. Claims 9 to 12 characterized in that the means (14) for storing the line spectra store quantized data at the non-uniform quantization frequencies in the frequency domain. Apparatus according to any one of the preceding clauses.

14. Means (6) for graphically displaying the representation of the audio signal; and editing means for editing the stored quantized audio signal in response to the graphical display operation performed by the user. 14. Device according to any one of claims 9 to 13, characterized in that the device comprises:

15. A computer having encoding and decoding means, connected to means (2, 9) for sampling an audio signal and thereby generating digital data representing said audio signal, and means for sampling. an encoder (10, 11) arranged to encode the digital data by linear predictive encoding, thereby generating variables representing an audio signal; means (12) for determining the roots of said sum-difference polynomial and thereby generating a series of linear spectral data; and means (14) for storing said line spectral data. , means (16) for retrieving the line spectrum data from the storage means, means (16) for determining a sum and difference polynomial corresponding to the retrieved line spectrum data, and converting the sum and difference polynomial. , means (17) for generating a variable representing the audio signal;
a decoder (18) arranged to decode said variable, thereby generating digital data representing said audio signal; and means (20) for generating and outputting an analog signal corresponding to said digital data. A computer having encoding and decoding means consisting of:

16. The computer of claim 15, wherein the computer is a handheld or laptop computer.

17. A handheld or laptop computer, characterized in that it comprises a device according to any one of claims 9 to 14.