JP3095133B2

JP3095133B2 - Acoustic signal coding method

Info

Publication number: JP3095133B2
Application number: JP09040404A
Authority: JP
Inventors: 仲大室; 一則間野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-02-25
Filing date: 1997-02-25
Publication date: 2000-10-03
Anticipated expiration: 2017-02-25
Also published as: JPH10242867A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、音声，音楽など
の音響信号の、スペクトル包絡特性を表すフィルタを音
源ベクトルで駆動して音響信号を合成する予測符号化に
より、音響信号の信号系列を少ない情報量でディジタル
符号化する高能率音声符号化方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention reduces a signal sequence of an audio signal by performing a predictive coding for synthesizing the audio signal by driving a filter representing a spectral envelope characteristic of an audio signal such as voice or music by a sound source vector. The present invention relates to a high-efficiency speech encoding method for digitally encoding information.

【０００２】[0002]

【従来の技術】ディジタル移動体通信において、電波を
効率的に利用したり、音声または音楽蓄積サービス等で
通信回線や記憶媒体を効率的に利用するために、高能率
音声符号化方法が用いられる。現在、音声を高能率に符
号化する方法として、原音声をフレーム（またはサブフ
レーム）と呼ばれる５〜５０ｍｓ程度の一定間隔の区間
に分割し、その１フレームの音声を周波数スペクトルの
包絡特性を表す線形フィルタの特性と、そのフィルタを
駆動するための駆動音源信号との２つの情報に分離し、
それぞれを符号化する手法が提案されている。この手法
において、駆動音源信号を符号化する方法として、音声
のピッチ周期（基本周波数）に対応すると考えられる周
期成分と、それ以外の成分に分離して符号化する方法が
知られている。この駆動音源情報の符号化法の例とし
て、符号駆動線形予測符号化（Code-Excited Linear Pr
ediction: ＣＥＬＰ）がある。上記技術の詳細について
は、文献 M.R. Schroeder and B.S. Atal,“Code-Excit
ed Linear Prediction（ＣＥＬＰ）：High Quality Spe
ech at Very Low Bit Rates ”，ＩＥＥＥ Proc.ＩＣＡ
ＳＳP-85, pp.937-940, 1985に記載されている。2. Description of the Related Art In digital mobile communication, a high-efficiency voice encoding method is used in order to efficiently use radio waves or to efficiently use a communication line or a storage medium for a voice or music storage service. . At present, as a method for encoding speech efficiently, an original speech is divided into sections called frames (or subframes) at a fixed interval of about 5 to 50 ms, and the speech of one frame represents an envelope characteristic of a frequency spectrum. Separation into two information, the characteristics of the linear filter and the driving sound source signal for driving the filter,
Techniques for encoding each have been proposed. In this method, as a method of encoding a drive excitation signal, a method of separating and encoding a periodic component considered to correspond to a pitch period (fundamental frequency) of a voice and other components is known. As an example of the encoding method of the drive excitation information, Code-Excited Linear Pr
ediction: CELP). For details of the above technology, refer to the document MR Schroeder and BS Atal, “Code-Excit
ed Linear Prediction (CELP): High Quality Spe
ech at Very Low Bit Rates ”, IEEE Proc. ICA
SSP-85, pp. 937-940, 1985.

【０００３】図８に上記符号化方法の構成例を示す。入
力端子１−０に入力された音声ｘは、線形予測分析部
１−１において、入力音声の周波数スペクトル包絡特性
を表す線形予測パラメータａが計算される。得られた
線形予測パラメータａは線形予測パラメータ符号化部
１−２において、符号化されて線形予測パラメータ復号
部１−３に送られる。また、歪み計算に聴覚特性を考慮
するなど、入力音声のスペクトル情報を利用して歪み計
算を行う場合には、線形予測パラメータａは歪み計算
部１−６へも送られる。線形予測パラメータ復号部１−
３では、受け取った符号から合成フィルタ係数ａ＾を
再生し、合成フィルタ１−５に送る。歪み計算に聴覚特
性を考慮する場合に、歪み計算部１−６において量子化
前の線形予測パラメータａを用いる代わりに、上記復
号された線形予測パラメータａ＾を歪み計算に使用す
ることもある。なお、線形予測分析の詳細および線形予
測パラメータの符号化例については、例えば古井貞煕著
“ディジタル音声処理”（東海大学出版会）に記載され
ている。ここで、線形予測分析部１−１、線形予測パラ
メータ符号化部１−２、線形予測パラメータ復号部１−
３および合成フィルタ１−５は非線形なものに置き換え
てもよい。FIG. 8 shows a configuration example of the above-mentioned encoding method. For the speech x input to the input terminal 1-0, the linear prediction analysis unit 1-1 calculates a linear prediction parameter a representing the frequency spectrum envelope characteristic of the input speech. The obtained linear prediction parameter a is encoded by the linear prediction parameter encoding unit 1-2 and sent to the linear prediction parameter decoding unit 1-3. In addition, when distortion calculation is performed using spectral information of an input voice, for example, in consideration of auditory characteristics in distortion calculation, the linear prediction parameter a is also sent to the distortion calculation unit 1-6. Linear prediction parameter decoding unit 1-
In the step 3, the synthesis filter coefficient a から is reproduced from the received code and sent to the synthesis filter 1-5. When the auditory characteristics are considered in the distortion calculation, the decoded linear prediction parameter a パラメータ may be used in the distortion calculation instead of using the linear prediction parameter a before quantization in the distortion calculation unit 1-6. The details of the linear prediction analysis and examples of encoding the linear prediction parameters are described in, for example, “Digital Speech Processing” by Sadahiro Furui (Tokai University Press). Here, the linear prediction analysis unit 1-1, the linear prediction parameter encoding unit 1-2, and the linear prediction parameter decoding unit 1-
3 and the synthesis filter 1-5 may be replaced with a non-linear filter.

【０００４】駆動音源ベクトル生成部１−４では、１フ
レーム分の長さの駆動音源ベクトル候補ｃを生成し、
合成フィルタ１−５に送る。図９に駆動音源ベクトル生
成部１−４の構成例を示す。適応符号帳２−１からは、
バッファに記憶された直前の過去の駆動音源ベクトル
（既に量子化された直前の１〜数フレーム分の駆動音源
ベクトル）ｃ（ｔ−１）を、ある周期に相当する長さ
で切り出し、その切り出したベクトルをフレームの長さ
になるまで繰り返すことによって、音声の周期成分に対
応する時系列ベクトルの候補ｖ_aが出力される。上記
「ある周期」とは、歪み計算部１−６における歪みｄが
小さくなるような周期が選択されるが、選択された周期
は、一般には音声のピッチ周期に相当することが多い。
固定符号帳２−２からは、音声の非周期成分に対応する
１フレーム分の長さの時系列符号ベクトルの候補ｖ_r
が出力される。固定符号帳２−２には入力音声とは独立
に符号化のためのビット数に応じてあらかじめ指定され
た数の候補ベクトルが記憶されている。適応符号帳２−
１および固定符号帳２−２から出力された時系列ベクト
ルの候補は、乗算部２−４，２−５において、それぞれ
重み符号帳２−３において作成された重みｇ_a, ｇ_rが
乗算され、これら乗算結果は加算部２−６において加算
され、駆動音源ベクトルの候補ｃとなる。図９の構成
例において、適応符号帳２−１を用いないで、固定符号
帳２−２のみの構成としてもよく、子音部や背景雑音な
どのピッチ周期性の少ない信号を符号化するときには、
ビットを節約するために、適応符号帳２−１を用いない
構成にすることも多い。[0004] A drive excitation vector generation section 1-4 generates a drive excitation vector candidate c having a length of one frame.
Send to synthesis filter 1-5. FIG. 9 shows a configuration example of the driving sound source vector generation unit 1-4. From adaptive codebook 2-1:
The immediately preceding past drive excitation vector stored in the buffer (the drive excitation vector for one to several frames just before quantization) c (t-1) is cut out at a length corresponding to a certain period, and the cut out is performed. was by repeated until the length of the frame vector, candidate v _a time series vector corresponding to the period component of the sound is output. As the “certain period”, a period that reduces the distortion d in the distortion calculator 1-6 is selected. In general, the selected period generally corresponds to the pitch period of voice.
From the fixed codebook 2-2, a time-series code vector candidate v _{r of} one frame length corresponding to the aperiodic component of speech
Is output. The fixed codebook 2-2 stores a predetermined number of candidate vectors according to the number of bits for encoding independently of the input speech. Adaptive codebook 2-
Candidate time-series vector outputted from the first and fixed codebook 2-2, in the multiplication unit 2-4 and 2-5, the weights g _a created in a weight codebook 2-3 respectively, g _r is multiplied These multiplication results are added in an adder 2-6 to become a drive excitation vector candidate c. In the configuration example of FIG. 9, the configuration may be such that only the fixed codebook 2-2 is used without using the adaptive codebook 2-1. When encoding a signal with a small pitch periodicity such as a consonant part or background noise,
In order to save bits, a configuration not using the adaptive codebook 2-1 is often used.

【０００５】図８の説明に戻って、合成フィルタ１−５
は、線形予測パラメータ復号部１−３の出力をフィルタ
の係数とする線形フィルタで、駆動音源ベクトル候補
ｃを入力として再生音声の候補ｙを出力する。合成
フィルタ１−５の次数すなわち線形予測分析の次数は、
一般に１０〜１６次程度が用いられることが多い。な
お、既に述べたように、合成フィルタ１−５は非線形な
フィルタでもよい。Returning to the description of FIG. 8, the synthesis filter 1-5
Is a linear filter that uses the output of the linear prediction parameter decoding unit 1-3 as a filter coefficient, and outputs a reproduced sound candidate y using the driving excitation vector candidate c as an input. The order of the synthesis filter 1-5, ie, the order of the linear prediction analysis, is
Generally, about 10 to 16 orders are often used. As described above, the synthesis filter 1-5 may be a non-linear filter.

【０００６】歪み計算部１−６では、合成フィルタ１−
５の出力である再生音声の候補ｙと、入力音声ｘと
の歪みｄを計算する。この歪みの計算は、例えば聴覚重
み付きなど、合成フィルタの係数ａ＾または量子化し
ていない線形予測係数ａを考慮にいれて行なうことが
多い。図１１に、聴覚重みづきを考慮して歪みを計算す
る構成例を示した。聴覚重みづきは、量子化していない
線形予測パラメータａもしくは量子化された合成フィ
ルタ係数ａ＾を用いた、聴覚重みフィルタの形で構成
される。合成フィルタ４−１から出力される再生音声候
補ｙは、聴覚重みフィルタ４−２を通され、これは、
同じく聴覚重みフィルタ４−３に通された入力音声との
間で、歪みｄが計算される。ここで、聴覚重みフィルタ
４−２，４−３は通常同一のフィルタ係数を用いるた
め、聴覚重みフィルタ４−２，４−３は、距離計算部４
−４の後に１つのフィルタとして入れても等価である
が、処理量の点から、図１１に示されるように、距離計
算部４−４の手前で２ケ所に分けて入れることが多い。[0006] In the distortion calculation unit 1-6, the synthesis filter 1-
Then, a distortion d between the reproduced voice candidate y, which is the output of No. 5, and the input voice x is calculated. The calculation of the distortion is often performed in consideration of the coefficient a ＾ of the synthesis filter or the non-quantized linear prediction coefficient a, for example, with auditory weighting. FIG. 11 shows a configuration example in which distortion is calculated in consideration of hearing weighting. Perceptual weighting is configured in the form of a perceptual weight filter using unquantized linear prediction parameters a or quantized synthetic filter coefficients a ＾. The reproduced voice candidate y output from the synthesis filter 4-1 is passed through an auditory weight filter 4-2,
Similarly, a distortion d is calculated between the input voice and the input voice that has passed through the auditory weight filter 4-3. Here, since the hearing weight filters 4-2 and 4-3 usually use the same filter coefficient, the hearing weight filters 4-2 and 4-3 use the distance calculation section 4-4.
Even if one filter is inserted after -4, it is equivalent. However, in terms of the amount of processing, as shown in FIG. 11, it is often divided into two places before the distance calculation unit 4-4.

【０００７】この合成重み計算部１−７について更に述
べると入力時系列音声ベクトルｘは聴覚重みフィルタ
４−３を通り、ターゲット音声ｘ_wとなって、距離計
算部４−４に送られる。一方、駆動音源ベクトル候補
ｃは、合成フィルタ４−１と聴覚重みフィルタ４−２
を通り、聴覚重み付き再生音声候補ベクトルｙ_wとな
って、距離計算部４−４に送られる。距離計算部４−４
では、ターゲット音声ベクトルｘ_wと再生音声候補ベ
クトルｙ_wの間の距離を測定する。このときの距離尺
度には例えば、ｄ＝‖ｘ_w−ｙ_w‖² （１）といった距離尺度を用いればよい。上記歪み尺度を最小
にするような駆動音源ベクトルが選択される。図９に示
したような駆動音源ベクトル生成の構成を用いる場合に
は、周期符号、固定符号、重み符号が決定される。な
お、聴覚重みフィルタ４−２，４−３は、人間の聴覚特
性を利用して再生音声の雑音感を低減するような歪み計
算をするためのフィルタで、必ずしも用いる必要はな
い。Furthermore the input time series speech vector x and forth passes through the perceptually weighted filter 4-3 This combining weight calculation unit 1-7, and the targeted voice x _w, is sent to a distance calculation unit 4-4. On the other hand, the driving sound source vector candidate c includes a synthesis filter 4-1 and an auditory weight filter 4-2.
, And becomes a perceptually weighted reproduced voice candidate vector y _w , which is sent to the distance calculator 4-4. Distance calculator 4-4
Then, the distance between the target audio vector x _w and the reproduced audio candidate vector y _w is measured. The distance measure this time for example, d = ‖x _{_w} -y _w ‖ ² (1) such as may be used distance measure. A driving sound source vector that minimizes the distortion measure is selected. In the case of using the configuration of driving excitation vector generation as shown in FIG. 9, a periodic code, a fixed code, and a weight code are determined. Note that the auditory weight filters 4-2 and 4-3 are filters for performing distortion calculation to reduce noise in the reproduced voice using human auditory characteristics, and need not always be used.

【０００８】このとき、入力時系列音声ベクトルｘ
は、入力音声信号そのままの場合もあるが、一般には、
前サブフレームからの影響を差し引いた、時系列信号で
あることが多い。また、図９に示したような駆動音源ベ
クトル生成の構成を用いる場合に、周期符号、固定符
号、重み符号のすべての可能な組み合わせの中から、最
適な組み合わせを１つ選択することは演算処理量の点か
ら難しく、例えば周期符号、固定符号、重み符号の順に
順次決定するか、途中で適宜候補を絞りながら順次探索
し、最後に準最適な組み合わせに決定することが多い。
このように順次決定または順次候補を残しながら探索す
る場合には、先に選択された符号ベクトル（例えば適応
符号ベクトル）に起因する合成成分を入力音声から差し
引き、駆動音源ベクトル候補ｃには、これから決定し
たいベクトル成分のみ（例えば固定符号ベクトルのみ）
を入力して歪み計算をする場合も多い。At this time, the input time-series speech vector x
May be the input audio signal as it is, but in general,
It is often a time-series signal from which the influence from the previous subframe has been subtracted. In addition, when using the configuration of driving excitation vector generation as shown in FIG. 9, selecting one optimal combination from all possible combinations of the periodic code, the fixed code, and the weight code is an operation process. It is difficult in terms of quantity. For example, it is often determined in the order of, for example, a periodic code, a fixed code, and a weight code, or sequentially searched while appropriately narrowing down candidates, and finally determining a sub-optimal combination.
In the case where the search is performed while sequentially determining or leaving the candidates in this manner, a synthesized component caused by the previously selected code vector (for example, the adaptive code vector) is subtracted from the input speech, and the driving excitation vector candidate c is Only the vector component to be determined (for example, only the fixed code vector)
Is often input to calculate distortion.

【０００９】図８において符号帳検索制御部１−８では
各再生音声候補ｙと入力音声ｘとの歪みｄが最小と
なるような駆動音源符号を選択し、そのフレームにおけ
る駆動音源ベクトルを決定する。なお、図９に示される
適応符号帳２−１、固定符号帳２−２、重み符号帳２−
３よりなる構成とする場合には、周期符号、固定符号お
よび重み符号を選択し、これらを駆動音源符号とする。In FIG. 8, a codebook search control section 1-8 selects a driving excitation code that minimizes the distortion d between each reproduced speech candidate y and input speech x, and determines a driving excitation vector in the frame. . The adaptive codebook 2-1, the fixed codebook 2-2, and the weighted codebook 2- shown in FIG.
In the case of a configuration composed of three, a periodic code, a fixed code, and a weight code are selected, and these are used as the drive excitation code.

【００１０】符号帳検索制御部１−８において決定され
た駆動音源符号（周期符号、雑音符号、重み符号）と、
線形予測パラメータ符号化部１−２の出力である線形予
測パラメータ符号は、符号送出部１−９に送られ、利用
の形態に応じて記憶装置に記憶されるか、または通信路
を介して受信側へ送られる。図１０に、上記符号化方法
に対応する復号方法の構成例を示した。伝送路または記
憶媒体から入力端子３−０に受信された符号のうち、線
形予測パラメータ符号は線形予測パラメータ復号部３−
２において合成フィルタ係数に復号され、合成フィルタ
３−４および、必要に応じて後処理部３−５に送られ
る。駆動音源符号は、駆動音源ベクトル生成部３−３に
送られ、符号に対応する音源ベクトルが生成される。な
お、駆動音源ベクトル生成部３−３の構成は、図８に示
された符号化方法の駆動音源ベクトル生成部１−４に対
応する構成となる。合成フィルタ３−４は、駆動音源ベ
クトルを入力として、音声を再生する。後処理部３−５
は、再生された音声の雑音感を聴覚的に低下させるよう
な処理（ポストフィルタリングとも呼ばれる）を行う
が、後処理部３−５は処理量削減等の関係から用いられ
ないことも多い。[0010] The excitation code (periodic code, noise code, weight code) determined by codebook search control section 1-8,
The linear prediction parameter code output from the linear prediction parameter coding unit 1-2 is sent to the code transmission unit 1-9 and stored in a storage device or received via a communication channel depending on the form of use. Sent to the side. FIG. 10 shows a configuration example of a decoding method corresponding to the above-described encoding method. Among the codes received at the input terminal 3-0 from the transmission path or the storage medium, the linear prediction parameter code is a linear prediction parameter decoding unit 3-
In step 2, the signal is decoded into a synthesis filter coefficient and sent to the synthesis filter 3-4 and, if necessary, the post-processing unit 3-5. The driving excitation code is sent to driving excitation vector generation section 3-3, and an excitation vector corresponding to the code is generated. The configuration of the driving excitation vector generation unit 3-3 corresponds to the configuration of the driving excitation vector generation unit 1-4 of the encoding method shown in FIG. The synthesis filter 3-4 reproduces a sound by using the driving sound source vector as an input. Post-processing unit 3-5
Performs processing (also referred to as post-filtering) to aurally reduce the sense of noise in the reproduced sound, but the post-processing unit 3-5 is often not used due to a reduction in processing amount or the like.

【００１１】[0011]

【発明が解決しようとする課題】ＣＥＬＰ方式において
問題となるのは、駆動音源ベクトル候補の選択をするた
めの歪み計算に、非常に多くの演算処理が必要になるこ
とである。この問題に対して、Algebraic Code-Excited
Linear Prediction（ＡＣＥＬＰ）という方式が提案さ
れている。この方式は、固定符号帳を、フレーム長のベ
クトルパターンとして蓄えるのではなく、高さが１のパ
ルスを、フレーム内に数本、例えば、４０サンプルのフ
レームまたはサブフレームに対して、４本、適当な位置
に立てることによって、固定符号ベクトルとする方式
で、この駆動音源方式の採用と、歪み計算において演算
順序を工夫することによって、従来の方式に比べて演算
処理を大幅に減らすことができる。なお、ＡＣＥＬＰ方
式の詳細は、例えば、文献，R. Salami, C. Laflamme,
and J-P. Adoul, “ 8 kbit/s ＡＣＥＬＰ Coding of
Speech with 10 ms Speech-Frame: a Candidate for Ｃ
ＣＩＴＴStandardization ”，ＩＥＥＥ Proc. ICASSP-
94, pp.II-97に記載されている。また、同様の処理概念
にもとづき、より高品質かつより低演算量の方法とし
て、この発明者等が既に出願した「音響信号符号化方法
及び音響信号復号化方法」（特願平７−１５０５５０）
がある。この方式では、固定符号ベクトルとして、高さ
が１のパルスのかわりに、隣接する２〜数サンプルを単
位とし、高さ情報を持つパルスパタンをフレーム内に配
置する手法を用いることによって、より低演算量と高品
質を両立している。A problem in the CELP system is that a great deal of arithmetic processing is required for calculating a distortion for selecting a driving excitation vector candidate. Algebraic Code-Excited
A method called Linear Prediction (ACELP) has been proposed. This method does not store a fixed codebook as a vector pattern of a frame length, but rather stores several pulses of height 1 in a frame, for example, four pulses in a frame or subframe of 40 samples. By adopting this driving excitation method in a method of setting a fixed code vector by setting it at an appropriate position, and devising the calculation order in distortion calculation, the calculation processing can be significantly reduced as compared with the conventional method. . The details of the ACELP method are described in, for example, Literature, R. Salami, C. Laflamme,
and JP. Adoul, “8 kbit / s ACELP Coding of
Speech with 10 ms Speech-Frame: a Candidate for C
CITT Standardization ”, IEEE Proc. ICASSP-
94, pp. II-97. Also, based on the same processing concept, as a method of higher quality and lower operation amount, “Acoustic signal encoding method and acoustic signal decoding method” already filed by the present inventors (Japanese Patent Application No. 7-150550).
There is. In this method, as a fixed code vector, instead of a pulse having a height of 1, adjacent two to several samples are used as a unit, and a method of arranging a pulse pattern having height information in a frame is used, thereby achieving a lower calculation. It balances quantity and high quality.

【００１２】しかしながら、これらの方式においては、
歪み計算に合成フィルタまたは聴覚重みづきフィルタ、
またはそれらを合わせたフィルタを、インパルス応答ま
たはＦＩＲ型のフィルタで表現することが多いが、フレ
ームまたはサブフレームが長くなると、ＩＩＲ型フィル
タを用いる場合と等価な結果を得るためのＦＩＲフィル
タのタップ数が長くなり、演算量が従来方式に比べて逆
に増加してしまうばかりでなく、歪み計算において計算
の途中結果を格納するために、著しく大量のメモリが必
要になるなどの問題がある。したがって、上記方法をそ
のまま、一般にサブフレームを長くする低ビットレート
音声符号化に利用することは難しい。However, in these systems,
Synthesis filter or auditory weighting filter for distortion calculation,
Or, a filter combining them is often expressed by an impulse response or FIR type filter. However, when the frame or subframe becomes longer, the number of taps of the FIR filter to obtain a result equivalent to the case of using an IIR type filter Not only increases the amount of computation in comparison to the conventional method, but also has a problem that an extremely large amount of memory is required in order to store an intermediate result of the calculation in the distortion calculation. Therefore, it is difficult to use the above method as it is for low bit rate audio coding that generally lengthens a subframe.

【００１３】一方図１１の構成において、駆動音源ベク
トル候補ｃを合成フィルタ４−１と聴覚重みづきフィ
ルタ４−２に通す操作を、高速に実行するためには、こ
れらの２つのフィルタを合わせて、等価なフィルタ特性
を持つ１つの聴覚重み付き合成フィルタとするとよい。
等価な１つのフィルタとするには、例えば合成フィルタ
４−１の入力から聴覚重みフィルタ４−２の出力までの
インパルス応答をフィルタ係数とすＦＩＲフィルタで表
現することができる。On the other hand, in the configuration of FIG. 11, in order to execute the operation of passing the driving sound source vector candidate c through the synthesis filter 4-1 and the auditory weighting filter 4-2 at high speed, these two filters must be combined. , A single auditory weighted synthesis filter having equivalent filter characteristics.
In order to make an equivalent one filter, for example, it is possible to represent an impulse response from the input of the synthesis filter 4-1 to the output of the auditory weight filter 4-2 as a filter coefficient by using an FIR filter.

【００１４】図１２は上記１つの等価なフィルタで表現
する構成において、更に高速な歪み計算を実現する構成
である。例えば、ＦＩＲフィルタ表現された聴覚重み付
き合成しフィルタを、有限タップで打ち切ったり、短い
タップ数のＡＲフィルタで近似したりして、あるいはＦ
ＩＲフィルタのタップ数を、ＩＩＲフィルタの場合と等
価な結果を得るのに必要なタップ数よりも減らすなどの
方法による厳密にはフィルタ特性の一致しない聴覚重み
付き合成近似フィルタ５−２で代用する。これによって
合成歪み計算における演算処理量およびメモリ量を減ら
すことができる。しかしながら、図１２の構成を用いた
場合、近似フィルタ５−２のフィルタ特性と、元の合成
フィルタ４−１および聴覚重み付きフィルタ４−２の特
性との差が大きくなると、近似誤差によって適当な駆動
音源符号が選択されなくなり、再生音声の著しい品質劣
化につながるため、事実上、サブフレームを長くとるこ
と、すなわちビットレートを低くすることは不可能であ
った。FIG. 12 shows a configuration for realizing higher-speed distortion calculation in the configuration expressed by the one equivalent filter. For example, an auditory weighted synthesis filter expressed by an FIR filter is truncated with a finite tap, approximated by an AR filter with a short tap number, or
The number of taps of the IR filter is reduced by a method such as reducing the number of taps required to obtain a result equivalent to that of the IIR filter by a perceptually weighted synthetic approximation filter 5-2 whose filter characteristics do not exactly match each other. . As a result, the amount of arithmetic processing and the amount of memory in the composite distortion calculation can be reduced. However, when the configuration of FIG. 12 is used, if the difference between the filter characteristics of the approximation filter 5-2 and the characteristics of the original synthesis filter 4-1 and the perceptual weighting filter 4-2 increases, an appropriate Since the driving excitation code is no longer selected, which leads to a remarkable deterioration in the quality of the reproduced voice, it was practically impossible to take a longer subframe, that is, lower the bit rate.

【００１５】この発明の目的は、低いビットレート、か
つ安価なプロセッサで許容される範囲内の少ないメモリ
量、少ない演算量で、高品質な再生音声が得られるよう
な、音声または音楽などの音響信号をディジタル符号化
する方法を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a sound such as a sound or a music which can obtain a high quality reproduced sound with a small bit rate, a small memory amount within a range allowed by an inexpensive processor, and a small calculation amount. It is to provide a method for digitally encoding a signal.

【００１６】[0016]

【課題を解決するための手段】この発明では、ＦＩＲ型
合成フィルタのタップを途中で打ち切るなどの高速に歪
み計算ができるように簡略化した近似フィルタを合成歪
み計算に用い、この近似フィルタで表現したことにもと
づき生じる近似誤差を、入力音声に付加し、これを符号
帳探索時のターゲットベクトルとする。According to the present invention, an approximate filter which is simplified so that distortion can be calculated at a high speed such as cutting off the tap of the FIR type synthesis filter in the middle is used for the synthesis distortion calculation, and the approximation filter is expressed by this approximation filter. The approximation error generated based on the above is added to the input speech, and this is used as a target vector at the time of codebook search.

【００１７】この構成により近似による影響を歪み計算
において、相殺し、サブフレームの長い場合でも、非常
に少ないメモリ量、処理量で、高品質な低ビットレート
符号化方法を実現する。With this configuration, the influence of the approximation is canceled in the distortion calculation, and a high quality low bit rate encoding method can be realized with a very small amount of memory and processing amount even when the subframe is long.

【００１８】[0018]

【発明の実施の形態】この発明の実施例の前提となる構
成を図１に示す。入力端子６−０よりの入力音声ｘ
は、量子化された（復号された）合成フィルタ係数ａ
＾による合成フィルタの逆フィルタ（合成逆フィルタ）
６−３を通り、理想の（量子化しない）駆動音源ベクト
ルｒに変換される。ｒは、図１１において駆動音源
ベクトル候補ｃを入力とする合成フィルタ４−１に通
したときに、入力音声ｘとの歪みがゼロになる理想の
駆動音源ベクトルである。理想駆動音源ベクトルｒ
は、聴覚重み付き合成近似フィルタ５−２と同じ特性の
聴覚重み付き合成近似フィルタ６−４を通って変形ター
ゲット音声ベクトルｘ′_wとなる。この時、聴覚重み
付き合成フィルタ５−２で生じる近似誤差と同様の近似
誤差が変形ターゲット音声ベクトルｘ′_wに付加され
たものとなる。距離計算部４−４では、聴覚重み付き合
成近似フィルタ５−２の出力である、近似誤差を含んだ
聴覚重み付き再生音声候補ｙ′_wと、変形ターゲット
音声ベクトルｘ′_wとの間の距離を計算する。従って
この距離計算においては聴覚重み付き合成近似フィルタ
５−２で生じる近似誤差は、聴覚重み付き合成近似フィ
ルタ６−４で付加された近似誤差と、距離計算の際に相
殺され、歪みｄ（距離）を高い精度で計算できる。DETAILED DESCRIPTION OF THE INVENTION structure underlying the embodiment of the present invention
The result is shown in FIG. Input audio x from input terminal 6-0
Is the quantized (decoded) synthesis filter coefficient a
Inverse filter of synthesis filter by （(synthesis inverse filter)
The signal passes through 6-3 and is converted into an ideal (non-quantized) drive excitation vector r. r is an ideal driving sound source vector in which the distortion from the input sound x becomes zero when the signal passes through the synthesis filter 4-1 having the driving sound vector candidate c as an input in FIG. Ideal driving sound source vector r
Is a modified target speech vector x _'w through auditory weighting synthesis approximation filter 6-4 having the same characteristics as the auditory weighting synthesis approximation filter 5-2. In this case, it is assumed that the approximation error similar to the approximation error caused by the perceptually weighted synthesis filter 5-2 is added to the modified target speech vector x _'w. In the distance calculation unit 4-4, the distance between the perceptually weighted reproduced speech candidate y ′ _w including the approximation error, which is the output of the perceptually weighted synthetic approximation filter 5-2, and the deformed target speech vector x ′ _w Is calculated. Therefore, in this distance calculation, the approximation error generated in the perceptually weighted synthetic approximation filter 5-2 is canceled out in the distance calculation with the approximation error added by the perceptually weighted synthetic approximation filter 6-4, and the distortion d (distance ) Can be calculated with high accuracy.

【００１９】図２は、図１におけるこの発明による方法
において、合成近似フィルタ５−２，６−４を具体的に
有限タップ長ＦＩＲフィルタ７−２，７−４の形で表現
したものである。このときのタップ数は、サブフレーム
長と同じ点数のタップ数を用いると、近似計算を用いな
い従来の方法と符号化結果が一致するが、演算処理量は
多くなる。一方、タップ数を過去のサンプル値を用いな
い１タップ（これを０タップと呼ぶこともある）に設定
すると、駆動音源ベクトル候補ｃと理想駆動音源ベク
トルｒとの間の歪みを、駆動音源レベルで測定する符
号化方法になり、演算処理量は極めて少なくなるが、十
分な符号化品質が得られない。タップ数は符号化品質と
演算処理量のバランスを考慮して、１からサブフレーム
長（サブフレームのサンプル数）の範囲で決定すること
になるが、この発明による方法では、サブフレームが例
えば８０サンプルのときに、タップ数を２〜６タップ程
度まで減らしても、有限タップ長ＦＩＲ型聴覚重み付き
合成フィルタ７−２で生じる近似誤差が、ターゲット音
声ｘに対しても有限タップ長ＦＩＲ型聴覚重み付き合
成フィルタ７−４に付加されるため、実際の音声を符号
化したときの信号対雑音比（ＳＮＲ）、聴覚的品質と
も、ほとんど劣化しないことを確認している。FIG. 2 specifically shows the synthesis approximation filters 5-2 and 6-4 in the form of the finite tap length FIR filters 7-2 and 7-4 in the method according to the present invention in FIG. . If the number of taps at this time is the same as the number of taps of the subframe length, the encoding result matches that of the conventional method that does not use approximation calculation, but the amount of computation increases. On the other hand, if the number of taps is set to 1 tap that does not use the past sample value (this may be referred to as 0 tap), the distortion between the driving excitation vector candidate c and the ideal driving excitation vector r is reduced to the driving excitation level. , And the amount of calculation processing is extremely small, but sufficient coding quality cannot be obtained. The number of taps is determined in the range of 1 to the subframe length (the number of samples of the subframe) in consideration of the balance between the coding quality and the amount of arithmetic processing. In the method according to the present invention, the number of subframes is, for example, 80. Even when the number of taps is reduced to about 2 to 6 taps at the time of a sample, the approximation error generated in the finite tap length FIR auditory weighted synthesis filter 7-2 causes the finite tap length FIR auditory Since it is added to the weighted synthesis filter 7-4, it has been confirmed that the signal-to-noise ratio (SNR) and the perceptual quality when actual speech is encoded hardly deteriorate.

【００２０】図３は、駆動音源ベクトル生成部１−４の
構成例において、固定符号ベクトル候補ｖ_rをピッチ
周期化して用いる構成例である。前記ＡＣＥＬＰ方式
や、「音響信号符号化方法及び音響信号復号化方法」
（特願平７−１５０５５０）でも図３に示す構成が用い
られている。ピッチ周期化部８−７には、適応符号帳に
入力される周期符号と同一の周期符号が入力され、周期
符号に対応する周期で固定符号帳２−２の出力ｖ_rを
周期化する。具体的な周期化操作は、固定符号ベクトル
ｖ_rに周期符号に対応するタップ位置のコムフィルタ
（櫛形フィルタ）をかけることが多い。またタップ位置
は、整数サンプル位置でもよいし、非整数サンプル位置
のコムフィルタを、アップサンプリングの手法を用いて
実現してもよい。[0020] Figure 3, in the configuration example of the driving excitation vector generation unit 1-4 is a configuration example using a fixed code vector candidates v _r and a pitch periodic. The ACELP method and “Acoustic signal encoding method and acoustic signal decoding method”
The configuration shown in FIG. 3 is also used in Japanese Patent Application No. 7-150550. The pitch period section 8-7, cycle codes same period code and inputted to the adaptive codebook is input to the period of the output v _r of the fixed codebook 2-2 at a period corresponding to the period code. As a specific periodic operation, a comb filter (comb filter) at a tap position corresponding to the periodic code is often applied to the fixed code vector v _r . The tap position may be an integer sample position, or a comb filter at a non-integer sample position may be realized using an upsampling technique.

【００２１】図３の構成において、通常、適応符号帳８
−１を探索するときには、固定符号帳２−２がないもの
として最適な周期符号（または、歪みが小さくなる複数
個の周期符号候補）を探索し、固定符号帳２−２を探索
するときには、適応符号ベクトルを合成して得られる適
応符号帳成分ｙ_aを、あらかじめ入力音声ｘから除
いたものを入力ｘ_rとして、固定符号ベクトルｖ_r
を合成して得られる成分ｙ_rpとｘ_rとの間の歪みが
最小になるような固定符号を探索するという手法が用い
られる。この手法を用いる場合の、固定符号ベクトル合
成歪み計算方法の構成例を図４に示す。図３におけるピ
ッチ周期化部８−７は、乗算部２−５と順序を入れ替え
ることができるため、図４に示すように、乗算部２−５
と合成フィルタ４−１の間にピッチ周期化部８−４を入
れることができる。固定符号ベクトルｖ_rは乗算部２
−５に送られる。乗算部２−５ではｖ_rに重みｇ_rを
かけて駆動音源ベクトル候補ｃ_rを生成し、ピッチ周
期化部８−４に送る。ｃ_rはピッチ周期化された後、
合成フィルタ４−１を通って再生音声候補ｙ_rpとな
り、聴覚重みフィルタ４−２を通って、距離計算部４−
４に送られる。このとき、ピッチ周期化部８−４、合成
フィルタ４−１、聴覚重みフィルタ４−２は３つのフィ
ルタ特性を合成した特性を持つ１つのフィルタで表現す
ると、探索にかかる演算処理量を削減することできる。
しかしながら、上記８−４，４−１，４−２の３つのフ
ィルタの合成特性を持つフィルタをＦＩＲフィルタで表
現した場合、合成フィルタ４−１や聴覚重みフィルタ４
−２の特性を持つＦＩＲフィルタと違って、ピッチ周期
に相当すると考えられる周期のタップ位置付近に大きな
値の係数を持つため、図２に示す構成例のように、短い
タップ数でフィルタ係数を打ち切ってさらに高速な探索
をすることができない。In the configuration shown in FIG.
When searching for a fixed codebook 2-2, the optimum code (or a plurality of periodic code candidates with a small distortion) is searched for assuming that there is no fixed codebook 2-2. an adaptive codebook component y _a obtained by combining the adaptive code vector, as inputs x _r a minus advance from the input speech x, fixed code vectors v _r
Is used to search for a fixed code that minimizes the distortion between the components y _rp and x _r obtained by combining FIG. 4 shows a configuration example of a fixed code vector combined distortion calculation method using this method. Since the pitch periodization unit 8-7 in FIG. 3 can exchange the order with the multiplication unit 2-5, as shown in FIG.
And a pitch filter 8-4 between the synthesizing filter 4-1. The fixed code vector v _r is multiplied by 2
-5. The multiplier unit 2-5 v _r over weight g _r generates a excitation vector candidates c _r, and sends it to the pitch period section 8-4. _cr is pitch-periodicized,
The reproduced speech candidate y _rp passes through the synthesis filter 4-1 and passes through the auditory weight filter 4-2 to pass through the distance calculation unit 4-
4 At this time, if the pitch periodization unit 8-4, the synthesis filter 4-1, and the auditory weight filter 4-2 are represented by one filter having a characteristic obtained by combining three filter characteristics, the amount of arithmetic processing required for the search is reduced. I can do it.
However, when a filter having a synthesis characteristic of the three filters 8-4, 4-1 and 4-2 is expressed by an FIR filter, the synthesis filter 4-1 and the auditory weight filter 4
Unlike the FIR filter having the characteristic of −2 , the coefficient has a large value near the tap position of the cycle considered to correspond to the pitch cycle. Therefore, as shown in the configuration example shown in FIG. I can't stop and search faster.

【００２２】この問題を解決し、ピッチ周期化のある場
合でも高速に歪みを計算するこの発明の実施例を図５に
示す。図５の構成例では、図１に示す構成例と同様に、
図４における合成フィルタ４−１と聴覚重みフィルタ４
−２の特性を合わせ持つフィルタを、聴覚重み付き合成
近似フィルタ５−２に置き換える。図１の構成例と同様
に、近似によって生じる歪みを入力側との間で相殺でき
るように、入力ｘ_rは合成逆フィルタ６−３を通し、
フィルタ５−２と同じ特性の聴覚重み付き合成近似フィ
ルタ６−４を通すが、この構成例では、図４におけるピ
ッチ周期化フィルタ８−４の逆フィルタ（ピッチの周期
性を取り除くフィルタ）１０−４を、音声ｘの入力側
に入れる。この構成において、聴覚重み付き合成近似フ
ィルタ５−２，６−４を、図２に示す構成例と同様に、
有限タップ長ＦＩＲ形聴覚重み付き合成フィルタで置き
換えれば、非常に高速に符号帳の探索をすることができ
る。このときのＦＩＲフィルタのタップ長は、図２の構
成例と同様に、過去のサンプル値を用いない１タップ
（０タップと呼ぶこともある）から、サブフレーム長ま
での間で、符号化品質と演算処理量とのバランスを考慮
して決められるが、この発明による方法では、サブフレ
ームが８０点のときに、タップ数を２〜６タップ程度ま
で減らしても、実際の音声を符号化したときの、信号対
雑音比（ＳＮＲ）、聴覚的品質とも、ほとんど劣化しな
いことを確認している。なお、図５の構成例において、
合成逆フィルタ６−３、ピッチ周期化逆フィルタ１０−
４、聴覚重み付き合成近似フィルタ６−４が、すべて線
形フィルタのときには、それらの順序を交換してもよ
い。FIG. 5 shows an embodiment of the present invention which solves this problem and calculates distortion at a high speed even in the case of pitch period. In the configuration example of FIG. 5, similarly to the configuration example shown in FIG.
Synthesis filter 4-1 and auditory weight filter 4 in FIG.
The filter having the characteristic of -2 is replaced with the perceptually weighted synthetic approximation filter 5-2. As in the configuration example of FIG. 1, the input _xr is passed through a synthetic inverse filter 6-3 so that distortion caused by approximation can be canceled between the input side and the input side.
The synthetic approximation filter 6-4 having the same characteristics as the filter 5-2 is passed through the auditory weight, but in this configuration example, an inverse filter (filter for removing the periodicity of the pitch) 10- of the pitch periodicization filter 8-4 in FIG. 4 is input to the input side of the voice x. In this configuration, the synthesis approximation filters 5-2 and 6-4 with auditory weights are replaced with the configuration example shown in FIG.
If a finite tap length FIR-type auditory weighted synthesis filter is used, the codebook can be searched very quickly. The tap length of the FIR filter at this time ranges from one tap (sometimes referred to as zero tap) that does not use a past sample value to the subframe length, as in the configuration example of FIG. In the method according to the present invention, even when the number of taps is reduced to about 2 to 6 taps, the actual audio is encoded in the method according to the present invention. At that time, it was confirmed that the signal-to-noise ratio (SNR) and the auditory quality hardly deteriorated. In the configuration example of FIG. 5,
Synthetic inverse filter 6-3, pitch periodic inverse filter 10-
4. When all of the hearing weighted synthesis approximation filters 6-4 are linear filters, their order may be exchanged.

【００２３】図６は、この発明による方法において、Ｆ
ＩＲフィルタを有限長で打ち切っても符号化音の品質劣
化が非常に少ない利点を用いて、効率的に歪み計算を実
施し、非常に高速な音声符号化を実現する構成例を示し
たものである。有限タップ長ＦＩＲ型聴覚重み付き合成
フィルタ係数算出部１１−１では、合成フィルタ係数
ａ＾と量子化していない線形予測パラメータａか
ら、合成フィルタと聴覚重み付きフィルタを合わせた特
性を持つ、聴覚重み付き合成フィルタをＦＩＲ型で実現
したときのフィルタ係数を算出し、このフィルタ係数を
有限タップ長で打ち切った係数βを出力する。インパ
ルス応答行列生成部１１−２では、下記式（２）に示す
ように、ＦＩＲフィルタ係数を要素とする三角行列を生
成する。ここで、Ｎはサブフレームのサンプル数を表
す。式（２）において、係数βは有限長で打ち切るた
め、例えば打ち切り次数をｋとすると、β_kからβ_N-1
までは０であって、式（３）のような行列となる。FIG. 6 shows that in the method according to the invention, F
This shows an example of a configuration in which the distortion calculation is performed efficiently and an extremely high-speed speech encoding is realized, with the advantage that the quality degradation of the encoded sound is very small even if the IR filter is truncated to a finite length. is there. The finite tap length FIR type hearing weighted synthetic filter coefficient calculating unit 11-1 has a perceptual weight having a combined characteristic of a synthetic filter and a perceptual weighted filter from the pertinent synthetic filter coefficient a ＾ and the unquantized linear prediction parameter a. A filter coefficient when the attached synthesis filter is realized by the FIR type is calculated, and a coefficient β obtained by truncating the filter coefficient by a finite tap length is output. The impulse response matrix generation section 11-2 generates a triangular matrix having FIR filter coefficients as elements as shown in the following equation (2). Here, N represents the number of samples of the subframe. In equation (2), since the coefficient β is truncated to a finite length, for example, if the truncation order is k, β _k to β _N−1
Up to 0, which is a matrix as shown in equation (3).

【００２４】このとき上記行列の要素が０の部分は、メモリなどに記
憶しておく必要がない。相関行列生成部１１−３では、
上記インパルス応答行列Ｈから、Ｈ^tＨを計算す
る。このとき、係数のβ_kからβ_N-1までは０であるの
で、Ｎ×Ｎの行列計算をする必要がなく、ｋ×ｋの行列
計算でＨ^tＨを求めることができる。例えば、ｋ
は、２から６の値に設定しても符号化音の品質がほとん
ど劣化しないため、Ｎ＝８０のときに、８０×８０の行
列計算に比べて、例えば５×５の行列計算は著しい演算
処理量の削減となる。適応符号帳成分を除いた入力音声
ｘ _rは、合成逆フィルタ６−３を通り、ピッチ周期化
逆フィルタ１０−４を通って、畳み込み部１１−６に入
力される。畳み込み部１１−６では、ピッチ周期化逆フ
ィルタ１０−４の出力ｒ_pを、係数βのＦＩＲフィ
ルタを通して、タップ打ち切り歪みを含むターゲット音
声ｘ′_rpを求め、ｘ′_rpとＨ行列を、時間軸反転
畳み込み操作もしくは行列演算によって、ｘ′_rp ^t
Ｈ（Ａ^tは行列Ａの転置を示す）を計算する。このと
きも打ち切り次数ｋを小さくとっていれば、非常に高速
に計算を行うことができる。畳み込み部１１−６は、別
の手法を用いることもでき、相関行列計算部１１−３の
出力Ｈ^tＨと、ピッチ逆周期化フィルタ１１−５の
出力ｒ_pから、行列演算によって、ｒ_p ^t（Ｈ^t
Ｈ）を計算することもできる。このとき、上記ｘ′
_rp ^tＨとｒ_p ^t（Ｈ^tＨ）は値が一致する。最
終距離尺度計算部１１−７では駆動音源ベクトル候補の
固定符号帳成分ｃ_rと、Ｈ^tＨ，ｘ′_rp ^tＨ
（またはｒ_p ^tＨ ^tＨ）から、距離尺度ｄ′＝（ｘ′_rp ^tＨｃ_r）²／（ｃ_r ^tＨ^tＨｃ_r）（４）を計算する。ｄ′は符号帳検索制御部に送られ、距離尺
度ｄ′が最大になる（歪み尺度ｄが最小になることと等
価な）符号が選択される。[0024]At this time, the part where the element of the matrix is 0 is recorded in the memory etc.
No need to remember. In the correlation matrix generation unit 11-3,
From the impulse response matrix H, H^tCalculate H
You. At this time, the coefficient β_kFrom β_N-1Up to 0
It is not necessary to perform an N × N matrix calculation, and a k × k matrix
H by calculation^tH can be determined. For example, k
Means that the quality of the coded sound is almost
80 × 80 rows when N = 80
Compared to column calculation, for example, 5 × 5 matrix calculation is a remarkable operation
The processing amount is reduced. Input speech without adaptive codebook components
x _rPasses through the synthesis inverse filter 6-3, and is pitch-cycled
It passes through the inverse filter 10-4 and enters the convolution unit 11-6.
Is forced. In the convolution unit 11-6, the pitch period reverse
Output r of filter 10-4_pIs the FIR filter of coefficient β
Target sound including tap truncation distortion
Voice x '_rpX '_rpAnd H matrix, time axis inversion
X ′ by convolution or matrix operation_rp ^t
H (A^tIndicates the transpose of matrix A). This and
Very high speed if the cutoff order k is small
Can be calculated. Folding section 11-6 is separate
Can be used, and the correlation matrix calculation unit 11-3
Output H^tH and the pitch inverse periodic filter 11-5
Output r_pFrom the matrix operation, r_p ^t(H^t
H) can also be calculated. At this time, the above x '
_rp ^tH and r_p ^t(H^tH) has the same value. Most
In the final distance scale calculation unit 11-7, the driving sound source vector candidate
Fixed codebook component c_rAnd H^tH, x '_rp ^tH
(Or r_p ^tH ^tH), the distance scale d '= (x'_rp ^tHc_r)^Two/ (C_r ^tH^tHc_r) (4) is calculated. d 'is sent to the codebook search control unit,
The degree d 'is maximized (e.g., the distortion measure d is minimized)
Value) code is selected.

【００２５】上述において、合成近似フィルタとしては
必ずしも聴覚重み付き特性を与えたものでなくてもよ
い。なお特許請求の範囲において「フレーム」はフレー
ムと、これを分割したサブフレームとの何れでもよい。In the above description, the synthetic approximation filter does not necessarily need to be one having the auditory weighting characteristic. In the claims, the “frame” may be either a frame or a sub-frame obtained by dividing the frame.

【００２６】[0026]

【発明の効果】この発明の効果を確認するため下記の実
験を行った。４．６ｋbit/ｓのＤｕａｌ−Ｐｕｌｓｅ
ＣＳ−ＣＥＬＰを構成した。フレーム長は２０ｍｓ、サ
ブフレーム長は１０ｍｓ（８０点）とし、ＬＰＣの量子
化はフレーム毎、その他はサブフレーム毎に行う。ビッ
ト配分はフレーム当り、ＬＳＰ２２ビット、適応符号８
×２ビット、Ｄｕａｌ−Ｐｕｌｓｅ符号２０×２利得符
号７×２（計９２（４．６ｋbit/ｓ））とし、Ｄｕａｌ
Ｐｕｌｓｅは、１サブフレームあたり３組配置し、位
置１１ビット、パタン６ビット、正負符号３ビットを割
り当てる。The following experiments were conducted to confirm the effects of the present invention. 4.6kbit / s Dual-Pulse
CS-CELP was configured. The frame length is 20 ms, the sub-frame length is 10 ms (80 points), and LPC quantization is performed for each frame, and the rest is performed for each sub-frame. Bit allocation is LSP 22 bits per frame, adaptive code 8
× 2 bits, Dual-Pulse code 20 × 2, gain code 7 × 2 (92 (4.6 kbit / s) in total)
Pulses are arranged in three sets per subframe, and 11 bits for the position, 6 bits for the pattern, and 3 bits for the sign are assigned.

【００２７】上記符号化器に実音声データを入力して、
この発明方法の性能を調べた。音声データは、８ｋＨｚ
サンプリングで、ＩＴＵ−ＴＧ．７１２帯域のフィル
タをかけたものを用いた。図７に、ＦＩＲフィルタのタ
ップを有限長で打ち切ったときの、打ち切りの次数とＷ
ＳＮＲの関係を示した。ＷＳＮＲは、最終的な合成音と
入力音声との間で測定しているため、打ち切りのタップ
数にかかわらず同一の尺度である。図中の方法（１）
は、歪みを最小化するターゲット音声を従来の方法で求
め、符号帳探索のためのフィルタのタップのみ打ち切っ
た場合である。この場合は、２０タップ以下になると急
速に品質が劣化している。方法（２）はピッチ周期化逆
フィルタを用いない図２に示したこの発明方法を適用し
た場合である。この方法を用いると、タップ数が２程度
まではＷＳＮＲにほとんど変化がない。方法（３）はピ
ッチ周期化逆フィルタを用いる図６に示したこの発明方
法を適用した場合である。４．６ｋbit/ｓのＤｕａｌ−
ＰｕｌｓｅＣＳ−ＣＥＬＰは、ＤｕａｌＰｕｌｓｅ
をピッチ周期化して駆動音源に用いるため、方法（３）
を用いることによって非常に高速な符号化を実現でき
る。この場合の品質を方法（２）の場合と比較すると、
全体的に０．３ｄＢ程度低下しているものの、方法
（２）の場合と同様に、タップ数を減らしてもＷＳＮＲ
はあまり低下しなかった。Inputting the actual audio data to the encoder,
The performance of the inventive method was investigated. Audio data is 8kHz
In sampling, ITU-TG. 712 band filters were used. FIG. 7 shows the cutoff order and W when the tap of the FIR filter is cut off at a finite length.
The relationship of SNR was shown. Since the WSNR is measured between the final synthesized speech and the input speech, the WSNR has the same scale irrespective of the number of taps for censoring. Method (1) in the figure
Is a case where a target speech for minimizing distortion is obtained by a conventional method, and only taps of a filter for searching a codebook are cut off. In this case, the quality rapidly deteriorates when the number of taps becomes equal to or less than 20 taps. The method (2) is a case where the method of the present invention shown in FIG. 2 without using the pitch periodic inverse filter is applied. With this method, the WSNR hardly changes until the number of taps is about two. Method (3) is a case in which the method of the present invention shown in FIG. 6 using a pitch period inverse filter is applied. 4.6 kbit / s Dual-
Pulse CS-CELP is Dual Pulse
Method (3) in which the pitch period is used for the driving sound source
, Very high-speed encoding can be realized. Comparing the quality in this case with that of method (2),
Although it is reduced by about 0.3 dB as a whole, as in the case of the method (2), even if the number of taps is reduced, the WSNR is reduced.
Did not drop much.

【００２８】聴感上も６タップ程度使えば、全タップ使
用する場合に比べてほとんど劣化が感じられない。ま
た、方法（３）は方法（２）に比べてわずかに劣化が感
じられる程度である。以上述べたようにこの発明によれ
ば、非常に少ないタップ数で打ち切り、高速な符号帳探
索、つまり高速な音声符号化を実現した場合でも、品質
の劣化が非常に少ないことが確認された。In terms of audibility, if about 6 taps are used, deterioration is hardly felt as compared with the case where all taps are used. Further, the method (3) is slightly deteriorated as compared with the method (2). As described above, according to the present invention, it has been confirmed that quality degradation is extremely small even when the cutoff is performed with a very small number of taps and a high-speed codebook search, that is, a high-speed voice coding is realized.

[Brief description of the drawings]

【図１】この発明の前提となる近似誤差を含んだ聴覚重
み付き再生音声候補と、同じく近似誤差を含んだ変形タ
ーゲット音声との間の距離を計算する方法の機能構成を
示す図。FIG. 1 is a diagram showing a functional configuration of a method for calculating a distance between a perceptually weighted reproduced speech candidate including an approximation error and a deformed target speech also including an approximation error, which is a premise of the present invention.

【図２】図１に示した方法において、聴覚重み付き合成
近似フィルタを有限タップ長ＦＩＲフィルタの形で表現
する例を示す機能構成図。FIG. 2 is a functional configuration diagram showing an example of expressing a synthetic approximation filter with auditory weights in the form of a finite tap length FIR filter in the method shown in FIG. 1;

【図３】駆動音源ベクトル生成部の構成において、固定
符号ベクトル候補をピッチ周期化して用いる機能構成例
を示す図。FIG. 3 is a diagram illustrating an example of a functional configuration in which a fixed code vector candidate is pitch-periodically used in a configuration of a driving excitation vector generation unit.

【図４】図３の構成を用いる場合の、固定符号ベクトル
合成歪み計算方法の機能構成例を示す図。FIG. 4 is a diagram showing an example of a functional configuration of a method for calculating a fixed code vector combined distortion when the configuration shown in FIG. 3 is used.

【図５】図３に示すピッチ周期化のある場合に、この発
明を適用し、ピッチ周期化逆フィルタを入力側に入れた
歪み計算方法の機能構成を示す図。FIG. 5 is a diagram showing a functional configuration of a distortion calculation method to which the present invention is applied and a pitch period inverse filter is provided on the input side in a case where there is pitch period shown in FIG. 3;

【図６】この発明方法で、ＦＩＲフィルタを有限長で打
ち切って効率的に歪み計算を実施し、非常に高速な音声
符号化を実現する方法の機能構成例を示す図。FIG. 6 is a diagram showing an example of a functional configuration of a method for realizing a very high-speed speech encoding by truncating an FIR filter to a finite length and efficiently performing distortion calculation according to the method of the present invention;

【図７】この発明を実際の音声符号化に適用した場合
の、ＦＩＲフィルタタップの打ち切り次数とＷＮＳＲの
関係を示すグラフ。FIG. 7 is a graph showing the relationship between the cutoff order of FIR filter taps and WNSR when the present invention is applied to actual speech coding.

【図８】音声の符号駆動線形予測符号化（Code-Excited
Linear Prediction：ＣＥＬＰ）の機能構成例を示す
図。FIG. 8: Code-driven linear prediction coding (Code-Excited)
FIG. 3 is a diagram illustrating a functional configuration example of Linear Prediction (CELP).

【図９】図８における駆動音源ベクトル生成部の機能構
成例を示す図。FIG. 9 is a diagram showing a functional configuration example of a driving sound source vector generation unit in FIG. 8;

【図１０】音声の符号駆動線形予測符号化（Code-Excit
ed Linear Prediction：ＣＥＬＰ）に対応する復号方法
の機能構成例を示す図。FIG. 10: Code-driven linear predictive coding of speech (Code-Excit)
FIG. 3 is a diagram showing an example of a functional configuration of a decoding method corresponding to ed Linear Prediction (CELP).

【図１１】聴覚重みづきを考慮して歪みを計算する機能
構成例を示す図。FIG. 11 is a diagram showing an example of a functional configuration for calculating distortion in consideration of auditory weighting.

【図１２】従来の高速歪み計算方法の例で、聴覚重み付
き合成フィルタの近似フィルタを合成歪み計算に用いる
機能構成例を示す図。FIG. 12 is a diagram showing an example of a conventional high-speed distortion calculation method, showing a functional configuration example in which an approximation filter of a perceptually weighted synthesis filter is used for the synthesis distortion calculation.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平８−248996（ＪＰ，Ａ) 特表平７−506202（ＪＰ，Ａ) 三樹ら「ＰＳＩ−ＣＥＬＰ音声符号化の基本アルゴリズム」ＮＴＴＲ＆ＤＶｏｌ．43 Ｎｏ．４，ｐｐ363−372 （1994) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/00 - 21/06 H03M 7/30 H03M 7/42 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-8-248996 (JP, A) JP-A-7-506202 (JP, A) Miki et al. “Basic algorithm of PSI-CELP speech coding” NTT R & D Vol . 43 No. 4, pp363-372 (1994) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 11/00-21/06 H03M 7/30 H03M 7/42 JICST file (JOIS)

Claims

(57) [Claims]

An adaptive codebook in which an adaptive codebook vector is recorded and a fixed codebook in which a fixed codebook vector is recorded, wherein a driving excitation vector based on a fixed codebook vector candidate extracted from the fixed codebook and an adaptive excitation vector are adapted. An audio signal encoding method for selecting the fixed codebook vector that maximizes a distance measure from an input audio signal from which a codebook component has been removed, wherein a step of calculating a linear prediction parameter from an input audio signal; Calculating the synthesized filter coefficient by quantizing the equation, a step of approximating the synthesized filter coefficient to a finite-length impulse response, and a step of generating an impulse response matrix expressed by a triangular matrix having the impulse response as an element. Calculating a correlation matrix consisting of the product of the impulse response matrix and the transpose of the impulse response matrix; Converting the input acoustic signal from which the adaptive codebook component has been removed through a synthetic inverse filter having an inverse filter characteristic of the synthetic filter coefficient to convert the input acoustic signal into an ideal drive excitation vector; And a convolution process of obtaining a target speech vector by multiplying the impulse response matrix by a convolution of the target speech vector multiplied by the impulse response matrix and a fixed codebook vector candidate. And a step of calculating a distance measure by dividing by a product of the correlation matrix and the transposed vector of the fixed codebook vector candidate.

2. The audio signal encoding method according to claim 1, wherein a tap length of the synthesis filter is set to be 2 taps or more and 6 taps or less.

A step of obtaining a driving excitation vector by periodicizing a fixed codebook vector candidate extracted from the fixed codebook at a period corresponding to a periodic code input to the adaptive codebook by a periodic filter; 2. The method according to claim 1, further comprising the step of passing the input speech or the ideal drive excitation vector or the target speech vector from which the adaptive codebook component has been removed, through a periodic inverse filter having an inverse characteristic of the periodic filter. Alternatively, the audio signal encoding method according to claim 2.

4. A step of calculating an auditory weighted synthetic filter coefficient truncated at the finite length from the synthetic filter coefficient and the linear prediction parameter, wherein the auditory weighted synthetic filter coefficient is used as the synthetic filter coefficient. The acoustic signal encoding method according to any one of claims 1 to 3, wherein the method is used.

5. The method according to claim 1, wherein said correlation matrix is calculated, developed and stored in a memory, and said distance scale calculation is performed by referring to a value of the correlation matrix stored in said memory. The audio signal encoding method according to any one of the above.