JP3292711B2

JP3292711B2 - Voice encoding / decoding method and apparatus

Info

Publication number: JP3292711B2
Application number: JP22380499A
Authority: JP
Inventors: 誠司佐々木
Original assignee: 株式会社ワイ・アール・ピー高機能移動体通信研究所
Priority date: 1999-08-06
Filing date: 1999-08-06
Publication date: 2002-06-17
Anticipated expiration: 2019-08-06
Also published as: JP2001051698A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声信号を低ビッ
トレートで符号化処理及び復号処理する音声符号化復号
方法とその装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio encoding / decoding method and apparatus for encoding and decoding audio signals at a low bit rate.

【０００２】[0002]

【従来の技術】低ビットレート音声符号化方式の従来技
術として、2.4kbps ＬＰＣ（Linear Predictive Codin
g：線形予測符号化）方式と2.4kbps ＭＥＬＰ（Mixed E
xcitaion Linear Prediction：混合音源線形予測）方式
が知られている。これらは共に米国連邦政府標準の音声
符号化方式であり、前者はFS-1015（FS はFederal Stan
dard）として標準化されており、後者は、FS-1015の音
質改良版として1996年に新たに選定、標準化された。2. Description of the Related Art As a conventional technology of a low bit rate speech coding method, a 2.4 kbps LPC (Linear Predictive Codin) is used.
g: Linear predictive coding) method and 2.4kbps MELP (Mixed E)
An xcitaion Linear Prediction (mixed sound source linear prediction) method is known. These are both U.S. federal government standard audio coding systems, the former being FS-1015 (FS is Federal Stanford).
dard), and the latter was newly selected and standardized in 1996 as an improved sound quality version of FS-1015.

【０００３】この2.4kbps ＬＰＣ方式および2.4kbps Ｍ
ＥＬＰ方式に関して、次のような参考文献がある。 [1] FREDERAL STANDARD 1015," ANALOG TO DIGITAL CON
VERSION OF VOICE BY 2400 BIT/SECOND LINEAR PREDICT
IVE CODING ", November 28, 1984 [2] Federal Information Processing Standards Publi
cation, "Analog to digital conversion of voice by
2400 bit/second Mixed Excitation Linear Prediction
(MELP)", May 28, 1998 Draft [3] L.Supplee, R.Cohn, J.Collura and A.McCree, " M
ELP：The new federal standard at 2400 bps ", Proc.
ICASSP, pp.1591-1594, 1997 [4] A.McCree and T. Barnwell III, " A Mixed Excita
tion LPC Vocoder Modelfor Low Bit Rate Speech Codi
ng ", IEEE TRANSACTIONS ON SPEECH AND AUDIOPROCESS
ING, VOL.3, NO.4, pp.242-250, July 1995 [5] D.Thomson and D. Prezas, " SELECTIVE MODELING
OF THE LPC RESIDUAL DURING UNVOICED FRAMES：WHITE
NOISE OR PULSE EXCITATION ", Proc. ICASSP, pp.3087
-3090, 1986The 2.4 kbps LPC system and the 2.4 kbps M
There are the following references regarding the ELP method. [1] FREDERAL STANDARD 1015, "ANALOG TO DIGITAL CON
VERSION OF VOICE BY 2400 BIT / SECOND LINEAR PREDICT
IVE CODING ", November 28, 1984 [2] Federal Information Processing Standards Publi
cation, "Analog to digital conversion of voice by
2400 bit / second Mixed Excitation Linear Prediction
(MELP) ", May 28, 1998 Draft [3] L. Supplee, R. Cohn, J. Collura and A. McCree," M
ELP: The new federal standard at 2400 bps ", Proc.
ICASSP, pp.1591-1594, 1997 [4] A. McCree and T. Barnwell III, "A Mixed Excita
tion LPC Vocoder Model for Low Bit Rate Speech Codi
ng ", IEEE TRANSACTIONS ON SPEECH AND AUDIOPROCESS
ING, VOL.3, NO.4, pp.242-250, July 1995 [5] D. Thomson and D. Prezas, "SELECTIVE MODELING
OF THE LPC RESIDUAL DURING UNVOICED FRAMES: WHITE
NOISE OR PULSE EXCITATION ", Proc. ICASSP, pp.3087
-3090, 1986

【０００４】最初に2.4kbps ＬＰＣ方式の原理について
図１０および図１１を用いて説明する（処理の詳細は参
考文献[1]を参照されたい）。図１０はＬＰＣ方式の音
声符号化器の構成を示すブロック図である。フレーム化
器(11)は、100-3600Hzで帯域制限された後、8kHzで標本
化され、少なくとも１２ビットの精度で量子化された入
力音声サンプル(a1)を蓄えるバッファであり、１音声符
号化フレーム（22.5ms）毎に音声サンプル（180サンプ
ル）を取り込み、音声符号化処理部へ(b1)として出力す
る。以下では１音声符号化フレーム毎に実行される処理
について説明する。First, the principle of the 2.4 kbps LPC system will be described with reference to FIGS. 10 and 11 (refer to reference [1] for details of the processing). FIG. 10 is a block diagram showing the configuration of the speech encoder of the LPC system. The framer (11) is a buffer that stores input audio samples (a1) sampled at 8 kHz after being band-limited at 100-3600 Hz and quantized with at least 12-bit accuracy. An audio sample (180 samples) is fetched every frame (22.5 ms) and output to the audio encoding processing unit as (b1). Hereinafter, a process performed for each audio encoded frame will be described.

【０００５】プリエンファシス器（12）は、(b1)を高域
強調処理し、高域強調処理された信号(c1)を出力する。
線形予測分析器(13)は、(c1)をDurbin-Levinson法を用
いて線形予測分析し、スペクトル包絡情報である10次の
反射係数(d1)を出力する。量子化器１(14)は(d1)を各次
数毎にスカラー量子化し、その結果である計41ビットの
出力(e1)を誤り訂正符号化／ビットパッキング器(19)へ
出力する。各次数の反射係数に対するビット配分は表１
に示す。ＲＭＳ（Root Mean Square）計算器(15)は高域
強調処理された信号(c1)のレベル情報であるＲＭＳ値
（実効値）を計算し、ＲＭＳ値(f1)を出力する。量子化
器２(16)は(f1)を５ビットで量子化し、その結果である
(g1)を誤り訂正符号化／ビットパッキング器(19)へ出力
する。[0005] The pre-emphasis device (12) performs high-frequency emphasis processing on (b1) and outputs a signal (c1) subjected to high-frequency emphasis processing.
The linear prediction analyzer (13) performs linear prediction analysis on (c1) using the Durbin-Levinson method, and outputs a 10th-order reflection coefficient (d1), which is spectral envelope information. The quantizer 1 (14) scalar-quantizes (d1) for each order, and outputs the resulting 41-bit output (e1) to the error correction coding / bit packing unit (19). Table 1 shows the bit allocation for each order reflection coefficient.
Shown in An RMS (Root Mean Square) calculator (15) calculates an RMS value (effective value) as level information of the signal (c1) subjected to the high-frequency emphasis processing, and outputs an RMS value (f1). Quantizer 2 (16) quantizes (f1) with 5 bits and the result is
(g1) is output to the error correction coding / bit packing unit (19).

【０００６】ピッチ検出／音響分類器(17)は、前記フレ
ーム化器１１の出力(b1)を入力し、ピッチ周期（20〜15
6サンプル（51〜400Hzに対応）の範囲をとる）および音
響分類情報（有声／無声／過渡部の識別情報）を抽出
し、それぞれ(h1)、(i1)として出力する。量子化器３(1
8)は(h1)および(i1)をまとめて７ビットで量子化し、そ
の結果(j1)を誤り訂正符号化／ビットパッキング器(19)
へ出力する。ここでの量子化方法（７ビットの符号（12
8種類の符号語）へのピッチ情報、音響分類情報の割り
当て方）は、７ビットが全て０の符号語および７ビット
中１ビットのみが１となる符号語を無声に割り当て、７
ビットが全て１の符号語および７ビット中１ビットのみ
が０となる符号語を過渡部に割り当てる。その他の符号
語は有声用としてピッチ周期情報に割り当てられる。誤
り訂正符号化／ビットパッキング器(19)は量子化された
それぞれの情報(e1)、(g1)、(j1)を54ビット／フレーム
にパッキングし、音声符号化情報フレームを構成し、１
フレーム毎に54ビットを(k1)として出力する。音声情報
ビット列(ｋ1)は、無線通信の場合、変調器、無線機を
通り、受信側に伝送される。A pitch detection / sound classifier (17) receives the output (b1) of the framing device 11 and inputs a pitch period (20 to 15).
Six samples (corresponding to 51 to 400 Hz) and sound classification information (voiced / unvoiced / transient identification information) are extracted and output as (h1) and (i1), respectively. Quantizer 3 (1
8) quantizes (h1) and (i1) together with 7 bits, and converts the result (j1) into an error correction coding / bit packing unit (19)
Output to Here, the quantization method (7-bit code (12
Assignment of pitch information and acoustic classification information to eight types of codewords) is such that codewords in which all 7 bits are 0 and codewords in which only 1 bit out of 7 bits is 1 are unvoiced.
A code word in which all bits are 1 and a code word in which only 1 bit out of 7 bits is 0 are allocated to the transient part. Other codewords are assigned to pitch period information for voiced use. The error correction coding / bit packing unit (19) packs each of the quantized information (e1), (g1), and (j1) into 54 bits / frame to form a voice coded information frame.
Output 54 bits (k1) for each frame. In the case of wireless communication, the audio information bit string (k1) is transmitted to the receiving side through a modulator and a wireless device.

【０００７】表１に１フレーム当たりのビット配分を示
す。同表から分かるように、誤り訂正符号化／ビットパ
ッキング器(19)では、そのフレームの音響的分類が有声
でないならば（つまり無声または過渡部であるなら
ば）、５〜１０次の反射係数を送る代わりに誤り訂正符
号（20ビット）を送る。無声または過渡部の場合に誤り
保護される情報は、ＲＭＳ情報の上位４ビット、１〜４
次の反射係数情報である。また、各フレーム毎に１ビッ
トの同期ビットが付加される。Table 1 shows the bit allocation per frame. As can be seen from the table, in the error correction coding / bit packing unit (19), if the acoustic classification of the frame is not voiced (that is, if the frame is unvoiced or a transient part), the reflection coefficient of the 5th to 10th order is obtained. Instead of sending an error correction code (20 bits). The information to be error-protected in the case of unvoiced or transient part is the upper 4 bits of RMS information, 1-4.
This is the next reflection coefficient information. Also, one synchronization bit is added to each frame.

【０００８】[0008]

【表１】 [Table 1]

【０００９】次に図１１を用いてＬＰＣ音声復号器の構
成について説明する。ビット分離／誤り訂正復号器(21)
は１フレーム毎に受信した54ビットの音声情報ビット列
(a2)を各パラメータ毎に分離すると共に、そのフレーム
が無声または過渡部の場合には前記該当ビットに対して
誤り訂正復号処理を施す。そして、その結果であるピッ
チ／音響分類情報ビット(b2)、10次の反射係数情報ビッ
ト(e2)およびＲＭＳ情報ビット(g2)を出力する。ピッチ
／音響分類情報復号器(22)は前記ピッチ／音響分類情報
ビット(b2)を復号し、ピッチ周期(c2)および音響分類情
報(d2)を出力する。反射係数復号器(23)は前記10次の反
射係数情報ビット(e2)を復号し、10次の反射係数(f2)を
出力する。ＲＭＳ復号器(24)は前記ＲＭＳ情報ビット(g
2)を復号し、ＲＭＳ情報(h2)を出力する。パラメータ補
間器(25)は、再生音声の品質を向上するため、各パラメ
ータ(c2)、(d2)、(f2)、(h2)をそれぞれ補間処理し、そ
の結果である(i2)、(j2)、(o2)、(r2)を出力する。Next, the configuration of the LPC speech decoder will be described with reference to FIG. Bit separation / error correction decoder (21)
Is the 54-bit audio information bit string received for each frame
(a2) is separated for each parameter, and when the frame is unvoiced or a transient part, error correction decoding processing is performed on the corresponding bit. Then, a pitch / sound classification information bit (b2), a tenth-order reflection coefficient information bit (e2), and an RMS information bit (g2) are output. A pitch / sound classification information decoder (22) decodes the pitch / sound classification information bit (b2) and outputs a pitch period (c2) and sound classification information (d2). The reflection coefficient decoder (23) decodes the 10th-order reflection coefficient information bit (e2) and outputs a 10th-order reflection coefficient (f2). The RMS decoder (24) outputs the RMS information bits (g
2) is decoded and RMS information (h2) is output. The parameter interpolator (25) interpolates each of the parameters (c2), (d2), (f2), and (h2) in order to improve the quality of the reproduced sound, and the results (i2), (j2) ), (O2) and (r2) are output.

【００１０】次に音源信号(m2)は以下のようにして作ら
れる。音響分類切替え器(28)は、前記補間された音響分
類情報(j2)が有声を示す時は、パルス音源発生器(26)が
発生するピッチ周期(i2)に同期したパルス音源(k2)を選
択し、音響分類情報(j2)が無声音声部を示す時には雑音
発生器(27)が発生する白色雑音(l2)を選択するように動
作する。また、音響分類情報(j2)が過渡部を示す時には
そのフレーム内の有声部分に対してはパルス音源(k2)、
無声部分に対しては白色雑音（擬似ランダム音源）(l2)
を選択するように動作する。ここでフレーム内における
有声部分と無声部分の境界はパラメータ補間器(25)で決
定される。また、ここで使用されるパルス音源(k2)を作
るためのピッチ周期情報(i2)は、隣接した有声音フレー
ムのものを使用する。音響分類切替え器(28)の出力が音
源信号(m2)となる。Next, the sound source signal (m2) is created as follows. When the interpolated sound classification information (j2) indicates voiced, the sound classification switch (28) generates a pulse sound source (k2) synchronized with the pitch period (i2) generated by the pulse sound source generator (26). When the sound classification information (j2) indicates an unvoiced voice part, the noise generator (27) operates to select the white noise (l2) generated. When the sound classification information (j2) indicates a transient portion, a pulsed sound source (k2) for a voiced portion in the frame,
White noise for unvoiced parts (pseudo-random sound source) (l2)
Works to select. Here, the boundary between the voiced portion and the unvoiced portion in the frame is determined by the parameter interpolator (25). The pitch period information (i2) for creating the pulse sound source (k2) used here uses that of the adjacent voiced sound frame. The output of the sound classification switch (28) becomes the sound source signal (m2).

【００１１】ＬＰＣ合成フィルタ(30)は、線形予測係数
(p2)を係数として用いる全極型フィルタであり、音源信
号(m2)に対しスペクトル包絡情報を付加して、その結果
である信号(n2)を出力する。ここで、スペクトル包絡情
報である線形予測係数(p2)は、線形予測係数計算器(29)
により前記反射係数(o2)から計算される。また、ＬＰＣ
合成フィルタ(30)は、有声に対しては10次の線形予測係
数(p2)を用いる10次の全極型フィルタとして構成され、
無声に対しては４次の線形予測係数(p2)を用いる４次の
全極型フィルタとして構成される。ゲイン調整器(31)は
前記ＬＰＣ合成フィルタ(39)の出力(n2)に対し前記ＲＭ
Ｓ情報(r2)を用いてゲイン調整を行い、(q2)を出力す
る。最後にデエンファシス器(32)は、(q2)に対し、前述
のプリエンファシス器(12)と逆の処理を行い再生音声(s
2)を出力する。The LPC synthesis filter (30) has a linear prediction coefficient
This is an all-pole filter using (p2) as a coefficient, adds spectral envelope information to the sound source signal (m2), and outputs a signal (n2) as a result. Here, the linear prediction coefficient (p2), which is the spectrum envelope information, is calculated by the linear prediction coefficient calculator (29).
Is calculated from the reflection coefficient (o2). Also, LPC
The synthesis filter (30) is configured as a 10th-order all-pole filter using a 10th-order linear prediction coefficient (p2) for voiced,
For unvoiced, it is configured as a fourth-order all-pole filter using a fourth-order linear prediction coefficient (p2). The gain adjuster (31) outputs the RM signal to the output (n2) of the LPC synthesis filter (39).
The gain is adjusted using the S information (r2), and (q2) is output. Finally, the de-emphasis unit (32) performs the reverse process on the (q2) with the pre-emphasis unit (12) described above to reproduce the reproduced sound (s
Output 2).

【００１２】このようなＬＰＣ方式の問題点を以下に示
す（参考文献[4]）。問題点Ａ：ＬＰＣ方式では、全周波数帯域に渡り、フレ
ーム毎に有声／無声／過渡部を切り替えている。しか
し、自然音声の音源信号は、小さな周波数帯域に分けて
観測すると有声の性質を持つ帯域と、無声の性質を持つ
帯域がある。従って、ＬＰＣ方式において有声と決定さ
れたフレームでは、雑音で駆動すべき成分をパルスで駆
動してしまうため、buzz音（ブンブンとうなるような
音）になる。これは、高いほうの周波数で顕著になる。問題点Ｂ：無声から有声に変化する過渡部では、非周期
性パルスを有する音源信号となる場合があるが、ＬＰＣ
方式の過渡部フレームでは、非周期的パルス音源を表現
できない。そのため、トーン的雑音が生じる。このよう
に、ＬＰＣ方式では、buzz音、トーン的雑音の発生によ
り再生音声は聞きづらい音質（機械的な音質）となって
しまうという問題点がある。The problems of such an LPC system are described below (reference [4]). Problem A: In the LPC system, voiced / unvoiced / transient sections are switched for each frame over the entire frequency band. However, when a natural sound source signal is observed while divided into small frequency bands, there are a band having a voiced property and a band having an unvoiced property. Therefore, in a frame determined to be voiced in the LPC method, a component to be driven by noise is driven by a pulse, and thus a buzz sound (sounding sound) is generated. This becomes noticeable at higher frequencies. Problem B: In a transition section where voice changes from unvoiced to voiced, a sound source signal having an aperiodic pulse may be generated.
Non-periodic pulse sound sources cannot be represented in the transient part frame of the system. Therefore, tone noise is generated. As described above, in the LPC method, there is a problem that the reproduced sound has a hard-to-hear sound quality (mechanical sound quality) due to the generation of buzz sound and tone noise.

【００１３】次に、上記のようなＬＰＣ方式の問題点を
解決し、音質改良を図った方式であるＭＥＬＰ方式につ
いて説明する（参考文献[2]-[4]）。まず、ＭＥＬＰ方
式ではどのような方法で音質改善しているかについて図
１２を用いて説明する。同図（ａ）に示すように、自然
音声を周波数軸上で帯域に分けて見ると、白で示されて
いる周期的パルス成分が支配的な帯域（有声部）と、黒
で示されている雑音成分が支配的な帯域（無声部）とが
存在する。上述のようにＬＰＣボコーダで再生音が機械
的になる主な原因は、同図（ｂ）に示すように、周波数
帯域全体にわたり、有声フレームでは周期的パルス成分
で、無声音フレームでは雑音成分で音源を表現している
からである（過渡部フレームにおいては、フレームを時
間的に有声と無声に分けている。）。この問題を解決す
るため、ＭＥＬＰ方式では、同図（ｃ）に示すように、
１フレーム内で５つの周波数帯域（サブバンド）毎に有
声／無声を切替えることにより混合音源を適用してい
る。この手法は、上記ＬＰＣ方式の問題点Ａを解決し、
再生音声におけるbuzz音を低減する効果がある。また、
上記ＬＰＣ方式の問題点Ｂを解決するため、非周期的パ
ルス情報を抽出、伝送し、復号器側で非周期的パルス音
源を生成する機能を有している。その他、再生音声の音
質改善のため、適応スペクトルエンハンスメントフィル
タ、パルス拡散フィルタおよびハーモニックス振幅情報
の利用という手法を取り入れている。表２にＭＥＬＰ方
式で使用される各手法の効果をまとめる。Next, the MELP system which solves the above-mentioned problems of the LPC system and improves sound quality will be described (reference documents [2]-[4]). First, the method of improving the sound quality in the MELP method will be described with reference to FIG. As shown in FIG. 3A, when natural speech is divided into bands on the frequency axis, the periodic pulse components shown in white are dominant (voiced portions) and the bands are shown in black. There is a band (unvoiced part) where the noise component is dominant. As described above, the main cause of the mechanical reproduction sound in the LPC vocoder is, as shown in FIG. 3B, the sound source is a periodic pulse component in voiced frames and a noise component in unvoiced frames over the entire frequency band. (In the transient part frame, the frame is temporally divided into voiced and unvoiced.) To solve this problem, in the MELP system, as shown in FIG.
A mixed sound source is applied by switching between voiced and unvoiced for every five frequency bands (subbands) within one frame. This method solves the problem A of the LPC method,
This has the effect of reducing the buzz sound in the reproduced sound. Also,
In order to solve the problem B of the LPC method, the decoder has a function of extracting and transmitting non-periodic pulse information, and generating a non-periodic pulse excitation on the decoder side. In addition, in order to improve the sound quality of the reproduced voice, a technique of using an adaptive spectrum enhancement filter, a pulse spread filter, and harmonics amplitude information is adopted. Table 2 summarizes the effects of each method used in the MELP method.

【００１４】[0014]

【表２】 [Table 2]

【００１５】次に、2.4kbps ＭＥＬＰ方式の構成につい
て、図１３および図１４を用いて説明する（処理の詳細
については、参考文献[2]を参照されたい）。図１３は
ＭＥＬＰ音声符号化器の構成を示すブロック図である。
フレーム化器(41)は、100-3800Hzで帯域制限された後、
8kHzで標本化され、少なくとも１２ビットの精度で量子
化された入力音声サンプル(a3)を蓄えるバッファであ
り、１音声符号化フレーム（22.5ms）毎に音声サンプル
（180サンプル）を取り込み、音声符号化処理部へ(b3)
として出力する。以下では１音声符号化フレーム毎に実
行される処理について説明する。Next, the configuration of the 2.4 kbps MELP system will be described with reference to FIGS. 13 and 14 (for details of the processing, refer to reference [2]). FIG. 13 is a block diagram showing a configuration of the MELP speech encoder.
Framer (41) is band-limited at 100-3800Hz,
A buffer that stores input audio samples (a3) sampled at 8 kHz and quantized with at least 12-bit precision. The buffer takes in audio samples (180 samples) for each audio encoded frame (22.5 ms), and To the chemical processing department (b3)
Output as Hereinafter, a process performed for each audio encoded frame will be described.

【００１６】ゲイン計算器(42)は(b3)のレベル情報であ
るＲＭＳ値の対数を計算し、その結果である(c3)を出力
する。この処理はフレームの前半と後半について行われ
る。すなわち、１フレーム当たり２つのＲＭＳ値の対数
を(c3)として出力する。量子化器１(43)は、(c3)を前半
のものについて３ビット、後半のものについて５ビット
で線形量子化し、その結果である(d3)を誤り訂正符号化
／ビットパッキング器(70)へ出力する。線形予測分析器
(44)は、(b3)をDurbin-Levinson法を用いて線形予測分
析し、スペクトル包絡情報である10次の線形予測係数(e
3)を出力する。ＬＳＦ係数計算器(45)は、10次の線形予
測係数(e3)を10次のＬＳＦ（Line Spectrum Frequencie
s）係数(f3)に変換する。ＬＳＦ係数は、線形予測係数
と等価な特徴パラメータであるが、それに比べ、量子化
特性、補間特性に優れるため、最近の殆どの音声符号化
方式に採用されている。量子化器２(46)は10次のＬＳＦ
係数(f3)を段数４の多段ベクトル量子化により２５ビッ
トで量子化し、(g3)を誤り訂正符号化／ビットパッキン
グ器(70)へ出力する。The gain calculator (42) calculates the logarithm of the RMS value, which is the level information of (b3), and outputs the result (c3). This process is performed for the first half and the second half of the frame. That is, the logarithm of two RMS values per frame is output as (c3). The quantizer 1 (43) linearly quantizes (c3) with 3 bits for the first half and 5 bits for the second half, and divides the result (d3) into an error correction encoder / bit packer (70). Output to Linear predictive analyzer
(44) performs linear prediction analysis on (b3) using the Durbin-Levinson method, and obtains a 10th-order linear prediction coefficient (e
Output 3). The LSF coefficient calculator (45) converts the 10th-order linear prediction coefficient (e3) into a 10th-order LSF (Line Spectrum Frequencie).
s) Convert to coefficient (f3). The LSF coefficient is a feature parameter equivalent to the linear prediction coefficient. However, since the LSF coefficient has better quantization characteristics and interpolation characteristics, it has been adopted in most recent speech coding systems. Quantizer 2 (46) is a 10-order LSF
The coefficient (f3) is quantized into 25 bits by multi-stage vector quantization with four stages, and (g3) is output to the error correction coding / bit packing unit (70).

【００１７】ピッチ検出器(54)は、前記フレーム化器(4
1)の出力(b3)の１kHz以下の信号成分に対して整数ピッ
チ周期を求めた後、この整数ピッチ周期と、(b3)をＬＰ
Ｆ（ローパスフィルタ）(55)により500Hz以下に帯域制
限した信号(q3)とを用いて小数ピッチ周期を求め、（r
3）として出力する。ピッチ周期は正規化自己相関関数
が最大となる遅延量として与えられるが、この時の正規
化自己相関関数の最大値(o3)も出力される。正規化自己
相関関数の最大値の大きさは、入力信号(b3)の周期性の
強さを表す情報であり、非周期フラグ発生器(56)（後で
説明する）で用いられる。また、正規化自己相関関数の
最大値(o3)は、相関関数補正器(53)（後で説明する）で
補正された後、誤り訂正符号化／ビットパッキング器(7
0)における全帯域の有声／無声判定に用いられる。そこ
では、補正後の正規化自己相関関数の最大値(n3)が閾値
（=0.6）以下であれば無声、そうでなければ有声と判定
される。量子化器３(57)は前記ピッチ検出器(54)からの
小数ピッチ周期(r3)を入力し対数変換した後、99レベル
で線形量子化し、その結果である(s3)を誤り訂正符号化
／ビットパッキング器(70)へ出力する。The pitch detector (54) is provided with the framing device (4).
After calculating the integer pitch period for the signal component of 1 kHz or less of the output (b3) of 1), this integer pitch period and (b3) are converted to LP
The fractional pitch period is obtained using the signal (q3) band-limited to 500 Hz or less by F (low-pass filter) (55), and (r
Output as 3). The pitch period is given as a delay amount at which the normalized autocorrelation function is maximized, and the maximum value (o3) of the normalized autocorrelation function at this time is also output. The magnitude of the maximum value of the normalized autocorrelation function is information indicating the strength of the periodicity of the input signal (b3), and is used in the aperiodic flag generator (56) (described later). The maximum value (o3) of the normalized autocorrelation function is corrected by a correlation function corrector (53) (described later), and then corrected by an error correction coding / bit packing unit (7).
It is used for voiced / unvoiced determination of all bands in 0). Here, if the corrected maximum value (n3) of the normalized autocorrelation function is equal to or smaller than the threshold value (= 0.6), it is determined that the voice is unvoiced; The quantizer 3 (57) receives the fractional pitch period (r3) from the pitch detector (54), performs logarithmic conversion, linearly quantizes it at 99 levels, and encodes the result (s3) as an error correction code. Output to the / bit packing device (70).

【００１８】４つのＢＰＦ（バンドパスフィルタ）(5
8)、(59)、(60)および(61)は、前記フレーム化器(41)の
出力(b3)をそれぞれ500〜1000Hz、1000〜2000Hz、2000
〜3000Hz、3000〜4000Hzで帯域制限し、(t3)、(u3)、(v
3)および(w3)を出力する。４つの自己相関計算器(62)、
(63)、(64)および(65)は、それぞれ、(t3)、(u3)、(v3)
および(w3)に対し、小数ピッチ周期（r3）に対応する遅
延量における正規化自己相関関数を計算し、(x3)、(y
3)、(z3)および(a4)として出力する。次に、４つの有声
／無声フラグ発生器(66)、(67)、(68)および(69)は、そ
れぞれ、(x3)、(y3)、(z3)および(a4)に対し、閾値（=
0.6）以下であれば無声、そうでなければ有声と判定
し、有声／無声を示すフラグ（１ビット）を(b4)、(c
4)、(d4)および(e4)として相関関数補正器(53)へ出力す
る。これらの各帯域の有声／無声フラグ(b4)、(c4)、(d
4)および(e4)は、復号器において混合音源を生成するの
に用いられる。非周期フラグ発生器(56)は、正規化自己
相関関数の最大値(o3)を入力し、閾値（=0.5）より小さ
ければ非周期フラグをＯＮにセット、そうでなければＯ
ＦＦにセットして、非周期フラグ（１ビット）(p3)を誤
り訂正符号化／ビットパッキング器(70)へ出力する。非
周期フラグ(p3)は復号器において過渡部、破裂音の音源
を表現するための非周期性パルスを生成するのに用いら
れる。Four BPFs (bandpass filters) (5
8), (59), (60) and (61) output the framer (41) output (b3) of 500 to 1000 Hz, 1000 to 2000 Hz, and 2000, respectively.
~ 3000Hz, 3000 ~ 4000Hz band limited, (t3), (u3), (v
3) and (w3) are output. Four autocorrelation calculators (62),
(63), (64) and (65) are (t3), (u3) and (v3), respectively.
For (w3) and (w3), calculate the normalized autocorrelation function for the delay amount corresponding to the fractional pitch period (r3), and calculate (x3), (y
3) Output as (z3) and (a4). Next, the four voiced / unvoiced flag generators (66), (67), (68) and (69) provide thresholds (x3), (y3), (z3) and (a4), respectively, for (x3), (y3), (z3) and (a4). =
0.6) or less is judged as unvoiced, otherwise it is judged as voiced, and the flag (1 bit) indicating voiced / unvoiced is set to (b4), (c
4) Output to the correlation function corrector (53) as (d4) and (e4). Voiced / unvoiced flags (b4), (c4), (d
4) and (e4) are used to generate a mixed sound source at the decoder. The aperiodic flag generator (56) inputs the maximum value (o3) of the normalized autocorrelation function, and sets the aperiodic flag to ON if it is smaller than the threshold value (= 0.5);
The flag is set to FF, and the aperiodic flag (1 bit) (p3) is output to the error correction coding / bit packing unit (70). The non-periodic flag (p3) is used in the decoder to generate a non-periodic pulse for expressing a transient portion and a plosive sound source.

【００１９】ＬＰＣ分析フィルタ１(51)は10次の線形予
測係数(e3)を係数として用いる全零型フィルタであり、
入力音声（b3）からスペクトル包絡情報を除去し、その
結果である残差信号(l3)を出力する。ピーキネス計算器
(56)は、残差信号(l3)を入力し、ピーキネス値を計算し
(m3)として出力する。ピーキネス値とは、信号中にピー
クをもつパルス的な成分（スパイク）が存在する可能性
を表すパラメータであり、上記参考文献[5]より次式で
与えられる。The LPC analysis filter 1 (51) is an all-zero filter using a 10th-order linear prediction coefficient (e3) as a coefficient.
The spectrum envelope information is removed from the input voice (b3), and the resulting residual signal (l3) is output. Peakiness calculator
(56) inputs the residual signal (l3) and calculates the peakiness value
Output as (m3). The peakiness value is a parameter indicating the possibility that a pulse-like component (spike) having a peak exists in a signal, and is given by the following equation from the above reference [5].

【数１】ここで、Ｎは１フレーム中のサンプル数、ｅ_nは残差信
号である。上式(1)の分子は分母に比べ大きな値の影響
を受けやすいので、ｐは残差信号中に大きなスパイクが
存在する時に大きな値となる。従って、ピーキネス値が
大きいほど、そのフレームが、過渡部に多くみられるジ
ッタを有する有声フレーム、または破裂音フレームであ
る可能性が大きくなる（これらのフレームでは、部分的
にスパイク（鋭いピーク）を持つが、その他の部分は、
白色雑音に近い性質の信号になっているため）。(Equation 1) Here, N the number of samples in one frame, e _n is the residual signal. Since the numerator of the above equation (1) is more susceptible to a large value than the denominator, p takes a large value when a large spike exists in the residual signal. Therefore, the higher the peakiness value, the greater the likelihood that the frame is a voiced or plosive frame with jitter that is common in transients (in these frames, the spikes (sharp peaks) will be partially reduced). But the other parts are
Because the signal is similar to white noise).

【００２０】相関関数補正器(53)は、ピーキネス値(m3)
の値により、正規化自己相関関数の最大値(o3)および有
声／無声フラグ(b4)および(c4)の値を補正する。ピーキ
ネス値(m3)が1.34より大きければ、正規化自己相関関数
の最大値(o3)を1.0（有声を示す）にセットする。ま
た、ピーキネス値(m3)が1.6より大きければ正規化自己
相関関数の最大値(o3)を1.0（有声を示す）にセットす
ると共に有声／無声フラグ(b4)および(c4)を有声にセッ
トする。なお、相関関数補正器(53)には(d4)および(e4)
も入力されるが補正処理は行わずに出力する。補正後の
正規化自己相関関数の最大値は(n3)として出力され、補
正後の有声／無声フラグ(b4)、(c4)および(d4)、(e4)は
帯域毎の有声性情報(f4)として出力される。The correlation function corrector (53) calculates a peakiness value (m3)
, The maximum value (o3) of the normalized autocorrelation function and the values of the voiced / unvoiced flags (b4) and (c4) are corrected. If the peakiness value (m3) is greater than 1.34, the maximum value (o3) of the normalized autocorrelation function is set to 1.0 (indicating voiced). If the peakiness value (m3) is larger than 1.6, the maximum value (o3) of the normalized autocorrelation function is set to 1.0 (indicating voiced), and the voiced / unvoiced flags (b4) and (c4) are set to voiced. . The correlation function corrector (53) has (d4) and (e4)
Is also input, but is output without performing the correction processing. The maximum value of the corrected normalized autocorrelation function is output as (n3), and the corrected voiced / unvoiced flags (b4), (c4) and (d4), and (e4) are voiced information (f4) for each band. ) Is output.

【００２１】前述のように、ジッタを有する有声フレー
ムまたは破裂音のフレームでは部分的にスパイク（鋭い
ピーク）を持つが、その他の部分は白色雑音に近い性質
の信号になっているため、正規化自己相関関数が0.5よ
り小さな値（このとき非周期フラグがＯＮにセットされ
る）となる可能性が大きい。そこで、ピーキネス値によ
りジッタを有する有声フレームまたは破裂音フレームを
検出して正規化自己相関関数を1.0に補正すれば、その
後の誤り訂正符号化／ビットパッキング器(70)における
全帯域の有声／無声判定において有声と判断され、復号
の際に非周期パルスを音源として用いることにより、ジ
ッタを有する有声フレームまたは破裂音フレームの音質
は改善される。As described above, a voiced frame or a plosive frame having jitter partially has a spike (sharp peak), but the other portions are signals having characteristics close to white noise. There is a high possibility that the autocorrelation function becomes a value smaller than 0.5 (at this time, the aperiodic flag is set to ON). Therefore, if a voiced frame or a plosive frame having jitter is detected based on the peakiness value and the normalized autocorrelation function is corrected to 1.0, voiced / unvoiced signals in the entire band in the subsequent error correction coding / bit packing unit (70) can be obtained. The voice quality is judged to be improved by using a non-periodic pulse as a sound source at the time of decoding.

【００２２】次にハーモニックス情報の検出について説
明する。線形予測係数計算器(47)は、量子化器２(46)の
出力である量子化後のＬＳＦ係数(g3)を線形予測係数に
変換し、量子化後の線形予測係数(h3)を出力する。ＬＰ
Ｃ分析フィルタ２(48)は、(h3)を係数として入力信号(b
3)からスペクトル包絡成分を除去し、残差信号(i3)を出
力する。ハーモニックス検出器(49)は(i3)における10次
のハーモニックス（基本ピッチ周波数の高調波成分）の
振幅を抽出し、その結果である(j3)を出力する。量子化
器４(50)は、(j3)を８ビットでベクトル量子化し、その
インデックス(k3)を誤り訂正符号化／ビットパッキング
器(70)へ出力する。ハーモニックス振幅情報は、残差信
号(i3)に残っているスペクトル包絡情報に相当する。従
って、ハーモニックス振幅情報を送ることにより、復号
時に入力信号のスペクトル表現をより正確に表現するこ
とができ、鼻音の品質、話者識別の性能および広帯域雑
音がある時の母音の品質を向上させることができる（表
２）。Next, detection of harmonics information will be described. The linear prediction coefficient calculator (47) converts the quantized LSF coefficient (g3) output from the quantizer 2 (46) to a linear prediction coefficient, and outputs the quantized linear prediction coefficient (h3). I do. LP
The C analysis filter 2 (48) uses the input signal (b
Remove the spectral envelope component from 3) and output the residual signal (i3). The harmonic detector (49) extracts the amplitude of the tenth harmonic (harmonic component of the basic pitch frequency) in (i3), and outputs the result (j3). The quantizer 4 (50) vector-quantizes (j3) by 8 bits and outputs the index (k3) to the error correction coding / bit packing unit (70). The harmonics amplitude information corresponds to the spectrum envelope information remaining in the residual signal (i3). Thus, by sending the harmonics amplitude information, the spectral representation of the input signal can be more accurately represented during decoding, improving the nasal quality, speaker identification performance, and vowel quality in the presence of wideband noise. (Table 2).

【００２３】誤り訂正符号化／ビットパッキング器(70)
は、前述したように補正後の正規化自己相関関数の最大
値(n3)が閾値（=0.6）以下であれば無声フレーム、そう
でなければ有声フレームと設定し、表３に示すビット配
分で音声情報ビット列を構成し、１フレーム毎に54ビッ
トを(g4)として出力する。音声情報ビット列(g4)は、無
線通信の場合、変調器、無線機を通り、受信側に伝送さ
れる。表３において、ピッチ、全体の有声／無声情報が
７ビットで量子化されているが、その方法は次の通りで
ある。７ビットの符号（128種類の符号語）のうち、７
ビットが全て０（または１）の符号語および７ビット中
１ビットのみが１となる符号語を無声に割り当て、７ビ
ット中２ビットが１となる符号語をイレージャ（消失）
に割り当てる。その他の符号語は有声用としてピッチ周
期情報（量子化器３(57)の出力(s3)）に割り当てられ
る。帯域毎の有声性としては、(b4)、(c4)、(d4)および
(e4)がそれぞれ有声を示す時は１、無声を示す時は０を
割当て、４ビットで伝送する。また、同表から分かるよ
うに、そのフレームが無声ならばハーモニック振幅(k
3)、帯域毎の有声性(f4)および非周期フラグ(p3)を送る
代わりに、聴感上重要なビットに対し誤り訂正を施し、
その誤り訂正符号（13ビット）を送る。また、各フレー
ム毎に１ビットの同期ビットが付加される。Error correction coding / bit packing device (70)
Is set as an unvoiced frame if the maximum value (n3) of the normalized autocorrelation function after correction is equal to or smaller than the threshold value (= 0.6) as described above, and as a voiced frame otherwise, An audio information bit string is formed, and 54 bits are output as (g4) for each frame. In the case of wireless communication, the audio information bit sequence (g4) is transmitted to the receiving side through a modulator and a wireless device. In Table 3, the pitch and the entire voiced / unvoiced information are quantized by 7 bits. The method is as follows. 7-bit code (128 kinds of code words)
A code word in which all bits are 0 (or 1) and a code word in which only 1 bit out of 7 bits is 1 are unvoiced, and a code word in which 2 bits out of 7 bits are 1 is erased (erased).
Assign to Other codewords are assigned to pitch period information (output (s3) of quantizer 3 (57)) for voiced use. Voicedness for each band is (b4), (c4), (d4) and
When (e4) indicates voiced, 1 is assigned, and when unvoiced, 0 is assigned, and transmitted with 4 bits. Also, as can be seen from the table, if the frame is unvoiced, the harmonic amplitude (k
3), Instead of sending voicedness (f4) and aperiodic flag (p3) for each band, perform error correction on bits that are important for hearing,
The error correction code (13 bits) is sent. Also, one synchronization bit is added to each frame.

【００２４】[0024]

【表３】 [Table 3]

【００２５】次に図１４を用いてＭＥＬＰ音声復号器の
構成について説明する。ビット分離／誤り訂正復号器(8
1)は１フレーム毎に受信した54ビットの音声情報ビット
列(a5)の中からピッチ、全体の有声／無声情報を取り出
し、それが無声フレームを示す場合には誤り保護該当ビ
ットに対して誤り訂正復号処理を施す。また、ピッチ、
全体の有声／無声情報がイレースを示す場合には、各パ
ラメータを前フレームのもので置換処理する。そして、
分離された各パラメータの情報ビットとして、ピッチ、
全体の有声／無声情報(b5)、非周期フラグ(d5)、ハーモ
ニックス振幅インデックス(e5)、帯域毎の有声性(g5)、
ＬＳＦパラメータインデックス(j5)、およびゲイン情報
(m5)を出力する。ここで、帯域毎の有声性(g5)は、各サ
ブバンド（0〜500Hz、500〜1000Hz、1000〜2000Hz、200
0〜3000Hz、3000〜4000Hz）の有声性を示す５ビットの
フラグであり、0〜500Hzの有声性については、ピッチ、
全体の有声／無声情報から取り出された全体の有声／無
声情報を用いる。Next, the configuration of the MELP speech decoder will be described with reference to FIG. Bit separation / error correction decoder (8
1) Extracts the pitch and the entire voiced / unvoiced information from the 54-bit voice information bit sequence (a5) received for each frame, and if it indicates a voiceless frame, performs error protection on the corresponding bit. Perform decryption processing. Also pitch,
If the entire voiced / unvoiced information indicates an erase, each parameter is replaced with that of the previous frame. And
Pitch, as information bits of each separated parameter
Overall voiced / unvoiced information (b5), aperiodic flag (d5), harmonics amplitude index (e5), voicedness per band (g5),
LSF parameter index (j5) and gain information
(m5) is output. Here, voicedness (g5) for each band is calculated for each subband (0 to 500 Hz, 500 to 1000 Hz, 1000 to 2000 Hz, 200
0-3000Hz, 3000-4000Hz) is a 5-bit flag indicating voicedness. For voicedness of 0-500Hz, pitch,
The whole voiced / unvoiced information extracted from the whole voiced / unvoiced information is used.

【００２６】ピッチ復号器(82)は、ピッチ、全体の有声
／無声情報が有声を示す場合にはピッチ周期を復号し、
無声を示す場合はピッチ周期として50.0をセットして復
号されたピッチ周期(c5)を出力する。ジッタ設定器(10
2)は、非周期フラグ(d5)を入力し、非周期フラグがＯＮ
を示すならばジッタ値を0.25、ＯＦＦを示すならばジッ
タ値を０にセットし、(g6)を出力する。ここで、上記の
有声／無声情報が無声を示す場合は、ジッタ値(g6)は0.
25にセットされる。ハーモニックス復号器(83)は、ハー
モニックス振幅のインデックス(e5)から10次のハーモニ
ックス振幅(f5)を復号し出力する。パルス音源用フィル
タ係数計算器(84)は、帯域毎の有声性(g5)を入力し、有
声を示しているサブバンドのゲインを1.0、無声を示し
ているサブバンドのゲインを０にするようなＦＩＲフィ
ルタの係数(h5)を計算し、出力する。また、雑音音源用
フィルタ係数計算器(85)は帯域毎の有声性(g5)を入力
し、有声を示しているサブバンドのゲインを０、無声を
示しているサブバンドのゲインを1.0にするようなＦＩ
Ｒフィルタの係数(i5)を計算し、出力する。ＬＳＦ復号
器(87)は、ＬＳＦパラメータインデックス(j5)から10次
のＬＳＦ係数(k5)を復号し出力する。傾斜補正係数計算
器(86)は、10次のＬＳＦ係数(k5)から傾斜補正係数(l5)
を計算する。ゲイン復号器(88)はゲイン情報(m5)を復号
してゲイン情報(n5)を出力する。A pitch decoder (82) decodes the pitch and the pitch period when the entire voiced / unvoiced information indicates voiced,
If it indicates unvoiced, 50.0 is set as the pitch period, and the decoded pitch period (c5) is output. Jitter setting device (10
2) Input the aperiodic flag (d5) and the aperiodic flag is ON
Indicates that the jitter value is 0.25, and if it indicates OFF, the jitter value is set to 0, and (g6) is output. Here, when the voiced / unvoiced information indicates unvoiced, the jitter value (g6) is 0.
Set to 25. The harmonics decoder (83) decodes and outputs the tenth-order harmonics amplitude (f5) from the index (e5) of the harmonics amplitude. The pulse sound source filter coefficient calculator (84) inputs the voicedness (g5) for each band, and sets the gain of the subband indicating voiced to 1.0 and the gain of the subband indicating unvoiced to 0. Calculate and output the coefficient (h5) of the appropriate FIR filter. The noise source filter coefficient calculator (85) inputs the voicedness (g5) for each band, and sets the gain of the subband indicating voiced to 0 and the gain of the subband indicating unvoiced to 1.0. FI like
Calculate and output the coefficient (i5) of the R filter. The LSF decoder (87) decodes and outputs a 10-order LSF coefficient (k5) from the LSF parameter index (j5). The inclination correction coefficient calculator (86) calculates the inclination correction coefficient (l5) from the 10th-order LSF coefficient (k5).
Is calculated. The gain decoder (88) decodes the gain information (m5) and outputs gain information (n5).

【００２７】パラメータ補間器(89)は、各パラメータ(c
5)、(g6)、(f5)、(h5)、(i5)、(l5)、(k5)および(n5)に
ついて、それぞれピッチ周期に同期して線形補間し、(o
5)、(p5)、(r5)、(s5)、(t5)、(u5)、(v5)および(w5)を
出力する。ここでの線形補間処理は、次式により実施さ
れる。補間後のパラメータ＝現フレームのパラメータ×int ＋
前フレームのパラメータ×(1.0−int) ここで、現フレームのパラメータは(c5)、(g6)、(f5)、
(h5)、(i5)、(l5)、(k5)および(n5)のそれぞれに対応
し、補間後のパラメータは(o5)、(p5)、(r5)、(s5)、(t
5)、(u5)、(v5)および(w5)のそれぞれに対応する。前フ
レームのパラメータは、前フレームにおける(c5)、(g
6)、(f5)、(h5)、(i5)、(l5)、(k5)および(n5)を保持し
ておくことにより与えられる。intは補間係数であり、
次式で求める。 int＝to／180 ここで、180は音声復号フレーム長（22.5ms）当たりの
サンプル数、toは復号フレームにおける１ピッチ周期の
開始点であり、１ピッチ周期分の再生音声が復号される
毎にそのピッチ周期が加算されることにより更新され
る。toが180を超えるとそのフレームの復号処理が終了
したことになり、toから180が減算される。The parameter interpolator (89) calculates each parameter (c
5), (g6), (f5), (h5), (i5), (l5), (k5) and (n5) are linearly interpolated in synchronization with the pitch period, respectively, and (o)
5) Output (p5), (r5), (s5), (t5), (u5), (v5) and (w5). Here, the linear interpolation processing is performed by the following equation. Interpolated parameter = current frame parameter x int +
The parameter of the previous frame × (1.0−int) where the parameters of the current frame are (c5), (g6), (f5),
(h5), (i5), (l5), (k5) and (n5) respectively, and the parameters after interpolation are (o5), (p5), (r5), (s5), (t
5), (u5), (v5) and (w5) respectively. The parameters of the previous frame are (c5), (g
6), (f5), (h5), (i5), (l5), (k5) and (n5) are retained. int is the interpolation coefficient,
It is calculated by the following equation. int = to / 180 where 180 is the number of samples per audio decoded frame length (22.5 ms), to is the starting point of one pitch cycle in the decoded frame, and every time the reproduced audio of one pitch cycle is decoded It is updated by adding the pitch period. If to exceeds 180, the decoding process for that frame has ended, and 180 is subtracted from to.

【００２８】ピッチ周期計算器(90)は、補間されたピッ
チ周期(o5)およびジッタ値(p5)を入力し、ピッチ周期(q
5)を次式により計算する。ピッチ周期(q5)＝ピッチ周期(o5)×（1.0−ジッタ値(p
5)×乱数値）ここで、乱数値は-1.0〜1.0の範囲の値をとる。上式よ
り無声または非周期的フレームではジッタ値が0.25にセ
ットされているのでジッタが付加され、周期的フレーム
ではジッタ値が０にセットされているのでジッタは付加
されない。但し、ジッタ値はピッチ毎に補間処理されて
いるので、0〜0.25の範囲をとるため中間的なピッチ区
間も存在する。このように非周期フラグに基づき非周期
ピッチ（ジッタが付加されたピッチ）を発生すること
は、表２のに示したように過渡部、破裂音で生じる不
規則な（非周期的な）声門パルスを表現することによ
り、トーン的雑音を低減する効果がある。The pitch period calculator (90) receives the interpolated pitch period (o5) and the jitter value (p5), and inputs the pitch period (q
5) is calculated by the following equation. Pitch period (q5) = pitch period (o5) x (1.0-jitter value (p
5) × random value) Here, the random value takes a value in the range of -1.0 to 1.0. According to the above formula, jitter is added to the unvoiced or aperiodic frame because the jitter value is set to 0.25, and no jitter is added to the periodic frame because the jitter value is set to 0. However, since the jitter value is interpolated for each pitch, there is an intermediate pitch section because it takes a range of 0 to 0.25. Generating an aperiodic pitch (pitch to which jitter has been added) based on the aperiodic flag as described above requires an irregular (aperiodic) glottis caused by a transient portion and a plosive as shown in Table 2. Expressing a pulse has the effect of reducing tone noise.

【００２９】ピッチ周期(q5)は整数値に変換された後、
１ピッチ波形復号器(101)に入力される。１ピッチ波形
復号器(101)は、ピッチ周期(q5)毎の再生音声(f6)を復
号し出力する。従って、このブロックに含まれる全ての
ブロックはピッチ周期(q5)を入力し、それに同期して動
作する。パルス音源発生器(91)は、補間されたハーモニ
ックス振幅(r5)を入力し、そのハーモニックス情報が付
加された単一パルスを有するパルス音源(x5)を発生す
る。このパルス音源(x5)はピッチ周期(q5)に１パルス発
生される。パルスフィルタ(92)は、補間されたパルスフ
ィルタ用係数(s5)を係数とするＦＩＲフィルタであり、
パルス音源(x5)に対し有声のサブバンドのみを有効にす
るようにフィルタリングし、(y5)を出力する。雑音発生
器(94)は、白色雑音(a6)を発生する。雑音フィルタ(93)
は、補間された雑音フィルタ用係数(t5)を係数とするＦ
ＩＲフィルタであり、雑音音源(a6)に対し無声のサブバ
ンドのみを有効にするようにフィルタリングし、(z5)を
出力する。混合音源発生器(95)は(y5)および(z5)を加算
し、混合音源(b6)を発生する。この混合音源は、表２の
に示したように周波数帯毎に有声／無声音源を切り替
えることによりbuzz音を低減する効果がある。After the pitch period (q5) is converted to an integer value,
It is input to a one-pitch waveform decoder (101). The one-pitch waveform decoder (101) decodes and outputs the reproduced sound (f6) for each pitch period (q5). Therefore, all the blocks included in this block receive the pitch period (q5) and operate in synchronization with it. The pulse sound source generator (91) receives the interpolated harmonics amplitude (r5) and generates a pulse sound source (x5) having a single pulse to which the harmonics information is added. This pulse source (x5) generates one pulse in the pitch period (q5). The pulse filter (92) is an FIR filter using the interpolated pulse filter coefficient (s5) as a coefficient,
Filter the pulse sound source (x5) so that only voiced subbands are valid, and output (y5). The noise generator (94) generates white noise (a6). Noise filter (93)
Is the coefficient of the interpolated noise filter coefficient (t5).
The IR filter filters the noise source (a6) so that only unvoiced subbands are valid, and outputs (z5). The mixed sound source generator (95) adds (y5) and (z5) to generate a mixed sound source (b6). This mixed sound source has an effect of reducing buzz sound by switching between voiced / unvoiced sound source for each frequency band as shown in Table 2.

【００３０】線形予測係数計算器(98)は補間された10次
のＬＳＦ係数(v5)から線形予測係数(h6)を計算する。適
応スペクトルエンハンスメントフィルタ(96)は、線形予
測係数(h6)に帯域幅拡張処理を施したものを係数とする
適応極／零フィルタであり、表２のに示した通り、ホ
ルマントの共振を鋭くし、自然音声のホルマントに対す
る近似度を改善することにより再生音声の自然性を向上
させる。さらに、補間された傾斜補正係数(u5)を用いて
スペクトルの傾きを補正して音のこもりを低減し、その
結果である音源信号(c6)を出力する。ＬＰＣ合成フィル
タ(97)は、線形予測係数(h6)を係数として用いる全極型
フィルタであり、音源信号(c6)に対しスペクトル包絡情
報を付加して、その結果である信号(d6)を出力する。ゲ
イン調整器(99)は(d6)に対しゲイン情報(w5)を用いてゲ
イン調整を行い、(e6)を出力する。パルス拡散フィルタ
(100)は、自然音声の声門パルス波形に対するパルス音
源波形の近似度を改善するためのフィルタであり、(e6)
をフィルタリングして自然性が改善された再生音声(f6)
を出力する。このパルス拡散フィルタの効果は表２の
に示す通りである。以上により、ＭＥＬＰ方式では、Ｌ
ＰＣ方式に比べ、同ビットレート（２．４kbps）におい
て自然性、了解性の高い再生音声を提供することができ
る。The linear prediction coefficient calculator (98) calculates a linear prediction coefficient (h6) from the interpolated 10th-order LSF coefficient (v5). The adaptive spectral enhancement filter (96) is an adaptive pole / zero filter having a coefficient obtained by performing a bandwidth extension process on the linear prediction coefficient (h6), and as shown in Table 2, sharpens the resonance of the formant. By improving the degree of approximation of the natural sound to the formant, the naturalness of the reproduced sound is improved. Further, the slope of the spectrum is corrected by using the interpolated slope correction coefficient (u5) to reduce the muffled sound, and a sound source signal (c6) as a result is output. The LPC synthesis filter (97) is an all-pole filter that uses the linear prediction coefficient (h6) as a coefficient, adds spectral envelope information to the sound source signal (c6), and outputs the resulting signal (d6). I do. The gain adjuster (99) performs gain adjustment on (d6) using the gain information (w5), and outputs (e6). Pulse spread filter
(100) is a filter for improving the approximation of the pulse sound source waveform with respect to the glottal pulse waveform of natural speech, and (e6)
Sound with improved naturalness by filtering (f6)
Is output. The effect of this pulse diffusion filter is as shown in Table 2. As described above, in the MELP method, L
Compared with the PC system, it is possible to provide a natural and intelligible reproduced sound at the same bit rate (2.4 kbps).

【００３１】[0031]

【発明が解決しようとする課題】移動体通信の爆発的普
及により、ユーザ収容数の増大が必要となっており、周
波数資源の更なる有効利用が課題となっている。音声符
号化方式の更なる低ビットレート化は、この課題を解決
するための必須の技術課題の１つである。そこで、本発
明は、２．４kbps ＭＥＬＰ方式より少ないビット数で
上述のＬＰＣ方式の問題点Ａを解決することのできる音
声符号化復号方法および装置を提供することを目的とし
ている。すなわち、ＭＥＬＰ方式のように帯域毎の有声
性情報を全て伝送する必要なしに、それと同様の効果が
得られる音声符号化復号方法および装置を提供すること
を目的としている。Due to the explosive spread of mobile communication, it is necessary to increase the number of users accommodated, and further effective use of frequency resources is an issue. Further lowering the bit rate of the audio coding scheme is one of the essential technical issues for solving this problem. Accordingly, it is an object of the present invention to provide a speech encoding / decoding method and apparatus capable of solving the above problem A of the LPC scheme with a smaller number of bits than the 2.4 kbps MELP scheme. That is, it is an object of the present invention to provide a speech encoding / decoding method and apparatus capable of obtaining the same effect without transmitting all voiced information for each band as in the MELP method.

【００３２】[0032]

【課題を解決するための手段】上記目的を達成するため
に、本発明の音声復号方法は、線形予測分析・合成方式
の音声符号化器によって音声信号が符号化処理された出
力である音声情報ビット列から音声信号を再生する音声
復号方法であって、前記音声情報ビット列に含まれるス
ペクトル包絡情報、有声／無声識別情報、ピッチ周期情
報およびゲイン情報を分離、復号し、前記スペクトル包
絡情報からスペクトル包絡振幅を求め、周波数軸上で分
割された帯域のうち、前記スペクトル包絡振幅の値が最
大となる帯域を決定し、該決定された帯域と前記有声／
無声識別情報に基づいて、各帯域毎に前記ピッチ周期情
報に基づき発生されるピッチパルスと白色雑音を混合す
る際の混合比を決定し、該決定された混合比に基づいて
各帯域毎の混合信号を作成した後、全帯域の混合信号を
加算して混合音源信号を作成し、該混合音源信号に対し
前記スペクトル包絡情報および前記ゲイン情報を付加し
て再生音声を生成することを特徴とするものである。こ
れにより、付加的な情報ビットを伝送することなく、上
述したＬＰＣ方式の問題点Ａを解決することができる。In order to achieve the above object, a speech decoding method according to the present invention provides a speech information which is an output of a speech signal encoded by a speech encoder of a linear prediction analysis / synthesis system. An audio decoding method for reproducing an audio signal from a bit sequence, comprising separating and decoding spectral envelope information, voiced / unvoiced identification information, pitch period information, and gain information included in the audio information bit sequence, and extracting a spectral envelope from the spectral envelope information. The amplitude is obtained, and a band in which the value of the spectrum envelope amplitude is maximized among the bands divided on the frequency axis is determined, and the determined band and the voiced /
Based on the unvoiced identification information, a mixing ratio for mixing a pitch pulse generated based on the pitch period information with white noise is determined for each band, and a mixing ratio for each band is determined based on the determined mixing ratio. After the signal is created, a mixed sound source signal is created by adding the mixed signals of all the bands, and the reproduced sound is generated by adding the spectrum envelope information and the gain information to the mixed sound source signal. Things. As a result, the above-described problem A of the LPC scheme can be solved without transmitting additional information bits.

【００３３】また、本発明の他の音声復号方法は、線形
予測分析・合成方式の音声符号化器によって音声信号を
符号化処理した音声情報ビット列であって、スペクトル
包絡情報、低周波数帯域の有声／無声識別情報、高周波
数帯域の有声／無声識別情報、ピッチ周期情報およびゲ
イン情報が符号化された音声情報ビット列から音声信号
を再生する音声復号方法であって、前記音声情報ビット
列に含まれるスペクトル包絡情報、低周波数帯域の有声
／無声識別情報、高周波数帯域の有声／無声識別情報、
ピッチ周期情報およびゲイン情報を分離、復号し、前記
低周波数帯域の有声／無声識別情報に基づいて、前記ピ
ッチ周期情報に基づき発生されるピッチパルスと白色雑
音を混合する際の混合比を決定して低周波数帯域の混合
信号を作成し、前記スペクトル包絡情報からスペクトル
包絡振幅を求め、周波数軸上で高周波数帯域を分割した
帯域のうち、該スペクトル包絡振幅の値が最大となる帯
域を決定し、該決定された帯域と前記高周波数帯域の有
声／無声識別情報に基づいて、前記高周波数帯域を分割
した各帯域毎に、前記ピッチパルスと白色雑音を混合す
る際の混合比を決定して混合信号を作成した後、該高周
波数帯域の全ての帯域の混合信号を加算して高周波数帯
域の混合信号を作成し、前記低周波数帯域の混合信号と
前記高周波数帯域の混合信号を加算して混合音源信号を
作成し、該混合音源信号に対し前記スペクトル包絡情報
および前記ゲイン情報を付加して再生音声を生成するこ
とを特徴とするものである。これにより、上述したＬＰ
Ｃ方式の問題点Ａを解決するとともに、再生音声品質を
さらに向上させることができる。According to another speech decoding method of the present invention, a speech information bit string in which a speech signal is encoded by a speech encoder of a linear prediction analysis / synthesis method, which includes spectrum envelope information and voiced speech in a low frequency band. An audio decoding method for reproducing an audio signal from an audio information bit sequence in which / unvoiced identification information, voiced / unvoiced identification information of a high frequency band, pitch period information and gain information are encoded, wherein a spectrum included in the audio information bit sequence is included. Envelope information, low frequency band voiced / unvoiced identification information, high frequency band voiced / unvoiced identification information,
The pitch period information and the gain information are separated and decoded, and based on the voiced / unvoiced identification information in the low frequency band, a mixing ratio when mixing a pitch pulse generated based on the pitch period information with white noise is determined. To create a mixed signal in the low frequency band, obtain the spectrum envelope amplitude from the spectrum envelope information, and determine the band in which the value of the spectrum envelope amplitude is maximum among the bands obtained by dividing the high frequency band on the frequency axis. Based on the determined band and the voiced / unvoiced identification information of the high frequency band, for each band obtained by dividing the high frequency band, determine a mixing ratio when mixing the pitch pulse and white noise. After creating the mixed signal, a mixed signal of the high frequency band is created by adding the mixed signals of all the bands of the high frequency band, and the mixed signal of the low frequency band and the high frequency band are mixed. The mixed signal by adding to create a mixed sound source signal, is characterized in that by adding the spectral envelope information and the gain information to the mixed sound source signal to generate a reproduced speech. Thereby, the above-described LP
It is possible to solve the problem A of the C system and to further improve the reproduced voice quality.

【００３４】さらに、本発明のさらに他の音声復号方法
は、線形予測分析・合成方式の音声符号化器によって音
声信号を符号化処理した音声情報ビット列であって、ス
ペクトル包絡情報、低周波数帯域の有声／無声識別情
報、高周波数帯域の有声／無声識別情報、ピッチ周期情
報およびゲイン情報が符号化された音声情報ビット列か
ら音声信号を再生する音声復号方法であって、前記音声
情報ビット列に含まれるスペクトル包絡情報、低周波数
帯域の有声／無声識別情報、高周波数帯域の有声／無声
識別情報、ピッチ周期情報およびゲイン情報を分離、復
号し、前記低周波数帯域の有声／無声識別情報に基づい
て、ピッチ周期に同期して線形補間された前記ピッチ周
期情報に基づき発生されるピッチパルスと白色雑音を混
合する際の混合比を決定し、前記スペクトル包絡情報か
らスペクトル包絡振幅を求め、周波数軸上で高周波数帯
域を分割した帯域のうち、該スペクトル包絡振幅の値が
最大となる帯域を決定し、該決定された帯域と前記高周
波数帯域の有声／無声識別情報に基づいて、前記高周波
数帯域を分割した各帯域毎に前記ピッチパルスと白色雑
音を混合する際の混合比を決定し、前記スペクトル包絡
情報、前記ピッチ周期情報、前記ゲイン情報、前記低周
波数帯域の混合比および前記高周波数帯域を分割した各
帯域の混合比をピッチ周期に同期して線形補間し、補間
後の前記低周波数帯域の混合比を用いて、前記ピッチパ
ルスと白色雑音を混合して低周波数帯域の混合信号を作
成し、補間後の前記高周波数帯域を分割した各帯域の混
合比を用いて、前記ピッチパルスと前記白色雑音を混合
して前記高周波数帯域を分割した各帯域毎に混合信号を
作成した後、該高周波数帯域の全ての帯域の混合信号を
加算して高周波数帯域の混合信号を作成し、前記低周波
数帯域の混合信号と前記高周波数帯域の混合信号を加算
して混合音源信号を作成し、該混合音源信号に対し補間
後の前記スペクトル包絡情報および補間後の前記ゲイン
情報を付加して再生音声を生成することを特徴とするも
のである。これにより、上述したＬＰＣ方式の問題点Ａ
を解決するとともに、再生音声品質をさらに向上させる
ことができる。Still another speech decoding method according to the present invention is a speech information bit sequence in which a speech signal is encoded by a speech encoder of a linear prediction analysis / synthesis method, and includes spectral envelope information and a low frequency band. A voice decoding method for reproducing a voice signal from a voice information bit sequence in which voiced / unvoiced identification information, voiced / unvoiced identification information in a high frequency band, pitch period information, and gain information are encoded, wherein the method is included in the voice information bit sequence. The spectrum envelope information, voiced / unvoiced identification information of the low frequency band, voiced / unvoiced identification information of the high frequency band, pitch period information and gain information are separated and decoded, and based on the voiced / unvoiced identification information of the low frequency band, The mixing ratio when mixing the white noise with the pitch pulse generated based on the pitch period information linearly interpolated in synchronization with the pitch period. The spectrum envelope amplitude is determined from the spectrum envelope information, and among the bands obtained by dividing the high frequency band on the frequency axis, a band in which the value of the spectrum envelope amplitude is maximized is determined. Based on voiced / unvoiced identification information of a high frequency band, a mixing ratio when mixing the pitch pulse and white noise is determined for each band obtained by dividing the high frequency band, and the spectrum envelope information and the pitch period information are determined. The gain information, the mixing ratio of the low frequency band and the mixing ratio of each band obtained by dividing the high frequency band are linearly interpolated in synchronization with a pitch cycle, and using the mixing ratio of the low frequency band after interpolation, The pitch pulse and white noise are mixed to create a mixed signal of a low frequency band, and the pitch pulse and the white noise are mixed using a mixing ratio of each band obtained by dividing the high frequency band. After creating a mixed signal for each band obtained by dividing the high frequency band by mixing white noise, a mixed signal of the high frequency band is created by adding the mixed signals of all the bands of the high frequency band, A mixed sound source signal is created by adding the mixed signal of the low frequency band and the mixed signal of the high frequency band, and the mixed sound source signal is reproduced by adding the spectrum envelope information after interpolation and the gain information after interpolation. It is characterized by generating voice. As a result, the above-mentioned problem A of the LPC method A
, And the quality of reproduced sound can be further improved.

【００３５】さらにまた、前記高周波数帯域は３つの帯
域に分割されており、前記高周波数帯域の有声／無声識
別情報が有声を示している場合における前記高周波数帯
域を分割した各帯域の混合比は、最低周波数の帯域また
は２番目に低い周波数の帯域のスペクトル包絡振幅の値
が最大であるときには、分割された帯域の周波数が高く
なるに従ってピッチパルスの割合を単調に減少させ、最
高周波数の帯域のスペクトル包絡振幅の値が最大である
ときには、２番目に低い周波数の帯域のピッチパルスの
割合を最低周波数の帯域のピッチパルスの割合より減少
させ、最高周波数の帯域のピッチパルスの割合を２番目
に低い周波数の帯域よりも所定量増加させるようになさ
れているものである。さらにまた、前記高周波数帯域は
３つの帯域に分割されており、前記高周波数帯域の有声
／無声識別情報が有声を示している場合における前記高
周波数帯域を分割した各帯域の混合比は、３つの帯域の
うちスペクトル包絡振幅の値が最大となる帯域の有声強
度を、その他の帯域のスペクトル包絡振幅の値が最大と
なる時の同帯域の有声強度よりも大きく設定するように
されているものである。さらにまた、前記高周波数帯域
は３つの帯域に分割されており、前記高周波数帯域の有
声／無声識別情報が無声を示している場合における前記
高周波数帯域を分割した各帯域の混合比は、３つの帯域
のうちスペクトル包絡振幅の値が最大となる帯域の有声
強度を、その他の帯域のスペクトル包絡振幅の値が最大
となる時の同帯域の有声強度よりも小さく設定するよう
にされているものである。Further, the high frequency band is divided into three bands, and the mixing ratio of each band obtained by dividing the high frequency band when the voiced / unvoiced identification information of the high frequency band indicates voiced. Is that when the value of the spectral envelope amplitude of the lowest frequency band or the second lowest frequency band is the maximum, the ratio of the pitch pulse monotonically decreases as the frequency of the divided band increases, and the highest frequency band When the value of the spectrum envelope amplitude is the largest, the ratio of the pitch pulse in the band with the second lowest frequency is made smaller than the ratio of the pitch pulse in the band with the lowest frequency, and the ratio of the pitch pulse in the band with the highest frequency is the second. The frequency is increased by a predetermined amount from the lower frequency band. Furthermore, the high frequency band is divided into three bands, and when the voiced / unvoiced identification information of the high frequency band indicates voiced, the mixing ratio of each band obtained by dividing the high frequency band is 3 The voiced intensity of the band where the value of the spectral envelope amplitude is maximum among the two bands is set to be higher than the voiced intensity of the same band when the value of the spectral envelope amplitude of the other band is maximum. It is. Furthermore, the high frequency band is divided into three bands, and when the voiced / unvoiced identification information of the high frequency band indicates unvoiced, the mixing ratio of each band obtained by dividing the high frequency band is 3 The voiced intensity of the band where the value of the spectral envelope amplitude is maximum among the two bands is set to be smaller than the voiced intensity of the same band when the value of the spectral envelope amplitude of the other band is maximum. It is.

【００３６】さらにまた、本発明の音声符号化装置は、
所定のサンプル周波数で標本化され、量子化された音声
サンプルを入力し、予め定められた時間長の音声符号化
フレーム毎に所定数の音声サンプルを出力するフレーム
化器と、該１フレーム分の音声サンプルのレベル情報で
あるＲＭＳ値の対数を計算し、その結果である対数ＲＭ
Ｓ値を出力するゲイン計算器と、該対数ＲＭＳ値を線形
量子化し、その結果である量子化後の対数ＲＭＳ値を出
力する第１の量子化器と、前記１フレーム分の音声サン
プルに対し線形予測分析を行い、スペクトル包絡情報で
ある所定次数の線形予測係数を出力する線形予測分析器
と、該線形予測係数をＬＳＦ（Line Spectrum Frequenc
ies）係数に変換して出力するＬＳＦ係数計算器と、該
ＬＳＦ係数を量子化し、その結果であるＬＳＦパラメー
タインデックスを出力する第２の量子化器と、前記１フ
レーム分の音声サンプルを所定のカットオフ周波数でフ
ィルタリングし入力信号の低周波数成分を出力するロー
パスフィルタと、該入力信号の低周波数成分から正規化
自己相関関数計算に基づきピッチ周期を抽出し、ピッチ
周期および正規化自己相関関数の最大値を出力するピッ
チ検出器と、該ピッチ周期を対数変換した後、第１の所
定のレベル数で線形量子化し、その結果であるピッチ周
期インデックスを出力する第３の量子化器と、前記正規
化自己相関関数の最大値を入力し、所定の閾値より小さ
ければ非周期フラグをＯＮにセット、そうでなければＯ
ＦＦにセットして、該非周期フラグを出力する非周期フ
ラグ発生器と、前記線形予測係数を係数として用いて前
記１フレーム分の音声サンプルからスペクトル包絡情報
を除去し、その結果である残差信号を出力するＬＰＣ分
析フィルタと、該残差信号を入力し、ピーキネス値を計
算して出力するピーキネス計算器と、該ピーキネス計算
器の値により、前記正規化自己相関関数の最大値の値を
補正して補正された正規化自己相関関数の最大値を出力
する相関関数補正器と、該補正された正規化自己相関関
数の最大値が所定の閾値以下であれば無声、そうでなけ
れば有声と判定し、その結果である有声／無声フラグを
出力する第１の有声／無声判定器と、前記非周期フラグ
が非周期を示しているフレームの前記ピッチ周期につい
て、第２の所定のレベル数で不均一量子化し、非周期的
ピッチインデックスを出力する非周期ピッチインデック
ス生成器と、前記有声／無声フラグ、前記非周期フラ
グ、前記ピッチ周期インデックスおよび前記非周期的ピ
ッチインデックスを入力し、これらを所定のビット数で
符号化した周期／非周期ピッチ・有声／無声情報コード
を出力する周期／非周期ピッチおよび有声／無声情報コ
ード生成器と、前記１フレーム分の音声サンプルを所定
のカットオフ周波数でフィルタリングし入力信号の高周
波数成分を出力するハイパスフィルタと、該入力信号の
高周波数成分から、前記ピッチ周期で与えられる遅延時
間における正規化自己相関関数を計算し出力する相関関
数計算器と、該正規化自己相関関数の最大値が所定の閾
値以下であれば無声、そうでなければ有声と判定し、そ
の結果である高域有声／無声フラグを出力する第２の有
声／無声判定器と、前記量子化後の対数ＲＭＳ値、前記
ＬＳＦパラメータインデックス、前記周期／非周期ピッ
チ・有声／無声情報コードおよび前記高域有声／無声フ
ラグを入力し、１フレーム毎にビットパッキングを行い
音声情報ビット列を出力するビットパッキング器とを備
えたものである。Further, the speech coding apparatus of the present invention
A framer for inputting audio samples sampled and quantized at a predetermined sample frequency and outputting a predetermined number of audio samples for each audio encoded frame having a predetermined time length; The logarithm of the RMS value, which is the level information of the audio sample, is calculated, and the resulting logarithm RM
A gain calculator that outputs an S value, a first quantizer that linearly quantizes the logarithmic RMS value and outputs a resulting logarithmic RMS value after quantization, and a A linear prediction analyzer that performs a linear prediction analysis and outputs a linear prediction coefficient of a predetermined order that is spectrum envelope information, and outputs the linear prediction coefficient to an LSF (Line Spectrum Frequenc
ies) an LSF coefficient calculator that converts the LSF coefficients into coefficients, outputs a LSF parameter index, and a second quantizer that quantizes the LSF coefficients. A low-pass filter that filters at a cutoff frequency and outputs a low-frequency component of the input signal; and extracts a pitch period from the low-frequency component of the input signal based on a normalized autocorrelation function calculation, and calculates a pitch period and a normalized autocorrelation function. A pitch detector that outputs a maximum value, a third quantizer that performs a logarithmic conversion of the pitch period, linearly quantizes the pitch period at a first predetermined number of levels, and outputs a pitch period index obtained as a result, Enter the maximum value of the normalized autocorrelation function, and set the aperiodic flag to ON if it is smaller than a predetermined threshold;
An aperiodic flag generator that outputs the aperiodic flag when set to FF; and removes the spectral envelope information from the one-frame audio sample using the linear prediction coefficient as a coefficient. An LPC analysis filter that outputs the residual signal, a peakiness calculator that calculates and outputs a peakiness value by inputting the residual signal, and corrects the value of the maximum value of the normalized autocorrelation function by the value of the peakiness calculator. A correlation function corrector that outputs a corrected maximum value of the normalized autocorrelation function, and if the corrected maximum value of the normalized autocorrelation function is equal to or less than a predetermined threshold, it is unvoiced; A first voiced / unvoiced determiner for determining and outputting a voiced / unvoiced flag as a result thereof, and a second predetermined period for the pitch period of the frame in which the aperiodic flag indicates an aperiod. Aperiodic pitch index generator that non-uniformly quantizes by the number of bells and outputs an aperiodic pitch index, and inputs the voiced / unvoiced flag, the aperiodic flag, the pitch period index and the aperiodic pitch index, A periodic / aperiodic pitch / voiced / unvoiced information code generator for outputting a periodic / aperiodic pitch / voiced / unvoiced information code in which these are encoded with a predetermined number of bits; A high-pass filter that filters at an off frequency and outputs a high-frequency component of the input signal; and a correlation function calculator that calculates and outputs a normalized autocorrelation function at a delay time given by the pitch period from the high-frequency component of the input signal. And if the maximum value of the normalized autocorrelation function is less than or equal to a predetermined threshold, A second voiced / unvoiced discriminator that determines a voice and outputs the resulting high-frequency voiced / unvoiced flag; the logarithmic RMS value after the quantization, the LSF parameter index, the periodic / aperiodic pitch / voiced / Unvoiced information code and the high-frequency voiced / unvoiced flag, and a bit packing unit that performs bit packing for each frame and outputs a voice information bit sequence.

【００３７】さらにまた、本発明の音声復号装置は、上
述の音声符号化装置により符号化された１フレーム毎の
音声情報ビット列を復号する音声復号装置であって、前
記音声情報ビット列を各パラメータ毎に分離し、周期／
非周期ピッチ・有声／無声情報コード、量子化後の対数
ＲＭＳ値、ＬＳＦパラメータインデックスおよび高域有
声／無声フラグを出力するビット分離器と、前記周期／
非周期ピッチ・有声／無声情報コードを入力し、現フレ
ームの状態が無声ならば、ピッチ周期を所定の値にセッ
ト、有声／無声フラグを０にセットして出力し、周期的
および非周期的の場合は、ピッチ周期を符号化の規則に
基づき復号処理して出力し、有声／無声フラグを１にセ
ットして出力する有声／無声情報・ピッチ周期復号器
と、前記周期／非周期ピッチ・有声／無声情報を入力
し、現フレームが無声または非周期的を示す場合は、ジ
ッタ値を所定の値にセットして出力し、周期的を示す場
合は、ジッタ値を０にセットして出力するジッタ設定器
と、前記ＬＳＦパラメータインデックスから前記所定の
次数のＬＳＦ係数を復号し出力するＬＳＦ復号器と、該
ＬＳＦ係数から傾斜補正係数を計算し出力する傾斜補正
係数計算器と、前記量子化後の対数ＲＭＳ値を復号し、
ゲイン情報を出力するゲイン復号器と、前記ＬＳＦ係数
を線形予測係数に変換し出力する第１の線形予測係数計
算器と、前記線形予測係数からスペクトル包絡振幅を計
算し出力するスペクトル包絡振幅計算器と、前記有声／
無声フラグ、前記高域有声／無声フラグ、前記スペクト
ル包絡振幅を入力し、周波数軸上で分割された帯域（以
下、「サブバンド」という）毎のパルス音源と雑音音源
の混合比情報を決定し出力するパルス音源／雑音音源混
合比計算器と、前記ピッチ周期、前記混合比情報、前記
ジッタ値、前記ＬＳＦ係数、前記傾斜補正係数および前
記ゲイン情報をそれぞれピッチ周期に同期して線形補間
し、補間後のピッチ周期、補間後の混合比情報、補間後
のジッタ値、補間後のＬＳＦ係数、補間後の傾斜補正係
数および補間後のゲイン情報を出力するパラメータ補間
器と、該補間後のピッチ周期および補間後のジッタ値を
入力し、補間後のピッチ周期にジッタを付加した後、整
数値に変換されたピッチ周期（以下、「整数ピッチ周
期」という）を出力するピッチ周期計算器と、該整数ピ
ッチ周期に同期して該整数ピッチ周期分の再生音声を復
号し出力する１ピッチ波形復号器とを備え、該１ピッチ
波形復号器は、前記整数ピッチ周期期間内に単一パルス
信号を出力する単一パルス発生器と、前記整数ピッチ周
期の長さを持つ白色雑音を出力する雑音発生器と、前記
補間された混合比情報に基づき、各サブバンド毎に前記
単一パルス信号と該白色雑音とを混合した後、それらを
合成して混合音源信号を出力する混合音源発生器と、前
記補間後のＬＳＦ係数から線形予測係数を計算する第２
の線形予測係数計算器と、前記線形予測係数に帯域幅拡
張処理を施したものを係数とする適応極／零フィルタと
前記補間後の傾斜補正係数を係数とするスペクトル傾斜
補正フィルタの従属接続であり、該混合音源信号をフィ
ルタリングしてスペクトルが改善された音源信号を出力
する適応スペクトルエンハンスメントフィルタと、前記
線形予測係数を係数として用いる全極型フィルタであ
り、該スペクトルが改善された音源信号に対してスペク
トル包絡情報を付加して、スペクトル包絡情報が付加さ
れた信号を出力するＬＰＣ合成フィルタと、該スペクト
ル包絡情報が付加された信号に対し、前記ゲイン情報を
用いてゲイン調整を行い、再生音声信号を出力するゲイ
ン調整器と、該再生音声信号に対し、パルス拡散処理を
施し、パルス拡散処理された再生音声信号を出力するパ
ルス拡散フィルタとを備えたものである。Furthermore, an audio decoding device according to the present invention is an audio decoding device for decoding an audio information bit sequence for each frame encoded by the above-described audio encoding device, wherein the audio information bit sequence is provided for each parameter. And the cycle /
A bit separator for outputting an aperiodic pitch voiced / unvoiced information code, a logarithmic RMS value after quantization, an LSF parameter index, and a high-frequency voiced / unvoiced flag;
An aperiodic pitch / voiced / unvoiced information code is input, and if the state of the current frame is unvoiced, the pitch period is set to a predetermined value, and the voiced / unvoiced flag is set to 0 and output. In the case of, a voiced / unvoiced information / pitch period decoder for decoding and outputting a pitch period based on a coding rule and outputting a voiced / unvoiced flag set to 1; If voiced / unvoiced information is input and the current frame indicates unvoiced or aperiodic, the jitter value is set to a predetermined value and output. If the current frame is periodic, the jitter value is set to 0 and output. A LSF decoder that decodes and outputs the LSF coefficient of the predetermined order from the LSF parameter index, a slope correction coefficient calculator that calculates and outputs a slope correction coefficient from the LSF coefficient, Decoding the log RMS value after reduction,
A gain decoder that outputs gain information, a first linear prediction coefficient calculator that converts and outputs the LSF coefficient to a linear prediction coefficient, and a spectrum envelope amplitude calculator that calculates and outputs a spectrum envelope amplitude from the linear prediction coefficient And said voiced /
The unvoiced flag, the high-frequency voiced / unvoiced flag, and the spectrum envelope amplitude are input, and the mixing ratio information of the pulse sound source and the noise sound source is determined for each of the divided bands (hereinafter referred to as “subbands”) on the frequency axis. A pulse sound source / noise source mixing ratio calculator to output, and linearly interpolate the pitch period, the mixing ratio information, the jitter value, the LSF coefficient, the tilt correction coefficient, and the gain information in synchronization with the pitch period, A parameter interpolator that outputs a pitch period after interpolation, mixing ratio information after interpolation, jitter value after interpolation, LSF coefficient after interpolation, inclination correction coefficient after interpolation, and gain information after interpolation, and a pitch after interpolation. After inputting the period and the jitter value after interpolation, add the jitter to the pitch period after interpolation, and then output the pitch period converted to an integer value (hereinafter referred to as “integer pitch period”) And a one-pitch waveform decoder that decodes and outputs the reproduced sound of the integer pitch period in synchronization with the integer pitch period. A single pulse generator that outputs a single pulse signal within, a noise generator that outputs white noise having the length of the integer pitch period, and, based on the interpolated mixing ratio information, A mixed sound source generator that mixes the single pulse signal and the white noise and then combines them to output a mixed sound source signal; and a second calculating unit that calculates a linear prediction coefficient from the interpolated LSF coefficient.
Of a linear prediction coefficient calculator, an adaptive pole / zero filter having a coefficient obtained by performing a bandwidth expansion process on the linear prediction coefficient, and a spectrum inclination correction filter having a coefficient of the inclination correction coefficient after the interpolation. An adaptive spectrum enhancement filter that filters the mixed sound source signal to output a sound source signal with an improved spectrum, and an all-pole filter that uses the linear prediction coefficient as a coefficient. An LPC synthesis filter that adds spectrum envelope information and outputs a signal to which spectrum envelope information has been added, and performs gain adjustment on the signal to which the spectrum envelope information has been added using the gain information to reproduce the signal. A gain adjuster for outputting an audio signal; and a pulse spreading process for the reproduced audio signal. It has been those with a pulse diffusion filter for outputting a reproduced audio signal.

【００３８】[0038]

【発明の実施の形態】本発明の音声符号化復号方法およ
び装置の一実施の形態について、図１〜９を用いて詳し
く説明する。なお、以下では、具体的な数値を用いて説
明するが、本発明は以下の説明に用いた数値以外の数値
を用いても実施することができる点に注意されたい。図
１は、本発明の音声符号化復号方法に用いられる音声符
号化器の一構成例のブロック図である。この図におい
て、フレーム化器(111)は、100-3800Hzで帯域制限され
た後、８kHzで標本化され、少なくとも１２ビットの精
度で量子化された入力音声サンプル(a7)を蓄えるバッフ
ァであり、１音声符号化フレーム（20ms）毎に音声サン
プル（160サンプル）を取り込み、音声符号化処理部へ
(b7)として出力する。以下では１音声符号化フレーム毎
に実行される処理について説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of a speech encoding / decoding method and apparatus according to the present invention will be described in detail with reference to FIGS. In the following, description will be made using specific numerical values, but it should be noted that the present invention can be implemented using numerical values other than the numerical values used in the following description. FIG. 1 is a block diagram of a configuration example of a speech encoder used in the speech encoding / decoding method of the present invention. In this figure, a framer (111) is a buffer that stores input audio samples (a7) sampled at 8 kHz after being band-limited at 100-3800 Hz and quantized with at least 12-bit precision, Captures audio samples (160 samples) for each audio encoded frame (20 ms) and sends it to the audio encoding processing unit
Output as (b7). Hereinafter, a process performed for each audio encoded frame will be described.

【００３９】ゲイン計算器(112)は入力音声(b7)のレベ
ル情報であるＲＭＳ値の対数を計算し、その結果である
(c7)を出力する。第１の量子化器（「量子化器１」とい
う）(113)は(c7)を５ビットで線形量子化し、その結果
である(d7)をビットパッキング器(125)へ出力する。線
形予測分析器(114)は、(b7)をDurbin-Levinson法を用い
て線形予測分析し、スペクトル包絡情報である10次の線
形予測係数(e7)を出力する。ＬＳＦ係数計算器(115)
は、10次の線形予測係数(e7)を10次のＬＳＦ（Line Spe
ctrum Frequencies）係数(f7)に変換する。The gain calculator (112) calculates the logarithm of the RMS value, which is the level information of the input voice (b7), and obtains the result.
(c7) is output. The first quantizer (referred to as "quantizer 1") (113) linearly quantizes (c7) with 5 bits, and outputs the result (d7) to the bit packing device (125). The linear prediction analyzer (114) performs linear prediction analysis on (b7) using the Durbin-Levinson method, and outputs a 10th-order linear prediction coefficient (e7) that is spectrum envelope information. LSF coefficient calculator (115)
Calculates the 10th-order LSF (Line Spe
ctrum Frequencies).

【００４０】第２の量子化器（量子化器２）(116)は、
段数３の多段ベクトル量子化を用いた無記憶ベクトル量
子化と予測（記憶）ベクトル量子化を切り替えて使用す
る構成とし、これにより10次のＬＳＦ係数(f7)を19ビッ
トで量子化し、その結果であるＬＳＦパラメータインデ
ックス(g7)をビットパッキング器(125)へ出力する。例
えば、量子化器２(116)は、入力される10次のＬＳＦ係
数(f7)を７，６，５ビットの３段無記憶ベクトル量子化
器と同じく７，６，５ビットの３段予測ベクトル量子化
器に入力してそれぞれ量子化し、入力(f7)との距離計算
の結果によりいずれかの出力を選択して、何れを選択し
たのかを示す切替えビット（１ビット）とともに出力す
る。なお、このような量子化器は、T.Eriksson, J.Lind
en and J.Skoglund, " EXPLOITING INTERFRAME CORRELA
TION IN SPECTRAL QUANTIZATIONA STUDY OF DIFFERENT
MEMORY VQ SCHEMES ", Proc. ICASSP , pp.765-768, 19
95 に記載されている。The second quantizer (quantizer 2) (116)
The configuration is such that the memoryless vector quantization and the prediction (memory) vector quantization using the multistage vector quantization with three stages are switched and used, whereby the 10th-order LSF coefficient (f7) is quantized by 19 bits, and the result is Is output to the bit packing unit (125). For example, the quantizer 2 (116) converts the input 10-order LSF coefficient (f7) into a 7, 6, 5 bit three-stage predictor in the same manner as the 7, 6, 5 bit three-stage memoryless vector quantizer. The data is input to the vector quantizer, quantized, and one of the outputs is selected based on the result of the distance calculation with the input (f7), and is output together with a switch bit (1 bit) indicating which is selected. Note that such a quantizer is described in T. Eriksson, J. Lind
en and J. Skoglund, "EXPLOITING INTERFRAME CORRELA
TION IN SPECTRAL QUANTIZATIONA STUDY OF DIFFERENT
MEMORY VQ SCHEMES ", Proc. ICASSP, pp.765-768, 19
95.

【００４１】ＬＰＦ（ローパスフィルタ）(120)は入力
音声(b7)をカットオフ周波数1000Hzでフィルタリング
し、1000Hz以下の成分(k7)を出力する。ピッチ検出器(1
21)は、(k7)からピッチ周期を求め、(m7)として出力す
る。ピッチ周期は正規化自己相関関数が最大となる遅延
量として与えられるが、この時の正規化自己相関関数の
最大値(l7)も出力される。正規化自己相関関数の最大値
の大きさは、入力信号(b7)の周期性の強さを表す情報で
あり、非周期フラグ発生器(122)（後で説明する）で用
いられる。また、正規化自己相関関数の最大値(l7)は、
相関関数補正器(119)（後で説明する）で補正された
後、第１の有声／無声判定器（有声／無声判定器１）(1
26)における有声／無声判定に用いられる。そこでは、
補正後の正規化自己相関関数の最大値(j7)が閾値（例え
ば、0.6）以下であれば無声、そうでなければ有声と判
定され、その結果である有声／無声フラグ(s7)が出力さ
れる。この有声／無声フラグ(s7)は低周波数帯域の有声
／無声識別情報に相当する。The LPF (low-pass filter) (120) filters the input voice (b7) at a cutoff frequency of 1000 Hz and outputs a component (k7) of 1000 Hz or less. Pitch detector (1
In (21), the pitch period is obtained from (k7) and output as (m7). The pitch period is given as a delay amount at which the normalized autocorrelation function is maximized, and the maximum value (17) of the normalized autocorrelation function at this time is also output. The magnitude of the maximum value of the normalized autocorrelation function is information indicating the strength of the periodicity of the input signal (b7), and is used by the aperiodic flag generator (122) (described later). Also, the maximum value (l7) of the normalized autocorrelation function is
After being corrected by the correlation function corrector (119) (described later), the first voiced / unvoiced determiner (voiced / unvoiced determiner 1) (1
Used for voiced / unvoiced determination in 26). Where,
If the maximum value (j7) of the corrected normalized autocorrelation function is equal to or smaller than a threshold value (for example, 0.6), it is determined that the voiced voice is unvoiced; otherwise, voiced / unvoiced flag (s7) is output. You. The voiced / unvoiced flag (s7) corresponds to voiced / unvoiced identification information in a low frequency band.

【００４２】第３の量子化器（量子化器３）(123)はピ
ッチ周期(m7)を入力し対数変換した後、99レベルで線形
量子化し、その結果であるピッチインデックス(o7)を周
期／非周期ピッチおよび有声／無声情報コード発生器(1
27)へ出力する。図３に量子化器３(123)への入力である
ピッチ周期（20〜160サンプルの範囲をとる）とその出
力であるインデックスの値（０〜98の範囲をとる）の関
係を示す。非周期フラグ発生器(122)は、正規化自己相
関関数の最大値(l7)を入力し、閾値（例えば、0.5）よ
り小さければ非周期フラグをＯＮにセット、そうでなけ
ればＯＦＦにセットして、非周期フラグ（１ビット）(n
7)を非周期ピッチインデックス生成器(124)および、周
期／非周期ピッチおよび有声／無声情報コード生成器(1
27)へ出力する。ここで、非周期フラグ(n7)がＯＮであ
れば、現フレームが非周期性をもつ音源であることを意
味する。A third quantizer (quantizer 3) (123) inputs the pitch period (m7), performs logarithmic conversion, linearly quantizes the 99 levels, and converts the resulting pitch index (o7) into a period. / Aperiodic pitch and voiced / unvoiced information code generator (1
Output to 27). FIG. 3 shows the relationship between the pitch period (taken in the range of 20 to 160 samples) as an input to the quantizer 3 (123) and the index value (taken in the range of 0 to 98) as its output. The aperiodic flag generator (122) inputs the maximum value (17) of the normalized autocorrelation function, and sets the aperiodic flag to ON if it is smaller than a threshold value (for example, 0.5); otherwise, sets it to OFF. And the aperiodic flag (1 bit) (n
7) with an aperiodic pitch index generator (124) and a periodic / aperiodic pitch and voiced / unvoiced information code generator (1).
Output to 27). Here, if the aperiodic flag (n7) is ON, it means that the current frame is a sound source having aperiodicity.

【００４３】ＬＰＣ分析フィルタ(117)は10次の線形予
測係数(e7)を係数として用いる全零型フィルタであり、
入力音声（b7）からスペクトル包絡情報を除去し、その
結果である残差信号(h7)を出力する。ピーキネス計算器
(118)は、残差信号(h7)を入力し、ピーキネス値を計算
して、(i7)として出力する。ピーキネス値はＭＥＬＰ方
式で説明したのと同様の方法を用いて計算する。相関関
数補正器(119)は、ピーキネス値(i7)が所定の値（例え
ば、1.34）より大きければ、正規化自己相関関数の最大
値(l7)を1.0（有声を示す）にセットし、(j7)を出力す
る。また、前記所定の値以下のときには、前記(l7)を(j
7)としてそのまま出力する。The LPC analysis filter (117) is an all-zero filter using a 10th-order linear prediction coefficient (e7) as a coefficient.
The spectrum envelope information is removed from the input speech (b7), and the resulting residual signal (h7) is output. Peakiness calculator
(118) receives the residual signal (h7), calculates the peakiness value, and outputs it as (i7). The peakiness value is calculated using the same method as described in the MELP method. If the peakiness value (i7) is larger than a predetermined value (for example, 1.34), the correlation function corrector (119) sets the maximum value (l7) of the normalized autocorrelation function to 1.0 (indicating voiced), and j7) is output. When the value is equal to or less than the predetermined value, (l7) is changed to (j
Output as it is as 7).

【００４４】上に述べたピーキネス値の計算および相関
関数補正処理は、ジッタを有する有声フレーム、または
破裂音フレームを検出し、正規化自己相関関数の最大値
を1.0（有声を示す値）に補正するための処理である。
ジッタを有する有声フレーム、または破裂音フレームで
は、部分的にスパイク（鋭いピーク）を持つが、その他
の部分は、白色雑音に近い性質の信号になっているた
め、補正される前の正規化自己相関関数は0.5より小さ
くなる可能性が大きい（つまり、非周期フラグがＯＮに
セットされている可能性が大きい）。一方、ピーキネス
値は大きくなる。従って、ピーキネス値によりジッタを
有する有声フレーム、または破裂音フレームを検出して
正規化自己相関関数を1.0に補正すると、その後の有声
／無声判定器１(126)における有声／無声判定において
有声と判定され、復号の際に非周期パルスが音源として
用いられることになるため、ジッタを有する有声フレー
ム、または破裂音フレームの音質は改善される。The peakiness value calculation and correlation function correction processing described above detects a voiced frame or a plosive frame having jitter and corrects the maximum value of the normalized autocorrelation function to 1.0 (a value indicating voiced). This is the process to perform.
Voiced or plosive frames with jitter have some spikes (sharp peaks), but the rest of the signal is close to white noise, so the normalized self-corrected The correlation function is more likely to be smaller than 0.5 (that is, it is more likely that the aperiodic flag is set to ON). On the other hand, the peakiness value increases. Therefore, if a voiced frame or a plosive frame having jitter is detected based on the peakiness value and the normalized autocorrelation function is corrected to 1.0, the voiced / unvoiced determination in the voiced / unvoiced determination unit 1 (126) is determined to be voiced. Since the aperiodic pulse is used as a sound source during decoding, the voice quality of a voiced frame having a jitter or a plosive frame is improved.

【００４５】次に、非周期ピッチインデックス生成器(1
24)および周期／非周期ピッチおよび有声／無声情報コ
ード生成器(127)について説明する。これらを用いて周
期／非周期識別情報を伝送し、後述する復号器において
周期パルス／非周期パルスを切り替えることにより、前
述したＬＰＣ方式の問題点Ｂであるトーン的雑音を低減
することができる。Next, the aperiodic pitch index generator (1
24) and the periodic / aperiodic pitch and voiced / unvoiced information code generator (127) will be described. By transmitting periodic / non-periodic identification information using these, and switching between periodic / non-periodic pulses in a decoder to be described later, tone-like noise, which is the problem B of the LPC method, can be reduced.

【００４６】非周期ピッチインデックス生成器(124)
は、非周期フレームにおけるピッチ周期(m7)を28レベル
で不均一量子化し、インデックス(p7)を出力する。この
不均一量子化の処理内容について説明する。まず、有声
／無声フラグ(s7)が有声、かつ、非周期フラグ(n7)がＯ
Ｎになっているフレーム（過渡部でのジッタを有する有
声フレーム、または、破裂音フレームに対応する）に対
し、ピッチ周期の度数を調べた結果を図４に、その累積
度数を図５に示す。これらは男女各４名（６音声サンプ
ル／各１名）で構成される合計112.12[s]（5606フレー
ム）の音声データについて測定した結果である。上記の
条件（有声／無声フラグ(s7)が有声、かつ、非周期フラ
グ(n7)がＯＮ）を満たすフレームは、5606フレーム中42
5フレーム存在した。図４および５より、その条件を満
たすフレーム（以後、非周期フレームと記す）における
ピッチ周期の分布はおよそ25〜100に集中していること
が分かる。よって、度数（出現頻度）に基づく不均一量
子化を行えば、すなわち、度数が大きなピッチ周期ほど
細かく、それが小さいピッチ周期ほど荒く量子化すれば
高能率に伝送することができる。Aperiodic pitch index generator (124)
Outputs a non-uniform quantization of the pitch period (m7) in the aperiodic frame at 28 levels and outputs an index (p7). The processing content of the non-uniform quantization will be described. First, the voiced / unvoiced flag (s7) is voiced, and the aperiodic flag (n7) is
FIG. 4 shows the result of examining the frequency of the pitch period for a frame having N (corresponding to a voiced frame or a plosive frame having a jitter in the transient portion), and FIG. 5 shows the cumulative frequency. . These are the results of measurements on a total of 112.12 [s] (5606 frames) of audio data composed of four men and women (6 audio samples / one each). Frames satisfying the above conditions (voiced / unvoiced flag (s7) is voiced and aperiodic flag (n7) is ON) are 42 out of 5606 frames.
There were 5 frames. From FIGS. 4 and 5, it can be seen that the pitch period distribution in frames satisfying the condition (hereinafter referred to as aperiodic frames) is concentrated at about 25 to 100. Therefore, if non-uniform quantization based on the frequency (appearance frequency) is performed, that is, if the pitch period is larger, the finer the pitch period is, and the smaller the pitch period, the coarser the quantization, the higher the transmission efficiency.

【００４７】また、後述するように、復号器側では、非
周期フレームのピッチ周期は次式により計算される。非周期フレームのピッチ周期＝伝送されたピッチ周期×
（1.0＋0.25×乱数値）上式で、伝送されたピッチ周期とは、非周期ピッチイン
デックス生成器(124)の出力であるインデックスにより
伝送されるピッチ周期であり、これに（1.0＋0.25×乱
数値）を乗算することによりピッチ周期毎にジッタが付
加される。したがって、ピッチ周期が大きいほど、ジッ
タの量も大きくなるため、荒い量子化が許される。As will be described later, on the decoder side, the pitch period of the aperiodic frame is calculated by the following equation. Pitch period of aperiodic frame = transmitted pitch period x
(1.0 + 0.25 × random value) In the above expression, the transmitted pitch period is a pitch period transmitted by an index output from the aperiodic pitch index generator (124), and is expressed as (1.0 + 0. By multiplying by 25 × random value, jitter is added for each pitch period. Therefore, the larger the pitch period is, the larger the amount of jitter is, so that rough quantization is allowed.

【００４８】上記の考えに基づいた非周期フレームのピ
ッチ周期に対する量子化テーブルの例を表４に示す。同
表では、入力ピッチ周期が20〜24の範囲を１レベル、25
〜50の範囲を13レベル（２ステップ幅）、51〜95の範囲
を９レベル（５ステップ幅）、96〜135の範囲を４レベ
ル（10ステップ幅）、136〜160の範囲を１レベルで量子
化し、インデックス（非周期０〜27）を出力する。通常
のピッチ周期の量子化は、６４レベル以上必要であるの
に対し、この非周期フレームのピッチ周期の量子化で
は、度数、復号方法を考慮することにより、２８レベル
で量子化することが可能となる。Table 4 shows an example of the quantization table for the pitch period of the aperiodic frame based on the above idea. In the table, the range of the input pitch period from 20 to 24 is one level,
13 levels (2 step widths) in the range of ~ 50, 9 levels (5 step widths) in the range of 51 to 95, 4 levels (10 step widths) in the range of 96 to 135, and 1 level in the range of 136 to 160 Quantize and output index (non-period 0 to 27). While normal pitch cycle quantization requires 64 levels or more, this non-periodic frame pitch cycle quantization can be performed at 28 levels by considering the frequency and decoding method. Becomes

【００４９】[0049]

【表４】 [Table 4]

【００５０】周期／非周期ピッチおよび有声／無声情報
コード゛生成器(127)は、有声／無声フラグ(s7)、非周
期フラグ(n7)、ピッチインデックス(o7)、非周期的ピッ
チインデックス(p7)を入力し、７ビット（128レベル）
の周期／非周期ピッチ・有声／無声情報コード(t7)を出
力する。ここでの処理について以下に述べる。有声／無
声フラグ(s7)が無声を示す場合は、７ビットの符号（12
8種類の符号語を持つ）うち、７ビットが全て０の符号
語を割り当てる。同フラグが有声を示す場合は、残りの
符号語（127種類）を非周期フラグ(n7)に基づき、ピッ
チインデックス(o7)または非周期ピッチインデックス(p
7)に割り当てる。非周期フラグ(n7)がＯＮの時は、非周
期ピッチインデックス(p7)（非周期０〜27）を７ビット
中１ビットあるいは２ビットが１となる符号語（計28種
類）に割り当てる。その他の符号語（99種類）は周期的
なピッチインデックス(o7)（周期０〜98）に割り当て
る。The periodic / aperiodic pitch and voiced / unvoiced information code ゛ generator (127) includes a voiced / unvoiced flag (s7), an aperiodic flag (n7), a pitch index (o7), and an aperiodic pitch index (p7). ) Is input and 7 bits (128 levels)
Output a periodic / aperiodic pitch / voiced / unvoiced information code (t7). The processing here will be described below. When the voiced / unvoiced flag (s7) indicates unvoiced, a 7-bit code (12
(7 types of codewords are allocated). If the flag indicates voiced, the remaining codewords (127 types) are pitch index (o7) or aperiodic pitch index (p) based on the aperiodic flag (n7).
Assign to 7). When the aperiodic flag (n7) is ON, the aperiodic pitch index (p7) (aperiods 0 to 27) is assigned to codewords in which 1 bit or 2 bits are 1 in 7 bits (total 28 types). Other code words (99 types) are assigned to a periodic pitch index (o7) (periods 0 to 98).

【００５１】以上に基づく周期／非周期ピッチ・有声／
無声情報コードの生成テーブルを表５に示す。通常、伝
送誤りにより有声／無声情報に誤りが発生し、無声フレ
ームが誤って有声フレームとして復号された場合、周期
的音源が使用されるため再生音声の品質は著しく劣下す
る。上述のように、非周期ピッチインデックス(p7)（非
周期０〜27）を７ビット中１ビットあるいは２ビットが
１となる符号語（計28種類）に割り当てることにより、
無声の符号語（0x0)が伝送誤りにより１または２ビット
誤ったとしても、非周期的なピッチパルスにより音源信
号が作られるため、伝送誤りによる影響を軽減すること
が出来る。なお、無声の符号語にオール１(0x7F)を割り
当て、非周期ピッチインデックスに７ビット中１ビット
または２ビットが０となる符号語を割当てるようにして
もよい。前述したＭＥＬＰ方式では非周期フラグの伝送
に１ビット使用していたが、この方法を用いることによ
り、それが不要となり、伝送ビット数の削減が可能とな
る。Based on the above, periodic / aperiodic pitch / voiced /
Table 5 shows a generation table of the unvoiced information code. Normally, when an error occurs in voiced / unvoiced information due to a transmission error and an unvoiced frame is erroneously decoded as a voiced frame, the quality of the reproduced voice is significantly deteriorated because a periodic sound source is used. As described above, by assigning the aperiodic pitch index (p7) (aperiods 0 to 27) to codewords in which 1 bit or 2 bits are 1 in 7 bits (total 28 types),
Even if the unvoiced code word (0x0) is erroneous by one or two bits due to a transmission error, an excitation signal is generated by an aperiodic pitch pulse, so that the effect of the transmission error can be reduced. Alternatively, all 1s (0x7F) may be assigned to unvoiced codewords, and codewords in which 1 or 2 bits out of 7 bits are 0 may be assigned to the aperiodic pitch index. In the MELP method described above, one bit is used for transmitting the aperiodic flag. However, by using this method, it is not necessary, and the number of transmission bits can be reduced.

【００５２】[0052]

【表５】 [Table 5]

【００５３】ＨＰＦ（ハイパスフィルタ）(128)は入力
音声(b7)をカットオフ周波数1000Hzでフィルタリング
し、高周波数成分（1000Hz以上の成分）(u7)を出力す
る。相関関数計算器(129)は、(u7)に対してピッチ周期
（m7）で与えられる遅延量における正規化自己相関関数
（v7）を計算し出力する。第２の有声／無声判定器（有
声／無声判定器２）(130)は、正規化自己相関関数(v7)
が閾値（例えば、0.5）以下であれば無声、そうでなけ
れば有声と判定し、その結果である高域有声／無声フラ
グ(w7)を出力する。この高域有声／無声フラグは高周波
数帯域の有声／無声識別情報に相当する。The HPF (high pass filter) (128) filters the input voice (b7) at a cutoff frequency of 1000 Hz and outputs a high frequency component (a component of 1000 Hz or more) (u7). The correlation function calculator (129) calculates and outputs a normalized autocorrelation function (v7) with respect to (u7) in a delay amount given by the pitch period (m7). The second voiced / unvoiced determiner (voiced / unvoiced determiner 2) (130) is a normalized autocorrelation function (v7)
If is less than or equal to a threshold value (for example, 0.5), it is determined that the voice is unvoiced. This high-frequency voiced / unvoiced flag corresponds to voiced / unvoiced identification information in a high frequency band.

【００５４】ビットパッキング器(125)は、量子化され
たＲＭＳ値（ゲイン情報）(d7)、ＬＳＦパラメータイン
デックス(g7)、周期／非周期ピッチ・有声／無声情報コ
ード(t7)および高域有声／無声フラグ(w7)を入力して、
１フレーム（20ms）当たり32ビットの音声情報ビット列
(q7)を出力する（表６）。これにより、ここに示した実
施の形態では、音声符号化速度1.6kbpsが実現できる。
なお、本実施の形態では、前述したＭＥＬＰ方式のよう
にハーモニック振幅情報は伝送していない。この理由は
次の通りである。音声符号化フレーム長を20msと短くし
ているため（ＭＥＬＰ方式では22.5ms）、ＬＳＦパラメ
ータを抽出する周期が短くなり、スペクトル表現の正確
さが向上する。従ってハーモニック振幅情報は必要とし
ない。また、ここでは、ＨＰＦ(128)、相関関数計算器
(129)および有声／無声判定器２(130)を設けて高域有声
／無声フラグ(w7)を送信するようにしているが、後述す
るように、この前記高域有声／無声フラグ(w7)は必ずし
も送る必要はない。The bit packing unit (125) includes a quantized RMS value (gain information) (d7), an LSF parameter index (g7), a periodic / aperiodic pitch voiced / unvoiced information code (t7), and a high-frequency voiced voice. Enter the / silent flag (w7)
32-bit audio information bit string per frame (20 ms)
(q7) is output (Table 6). As a result, in the embodiment shown here, a speech coding speed of 1.6 kbps can be realized.
In the present embodiment, the harmonic amplitude information is not transmitted as in the MELP method described above. The reason is as follows. Since the speech coding frame length is shortened to 20 ms (22.5 ms in the MELP system), the period for extracting the LSF parameter is shortened, and the accuracy of spectrum expression is improved. Therefore, no harmonic amplitude information is required. In addition, here, HPF (128), correlation function calculator
(129) and a voiced / unvoiced decision unit 2 (130) are provided to transmit the high-frequency voiced / unvoiced flag (w7). As described later, the high-frequency voiced / unvoiced flag (w7) is transmitted. Need not be sent.

【００５５】[0055]

【表６】 [Table 6]

【００５６】次に、上述の音声符号化器により符号化さ
れた音声情報ビット列を復号し音声信号を再生する、本
発明の音声復号方法が適用された音声復号器の一実施の
形態について、図２を参照して説明する。図２におい
て、ビット分離器(131)は１フレーム毎に受信した32ビ
ットの音声情報ビット列(a8)を各パラメータ毎に分離
し、周期／非周期ピッチ・有声／無声情報コード(b8)、
高域有声／無声フラグ(f8)、ゲイン情報(m8)およびＬＳ
Ｆパラメータインデックス(h8)を出力する。有声／無声
情報・ピッチ周期復号器(132)は周期／非周期ピッチ・
有声／無声情報コード(b8)を入力し、前記表５に示した
テーブルに基づき、無声／周期的／非周期的のうちどれ
であるかを求め、無声ならば、ピッチ周期(c8)を所定の
値（例えば、50）にセット、有声／無声フラグ(d8)を０
にセットして出力する。周期的および非周期的の場合
は、ピッチ周期(c8)を復号処理（非周期的の場合は前記
表４を用いる）して出力し、有声／無声フラグ(d8)を1.
0にセットして出力する。Next, an embodiment of a speech decoder to which the speech decoding method of the present invention is applied, which decodes a speech information bit string encoded by the above speech encoder and reproduces a speech signal, will be described with reference to FIG. This will be described with reference to FIG. In FIG. 2, a bit separator (131) separates a 32-bit audio information bit string (a8) received for each frame for each parameter, and generates a periodic / aperiodic pitch / voiced / unvoiced information code (b8),
High frequency voiced / unvoiced flag (f8), gain information (m8) and LS
Output the F parameter index (h8). The voiced / unvoiced information / pitch period decoder (132)
The voiced / unvoiced information code (b8) is input, and based on the table shown in Table 5, which of unvoiced / periodic / aperiodic is determined. If unvoiced, the pitch period (c8) is determined. (For example, 50) and the voiced / unvoiced flag (d8) is set to 0
Set to and output. In the case of periodic and non-periodic, the pitch period (c8) is decoded (in the case of non-periodic, use Table 4 above) and output, and the voiced / unvoiced flag (d8) is set to 1.
Set to 0 and output.

【００５７】ジッタ設定器(133)は、周期／非周期ピッ
チ・有声／無声情報コード(b8)を入力し、表５に基づ
き、無声／周期的／非周期的のうちどれであるかを求
め、無声または非周期的を示す場合は、ジッタ値(e8)を
所定の値（例えば、0.25）にセットして出力する。周期
的を示す場合は、ジッタ値(e8)を０にセットして出力す
る。ＬＳＦ復号器(138)は、ＬＳＦパラメータインデッ
クス(h8)から10次のＬＳＦ係数(i8)を復号し出力する。
傾斜補正係数計算器(137)は、10次のＬＳＦ係数(i8)か
ら傾斜補正係数(j8)を計算する。ゲイン復号器(139)は
ゲイン情報(m8)を復号し、ゲイン情報(n8)を出力する。
第１の線形予測係数計算器（線形予測係数計算器１）(1
36)は、ＬＳＦ係数(i8)を線形予測係数に変換し、線形
予測係数(k8)を出力する。スペクトル包絡振幅計算器(1
35)は、線形予測係数(k8)からスペクトル包絡振幅(l8)
を計算する。ここで、有声／無声フラグ(d8)、高域有声
／無声フラグ(f8)は、それぞれ、低周波数帯域の有声／
無声識別情報、高周波数帯域の有声／無声識別情報に相
当する。The jitter setting unit (133) receives the periodic / non-periodic pitch / voiced / unvoiced information code (b8) and determines which of unvoiced / periodic / non-periodic based on Table 5. If the signal indicates unvoiced or aperiodic, the jitter value (e8) is set to a predetermined value (for example, 0.25) and output. If it indicates periodicity, the jitter value (e8) is set to 0 and output. The LSF decoder (138) decodes and outputs the tenth-order LSF coefficient (i8) from the LSF parameter index (h8).
The inclination correction coefficient calculator (137) calculates an inclination correction coefficient (j8) from the tenth-order LSF coefficient (i8). The gain decoder (139) decodes the gain information (m8) and outputs the gain information (n8).
First linear prediction coefficient calculator (linear prediction coefficient calculator 1) (1
36) converts the LSF coefficient (i8) into a linear prediction coefficient and outputs a linear prediction coefficient (k8). Spectral envelope amplitude calculator (1
35) is the spectral envelope amplitude (l8) from the linear prediction coefficient (k8).
Is calculated. Here, the voiced / unvoiced flag (d8) and the high-frequency voiced / unvoiced flag (f8) are the voiced / unvoiced flag in the low frequency band, respectively.
The voiceless identification information corresponds to voiced / unvoiced identification information in a high frequency band.

【００５８】次に、パルス音源／雑音音源混合比計算器
(134)の構成について説明する。図６はパルス音源／雑
音音源混合比計算器(134)の構成を示す図であり、図２
における、有声／無声フラグ(d8)、スペクトル包絡振幅
(l8)、および、高域有声／無声フラグ(f8)を入力し、各
帯域（サブバンド）の混合比(g8)を決定し出力する。本
発明の音声復号方法のこの実施の形態においては、周波
数軸上で４つの帯域に分割して、それぞれの帯域毎にパ
ルス音源と雑音音源の混合比を決定して混合信号を生成
し、これらの信号を加算した混合音源信号を用いてい
る。この４つの帯域としては、サブバンド１（0〜1000H
z）、サブバンド２（1000〜2000Hz）、サブバンド３（2
000〜3000Hz）、およびサブバンド４（3000〜4000Hz）
を設定する。サブバンド１は低周波数帯域、サブバンド
２，３，４は高周波数の各帯域に対応する。Next, a pulse sound source / noise sound source mixture ratio calculator
The configuration of (134) will be described. FIG. 6 is a diagram showing a configuration of the pulse sound source / noise sound source mixture ratio calculator (134).
Voiced / unvoiced flag (d8), spectral envelope amplitude at
(l8) and the high-frequency voiced / unvoiced flag (f8) are input, and the mixing ratio (g8) of each band (sub-band) is determined and output. In this embodiment of the speech decoding method of the present invention, the signal is divided into four bands on the frequency axis, and a mixing ratio between a pulse sound source and a noise sound source is determined for each band to generate a mixed signal. Are used. Subband 1 (0 to 1000H)
z), sub-band 2 (1000-2000 Hz), sub-band 3 (2
000-3000Hz) and sub-band 4 (3000-4000Hz)
Set. Subband 1 corresponds to a low frequency band, and subbands 2, 3, and 4 correspond to high frequency bands.

【００５９】図６のサブバンド１有声強度設定器(160)
は、有声／無声フラグ(d8)を入力し、サブバンド１の有
声強度(a10)を設定する。ここでは、有声／無声フラグ
(d8)が1.0であれば有声強度(a10)を1.0、有声／無声フ
ラグ(d8)が０であれば有声強度(a10)を０と設定する。
サブバンド２，３，４平均振幅計算器(161)は、スペク
トル包絡振幅(l8)を入力してサブバンド２，３，４にお
けるスペクトル包絡振幅の平均値を計算し、それぞれ(b
10)、(c10)および(d10)として出力する。サブバンド選
択器(162)は、(b10)、(c10)および(d10)を入力し、スペ
クトル包絡振幅の平均値が最大となるサブバンド番号(e
10)を出力する。サブバンド２，３，４有声強度テーブ
ル(有声用)(163)は、３つの３次元ベクトル（(f101)、
(f102)、(f103)）を記憶しており、それぞれの３次元ベ
クトルは、有声フレーム時のサブバンド２，３，４の有
声強度から構成されている。第１の切替え器（切替え器
１）(165)は、サブバンド番号(e10)に応じて３つの３次
元ベクトルから１つのベクトル(h10)を選択し出力す
る。Subband 1 voiced intensity setting unit (160) in FIG.
Inputs the voiced / unvoiced flag (d8) and sets the voiced strength (a10) of subband 1. Here, the voiced / unvoiced flag
If (d8) is 1.0, the voiced intensity (a10) is set to 1.0, and if the voiced / unvoiced flag (d8) is 0, the voiced intensity (a10) is set to 0.
The sub-band 2, 3, and 4 average amplitude calculator (161) receives the spectrum envelope amplitude (18), calculates the average value of the spectrum envelope amplitude in the sub-bands 2, 3, and 4, and calculates (b
10), output as (c10) and (d10). The subband selector (162) inputs (b10), (c10), and (d10), and receives the subband number (e
Output 10). The subband 2, 3, and 4 voiced strength table (for voiced) (163) is composed of three three-dimensional vectors ((f101),
(f102) and (f103)), and each three-dimensional vector is composed of the voiced intensities of the subbands 2, 3, and 4 at the time of the voiced frame. The first switch (switch 1) (165) selects and outputs one vector (h10) from the three three-dimensional vectors according to the subband number (e10).

【００６０】サブバンド２，３，４有声強度テーブル
(無声用)(164)は、同様に３つの３次元ベクトル（(g10
1)、(g102)、(g103)）を記憶しており、それぞれの３次
元ベクトルは、無声フレーム時のサブバンド２，３，４
の有声強度から構成されている。第２の切替え器（切替
え器２）(166)はサブバンド番号(e10)に応じて３つの３
次元ベクトルから１つのベクトル(i10)を選択し出力す
る。第３の切替え器（切替え器３）(167)は高域有声／
無声フラグ(f8)を入力し、それが有声を示す場合は(h1
0)を、無声を示す場合は(i10)を選択し、(j10)として出
力する。なお、前述したように高域有声／無声フラグ(w
7)が送られない場合には、高域有声／無声フラグ(f8)の
代わりに有声／無声フラグ(d8)を使用すればよい。Subbands 2, 3, 4 Voiced strength table
(For unvoiced) (164) is also a three-dimensional vector ((g10
1), (g102), and (g103)), and the respective three-dimensional vectors are subbands 2, 3, and 4 at the time of unvoiced frames.
Is composed of voiced intensities. The second switching device (switching device 2) (166) performs three 3 switching operations according to the subband number (e10).
One vector (i10) is selected and output from the dimensional vector. The third switch (switch 3) (167) is high-frequency voiced /
Enter the unvoiced flag (f8) and if it indicates voiced (h1
(0) is selected (i10) to indicate unvoiced and output as (j10). As described above, the high-frequency voiced / unvoiced flag (w
If 7) is not sent, a voiced / unvoiced flag (d8) may be used instead of the high-frequency voiced / unvoiced flag (f8).

【００６１】混合比計算器(168)はサブバンド１の有声
強度(a10)、サブバンド２，３，４の有声強度(j10)を入
力し、各サブバンドの混合比(g8)を出力する。この混合
比(g8)は、各サブバンドでのパルス音源の割合を示すsb
1_p、sb2_p、sb3_p、sb4_pと、雑音音源の割合を示すsb
1_n、sb2_n、sb3_n、sb4_nにより構成される（ここで、
sbx_yにおいてxはサブバンド番号を示し、yがpの時はパ
ルス音源、yがnの時は雑音音源を示す）。sb1_p、sb2_
p、sb3_p、sb4_pとしては、サブバンド１の有声強度(a1
0)、サブバンド２，３，４の有声強度(j10)の値をそれ
ぞれそのまま使用する。sbx_n（x=1,...4）について
は、sbx_n＝（1.0−sbx_p）（x=1,..,4）と設定する。The mixing ratio calculator (168) receives the voiced intensity (a10) of subband 1 and the voiced intensity (j10) of subbands 2, 3, and 4, and outputs the mixing ratio (g8) of each subband. . This mixing ratio (g8) indicates the ratio of the pulse sound source in each subband.
1_p, sb2_p, sb3_p, sb4_p, and sb indicating the ratio of noise sources
1_n, sb2_n, sb3_n, and sb4_n (where,
In sbx_y, x indicates a subband number, and when y is p, a pulsed sound source and when y is n, a noise source. sb1_p, sb2_
As p, sb3_p, and sb4_p, the voiced intensity (a1
0), the values of the voiced intensities (j10) of the subbands 2, 3, and 4 are used as they are. For sbx_n (x = 1,... 4), sbx_n = (1.0−sbx_p) (x = 1,..., 4) is set.

【００６２】次に、前記サブバンド２，３，４有声強度
テーブル(有声用)(163)の決定方法について説明する。
同テーブルの値は、図８に示す有声フレームにおけるサ
ブバンド２，３，４の有声強度測定結果を基に決定し
た。同図の測定方法を以下に示す。入力音声に対しフレ
ーム（20ms）毎に各サブバンド２，３，４におけるスペ
クトル包絡振幅の平均値を計算し、サブバンド２のそれ
が最大になるフレームのグループ（fg_sb2と表す）、サ
ブバンド３のそれが最大になるフレームのグループ（fg
_sb3と表す）、およびサブバンド４のそれが最大になる
フレームのグループ（fg_sb4と表す）の３つのフレーム
グループに分類する。次に、フレームグループfg_sb2に
属する音声フレームについてサブバンド２，３，４に対
応するサブバンド信号に分割し、それぞれのサブバンド
信号についてピッチ周期における正規化自己相関関数を
求め、サブバンド毎にその平均値を求めた。Next, a method of determining the subband 2, 3, and 4 voiced strength table (for voiced) (163) will be described.
The values in the table are determined based on the voiced strength measurement results of subbands 2, 3, and 4 in the voiced frame shown in FIG. The measurement method shown in FIG. The average value of the spectral envelope amplitude in each of the subbands 2, 3, and 4 is calculated for each frame (20 ms) with respect to the input voice, and a group of frames (represented by fg_sb2) in which the subband 2 has the maximum value, and a subband 3 Group of frames that maximizes it (fg
_sb3) and a group of frames (represented as fg_sb4) in which the subband 4 has the maximum value. Next, the audio frame belonging to the frame group fg_sb2 is divided into subband signals corresponding to subbands 2, 3, and 4, and a normalized autocorrelation function in a pitch cycle is obtained for each subband signal. The average was determined.

【００６３】図８の横軸は、そのサブバンド番号を示
す。正規化自己相関関数は入力信号の周期性の強さ、つ
まり有声性の強さを示すパラメータであるため有声強度
を意味する。図８の縦軸は、各サブバンド信号の有声強
度（正規化自己相関）を示す。同図の◆印の曲線は、fg
_sb2について測定した結果を示す。同様に、フレームグ
ループfg_sb3について測定した結果を●印の曲線、フレ
ームグループfg_sb4について測定した結果を○印の曲線
で示している。この測定で使用した入力音声信号は、音
声データベースＣＤ−ＲＯＭからの音声とＦＭ放送から
録音した音声で構成されている。The horizontal axis in FIG. 8 indicates the subband number. Since the normalized autocorrelation function is a parameter indicating the strength of the periodicity of the input signal, that is, the strength of voicedness, it means voiced strength. The vertical axis in FIG. 8 indicates the voiced intensity (normalized autocorrelation) of each subband signal. The curve marked with ◆ in the figure is fg
The results measured for _sb2 are shown. Similarly, the result measured for the frame group fg_sb3 is indicated by a curve indicated by ●, and the result measured for the frame group fg_sb4 is indicated by a curve indicated by ○. The input audio signal used in this measurement is composed of audio from the audio database CD-ROM and audio recorded from FM broadcast.

【００６４】図８より以下の傾向があることが分かる。サブバンド２または３におけるスペクトル包絡振幅の
平均値が最大になるフレーム（◆印および●印）では、
サブバンドの周波数が高くなるに従って有声強度は単調
に減少する。サブバンド４におけるスペクトル包絡振幅の平均値が
最大になるフレーム（○印）では、サブバンドの周波数
が高くなるに従って有声強度は単調に減少せず、サブバ
ンド４の有声強度が比較的強くなる。また、サブバンド
２、３の有声強度は弱くなる（サブバンド２または３に
おけるスペクトル包絡振幅の平均値が最大になる場合
（◆印および●印）と比較して）。サブバンド２のスペクトル包絡振幅の平均値が最大に
なるフレーム（◆印）のサブバンド２の有声強度は、●
印および○印におけるサブバンド２の有声強度よりも大
きくなる。同様に、サブバンド３のスペクトル包絡振幅
の平均値が最大になるフレーム（●印）のサブバンド３
の有声強度は、◆印および○印におけるサブバンド３の
有声強度よりも大きくなる。同様に、サブバンド４のス
ペクトル包絡振幅の平均値が最大になるフレーム（○
印）のサブバンド３の有声強度は、◆印および●印にお
けるサブバンド４の有声強度よりも大きくなる。FIG. 8 shows the following tendency. In the frame where the average value of the spectral envelope amplitude in the subband 2 or 3 is maximum (marked with ◆ and circle),
The voiced intensity decreases monotonically as the frequency of the subband increases. In the frame where the average value of the spectral envelope amplitude in the subband 4 is the maximum (marked by ○), the voiced intensity does not monotonously decrease as the frequency of the subband increases, and the voiced intensity of the subband 4 becomes relatively strong. Also, the voiced intensity of the subbands 2 and 3 becomes weaker (compared to the case where the average value of the spectral envelope amplitude in the subbands 2 and 3 becomes maximum (◆ and ●). The voiced intensity of the subband 2 of the frame (marked by a triangle) in which the average value of the spectral envelope amplitude of the subband 2 is the maximum is
It becomes larger than the voiced intensity of subband 2 in the mark and the mark. Similarly, the sub-band 3 of the frame (marked by ●) in which the average value of the spectral envelope amplitude of the sub-band 3 is maximum
Is larger than the voiced intensity of the sub-band 3 at the mark Δ and the mark ○. Similarly, the frame in which the average value of the spectral envelope amplitude of subband 4 is the maximum (○
The voiced intensity of the sub-band 3 of the mark (印) is larger than the voiced intensity of the sub-band 4 at the mark Δ and the mark ●.

【００６５】従って、図６のサブバンド２，３，４有声
強度テーブル(有声用)(163)に、３次元ベクトル(f101)
として◆印の曲線の有声強度の値、(f102)として●印の
曲線の有声強度の値、(f103)として○印の曲線の有声強
度の値を記憶しておき、スペクトル包絡振幅の平均値が
最大となるサブバンド番号(e10)が示すサブバンド番号
に基づき、選択すればスペクトル包絡振幅に応じて適切
な有声強度が設定できる。表７にサブバンド２，３，
４有声強度テーブル(有声用)(163)の内容を示す。Accordingly, the three-dimensional vector (f101) is added to the subband 2, 3, and 4 voiced strength table (for voiced) (163) in FIG.
The value of the voiced intensity of the curve marked with ◆ is stored, the value of the voiced intensity of the curve marked with として is stored as (f102), and the value of the voiced intensity of the curve marked with として is stored as (f103), and the average value of the spectral envelope amplitude is stored. Based on the sub-band number indicated by the sub-band number (e10) that maximizes, the appropriate voiced strength can be set according to the spectral envelope amplitude if selected. Table 7 shows subbands 2, 3,
4 shows the contents of the voiced strength table (for voiced) (163).

【表７】 [Table 7]

【００６６】サブバンド２，３，４有声強度テーブル
(無声用)(164)は、図９に示す、無声フレームにおける
サブバンド２，３，４の有声強度測定結果を基に決定し
た。同図の測定方法、テーブル内容の決定方法は、上述
した有声フレームの場合と全く同様である。図９より以
下の傾向があることが分かる。サブバンド２のスペクトル包絡振幅の平均値が最大に
なるフレーム（◆印）のサブバンド２の有声強度は、●
印および○印におけるサブバンド２の有声強度よりも小
さくなる。同様に、サブバンド３のスペクトル包絡振幅
の平均値が最大になるフレーム（●印）のサブバンド３
の有声強度は、◆印および○印におけるサブバンド３の
有声強度よりも小さくなる。同様に、サブバンド４のス
ペクトル包絡振幅の平均値が最大になるフレーム（○
印）のサブバンド３の有声強度は、◆印および●印にお
けるサブバンド４の有声強度よりも小さくなる。同テー
ブルの内容を表８に示す。Subband 2, 3, 4 Voiced Strength Table
(For unvoiced) (164) was determined based on the voiced intensity measurement results of subbands 2, 3, and 4 in the unvoiced frame shown in FIG. The method of measurement and the method of determining the contents of the table in the figure are exactly the same as those of the above-described voiced frame. FIG. 9 shows the following tendency. The voiced intensity of the subband 2 of the frame (marked by a triangle) in which the average value of the spectral envelope amplitude of the subband 2 is the maximum is
It becomes smaller than the voiced intensity of the subband 2 in the mark and the mark. Similarly, the sub-band 3 of the frame (marked by ●) in which the average value of the spectral envelope amplitude of the sub-band 3 is maximum
Is smaller than the voiced intensity of the sub-band 3 at the mark Δ and the mark ○. Similarly, the frame in which the average value of the spectral envelope amplitude of subband 4 is the maximum (○
The voiced intensity of the sub-band 3 of the mark () is smaller than the voiced intensity of the sub-band 4 at the mark Δ and the mark ●. Table 8 shows the contents of the table.

【表８】 [Table 8]

【００６７】前記図２に戻り、パラメータ補間器(140)
は、ピッチ周期(c8)、ジッタ値(e8)、混合比(g8)、傾斜
補正係数(j8)、ＬＳＦ係数(i8)およびゲイン情報(n8)の
各パラメータについてそれぞれピッチ周期に同期して線
形補間し、補間されたピッチ周期(o8)、ジッタ値(p8)、
混合比(r8)、傾斜補正係数(s8)、ＬＳＦ係数(t8)および
ゲイン情報(u8)を出力する。ここでの線形補間処理は、
次式により実施される。補間後のパラメータ＝現フレームのパラメータ×int＋
前フレームのパラメータ×（1.0−int）ここで、現フレームのパラメータは(c8)、(e8)、(g8)、
(j8)、(i8)および(n8)のそれぞれに対応し、補間後のパ
ラメータは(o8)、(p8)、(r8)、(s8)、(t8)および(u8)の
それぞれに対応する。前フレームのパラメータは、前フ
レームにおける(c8)、(e8)、(g8)、(j8)、(i8)および(n
8)を保持しておくことにより与えられる。intは補間係
数であり、次式で求める。 int＝to／160.0 ここで、160.0は音声復号フレーム長（20ms）当たりの
サンプル数、toは、復号フレームにおける１ピッチ周期
の開始サンプル点であり、１ピッチ周期分の再生音声が
復号される毎にそのピッチ周期が加算されることにより
更新される。toが160を超えるとそのフレームの復号処
理が終了したことになり、toから160が減算される。こ
こで、補間係数intを1.0に固定するとピッチ周期に同期
した線形補間処理は実施されない。Returning to FIG. 2, the parameter interpolator (140)
Are linearly synchronized with the pitch period for each parameter of the pitch period (c8), the jitter value (e8), the mixing ratio (g8), the slope correction coefficient (j8), the LSF coefficient (i8), and the gain information (n8). Interpolated, interpolated pitch period (o8), jitter value (p8),
The mixing ratio (r8), the inclination correction coefficient (s8), the LSF coefficient (t8), and the gain information (u8) are output. The linear interpolation process here is
The following equation is used. Interpolated parameter = current frame parameter x int +
Parameter of previous frame × (1.0−int) Here, parameters of current frame are (c8), (e8), (g8),
(j8), corresponding to (i8) and (n8), the parameters after interpolation correspond to (o8), (p8), (r8), (s8), (t8) and (u8) . The parameters of the previous frame are (c8), (e8), (g8), (j8), (i8) and (n
Given by holding 8). int is an interpolation coefficient, which is obtained by the following equation. int = to / 160.0 Here, 160.0 is the number of samples per audio decoding frame length (20 ms), to is the starting sample point of one pitch period in the decoding frame, and every time the reproduced audio of one pitch period is decoded. Is updated by adding the pitch period to the data. When to exceeds 160, it means that the decoding process of the frame has been completed, and 160 is subtracted from to. Here, if the interpolation coefficient int is fixed to 1.0, the linear interpolation processing synchronized with the pitch cycle is not performed.

【００６８】ピッチ周期計算器(141)は、補間されたピ
ッチ周期(o8)およびジッタ値(p8)を入力し、ピッチ周期
(q8)を次式により計算する。ピッチ周期(q8)＝ピッチ周期(o8)×（1.0−ジッタ値(p
8)×乱数値）ここで、乱数値は-1.0〜1.0の範囲の値をとる。このピ
ッチ周期(q8)は小数を持つが、四捨五入され整数に変換
される。整数に変換されたピッチ周期(q8)を以下では、
整数ピッチ周期(q8)と表す。上式より、無声または非周
期的フレームではジッタ値が所定の値（ここでは、0.2
5）にセットされているのでジッタが付加され、完全な
周期的フレームではジッタ値が０にセットされているの
でジッタは付加されない。但し、ジッタ値はピッチ毎に
補間処理されているので、０〜0.25の範囲をとるため中
間的なジッタ量が付加されるピッチ区間も存在する。こ
のように非周期ピッチ（ジッタが付加されたピッチ）を
発生することは、ＭＥＬＰ方式の説明で述べたように過
渡部、破裂音で生じる不規則な（非周期的な）声門パル
スを表現することにより、トーン的雑音を低減する効果
がある。The pitch period calculator (141) receives the interpolated pitch period (o8) and jitter value (p8),
(q8) is calculated by the following equation. Pitch period (q8) = pitch period (o8) x (1.0-jitter value (p
8) × random value) Here, the random value takes a value in the range of -1.0 to 1.0. This pitch period (q8) has a decimal number, but is rounded off and converted to an integer. In the following, the pitch period (q8) converted to an integer is
Expressed as an integer pitch period (q8). From the above equation, the jitter value for unvoiced or aperiodic frames is a predetermined value (here, 0.2
Jitter is added because it is set to 5), and no jitter is added in a complete periodic frame because the jitter value is set to zero. However, since the jitter value is interpolated for each pitch, there is a pitch section in which an intermediate amount of jitter is added because it takes a range of 0 to 0.25. The generation of the aperiodic pitch (pitch to which jitter is added) as described above represents an irregular (aperiodic) glottal pulse generated by a transient portion and a plosive as described in the description of the MELP system. This has the effect of reducing tone noise.

【００６９】１ピッチ波形復号器(150)は、整数ピッチ
周期(q8)毎の再生音声(b9)を復号し出力する。従って、
このブロックに含まれる全てのブロックは整数ピッチ周
期(q8)を入力し、それに同期して動作する。パルス発生
器(142)は、整数ピッチ周期(q8)期間内に単一パルス信
号(v8)を出力する。雑音発生器(143)は整数ピッチ周期
(q8)の長さを持つ白色雑音(w8)を出力する。混合音源発
生器(144)は、補間後の各サブバンドの混合比(r8)に基
づき、単一パルス信号(v8)と白色雑音(w8)を混合して混
合音源信号(x8)を出力する。The one-pitch waveform decoder (150) decodes and outputs the reproduced voice (b9) for each integer pitch period (q8). Therefore,
All blocks included in this block receive an integer pitch period (q8) and operate in synchronization with it. The pulse generator (142) outputs a single pulse signal (v8) within an integer pitch period (q8) period. Noise generator (143) is an integer pitch period
A white noise (w8) having a length of (q8) is output. The mixed sound source generator (144) outputs a mixed sound source signal (x8) by mixing the single pulse signal (v8) and the white noise (w8) based on the mixing ratio (r8) of each subband after interpolation. .

【００７０】前記混合音源発生器(144)の構成を図７に
示す。まず、サブバンド１の混合信号(q11)を生成する
過程を説明する。ＬＰＦ（ローパスフィルタ）１(170)
は単一パルス信号(v8)を０〜１kHzで帯域制限して(a11)
を出力する。ＬＰＦ２(171)は白色雑音(w8)を０〜１kHz
で帯域制限して(b11)を出力する。乗算器１(178)、乗算
器２(179)は、それぞれ(a11)、(b11)に混合比情報(r8)
に含まれるsb1_p、sb1_nを乗算し、(i11)、(j11)を出力
する。加算器１(186)は、(i11)と(j11)を加算し、サブ
バンド１の混合信号(q11)を出力する。サブバンド２の
混合信号(r11)も同様にして、ＢＰＦ（バンドパスフィ
ルタ）１(172)、ＢＰＦ２(173)、乗算器３(180)、乗算
器４(181)、および加算器２(189)を用いて作られる。サ
ブバンド３の混合信号(s11)も同様にして、ＢＰＦ３(17
4)、ＢＰＦ４(175)、乗算器５(182)、乗算器６(183)、
および加算器３(190)を用いて作られる。サブバンド４
の混合信号(t11)も同様にして、ＨＰＦ（ハイパスフィ
ルタ）１(176)、ＨＰＦ２(177)、乗算器７(184)、乗算
器８(185)、および加算器４(191)を用いて作られる。加
算器５(192)は、各サブバンドの混合信号(q11)、(r1
1)、(s11)および(t11)を加算し、混合音源信号(x8)を合
成する。FIG. 7 shows the structure of the mixed sound source generator (144). First, a process of generating the mixed signal (q11) of subband 1 will be described. LPF (Low Pass Filter) 1 (170)
Limits the band of single pulse signal (v8) from 0 to 1kHz (a11)
Is output. LPF2 (171) converts white noise (w8) to 0 to 1 kHz
And (b11) is output. The multiplier 1 (178) and the multiplier 2 (179) add mixing ratio information (r8) to (a11) and (b11), respectively.
Are multiplied by sb1_p and sb1_n, and (i11) and (j11) are output. Adder 1 (186) adds (i11) and (j11) and outputs a mixed signal (q11) of subband 1. Similarly, the mixed signal (r11) of the subband 2 has a BPF (bandpass filter) 1 (172), a BPF 2 (173), a multiplier 3 (180), a multiplier 4 (181), and an adder 2 (189). ). Similarly, the mixed signal (s11) of the sub-band 3 has the BPF3 (17
4), BPF4 (175), multiplier 5 (182), multiplier 6 (183),
And the adder 3 (190). Subband 4
Similarly, the mixed signal (t11) of (1) is obtained using an HPF (high pass filter) 1 (176), an HPF 2 (177), a multiplier 7 (184), a multiplier 8 (185), and an adder 4 (191). Made. The adder 5 (192) outputs the mixed signals (q11), (r1
1), (s11) and (t11) are added to synthesize a mixed sound source signal (x8).

【００７１】図２において、第２の線形予測係数計算器
（線形予測係数計算器２）(147)は、補間後のＬＳＦ係
数(t8)を線形予測係数に変換し、線形予測係数(b10)を
出力する。適応スペクトルエンハンスメントフィルタ(1
45)は、線形予測係数(b10)に帯域幅拡張処理を施したも
のを係数とする適応極／零フィルタと、前記補間後の傾
斜補正係数(s8)を係数とするスペクトル傾斜補正フィル
タの従属接続であり、適応極／零フィルタにより、表２
のに示した通り、ホルマントの共振を鋭くし、自然音
声のホルマントに対する近似度を改善することにより再
生音声の自然性を向上させる。さらに、補間された傾斜
補正係数(s8)を係数とするスペクトル傾斜補正フィルタ
によりスペクトルの傾きを補正して音のこもりを低減す
る。混合音源信号(x8)は、この適応スペクトルエンハン
スメントフィルタ(145)によりフィルタリングされ、そ
の結果である(y8)が出力される。In FIG. 2, a second linear prediction coefficient calculator (linear prediction coefficient calculator 2) (147) converts the LSF coefficient (t8) after interpolation into a linear prediction coefficient, and Is output. Adaptive spectral enhancement filter (1
45) is dependent on an adaptive pole / zero filter having a coefficient obtained by performing a bandwidth extension process on the linear prediction coefficient (b10) and a spectrum inclination correction filter having a coefficient of the inclination correction coefficient (s8) after the interpolation. Table 2
As shown in FIG. 5, the resonance of the formant is sharpened, and the approximation of the natural sound to the formant is improved, thereby improving the naturalness of the reproduced sound. Further, the slope of the spectrum is corrected by a spectrum tilt correction filter using the interpolated tilt correction coefficient (s8) as a coefficient to reduce muffled sound. The mixed sound source signal (x8) is filtered by the adaptive spectrum enhancement filter (145), and the result (y8) is output.

【００７２】ＬＰＣ合成フィルタ(146)は、線形予測係
数(b10)を係数として用いる全極型フィルタであり、音
源信号(y8)に対しスペクトル包絡情報を付加して、その
結果である信号(z8)を出力する。ゲイン調整器(148)は
(z8)に対し補間されたゲイン情報(u8)を用いてゲイン調
整を行い、(a9)を出力する。パルス拡散フィルタ(149)
は、自然音声の声門パルス波形に対するパルス音源波形
の近似度を改善するためにパルス拡散処理を施すフィル
タであり、(a9)をフィルタリングして自然性が改善され
た再生音声(b9)を出力する。このパルス拡散フィルタの
効果は表２のに示す通りである。The LPC synthesis filter (146) is an all-pole filter that uses the linear prediction coefficient (b10) as a coefficient, adds spectral envelope information to the sound source signal (y8), and generates a signal (z8 ) Is output. The gain adjuster (148)
The gain is adjusted using the interpolated gain information (u8) for (z8), and (a9) is output. Pulse diffusion filter (149)
Is a filter that performs pulse diffusion processing to improve the approximation of the pulse sound source waveform to the glottal pulse waveform of the natural voice, and outputs a reproduced voice (b9) with improved naturalness by filtering (a9) . The effect of this pulse diffusion filter is as shown in Table 2.

【００７３】なお、上述においては、各サブバンドのス
ペクトル包絡振幅の平均値が最大になるサブバンドを決
定し、それに応じて混合比を決定していたが、必ずしも
スペクトル包絡振幅の平均値を基準とすることはなく、
他の値を用いるようにしても良い。また、以上説明した
本発明の音声符号化装置および音声復号装置は、ＤＳＰ
（デジタル・シグナル・プロセッサ）によって容易に実
現可能である。さらに、前記パルス音源／雑音音源混合
比計算器における切替え器３(167)の制御信号として、
高域有声／無声フラグ(f8)の代わりに有声／無声フラグ
(d8)を使用する場合は、音声符号化器として従来方式
（ＬＰＣ方式）のものがそのまま使用可能である。さら
にまた、上述の量子化レベル数、符号語のビット数、音
声符号化フレーム長、線形予測係数、ＬＳＦ係数などの
次数、各フィルタのカットオフ周波数などは、それぞれ
前述した実施の形態において用いられた値に限られるこ
とはなく、それぞれの場合に応じた値を採用することが
できる。In the above description, the sub-band in which the average value of the spectral envelope amplitude of each sub-band is maximum is determined, and the mixing ratio is determined accordingly. And not
Other values may be used. Further, the speech encoding device and speech decoding device of the present invention described above
(Digital signal processor). Further, as a control signal of the switch 3 (167) in the pulse sound source / noise sound source mixture ratio calculator,
Voiced / unvoiced flag instead of high-frequency voiced / unvoiced flag (f8)
When (d8) is used, the conventional system (LPC system) can be used as it is as the audio encoder. Furthermore, the number of quantization levels, the number of bits of codewords, the length of a speech coding frame, the order of linear prediction coefficients, the LSF coefficients, and the cutoff frequency of each filter are used in the above-described embodiments. The value is not limited to the above value, and a value according to each case can be adopted.

【００７４】[0074]

【発明の効果】以上説明したように、本発明の音声符号
化復号方法および装置を用いることにより、従来方式
（ＬＰＣ方式）において品質劣下の原因となっているbu
zz音を低減し、再生音声の音質を向上できると共に、従
来方式（ＭＥＬＰ方式）よりも符号化速度を下げること
が可能となる。したがって、無線通信に用いる際には、
周波数利用効率の向上を図ることができる。As described above, by using the speech encoding / decoding method and apparatus of the present invention, the quality of the conventional system (LPC system) is deteriorated.
The zz sound can be reduced, the sound quality of the reproduced sound can be improved, and the encoding speed can be lower than that of the conventional method (MELP method). Therefore, when used for wireless communication,
It is possible to improve the frequency use efficiency.

[Brief description of the drawings]

【図１】本発明の音声符号化方法が適用された音声符
号化器の一実施の形態の構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of an embodiment of a speech encoder to which a speech encoding method according to the present invention is applied.

【図２】本発明の音声復号方法が適用された音声復号
器の一実施の形態の構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of an embodiment of a speech decoder to which the speech decoding method of the present invention is applied.

【図３】ピッチ周期とインデックスの関係を説明する
ための図である。FIG. 3 is a diagram for explaining a relationship between a pitch period and an index.

【図４】ピッチ周期の度数を説明するための図であ
る。FIG. 4 is a diagram for explaining a frequency of a pitch cycle.

【図５】ピッチ周期の累積度数を説明するための図で
ある。FIG. 5 is a diagram for explaining a cumulative frequency of a pitch cycle.

【図６】本発明の音声復号器の一実施の形態における
パルス音源／雑音音源混合比計算器の構成例を示すブロ
ック図である。FIG. 6 is a block diagram showing a configuration example of a pulse sound source / noise source mixture ratio calculator in one embodiment of the speech decoder of the present invention.

【図７】本発明の音声復号器の一実施の形態における
混合音源発生器の構成例を示すブロック図である。FIG. 7 is a block diagram illustrating a configuration example of a mixed sound source generator according to an embodiment of the speech decoder of the present invention.

【図８】サブバンド２，３，４の有声強度（有声時）
を説明するための図である。FIG. 8 Voiced intensity of subbands 2, 3, and 4 (when voiced)
FIG.

【図９】サブバンド２，３，４の有声強度（無声時）
を説明するための図である。[FIG. 9] Voiced intensity of subbands 2, 3, and 4 (when unvoiced)
FIG.

【図１０】従来方式（ＬＰＣ）の音声符号化器の構成
を示す図である。FIG. 10 is a diagram illustrating a configuration of a conventional (LPC) speech encoder.

【図１１】従来方式（ＬＰＣ）の音声復号器の構成を
示す図である。FIG. 11 is a diagram illustrating a configuration of a conventional (LPC) speech decoder.

【図１２】ＬＰＣ方式およびＭＥＬＰ方式のスペクト
ルについて説明するための図である。FIG. 12 is a diagram illustrating spectra of the LPC system and the MELP system.

【図１３】従来方式（ＭＥＬＰ）の音声符号化器の構
成を示す図である。FIG. 13 is a diagram illustrating a configuration of a conventional (MELP) speech encoder.

【図１４】従来方式（ＭＥＬＰ）の音声復号器の構成
を示す図である。FIG. 14 is a diagram showing a configuration of a conventional system (MELP) speech decoder.

[Explanation of symbols]

１１１フレーム化器、１１２ゲイン計算器、１１３
量子化器、１１４線形予測分析器、１１５ＬＳＦ係
数計算器、１１６量子化器、１１７ＬＰＣ分析フィ
ルタ、１１８ピーキネス計算器、１１９相関関数補
正器、１２０ローパスフィルタ、１２１ピッチ検出
器、１２２非周期フラグ発生器、１２３量子化器、
１２４非周期ピッチインデックス生成器、１２５ビ
ットパッキング器、１２６有声／無声判定器、１２７
周期／非周期ピッチおよび有声／無声情報コード生成
器、１２８ハイパスフィルタ、１２９相関関数計算
器、１３０有声／無声判定器、１３１ビット分離
器、１３２有声／無声情報・ピッチ周期復号器、１３
３ジッタ設定器、１３４パルス音源／雑音音源混合
比計算器、１３５スペクトル包絡振幅計算器、１３６
線形予測係数計算器、１３７傾斜補正係数計算器、
１３８ＬＳＦ復号器、１３９ゲイン復号器、１４０
パラメータ補間器、１４１ピッチ周期計算器、１４
２パルス音源発生器、１４３雑音発生器、１４４
混合音源発生器、１４５適応スペクトルエンハンスメ
ントフィルタ、１４６ＬＰＣ合成フィルタ、１４７
線形予測係数計算器、１４８ゲイン調整器、１４９
パルス拡散フィルタ、１６０サブバンド１有声強度設定
器、１６１サブバンド２，３，４平均振幅計算器、１
６２サブバンド選択器、１６３サブバンド２，３，
４有声強度テーブル（有声用）、１６４サブバンド
２，３，４有声強度テーブル（無声用）、１６５〜１６
７切替え器、１６８混合比計算器、１７０，１７１
ＬＰＦ、１７２〜１７５ＢＰＦ、１７６，１７７
ＨＰＦ、１７８〜１８５乗算器、１８６，１８９〜１
９２加算器111 framer, 112 gain calculator, 113
Quantizer, 114 linear prediction analyzer, 115 LSF coefficient calculator, 116 quantizer, 117 LPC analysis filter, 118 peakiness calculator, 119 correlation function corrector, 120 low-pass filter, 121 pitch detector, 122 aperiodic flag Generator, 123 quantizer,
124 aperiodic pitch index generator, 125 bit packing device, 126 voiced / unvoiced decision device, 127
Periodic / aperiodic pitch and voiced / unvoiced information code generator, 128 high-pass filter, 129 correlation function calculator, 130 voiced / unvoiced determiner, 131 bit separator, 132 voiced / unvoiced information / pitch periodic decoder, 13
3 Jitter setting unit, 134 pulse sound source / noise source mixing ratio calculator, 135 spectrum envelope amplitude calculator, 136
Linear prediction coefficient calculator, 137 tilt correction coefficient calculator,
138 LSF decoder, 139 gain decoder, 140
Parameter interpolator, 141 pitch period calculator, 14
2 pulse source generator, 143 noise generator, 144
Mixed sound source generator, 145 Adaptive spectrum enhancement filter, 146 LPC synthesis filter, 147
Linear prediction coefficient calculator, 148 gain adjuster, 149
Pulse spread filter, 160 sub-band 1 voiced intensity setter, 161 sub-band 2, 3, 4 average amplitude calculator, 1
62 subband selector, 163 subbands 2, 3,
4 voiced strength table (for voiced), 164 subbands 2, 3, 4 voiced strength table (for unvoiced), 165 to 16
7 Switcher, 168 Mixing ratio calculator, 170, 171
LPF, 172-175 BPF, 176,177
HPF, 178-185 Multiplier, 186, 189-1
92 Adder

フロントページの続き (56)参考文献特開2000−267700（ＪＰ，Ａ) 特開平11−143499（ＪＰ，Ａ) 特開平10−20892（ＪＰ，Ａ) 特開平７−295593（ＪＰ，Ａ) 特開平４−116700（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 19/06 Continuation of the front page (56) References JP-A-2000-267700 (JP, A) JP-A-11-143499 (JP, A) JP-A-10-20892 (JP, A) JP-A-7-295593 (JP, A A) JP-A-4-116700 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 19/06

Claims

(57) [Claims]

1. An audio decoding method for reproducing an audio signal from an audio information bit string which is an output obtained by encoding an audio signal by a linear predictive analysis / synthesis audio encoder, wherein the audio signal is included in the audio information bit string. Spectrum envelope information,
It separates and decodes voiced / unvoiced discrimination information, pitch period information, and gain information, obtains a spectrum envelope amplitude from the spectrum envelope information, and maximizes the value of the spectrum envelope amplitude among the bands divided on the frequency axis. Determining a band, based on the determined band and the voiced / unvoiced identification information, determining a mixing ratio when mixing a pitch pulse generated based on the pitch period information with white noise for each band; After creating a mixed signal for each band based on the determined mixing ratio, a mixed sound source signal is created by adding the mixed signals of all bands, and the spectrum envelope information and the gain information are obtained for the mixed sound source signal. To generate a reproduced sound.

2. A speech information bit sequence obtained by encoding a speech signal by a speech encoder based on a linear prediction analysis / synthesis method, comprising spectrum envelope information, voiced / unvoiced identification information in a low frequency band, and voiced in a high frequency band. A voice decoding method for reproducing a voice signal from a voice information bit sequence in which / unvoiced identification information, pitch period information, and gain information are encoded, wherein spectrum envelope information included in the voice information bit sequence,
Separating and decoding voiced / unvoiced identification information in a low frequency band, voiced / unvoiced identification information in a high frequency band, pitch period information and gain information, and based on the voiced / unvoiced identification information in the low frequency band, Determine the mixing ratio when mixing the pitch pulse and white noise generated based on the to create a mixed signal in the low frequency band, determine the spectral envelope amplitude from the spectral envelope information, the high frequency band on the frequency axis Of the divided bands,
Determining a band in which the value of the spectrum envelope amplitude is maximum; and determining the pitch pulse for each band obtained by dividing the high frequency band based on the determined band and the voiced / unvoiced identification information of the high frequency band. After determining the mixing ratio when mixing the white noise and creating a mixed signal, a mixed signal of the high frequency band is created by adding the mixed signals of all the bands of the high frequency band, A mixed sound source signal is created by adding the mixed signal of the high frequency band and the mixed signal of the high frequency band, and a reproduced sound is generated by adding the spectrum envelope information and the gain information to the mixed sound source signal. Audio decoding method.

3. A speech information bit sequence obtained by encoding a speech signal by a speech encoder based on a linear prediction analysis / synthesis method, comprising: spectrum envelope information, voiced / unvoiced identification information in a low frequency band, and voiced in a high frequency band. An audio decoding method for reproducing an audio signal from an audio information bit sequence in which / unvoiced identification information, pitch period information, and gain information are encoded, comprising: a spectrum envelope information included in the audio information bit sequence;
Separates and decodes voiced / unvoiced identification information in the low frequency band, voiced / unvoiced identification information in the high frequency band, pitch period information and gain information, and synchronizes with the pitch period based on the voiced / unvoiced identification information in the low frequency band. Determine the mixing ratio when mixing the white noise and the pitch pulse generated based on the pitch period information linearly interpolated, determine the spectral envelope amplitude from the spectral envelope information, high frequency band on the frequency axis Of the divided bands,
A band in which the value of the spectrum envelope amplitude is maximum is determined. Based on the determined band and the voiced / unvoiced identification information of the high frequency band, the pitch pulse and the pitch pulse are divided for each band obtained by dividing the high frequency band. The mixing ratio when mixing white noise is determined, and the spectrum envelope information, the pitch period information, the gain information, the mixing ratio of the low frequency band and the mixing ratio of each band obtained by dividing the high frequency band are pitch periods. Linearly interpolating in synchronization with the above, using the mixing ratio of the interpolated low frequency band, mixing the pitch pulse and white noise to create a mixed signal of a low frequency band, and interpolating the interpolated high frequency band. Using the mixing ratio of each divided band, the pitch pulse and the white noise are mixed to create a mixed signal for each divided band of the high frequency band, and then all the bands of the high frequency band To generate a mixed signal in the high frequency band, and to add the mixed signal in the low frequency band and the mixed signal in the high frequency band to create a mixed sound source signal, and to interpolate the mixed sound source signal. A speech decoding method, characterized in that a reproduced speech is generated by adding the subsequent spectral envelope information and the interpolated gain information.

4. The high frequency band is divided into three bands, and when the voiced / unvoiced identification information of the high frequency band indicates voiced, a mixing ratio of each band obtained by dividing the high frequency band is: When the value of the spectral envelope amplitude of the lowest frequency band or the second lowest frequency band is the maximum, the ratio of the pitch pulse is monotonously decreased as the frequency of the divided band becomes higher, and the highest frequency band is When the value of the spectral envelope amplitude is the maximum, the ratio of the pitch pulse in the band with the second lowest frequency is made smaller than the ratio of the pitch pulse in the band with the lowest frequency, and the ratio of the pitch pulse in the band with the highest frequency is the second. 4. The speech decoding method according to claim 2, wherein a predetermined amount is increased from a low frequency band.

5. The high frequency band is divided into three bands, and when the voiced / unvoiced identification information of the high frequency band indicates voiced, the mixing ratio of each band obtained by dividing the high frequency band is: The voiced intensity of the band where the value of the spectral envelope amplitude is maximum among the three bands is set to be larger than the voiced intensity of the same band when the value of the spectral envelope amplitude of the other bands is maximum. 3. The method according to claim 2, wherein
Alternatively, the audio decoding method according to 3.

6. The high frequency band is divided into three bands, and when the voiced / unvoiced identification information of the high frequency band indicates unvoiced, the mixing ratio of each band obtained by dividing the high frequency band is: The voiced intensity of the band where the value of the spectral envelope amplitude is maximum among the three bands is set to be smaller than the voiced intensity of the same band when the value of the spectral envelope amplitude of the other band is maximum. 3. The method according to claim 2, wherein
Alternatively, the audio decoding method according to 3.

7. A framer which inputs a sampled and quantized voice sample at a predetermined sample frequency and outputs a predetermined number of voice samples for each voice-encoded frame having a predetermined time length; RM which is the level information of the audio sample for one frame
A gain calculator that calculates the logarithm of the S value and outputs a logarithmic RMS value as a result thereof; and a first quantum that linearly quantizes the logarithmic RMS value and outputs the resulting logarithmic RMS value after the quantization. A linear prediction analyzer that performs a linear prediction analysis on the audio samples for one frame and outputs a linear prediction coefficient of a predetermined order that is spectrum envelope information; and an LSF (Line Spectrum Frequencie
s) an LSF coefficient calculator that converts and outputs the coefficients, a second quantizer that quantizes the LSF coefficients, and outputs an LSF parameter index that is a result thereof, A low-pass filter that filters at a cutoff frequency and outputs a low-frequency component of the input signal; and extracts a pitch period from the low-frequency component of the input signal based on a normalized autocorrelation function calculation, and calculates a pitch period and a normalized autocorrelation function. A pitch detector that outputs a maximum value, a third quantizer that performs a logarithmic transformation of the pitch period, linearly quantizes the pitch period at a first predetermined number of levels, and outputs a pitch period index that is a result thereof, Enter the maximum value of the normalized autocorrelation function, and set the aperiodic flag to ON if it is smaller than a predetermined threshold, and set it to OFF otherwise. An aperiodic flag generator that outputs the aperiodic flag; and an LPC analysis filter that removes spectral envelope information from the one-frame audio sample using the linear prediction coefficient as a coefficient and outputs a resulting residual signal. And a peakiness calculator that receives the residual signal, calculates and outputs a peakiness value, and corrects the value of the maximum value of the normalized autocorrelation function by the value of the peakiness calculator. A correlation function corrector that outputs the maximum value of the normalized autocorrelation function, and if the corrected maximum value of the normalized autocorrelation function is equal to or less than a predetermined threshold, it is determined that the voice is unvoiced; A first voiced / unvoiced determiner that outputs a voiced / unvoiced flag, and a non-uniformity of the pitch period of the frame whose aperiodic flag indicates aperiod at a second predetermined number of levels. An aperiodic pitch index generator for quantizing and outputting an aperiodic pitch index; inputting the voiced / unvoiced flag, the aperiodic flag, the pitch period index, and the aperiodic pitch index, and A periodic / aperiodic pitch / voiced / unvoiced information code generator for outputting a periodic / aperiodic pitch / voiced / unvoiced information code coded by the number of bits, and filtering the voice sample for one frame with a predetermined cutoff frequency A high-pass filter that outputs a high-frequency component of the input signal; a correlation function calculator that calculates and outputs a normalized autocorrelation function at a delay time given by the pitch period from the high-frequency component of the input signal; If the maximum value of the generalized autocorrelation function is equal to or smaller than a predetermined threshold, A second voiced / unvoiced decision unit that outputs a high-frequency voiced / unvoiced flag as a result thereof; a logarithmic RMS value after the quantization, the LSF parameter index, the periodic / aperiodic pitch voiced / unvoiced information code. And a bit packing unit that inputs the high-frequency voiced / unvoiced flag, performs bit packing for each frame, and outputs a voice information bit sequence.

8. An audio decoding device for decoding an audio information bit sequence for each frame encoded by the audio encoding device according to claim 7, wherein the audio information bit sequence is separated for each parameter, A non-periodic pitch / voiced / unvoiced information code, a quantized logarithmic RMS value, an LSF parameter index and a high-frequency voiced / unvoiced flag, and a bit separator for outputting the periodic / aperiodic pitch / voiced / unvoiced information code. If the state of the current frame is unvoiced, the pitch period is set to a predetermined value, and the voiced / unvoiced flag is set to 0 and output. A voiced / unvoiced information / pitch-period decoder for decoding and outputting based on rules, and setting and outputting a voiced / unvoiced flag to 1; When the current frame indicates unvoiced or non-periodic, the jitter value is set and output, and when the current frame indicates periodic,
A jitter setting unit that sets and outputs a jitter value to 0; an LSF decoder that decodes and outputs the LSF coefficient of the predetermined order from the LSF parameter index; and a slope that calculates and outputs a slope correction coefficient from the LSF coefficient. A correction coefficient calculator, a gain decoder that decodes the logarithmic RMS value after quantization and outputs gain information, and a first linear prediction coefficient calculator that converts and outputs the LSF coefficients to linear prediction coefficients. A spectrum envelope amplitude calculator that calculates and outputs a spectrum envelope amplitude from the linear prediction coefficient; a voice / unvoiced flag, the high-frequency voiced / unvoiced flag, and the spectrum envelope amplitude that are input and divided on a frequency axis. (Hereinafter referred to as “subband”) a pulse sound source / pulse sound source for determining and outputting mixing ratio information of a pulse sound source and a noise sound source.
A noise source mixing ratio calculator, linearly interpolating the pitch period, the mixing ratio information, the jitter value, the LSF coefficient, the tilt correction coefficient, and the gain information in synchronization with the pitch period, and the pitch period after interpolation A parameter interpolator for outputting the mixture ratio information after interpolation, the jitter value after interpolation, the LSF coefficient after interpolation, the slope correction coefficient after interpolation, and the gain information after interpolation, and a pitch period after interpolation and a parameter after interpolation. A pitch period calculator that inputs a jitter value, adds the jitter to the pitch period after interpolation, and outputs a pitch period converted into an integer value (hereinafter referred to as an “integer pitch period”); A one-pitch waveform decoder for synchronously decoding and outputting the reproduced voice for the integer pitch period, the one-pitch waveform decoder comprising a single pulse signal within the integer pitch period period A single pulse generator for outputting, a noise generator for outputting white noise having a length of the integer pitch period, and a single pulse signal for each subband based on the interpolated mixing ratio information. A mixed sound source generator that mixes them with white noise and then combines them to output a mixed sound source signal; a second linear prediction coefficient calculator that calculates a linear prediction coefficient from the interpolated LSF coefficients; This is a cascade connection of an adaptive pole / zero filter having coefficients obtained by performing bandwidth expansion processing on prediction coefficients and a spectrum inclination correction filter having coefficients of the interpolation-corrected interpolation coefficients. An adaptive spectrum enhancement filter that outputs a sound source signal having an improved spectrum; and an all-pole filter that uses the linear prediction coefficient as a coefficient. An LPC synthesis filter that adds spectrum envelope information to the generated sound source signal and outputs a signal to which the spectrum envelope information has been added; and a gain using the gain information for the signal to which the spectrum envelope information has been added. An audio decoding device comprising: a gain adjuster that performs adjustment and outputs a reproduced audio signal; and a pulse spreading filter that performs pulse spreading processing on the reproduced audio signal and outputs a reproduced audio signal that has been subjected to pulse spreading processing.