JP3490324B2

JP3490324B2 - Acoustic signal encoding device, decoding device, these methods, and program recording medium

Info

Publication number: JP3490324B2
Application number: JP03542099A
Authority: JP
Inventors: 仲大室; 一則間野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-02-15
Filing date: 1999-02-15
Publication date: 2004-01-26
Anticipated expiration: 2019-02-15
Also published as: JP2000235400A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、音声をはじめと
する音響信号の信号系列を、少ない情報量でディジタル
符号化する高能率音響信号符号化方法、その復号化方
法、これらの装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a high-efficiency acoustic signal encoding method for digitally encoding a signal sequence of an acoustic signal such as voice with a small amount of information, a decoding method therefor, and these devices.

【０００２】[0002]

【従来の技術】ディジタル移動体通信において、電波を
効率的に利用したり、音声または音楽蓄積サービス等で
通信回線や記憶媒体を効率的に利用するために、高能率
音声符号化方法が用いられる。現在、音声を高能率に符
号化する方法として、原音声を、フレームまたはサブフ
レーム（以降総称してフレーム）と呼ばれる５〜５０ｍ
ｓ程度の一定間隔の区間に分割し、その１フレームの音
声を、周波数スペクトルの包絡特性を表す線形フィルタ
の特性と、そのフィルタを駆動するための駆動音源信号
との２つの情報に分離し、それぞれを符号化する手法が
提案されている。この手法において、駆動音源信号を符
号化する方法として、音声のピッチ周期（基本周波数）
に対応すると考えられる周期成分と、それ以外の成分に
分離して符号化する方法が知られている。この駆動音源
情報の符号化法の例として、符号駆動線形予測符号化
（Code-Excited Linear Prediction：ＣＥＬＰ）があ
る。この技術の詳細については、文献M.R. Schroeder a
nd B.S. Atal，“Code-Excited Linear Prediction（Ｃ
ＥＬＰ）：High Quality Speech at Very Low Bit Rate
s”，ＩＥＥＥ Proc. ICASSP-85，pp.937-940，1985に
記載されている。2. Description of the Related Art In digital mobile communication, a high-efficiency voice coding method is used in order to efficiently use radio waves and to efficiently use communication lines and storage media for voice or music storage services. . At present, as a method for efficiently encoding speech, original speech is called a frame or subframe (hereinafter collectively referred to as frame) for 5 to 50 m.
It is divided into sections at a constant interval of about s, and the sound of one frame is separated into two pieces of information, a characteristic of a linear filter that represents the envelope characteristic of the frequency spectrum and a driving sound source signal for driving the filter, A method of encoding each has been proposed. In this method, the pitch period (fundamental frequency) of the voice is used as a method of encoding the driving excitation signal.
A method is known in which a periodic component that is considered to correspond to the above and a component other than that are encoded separately. Code-Excited Linear Prediction (CELP) is an example of a method of encoding the drive excitation information. For more information on this technique, see the document MR Schroeder a
nd BS Atal, “Code-Excited Linear Prediction (C
ELP): High Quality Speech at Very Low Bit Rate
s ", IEEE Proc. ICASSP-85, pp.937-940, 1985.

【０００３】図３に上記符号化方法の構成例を示す。入
力端子１１に入力された音声ｘは、線形予測分析部１
２において、入力音声の周波数スペクトル包絡特性を表
す線形予測パラメータａが計算される。得られた線形
予測パラメータａは線形予測パラメータ符号化部１３
において、量子化および符号化され、量子化値は合成フ
ィルタ係数ａｑに変換されて合成フィルタ１４に、符
号ｂａは符号送出部１５へ送られる。なお、歪み計算
に聴覚特性を考慮するなど、入力音声のスペクトル情報
を利用して歪み計算を行う場合には、線形予測パラメー
タａまたは量子化された線形予測パラメータａｑが
波形歪み計算部１６へも送られる。線形予測分析の詳細
および線形予測パラメータの符号化例については、例え
ば古井貞煕著“ディジタル音声処理”（東海大学出版
会）に記載されている。ここで、線形予測分析部１２、
線形予測パラメータ符号化部１３および合成フィルタ１
４は非線形なものに置き換えてもよい。FIG. 3 shows an example of the configuration of the above encoding method. The speech x input to the input terminal 11 is the linear prediction analysis unit 1
In 2, the linear prediction parameter a representing the frequency spectrum envelope characteristic of the input speech is calculated. The obtained linear prediction parameter a is the linear prediction parameter encoding unit 13
In, the quantization and coding are performed, and the quantized value is converted into the synthesis filter coefficient aq and sent to the synthesis filter 14, and the code ba is sent to the code sending unit 15. When the distortion calculation is performed by using the spectral information of the input voice such as considering the auditory characteristics in the distortion calculation, the linear prediction parameter a or the quantized linear prediction parameter aq is also input to the waveform distortion calculation unit 16. Sent. Details of the linear prediction analysis and an example of coding the linear prediction parameters are described in, for example, “Digital Speech Processing” by Sadahiro Furui (Tokai University Press). Here, the linear prediction analysis unit 12,
Linear prediction parameter coding unit 13 and synthesis filter 1
4 may be replaced with a non-linear one.

【０００４】駆動音源ベクトル生成部１７では、１フレ
ーム分の長さの駆動音源ベクトル候補を生成し、合成フ
ィルタ１４に送る。駆動音源ベクトル生成部１７は、一
般に適応符号帳１８と固定符号帳１９から構成されるこ
とが多い。適応符号帳１８からはバッファに記憶された
直前の過去の駆動音源ベクトル（既に量子化された直前
の１〜数フレーム分の駆動音源ベクトル）を、ある周期
に相当する長さで切り出し、その切り出したベクトルを
フレームの長さになるまで繰り返すことによって、音声
の周期成分に対応する時系列ベクトルの候補が出力され
る。上記「ある周期」とは、波形歪み計算部１６におけ
る歪みが小さくなるような周期が選択されるが、選択さ
れた周期は、一般には音声のピッチ周期に相当すること
が多い。固定符号帳１９からは、音声の非周期成分に対
応する１フレーム分の長さの時系列符号ベクトルの候補
が出力される。これらの候補は入力音声とは独立に符号
化のためのビット数に応じて、あらかじめ指定された数
の候補ベクトルを記憶してそのうちの１つであったり、
あらかじめ決められた生成規則によってパルスを配置し
て生成されたベクトルの１つであったりする。なお、固
定符号帳１９は、本来音声の非周期成分に対応するもの
であるが、特に母音区間など、ピッチ周期性の強い音声
区間では、上記あらかじめ用意された候補ベクトルに、
ピッチ周期または適応符号帳で用いるピッチに対応する
周期を持つ櫛形フィルタをかけたり、適応符号帳での処
理と同様にベクトルを切り出して繰り返したりして固定
符号ベクトルとすることもある。適応符号帳１８および
固定符号帳１９から出力された時系列ベクトルの候補
ｃａおよびｃｒは、乗算部２１および２２におい
て、それぞれゲイン符号帳２３から出力されるゲイン候
補ｇａ，ｇｒが乗算され、加算部２４において加
算され、駆動音源ベクトルの候補ｃとなる。図３の構
成例において、実際の動作中には適応符号帳１８のみま
たは固定符号帳１９のみが用いられる場合もある。The driving sound source vector generation unit 17 generates driving sound source vector candidates having a length of one frame and sends them to the synthesis filter 14. The drive excitation vector generation unit 17 is generally composed of an adaptive codebook 18 and a fixed codebook 19. From the adaptive codebook 18, the immediately preceding past drive excitation vector (the already quantized drive excitation vector for one to several frames immediately before) stored in the buffer is cut out at a length corresponding to a certain cycle, and the cutout is performed. By repeating the above vector until the length of the frame is reached, the candidates of the time series vector corresponding to the periodic component of the voice are output. The "certain cycle" is selected as a cycle in which the distortion in the waveform distortion calculation unit 16 is small, but the selected cycle is generally often equivalent to the pitch cycle of voice. The fixed codebook 19 outputs candidates for a time-series code vector having a length of one frame, which corresponds to the aperiodic component of speech. Depending on the number of bits for encoding, these candidates are independent of the input speech and are stored in a predetermined number of candidate vectors, and are one of them.
It may be one of the vectors generated by arranging the pulses according to a predetermined generation rule. The fixed codebook 19 originally corresponds to a non-periodic component of speech, but particularly in a speech section having a strong pitch periodicity such as a vowel section, the previously prepared candidate vector is
A fixed code vector may be obtained by applying a comb filter having a pitch period or a period corresponding to the pitch used in the adaptive codebook, or by cutting out and repeating the vector as in the process in the adaptive codebook. The time-series vector candidates ca and cr output from the adaptive codebook 18 and the fixed codebook 19 are multiplied by gain candidates ga and gr output from the gain codebook 23, respectively, in the multiplication units 21 and 22, and added by the addition unit. At 24, they are added and become the drive sound source vector candidate c. In the configuration example of FIG. 3, only the adaptive codebook 18 or the fixed codebook 19 may be used during the actual operation.

【０００５】合成フィルタ１４は、線形予測パラメータ
符号化部１３において量子化された線形予測パラメータ
ａｑから得られる合成フィルタ係数をフィルタの係数
とする線形フィルタであって、駆動音源ベクトル候補
ｃを入力として再生音声の候補ｙを出力する。合成
フィルタ１４の次数すなわち線形予測分析の次数は、一
般に１０〜１６次程度が用いられることが多い。なお、
既に述べたように、合成フィルタ１４は非線形なフィル
タでもよい。The synthesis filter 14 is a linear filter having a synthesis filter coefficient obtained from the linear prediction parameter aq quantized in the linear prediction parameter coding unit 13 as a filter coefficient, and receives the driving sound source vector candidate c as an input. The candidate y of the reproduced voice is output. As the order of the synthesizing filter 14, that is, the order of the linear prediction analysis, the order of 10 to 16 is generally used. In addition,
As already mentioned, the synthesis filter 14 may be a non-linear filter.

【０００６】波形歪み計算部１６では、合成フィルタ１
４の出力である再生音声の候補ｙと、入力音声ｘと
の歪みｄを計算する。この歪みの計算は、例えば聴覚
重み付けに代表されるように、合成フィルタ１４の係数
ａｑまたは量子化していない線形予測係数ａを考慮
にいれて行なうことが多い。符号帳検索制御部２５で
は、各再生音声候補ｙと入力音声ｘとの歪みが最小
または最小に準ずるような駆動音源符号ｂｃ、すなわ
ち周期符号、固定（雑音）符号およびゲイン符号を選択
し、そのフレームにおける駆動音源ベクトルを決定す
る。In the waveform distortion calculator 16, the synthesis filter 1
The distortion d between the reproduced voice candidate y which is the output of No. 4 and the input voice x is calculated. The calculation of this distortion is often performed in consideration of the coefficient aq of the synthesizing filter 14 or the non-quantized linear prediction coefficient a, as represented by auditory weighting. The codebook search control unit 25 selects the drive excitation code bc, that is, the periodic code, the fixed (noise) code, and the gain code such that the distortion between each reproduced speech candidate y and the input speech x is the minimum or equivalent to the minimum. Determine the driving source vector in the frame.

【０００７】符号帳検索制御部２５において決定された
駆動音源符号ｂｃ（周期符号、固定符号、ゲイン符
号）と、線形予測パラメータ符号化部１３の出力である
線形予測パラメータ符号ｂａは、符号送出部１５に送
られ、利用の形態に応じて記憶装置に記憶されるか、ま
たは通信路を介して受信側へ送られる。図４に、上記符
号化方法に対応する復号方法の構成例を示した。伝送路
または記憶媒体から符号受信部３１において受信された
符号のうち、線形予測パラメータ符号ｂａは線形予測
パラメータ復号部３２において合成フィルタ係数ａｑ
に復号され、合成フィルタ３３および、必要に応じて後
処理部（ポストフィルタとも呼ばれる）３４に送られ
る。受信された符号のうち、駆動音源符号ｂｃは、駆
動音源ベクトル生成部３５に送られ、符号に対応する音
源ベクトルｃが生成される。合成フィルタ３３は、駆
動音源ベクトルｃを入力として、合成音声ｙを出力
し、後処理部３４はスペクトル強調やピッチ強調の処理
を合成音声ｙに施して、量子化ノイズを聴覚的に低減
する。なお、後処理部３４は一種の音声強調処理である
ので、処理量の関係や、入力信号の特性によって用いな
い場合もある。駆動音源ベクトル生成部３５は駆動音源
符号ｂｃ中の周期符号により適応符号帳３６から時系
列ベクトルｃａが選択され、また固定符号により固定
符号帳３７から時系列ベクトルｃｒが選択され、これ
ら時系列ベクトルｃａ，ｃｒは乗算部３８，３９
で、ゲイン符号によりゲイン符号帳４１から取り出され
たゲインｇａ，ｇｒが乗算されて加算部４２で互い
に加算されて駆動音源ベクトルｃとして合成フィルタ
３３に入力される。前述のように実際の動作中に、適応
符号帳１８のみ又は固定符号帳１９のみが符号化に用い
られる場合には、それに対応して、図４においては適応
符号帳３６又は固定符号帳３７のみが用いられる。The driving excitation code bc (periodic code, fixed code, gain code) determined by the codebook search control unit 25 and the linear prediction parameter code ba output from the linear prediction parameter coding unit 13 are the code sending unit. 15 and is stored in the storage device or sent to the receiving side via the communication path depending on the form of use. FIG. 4 shows a configuration example of a decoding method corresponding to the above encoding method. Among the codes received by the code receiving unit 31 from the transmission path or the storage medium, the linear prediction parameter code ba is the synthesis filter coefficient aq in the linear prediction parameter decoding unit 32.
And is sent to the post-processing section (also called a post filter) 34 if necessary. Of the received codes, the drive excitation code bc is sent to the drive excitation vector generation unit 35, and the excitation vector c corresponding to the code is generated. The synthesis filter 33 receives the driving sound source vector c as an output and outputs a synthesized speech y. The post-processing unit 34 subjects the synthesized speech y to spectrum enhancement and pitch enhancement processing to aurally reduce quantization noise. Since the post-processing unit 34 is a kind of voice enhancement process, it may not be used depending on the relation of the processing amount and the characteristics of the input signal. The driving excitation vector generation unit 35 selects the time series vector ca from the adaptive codebook 36 according to the periodic code in the driving excitation code bc, and also selects the time series vector cr from the fixed codebook 37 according to the fixed code. ca and cr are multiplication units 38 and 39
Then, the gains g a and g r extracted from the gain code book 41 are multiplied by the gain code, added together in the adder 42, and input to the synthesis filter 33 as the driving sound source vector c. As described above, when only the adaptive codebook 18 or only the fixed codebook 19 is used for encoding during the actual operation, correspondingly, only the adaptive codebook 36 or the fixed codebook 37 in FIG. Is used.

【０００８】[0008]

【発明が解決しようとする課題】このようなＣＥＬＰ系
符号化方式をはじめとする音声の生成モデルに基づく符
号化方式において問題となるのは、静かな環境で録音さ
れた背景雑音のない音声信号が入力された場合には、少
ない情報量で高品質な符号化を実現することができる
が、オフィスや街頭など、背景雑音のある環境で録音さ
れた音声が入力されると、キュルキュルとかバチバチと
いった大変に不快な音が再生されるといった点である。
これらの背景雑音を入力した場合の問題は、ピッチ周期
性を利用するＣＥＬＰ系の音声符号化モデルが音声の生
成モデルに基づいているのに対して、背景雑音は音声と
は異なる性質を示すためである。具体的には、適応符号
帳が音声のピッチ周期に対応する信号成分を出力するの
に対して、背景雑音には一般にピッチ周期性がないた
め、背景雑音区間において、不自然な周期音が発生す
る。また、背景雑音が重畳された音声区間においては、
本来はピッチ周期性のある音声とピッチ周期性のない雑
音信号が加算された性質の信号であるにもかかわらず、
音声のピッチ周期性を重視する符号化モデルを適用する
ために、やはり背景雑音成分が不自然な周期音となって
重畳する。固定符号帳をピッチ周期で周期化して用いる
場合には、固定符号帳のピッチ周期化もまた不自然な周
期音の発生する原因となる。上記のように、適応符号帳
や固定符号帳の構成が信号の性質に合わない場合には、
ゲインの決定方法にも問題が生じる。つまり、従来のゲ
インの決定方法は、適応符号帳や固定符号帳から出力さ
れる駆動音源ベクトルの性質が、入力信号の性質によく
合っていることを前提としたものであり、駆動音源ベク
トルの性質が入力信号の性質に合わない場合に、従来の
方法では不自然に変動する信号となる。A problem with a coding method based on a speech generation model such as such a CELP-based coding method is that a speech signal recorded in a quiet environment and having no background noise is a problem. When is input, high-quality encoding can be achieved with a small amount of information, but when a voice recorded in an environment with background noise, such as an office or the street, is input, such as curcules and crackles. That is, a very unpleasant sound is reproduced.
The problem when these background noises are input is that the CELP-based speech coding model using pitch periodicity is based on a speech generation model, whereas background noise exhibits different properties from speech. Is. Specifically, while the adaptive codebook outputs a signal component corresponding to the pitch period of speech, background noise generally has no pitch periodicity, so an unnatural periodic sound is generated in the background noise section. To do. Also, in the voice section where background noise is superimposed,
Originally, although it is a signal of the nature that a voice signal with pitch periodicity and a noise signal without pitch periodicity are added,
In order to apply the coding model that emphasizes the pitch periodicity of speech, the background noise component is also superposed as an unnatural periodic sound. When the fixed codebook is used with a pitch period, the pitch period of the fixed codebook also causes an unnatural periodic sound. As described above, when the configuration of the adaptive codebook or fixed codebook does not match the characteristics of the signal,
There is also a problem in the method of determining the gain. That is, the conventional gain determination method is based on the premise that the characteristics of the driving excitation vector output from the adaptive codebook or the fixed codebook match the characteristics of the input signal well. When the property does not match the property of the input signal, the conventional method results in an unnaturally varying signal.

【０００９】この問題に対する代表的な解決法として、
ノイズリダクションによる方法と、コンフォートノイズ
ジェネレータという方法がある。前者は、入力信号の前
に、雑音低減処理を入れて、背景雑音成分を相対的に低
減するもので、雑音成分が低減された分だけ再生音にお
ける不快音も低減される。しかしながら、雑音低減処理
を入れても、完全に雑音がなくなるわけではなく、不快
音をなくしてしまうことはできない。また、背景雑音が
非定常音の場合には、十分な雑音低減効果そのものを得
ることが難しい。一方、後者のコンフォートノイズジェ
ネレータは、音声区間についてはＣＥＬＰ系符号化方式
でそのまま符号化し、雑音区間については、適当な「心
地よい」雑音、例えば白色雑音などを生成して置き換え
るというものである。コンフォートノイズジェネレータ
の方法を使うと、キュルキュルといった不快な音は再生
されなくなるが、オフィスや雑踏などさまざまな背景雑
音の性質に対して、再生される音の雑音区間は、常に同
じ性質の雑音になってしまって、背景音の情報は受信側
に伝達されないという問題が生じる。また、背景雑音の
レベルが高いときには、音声区間と雑音区間を誤りなく
切り替えることは難しく、区間検出誤りが原因で逆に再
生音が劣化してしまう場合や、区間検出誤りがなくても
音声区間と雑音区間の性質に違いがありすぎて不連続な
感じに聞こえてしまうことも多かった。As a typical solution to this problem,
There are a method using noise reduction and a method called comfort noise generator. In the former method, noise reduction processing is performed before the input signal to relatively reduce the background noise component, and the unpleasant sound in the reproduced sound is reduced by the amount of the noise component reduced. However, the noise reduction process does not completely eliminate the noise, and the unpleasant noise cannot be eliminated. Further, when the background noise is a non-stationary sound, it is difficult to obtain a sufficient noise reduction effect itself. On the other hand, the latter comfort noise generator directly encodes the voice section using the CELP system encoding method, and generates and replaces appropriate "comfortable" noise, such as white noise, for the noise section. If you use the comfort noise generator method, unpleasant sounds such as curcules will not be played, but the noise interval of the sound that is played will always be the same as the nature of background noise such as offices and crowds. As a result, the problem that the background sound information is not transmitted to the receiving side occurs. Also, when the level of background noise is high, it is difficult to switch between the voice section and the noise section without error, and the reproduced sound is deteriorated due to the section detection error, or even if there is no section detection error. There were too many differences in the nature of the noise section, and it often sounded discontinuous.

【００１０】この発明では、ＣＥＬＰ系の方式をはじめ
とする、音声の生成モデルに基づく音声符号化方式にお
いて、不快な音が再生されないでかつ、背景音の性質を
受信側に伝えて、より自然な再生音を実現する符号化お
よび復号する方法及びその装置を提供することにある。According to the present invention, in a voice encoding system based on a voice generation model, such as a CELP system, an unpleasant sound is not reproduced and the nature of the background sound is transmitted to the receiving side so that it is more natural. An object of the present invention is to provide a method and apparatus for encoding and decoding that realizes various reproduced sounds.

【００１１】[0011]

【課題を解決するための手段】この発明では、背景雑音
の特性を、例えばガタンという音や車が通過した音、足
音、遠くでの人の話し声などの短時間特性（短時間変動
成分）と、例えば定常的にざわざわした感じとかモータ
ーの回転音などの平均的な長時間特性（長時間変動成
分）という２つの立場でとらえ、両特性をＣＥＬＰ系符
号化モデルの枠組みの中で送信パラメータに情報を乗せ
て受信側に送り、再生側で両特性を再現する信号を生成
して、それらを混合することによって、背景音のある音
声入力の場合でも、自然な再生音を出力する。発明のポ
イントは、音声に特化された符号化モデルの枠組みを大
きく切り替えることなく、かつ、限られた情報量（ビッ
トレート）のなかで、うまく雑音の特性を乗せて送ると
いう点で、雑音区間の検出にもそれほどの厳密性は要求
されない。According to the present invention, the characteristics of background noise are defined as short-term characteristics (short-term fluctuation component) such as rattling noise, vehicle passing sound, footsteps, and human voice in the distance. For example, the two characteristics of an average long-term characteristic (long-term fluctuation component) such as a steady noise and a rotating sound of a motor are regarded as the transmission parameters in the framework of the CELP system coding model. Information is carried and sent to the receiving side, a signal that reproduces both characteristics is generated on the reproducing side, and by mixing them, a natural reproduced sound is output even in the case of voice input with a background sound. The point of the invention is that the noise characteristics are well transmitted without significantly switching the framework of the coding model specialized for speech and in the limited amount of information (bit rate). The detection of the section does not need to be so strict.

【００１２】この発明の復号方法によれば、適応符号帳
および固定符号帳の両方または一方からフレーム単位あ
るいはサブフレーム単位（以降総称してフレーム単位）
で取り出した符号ベクトルに、ゲイン符号帳より取り出
したゲインを乗算して駆動音源ベクトルを生成し、その
駆動音源ベクトルで合成フィルタを駆動して音声信号ま
たは音響信号（以降総称して音声信号）を生成する音声
の復号方法において、該当フレームが背景雑音区間であ
るか否かの情報を受け取り、背景雑音区間内のフレーム
において、合成フィルタの出力信号のパワーを測定し、
その長時間平均値を表すパワーレベルを計算し、背景雑
音区間内のフレームにおいて、合成フィルタのフィルタ
係数を表すスペクトルパラメータの長時間平均を表す平
均スペクトルを計算して、上記平均スペクトルの特性を
表すフィルタを、白色雑音で駆動して生成した信号を、
上記測定したパワーレベルをもとに振幅調整して、背景
雑音の定常成分信号を生成し、上記生成した背景雑音の
定常成分信号を、当該フレームが背景雑音区間であるか
否かにかかわらず、合成フィルタの出力信号に加算し
て、再生音声を生成する。According to the decoding method of the present invention, the adaptive codebook and / or the fixed codebook or both of them are used as a frame unit or a subframe unit (hereinafter collectively referred to as a frame unit).
The code vector extracted in step 1 is multiplied by the gain extracted from the gain codebook to generate a driving sound source vector, and a driving filter is driven by the driving sound source vector to generate a voice signal or an acoustic signal (hereinafter collectively referred to as a voice signal). In the method of decoding the generated voice, receiving information on whether or not the corresponding frame is in the background noise section, measuring the power of the output signal of the synthesis filter in the frame in the background noise section,
The power level representing the long-term average value is calculated, and in the frame in the background noise interval, the average spectrum representing the long-term average of the spectral parameters representing the filter coefficient of the synthesis filter is calculated, and the characteristics of the average spectrum are represented. The signal generated by driving the filter with white noise is
Amplitude adjustment based on the measured power level, to generate a background noise stationary component signal, the generated background noise stationary component signal, regardless of whether the frame is a background noise section, It is added to the output signal of the synthesis filter to generate a reproduced voice.

【００１３】更に入力信号の特性とは独立した、一定の
スペクトル特性、一定の振幅を持つ定常音を、上記平均
スペクトルの特性を表すフィルタを白色雑音で駆動して
生成した信号とともに、前記背景雑音の定常成分信号と
して、合成フィルタの出力信号に加算して、再生音声を
生成する。また当該フレームが背景雑音区間または子音
区間であるか否かの情報を受け取り、駆動音源ベクトル
のフレーム毎のパワーを測定し、白色雑音に、上記駆動
音源ベクトルのパワーをもとに決定した振幅を乗算して
生成した背景雑音の変動成分信号を、当該フレームが背
景雑音区間または子音区間である場合に、駆動音源ベク
トルに加算する。Further, a stationary sound having a constant spectral characteristic and a constant amplitude independent of the characteristic of the input signal is generated together with a signal generated by driving a filter showing the characteristic of the average spectrum with white noise, and the background noise. Is added to the output signal of the synthesizing filter as a stationary component signal of, and reproduced sound is generated. It also receives information on whether the frame is in the background noise section or consonant section, measures the power of the driving sound source vector for each frame, and uses the amplitude determined based on the power of the driving sound source vector as the white noise. The background noise fluctuation component signal generated by multiplication is added to the driving sound source vector when the frame is in the background noise section or the consonant section.

【００１４】この発明の符号化方法によれば、適応符号
帳および固定符号帳の両方または一方からフレーム単位
あるいはサブフレーム単位（以降総称してフレーム単
位）で取り出した符号ベクトルに、ゲイン符号帳より取
り出したゲインを乗算して駆動音源ベクトルを生成し、
合成フィルタを駆動して生成した音声信号または音響信
号（以降総称して音声信号）と入力音声信号とを比較し
て、適応符号、固定符号、ゲイン符号を選択する符号化
方法において、入力信号を分析して、当該フレームが、
背景雑音区間または子音区間に相当するか否かを決定
し、当該フレームが、背景雑音区間または子音区間に相
当する場合には、上記合成フィルタの出力信号と入力信
号の波形歪み最小化に基づく距離尺度と、上記合成フィ
ルタの出力信号と入力信号のパワーレベル差最小化に基
づく距離尺度の、加重和を用いるか、またはパワーレベ
ル差最小化に基づく距離尺度のみを用いることによっ
て、ゲイン符号帳を検索して、最適なゲイン符号を選択
する。According to the encoding method of the present invention, a code vector extracted in a frame unit or a subframe unit (hereinafter collectively referred to as a frame unit) from both or one of the adaptive codebook and the fixed codebook is used by the gain codebook. Multiply the extracted gain to generate a driving sound source vector,
In an encoding method for selecting an adaptive code, a fixed code, or a gain code by comparing an audio signal or an audio signal generated by driving a synthesis filter (hereinafter collectively referred to as an audio signal) with an input audio signal, Analysis, the frame is
It is determined whether or not it corresponds to the background noise section or the consonant section, and when the frame corresponds to the background noise section or the consonant section, the distance based on the waveform distortion minimization of the output signal of the synthesis filter and the input signal. By using a weighted sum of the scale and the distance measure based on the power level difference minimization of the output signal and the input signal of the synthesis filter, or by using only the distance measure based on the power level difference minimization, the gain codebook is calculated. Search and select the optimal gain code.

【００１５】なお、この発明の実現にあたっては、信号
処理用の専用プロセッサを用いてハードウェア的に実現
してもよいし、コンピュータプログラムの形でソフトウ
ェア的に実現してもよい。In implementing the present invention, a dedicated processor for signal processing may be used to realize the hardware, or software may be realized in the form of a computer program.

【００１６】[0016]

【発明の実施の形態】以下にこの発明の実施例を、図を
用いて説明する。図１は、この発明における符号化方法
の機能構成例を示したもので、図３と対応する部分に同
一番号を付けてある。また、図２は、この発明における
復号方法の機能構成例を示したもので、図４と対応する
部分の同一番号を付けてある。この発明のイメージをよ
りわかりやすく説明するため、まず図２の復号方法を先
に説明する。またこの発明は音声信号のみならず音楽信
号などの音響信号に適用できるが、以下の説明では音声
信号で代表して行う。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows an example of the functional configuration of the encoding method according to the present invention, in which parts corresponding to those in FIG. 3 are designated by the same reference numerals. Further, FIG. 2 shows an example of the functional configuration of the decoding method according to the present invention, and the parts corresponding to those in FIG. In order to explain the image of the present invention more clearly, the decoding method of FIG. 2 will be described first. Further, the present invention can be applied not only to audio signals but also to audio signals such as music signals, but in the following description, audio signals are used as a representative.

【００１７】この発明による復号方法によって再生され
る音声信号は、図２において一点鎖線で示される、駆動
音源ベクトル生成部３５、背景雑音変動成分生成部４
４、背景雑音定常成分生成部４５のそれぞれにおいて生
成される信号の全部または一部から構成される。図４に
示した従来の復号方法は、上記３つの生成部のうち、駆
動音源ベクトル生成部３５のみしか有しないため、背景
雑音のある音声信号を符号化、復号したときに再生音の
自然性が十分でなかった。自然性が十分でない理由は、
前述のように、駆動音源ベクトル生成部３５のモデル
が、人間が音声を生成する機構のモデルに基づいてお
り、背景雑音のある音声のモデル化には必ずしも適当で
ないことに起因する。これに対して、この発明は、音声
の生成モデルによって表現される再生信号（駆動音源ベ
クトルによる成分）に、雑音モデルによって表現される
再生信号が加算されるというモデルに基づいており、加
算する雑音のモデルを、短時間で特性が変動する「背景
雑音変動成分」、時間的にゆっくりとした周期で特性が
変動するか、または変動しない「背景雑音定常成分」の
２つの成分で表現する。The audio signal reproduced by the decoding method according to the present invention is represented by the alternate long and short dash line in FIG.
4. The background noise stationary component generation unit 45 is composed of all or a part of the signal generated. Since the conventional decoding method shown in FIG. 4 has only the driving sound source vector generation unit 35 among the above three generation units, the naturalness of the reproduced sound when the speech signal with background noise is encoded and decoded. Was not enough. The reason why the naturalness is not enough is
This is because, as described above, the model of the driving sound source vector generation unit 35 is based on the model of the mechanism by which humans generate speech, and is not necessarily suitable for modeling speech with background noise. On the other hand, the present invention is based on a model in which a reproduction signal represented by a noise model is added to a reproduction signal represented by a voice generation model (a component by a driving sound source vector). The model is expressed by two components: a "background noise fluctuation component" whose characteristics fluctuate in a short time, and a "background noise stationary component" whose characteristics fluctuate with a slow period in time, or does not fluctuate.

【００１８】まず、符号化部において送信または蓄積さ
れた符号は、符号受信部３１において駆動音源符号ｂ
ｃ、線形予測パラメータ符号ｂａ、モード情報ｂｍ
に分解される。モード情報ｂｍは、雑音付加音声信号
において、母音区間、子音区間、無音区間、背景雑音区
間といったフレーム毎に区間を分類した情報である。た
だし、背景雑音のある入力音声信号を正確に上記区間に
分けることは困難であるため、曖昧なフレームは分類不
明としてもよい。First, the code transmitted or accumulated in the encoder is the driving excitation code b in the code receiver 31.
c, linear prediction parameter code ba, mode information bm
Is decomposed into. The mode information bm is information obtained by classifying the noise-added voice signal into sections such as vowel sections, consonant sections, silent sections, and background noise sections. However, since it is difficult to accurately divide the input speech signal having background noise into the above sections, the ambiguous frame may be unclassified.

【００１９】駆動音源ベクトル生成部３５は、従来法と
同様に、適応符号帳３６と固定符号帳３７からなり、そ
れぞれの符号帳は、受信した駆動音源符号中の周期符
号、固定符号とそれぞれ対応する適応符号ベクトルｃ
ａ、固定符号ベクトルｃｒを出力する。これらベクト
ルｃａ，ｃｒは、受信した駆動音源符号中のゲイン
符号に対応してゲイン符号帳４１から出力されるゲイン
ｇａ，ｇｒをそれぞれ乗算後、加算されて駆動音源
ベクトルｃとなる。The drive excitation vector generation unit 35 is composed of an adaptive codebook 36 and a fixed codebook 37 as in the conventional method, and each codebook corresponds to a periodic code and a fixed code in the received drive excitation code. Adaptive code vector c
a, a fixed code vector cr is output. These vectors ca and cr are multiplied by the gains ga and gr output from the gain codebook 41 corresponding to the gain code in the received drive excitation code, respectively, and then added to form the drive excitation vector c.

【００２０】背景雑音変動成分生成部４４から出力され
るベクトルは、背景雑音の、フレーム単位でパワーやス
ペクトル特性が変動する成分に対応する。白色雑音生成
部５１からは、１フレーム長の白色雑音系列ベクトル
ｓ１が出力される。白色雑音の生成法として一般には
ガウス乱数を用いることが多いが、一様乱数を用いた
り、あらかじめ乱数系列をテーブルに記憶しておき、フ
レーム長単位で切り出して用いるといった擬似的な手法
を用いてもよい。The vector output from the background noise variation component generation unit 44 corresponds to the component of the background noise in which the power and spectral characteristics vary in frame units. The white noise generator 51 outputs a white noise series vector s1 having a length of one frame. Gaussian random numbers are generally used as the white noise generation method, but uniform random numbers are used, or pseudo-methods such as storing random number sequences in a table in advance and cutting them out in frame length units are used. Good.

【００２１】駆動音源ベクトル生成部３５から出力され
た駆動音源ベクトルｃは、音源パワー測定部５２に入
力されて、当該フレームにおける駆動音源ベクトルｃ
のパワーＰｃが測定され、測定されたパワーＰｃと
モード情報ｂｍに基づいて、白色雑音ベクトルｓ１
と駆動音源ベクトルｃにそれぞれ乗算される重みｗ
ｓ１およびｗｃが、重み作成部５３において決定され
る。白色雑音ベクトルｓ１と駆動音源ベクトルｃ
は、それぞれ重みｗｓ１およびｗｃが、乗算部５
４，５５で乗算された後、加算部５６で加算されて、合
成フィルタ３３への入力ベクトルｃｓ１となる。The driving sound source vector c output from the driving sound source vector generation unit 35 is input to the sound source power measuring unit 52, and the driving sound source vector c in the frame.
Of the white noise vector s1 based on the measured power Pc and the mode information bm.
And the weight w by which the driving sound source vector c is respectively multiplied
The weight creating unit 53 determines s1 and wc. White noise vector s1 and driving sound source vector c
Are the weights ws1 and wc, respectively.
After being multiplied by 4, 55, they are added by the adder 56 to be the input vector cs1 to the synthesis filter 33.

【００２２】なお、重み作成部５３において作成される
重みｗｓ１およびｗｃは、合成フィルタ入力ｃｓ
１のパワーが、駆動音源ベクトルｃのパワーと同じに
なるように決める（完全に一致する必要はないが、聴感
上同じパワーに聞こえるようにする）。また、当該フレ
ームが母音区間であるか、区間が不明の場合には、ｗ
ｓ１は０または０に近い小さい値とする。これは、母音
区間においては、入力音声は音声の生成モデルに基づく
駆動音源ベクトル生成部３５で十分良好に表現されるた
めで、背景雑音変動成分は加算する必要がない。また、
区間の分類が不確実な場合も、背景雑音変動成分は加算
しないほうが無難である。The weights ws1 and wc created by the weight creating unit 53 are the synthesis filter input cs.
The power of 1 is determined to be the same as the power of the driving sound source vector c (they do not need to be exactly the same, but they are perceived to have the same power). If the frame is a vowel section or if the section is unknown, w
s1 is 0 or a small value close to 0. This is because in the vowel section, the input speech is sufficiently well expressed by the driving sound source vector generation unit 35 based on the speech generation model, and it is not necessary to add the background noise fluctuation component. Also,
Even when the section classification is uncertain, it is safer not to add the background noise fluctuation component.

【００２３】図２では背景雑音変動成分生成部４４の出
力は、合成フィルタ３３の手前で駆動音源ベクトルに加
算されているが、合成フィルタ３３が線形フィルタの場
合には、背景雑音変動成分生成部４４の出力を合成フィ
ルタに通したものと、駆動音源ベクトルを合成フィルタ
３３に通したものを加算しても（即ち合成してから加算
しても）結果は等価である。In FIG. 2, the output of the background noise variation component generation unit 44 is added to the driving sound source vector before the synthesis filter 33. However, when the synthesis filter 33 is a linear filter, the background noise variation component generation unit is used. The result obtained by passing the output of 44 through the synthesis filter and the one obtained by passing the driving sound source vector through the synthesis filter 33 (that is, even after the synthesis and the addition) are equivalent.

【００２４】合成フィルタ３３の出力ｙｓ１は、後処
理部３４において、従来法と同様に、スペクトル包絡や
ピッチ成分が強調される。ただし、音声区間以外は強調
されると逆に不自然になるため、モード情報ｂｍが母
音区間についてのみ従来法と同程度の強調を行い、母音
性が低くなるにしたがって、強調の度合いを弱める。合
成フィルタ３３の出力ｙｓ１は、背景雑音定常成分生
成部４５の、平均雑音パワー測定部６１へも送られる。
平均雑音パワー測定部６１では、確実な背景雑音区間
（無音区間を含む）においてのみ、合成フィルタ３３の
出力信号ｙｓ１のパワーを測定し、フレーム長に対し
て十分に長い時間にわたるパワーの平均値を計算する。
ここで「確実な背景雑音区間」とは、背景雑音区間でも
母音区間や子音区間の疑いのあるフレームは除外するこ
とを意味する。この確実な背景雑音区間だけ合成フィル
タ３３の出力が平均雑音パワー測定部６１へ供給される
ようにモード情報ｂｍによりスイッチ４６が制御され
る。また「フレーム長に対して十分に長い時間」とは、
１秒〜数十秒程度がよいと考えられる。長時間平均の計
算の方法としては、バッファにフレーム毎のパワーを記
憶しておいて、一定の時間毎に平均をとってもよいし、
第ｎフレームにおける瞬時パワーをＰ（ｎ）、第ｎフレ
ームにおける平均パワーをＰave(ｎ) 、第ｎ−１フレー
ムにおける平均パワーをＰave(ｎ−１) ，０＜α＜１と
して、Ｐave(ｎ) ＝（１−α）Ｐave(ｎ−１）＋αＰ( ｎ) のような逐次更新式を用いて近似的に求めてもよい。な
おαは値が小さいほど長時間平均に相当する。計算され
た平均雑音パワーＰｙは、重み作成部６２に送られ、
背景雑音定常成分のパワーを決定する重みｗｕが計算
される。背景雑音定常成分生成部４５から出力される信
号のパワーは、平均雑音パワーＰｙとほぼ同じになる
ように決められるが、多少低くなるように決めると、聴
感上聞きやすい音になることが多い。なお、合成フィル
タ６３の出力ｕ２のパワーは平均雑音スペクトルａ
ｕに依存するため、重み作成部６２で重みｗｕを求め
る際には、平均雑音パワーＰｙと合成フィルタ６３の
フィルタゲインを併用するか、合成フィルタ６３の出力
ｕ２のパワーを実際に測定してその値をもとにｗｕ
を求めるとよい。The output ys1 of the synthesizing filter 33 has the spectral envelope and the pitch component emphasized in the post-processing unit 34 as in the conventional method. However, since it becomes unnatural when emphasized except the voice section, only the vowel section of the mode information bm is emphasized to the same extent as in the conventional method, and the degree of emphasis is weakened as the vowel characteristic becomes lower. The output ys1 of the synthesis filter 33 is also sent to the average noise power measurement unit 61 of the background noise stationary component generation unit 45.
The average noise power measurement unit 61 measures the power of the output signal ys1 of the synthesis filter 33 only in a certain background noise section (including a silent section), and obtains the average value of the power over a sufficiently long time with respect to the frame length. calculate.
Here, the “certain background noise section” means that a frame in which a vowel section or a consonant section is suspected is excluded even in the background noise section. The switch 46 is controlled by the mode information bm so that the output of the synthesizing filter 33 is supplied to the average noise power measuring unit 61 only in the certain background noise section. Also, "a sufficiently long time for the frame length" means
It is considered that about 1 second to several tens of seconds is preferable. As a method of calculating the long-term average, the power of each frame may be stored in a buffer, and the average may be taken at regular intervals.
Let P (n) be the instantaneous power in the nth frame, Pave (n) be the average power in the nth frame, Pave (n-1) be the average power in the n-1th frame, and Pave (n ) = (1−α) Pave (n−1) + αP (n). The smaller the value of α, the longer the average. The calculated average noise power Py is sent to the weight creating unit 62,
The weight wu that determines the power of the background noise stationary component is calculated. The power of the signal output from the background noise stationary component generation unit 45 is determined to be approximately the same as the average noise power Py, but if it is determined to be slightly lower, it often sounds easier to hear. The power of the output u2 of the synthesis filter 63 is the average noise spectrum a.
Since it depends on u, when the weight creating unit 62 obtains the weight wu, the average noise power Py and the filter gain of the synthesis filter 63 are used together, or the power of the output u2 of the synthesis filter 63 is actually measured and Wu based on the value
You should ask.

【００２５】平均雑音スペクトル測定部６４では、平均
雑音パワー測定部６１と同様に、確実な背景雑音区間
（無音区間を含む）においてのみ、復号された線形予測
パラメータａｑから、フレーム長に対して十分に長い
時間にわたるスペクトルの平均値を計算する。このため
確実な背景雑音区間のみ線形予測パラメータａｑが平
均雑音スペクトル測定部６４へ供給されるようにモード
情報ｂｍによりスイッチ４７が制御される。スペクト
ルの平均値は、一般に、線形予測パラメータの一種であ
る、線スペクトル対（ＬＳＰ）の領域で平均操作を行う
ことが多いが、ケプストラムやパワースペクトルの領域
で平均をとってもよい。平均の計算方法は、上記パワー
の平均と同様に、バッファにフレーム毎のスペクトルパ
ラメータを記憶しておいて、一定の時間毎に平均をとっ
てもよいし、逐次更新式を用いてもよい。平均雑音スペ
クトル測定部６４からは、平均雑音スペクトルに対応す
る、線形フィルタ係数ａｕが出力され、合成フィルタ
６３の係数となる。なお、線スペクトル対（ＬＳＰ）か
らの線形フィルタ係数の計算方法は、前述の古井貞煕著
“ディジタル音声処理”（東海大学出版会）にも記載さ
れている。Similar to the average noise power measuring section 61, the average noise spectrum measuring section 64 is sufficient for the frame length from the decoded linear prediction parameter aq only in the certain background noise section (including the silent section). Calculate the average value of the spectrum over a long time. Therefore, the switch 47 is controlled by the mode information bm so that the linear prediction parameter aq is supplied to the average noise spectrum measurement unit 64 only in the reliable background noise section. Generally, the average value of the spectrum is often averaged in the line spectrum pair (LSP) region, which is a type of linear prediction parameter, but may be averaged in the cepstrum and power spectrum regions. As a method of calculating the average, similar to the average of the power, the spectrum parameter for each frame may be stored in the buffer, and the average may be taken at a constant time, or a sequential update formula may be used. From the average noise spectrum measuring unit 64, a linear filter coefficient au corresponding to the average noise spectrum is output and becomes a coefficient of the synthesis filter 63. The method of calculating the linear filter coefficient from the line spectrum pair (LSP) is also described in "Digital Speech Processing" by Sadahiro Furui (Tokai University Press).

【００２６】白色雑音生成部６５は、白色雑音生成部５
１と同様に、１フレーム長の白色雑音ｓ２を出力し、
出力された白色雑音ベクトルｓ２は、合成フィルタ６
３に通されてｕ２となる。定常音生成部６６は、入力
信号の性質に依存しない、完全に一定の音ｕ３を出力
する。背景雑音定常成分生成部４５の出力は、定常音
ｕ３の振幅を乗算部６７で重みｗｕ３を乗算して調
整した信号と、合成フィルタ６３の出力信号ｕ２とを
加算部６８で加算したものに、重み作成部６２で作成さ
れる重みｗｕを乗算部６９で乗算したものとなる。な
お、定常音生成部６６から出力される信号ｕ３は、入
力信号とは独立した音であるので、入力信号を符号化
し、復号して再生するという立場からいえば、用いても
用いなくてもよい。しかし、人間の聴覚特性上、定常な
背景雑音は安心感をもたらし、入力信号の特性と必ずし
も一致していなくても、より自然に感じることが多い。
したがって、定常音生成部６６において、ブーンといっ
た低周波の音や、サーといった定常的な白色雑音を生成
して、合成フィルタ６３の出力レベルに比べて相対的に
低いレベルに振幅を調整して加算すると、より自然な背
景雑音となる。The white noise generator 65 is a white noise generator 5.
Similar to 1, outputs white noise s2 of 1 frame length,
The output white noise vector s2 is used in the synthesis filter 6
It is passed through 3 and becomes u2. The stationary sound generator 66 outputs a completely constant sound u3 that does not depend on the property of the input signal. The output of the background noise steady component generation unit 45 is obtained by adding the output signal u2 of the synthesis filter 63 and the signal obtained by multiplying the amplitude of the steady sound u3 by the weighting wu3 in the multiplication unit 67 in the addition unit 68, The weight wu created by the weight creating unit 62 is multiplied by the multiplying unit 69. Since the signal u3 output from the stationary sound generator 66 is a sound independent of the input signal, from the standpoint of encoding, decoding and reproducing the input signal, it may or may not be used. Good. However, in terms of human auditory sense, stationary background noise provides a sense of security, and even if the background noise does not always match the characteristics of the input signal, it often feels more natural.
Therefore, the stationary sound generating unit 66 generates a low-frequency sound such as a boon and a stationary white noise such as a sir, and adjusts the amplitude to a level relatively lower than the output level of the synthesis filter 63 and adds the noise. Then it becomes more natural background noise.

【００２７】背景雑音定常成分生成部４５の出力は、後
処理部３４の出力信号ｙｓ１ｅに加算部７１で加算さ
れて、再生信号出力となる。ここで注意すべきことは、
背景雑音定常成分のレベルは、平均雑音パワーの測定結
果Ｐｙによって決定され、母音区間、子音区間、背景
雑音区間といった、フレーム毎のモードにかかわらず、
ほぼ一定のレベルで加算されることである。この点は、
背景雑音変動成分が、該当フレームのモード情報によっ
て、フレーム毎に、加算される雑音レベルが制御される
点と異なっている。The output of the background noise stationary component generating section 45 is added to the output signal ys1e of the post-processing section 34 by the adding section 71 and becomes a reproduction signal output. The important thing to note here is
The level of the background noise stationary component is determined by the measurement result Py of the average noise power, regardless of the mode for each frame such as the vowel section, the consonant section, and the background noise section.
It is to be added at an almost constant level. This point is
The difference is that the background noise fluctuation component controls the added noise level for each frame depending on the mode information of the corresponding frame.

【００２８】なお、背景雑音定常成分生成部４５の出力
は、合成フィルタ３３と後処理部３４との間で加算して
もよいが、後処理部３４は音声を強調するための処理で
あるので、音声強調の度合いが大きい場合には、図２に
示すように後処理部３４の後で加算したほうが処理も簡
単で再生音声の自然性も高い。次に、図１を用いて、符
号化方法を説明する。The output of the background noise stationary component generator 45 may be added between the synthesis filter 33 and the post-processor 34, but the post-processor 34 is a process for emphasizing the voice. When the degree of speech enhancement is large, the processing is easier and the reproduced speech is more natural when added after the post-processing unit 34 as shown in FIG. Next, the encoding method will be described with reference to FIG.

【００２９】図２に示したような、復号方法を用いて自
然な再生音声を得るためには、符号化側では、従来の符
号化方法に加えて、以下の２点を実現しなければならな
い。１点目は、母音区間、子音区間、無音区間、背景雑
音区間といった、フレーム毎のモード分けをして、復号
側にモード情報の全部または一部を送ること、２点目
は、背景雑音区間において、雑音パワーの情報を復号側
に送ることである。In order to obtain a natural reproduced voice by using the decoding method as shown in FIG. 2, the coding side must realize the following two points in addition to the conventional coding method. . The first point is to divide modes into frames, such as vowel sections, consonant sections, silent sections, and background noise sections, and send all or part of the mode information to the decoding side. The second point is background noise sections. In, the noise power information is sent to the decoding side.

【００３０】図１において、入力音声信号ｘは、モー
ド判定部８１にも送られる。モード判定部８１では、入
力信号を分析して、区間の特性を表すモード情報ｂｍ
を出力する。符号化側のモード分けとしては、「母音区
間である」「子音区間である」「無音区間である」「背
景雑音区間である」の４つに、「よくわからない（不
明）」というカテゴリも許すことにする。不明を許すの
は、背景雑音が重畳した音声信号を分析した場合、必ず
中間的な性質のフレームが存在するためで、強制的に４
つの区間のどれかに分類してしまうのは自然な音を再生
するという立場から適当でないと考えられるからであ
る。ただし、不明を含めた５つのモード情報を受信側に
送ろうとすると、送信情報が無駄になるため、不明モー
ドは符号化側のみで利用して、送信する際には「不明」
は母音区間に含めてもよい。上記４つのモード情報を送
る場合、一般的には２ビット必要であるが、最初に母音
であるかそうでないかを１ビットで表すと、母音以外の
区間については、母音区間よりも少ない情報で入力信号
を表現できるようになるため、つまり使用ビットに余り
が生じるため、この余ったビットで子音／無音／背景雑
音の各区間を表すことができ、実質的にはこの発明を実
施するためには、従来法と比べてフレームあたり１ビッ
ト余分に使用すればよいことになる。フレームあたり１
ビットとは、例えばフレーム長が２０ミリ秒であれば５
０ビット／秒に相当し、全体の情報量の４ｋビット／秒
に対し、ごくわずかな情報量増でよい。In FIG. 1, the input audio signal x is also sent to the mode judging section 81. The mode determination unit 81 analyzes the input signal and outputs the mode information bm indicating the characteristic of the section.
Is output. As for the mode classification on the encoding side, the category "I don't know (unknown)" is also allowed in four of "vowel interval", "consonant interval", "silence interval", and "background noise interval". I will decide. The reason why the unknown is allowed is that when an audio signal on which background noise is superimposed is analyzed, there is always a frame with an intermediate property, so it is forced to
It is considered that it is not appropriate to classify the sound into one of the two intervals from the standpoint of reproducing a natural sound. However, if the 5 pieces of mode information including unknown are sent to the receiving side, the transmission information is wasted. Therefore, the unknown mode is used only by the encoding side and “unknown” is transmitted.
May be included in the vowel segment. When sending the above-mentioned four mode information, generally 2 bits are required, but if the first or not vowel is expressed by 1 bit, the information other than the vowel section requires less information than the vowel section. Since the input signal can be expressed, that is, there is a surplus in the used bits, the surplus bits can represent each consonant / silence / background noise section. Would require the use of one extra bit per frame as compared to the conventional method. 1 per frame
A bit is, for example, 5 if the frame length is 20 milliseconds.
This corresponds to 0 bit / sec, and a slight increase in the amount of information is sufficient with respect to the total amount of information of 4 kbit / sec.

【００３１】モードを母音区間／子音区間／無音区間／
背景雑音区間に分ける手法としては、信号のパワー、パ
ワーの変動分、スペクトル包絡の傾き、ピッチ周期性な
どを分析して求め、それぞれをしきい値と比較して判断
する。また、背景雑音区間の場合は、信号の性質が多岐
に渡るので、モードの連続性を考慮したり、過去の背景
雑音区間のパワーや性質と比較して、相対的なしきい値
を用いるとよい。分析して得られた値とそれぞれのしき
い値を比較しても、明確な区間分類ができない場合は、
区間が「不明」とする。The modes are vowel section / consonant section / silent section /
As a method of dividing into the background noise section, the power of the signal, the fluctuation of the power, the slope of the spectrum envelope, the pitch periodicity, etc. are obtained by analysis, and each is compared with a threshold value for judgment. Further, in the case of the background noise section, since the characteristics of the signal are various, it is preferable to consider the continuity of the modes or to use the relative threshold value in comparison with the power and characteristics of the past background noise section. . If clear interval classification is not possible even if the values obtained by analysis are compared with the respective thresholds,
The section is “unknown”.

【００３２】２点目の、背景雑音区間において雑音パワ
ーの情報を復号側に送るために、この発明では、ゲイン
符号帳２３の探索に、歪み計算部１６とパワーレベル差
計算部８２のそれぞれの出力の加重和を用いて符号帳を
検索する。ＣＥＬＰ系の符号化方式において、特に背景
雑音区間や子音区間では、入力信号のパワーが再生音の
パワーに必ずしも反映されない。これは、従来のＣＥＬ
Ｐ系符号化方式の符号帳探索が、サンプル単位の波形歪
みを小さくすることを念頭において行われていることに
対して、符号化のモデルが背景雑音や子音の生成過程に
合っていないことに起因する。したがって、従来法を用
いた場合、背景雑音区間や子音区間では、再生信号のパ
ワーが、入力信号のパワーを正しく表さないだけでな
く、不自然で不安定に変動することが多かった。この発
明では、復号側で、合成フィルタ出力ｙｓ１のパワー
を計算して、背景雑音レベルＰｙを推定するため、従
来の符号帳探索方法を用いるのでは、復号側で間違った
雑音レベルが推定されてしまうことになる。そこで、こ
の発明における符号化側では、合成フィルタ１４の出力
ｙを従来と同様の波形歪み計算部１６に送って、入力
信号とのサンプル単位での波形歪み値ｄを計算するほ
か、合成フィルタ出力ｙをパワーレベル差計算部８２
にも送り、入力信号のパワーと合成信号ｙのパワーの
差も計算する。一般にＣＥＬＰ系符号化では、適応符号
ベクトルｃａと固定符号ベクトルｃｒと、それらに
乗算するゲインｇａ，ｇｒの最適な組み合わせを探
索するが、実際には、これらの同時最適値を探索するに
は膨大な演算量が必要となるため、適応符号帳１８と固
定符号帳１９を先に探索して最適または準最適な適応符
号ベクトルｃａと固定符号ベクトルｃｒを決めた
後、ゲイン符号帳２３を最後に探索することが多い。In order to send noise power information to the decoding side in the second background noise interval, in the present invention, each of the distortion calculation section 16 and the power level difference calculation section 82 is searched for in the gain codebook 23. Search the codebook using the weighted sum of the outputs. In the CELP-based coding system, the power of the input signal is not necessarily reflected in the power of the reproduced sound, especially in the background noise section and the consonant section. This is the conventional CEL
In contrast to the fact that the P-type coding scheme codebook search is performed with the aim of reducing the waveform distortion in sample units, the coding model does not match the background noise or consonant generation process. to cause. Therefore, when the conventional method is used, in the background noise section and the consonant section, not only does the power of the reproduced signal not accurately represent the power of the input signal, but it is often unnatural and unstable. In the present invention, the decoding side calculates the power of the synthesis filter output ys1 and estimates the background noise level Py. Therefore, if the conventional codebook search method is used, the decoding side may estimate an incorrect noise level. Will end up. Therefore, on the encoding side in the present invention, the output y of the synthesizing filter 14 is sent to the same waveform distortion calculating unit 16 as in the conventional case to calculate the waveform distortion value d for each sample with the input signal, and the synthesizing filter output y is the power level difference calculation unit 82
Also, the difference between the power of the input signal and the power of the combined signal y is calculated. Generally, in CELP system coding, an optimum combination of an adaptive code vector ca, a fixed code vector cr, and gains ga and gr by which they are multiplied is searched, but in reality, it is enormous to search for these simultaneous optimum values. Since a large amount of calculation is required, the adaptive codebook 18 and the fixed codebook 19 are searched first to determine the optimum or suboptimal adaptive code vector ca and the fixed code vector cr, and then the gain codebook 23 is finally set. I often search.

【００３３】この発明でも上記探索順序によるものとす
るが、適応符号帳１８と固定符号帳１９の探索には、従
来と同様に、波形歪みｄの最小化に基づいて符号帳を
探索し、ゲイン符号帳２３の探索時には、波形歪みｄ
とパワーレベル差ｄｐを併用して探索する。重み作成
部８３では、当該フレームのモード情報ｂｍに基づい
て、波形歪み値ｄとパワーレベル差ｄｐのそれぞれ
の値に乗算部８４，８５で乗算する重みｗｄ，ｗｐ
を作成する。例えば、母音区間では波形歪み値ｄのみ
を用いるか波形歪み値ｄを優先する重みの組を使い、
子音区間や背景雑音区間ではパワーレベル差ｄｐのみ
を用いるかパワーレベル差ｄｐを優先する重みの組を
使用する。この結果、母音区間では従来と同様の波形歪
みの少ない良好な品質の音が、子音区間や背景雑音区間
では、合成波形ｙの形状は入力信号ｘの形状と相似
性が高く、パワーは入力信号のパワーをできるだけ保存
するような音を再生するための駆動音源符号が選択され
る。なお、「不明」区間は母音区間と子音や背景雑音区
間との中間的な性質であるので、ゲイン符号帳２３を探
索する重みとしてはパワーレベル差ｄｐを重視するよ
うな重みを用いるのがよい。乗算部８４，８５でそれぞ
れ重み付けされた波形歪み値ｄとパワーレベル差ｄ
ｐは加算部８６で加算されて符号帳検索制御部２５に入
力される。In the present invention as well, although the above search order is used, the search for the adaptive codebook 18 and the fixed codebook 19 is performed by searching the codebook based on the minimization of the waveform distortion d as in the conventional case. When searching the codebook 23, the waveform distortion d
And the power level difference dp are used in combination. In the weight creation unit 83, the weights wd and wp by which the respective values of the waveform distortion value d and the power level difference dp are multiplied by the multiplication units 84 and 85 based on the mode information bm of the frame.
To create. For example, in the vowel section, only the waveform distortion value d is used, or a set of weights giving priority to the waveform distortion value d is used.
In the consonant section and the background noise section, only the power level difference dp is used or a set of weights giving priority to the power level difference dp is used. As a result, in the vowel section, a good quality sound with less waveform distortion as in the conventional case is obtained, but in the consonant section and the background noise section, the shape of the composite waveform y is highly similar to the shape of the input signal x, and the power is the input signal. A drive excitation code is selected for reproducing a sound that preserves the power of s. Since the "unknown" section has an intermediate property between the vowel section and the consonant or background noise section, it is preferable to use a weight that emphasizes the power level difference dp as the weight for searching the gain codebook 23. . The waveform distortion value d and the power level difference d weighted by the multiplication units 84 and 85, respectively.
p is added by the addition unit 86 and input to the codebook search control unit 25.

【００３４】[0034]

【発明の効果】この発明の効果を調べるために、この発
明をコンピュータプログラムによるシミュレーションの
形で実現し、実際の音声データを用いて主観品質評価実
験を行った。フレーム長は２０ミリ秒とし、フレームは
２つのサブフレームに分割した。ビットレートは４キロ
ビット／秒で設計した。フレームあたりの詳細なビット
配分は、図５に示すように、線形予測パラメータ２０ビ
ット、適応符号帳１３ビット、固定符号帳３４ビットと
し、ゲイン符号帳は母音区間では１２ビット、母音以外
の区間では１０ビットとした。モード情報は母音区間で
１ビット、母音以外の区間で３ビットである。In order to investigate the effects of the present invention, the present invention was realized in the form of simulation by a computer program, and subjective quality evaluation experiments were conducted using actual voice data. The frame length was 20 milliseconds and the frame was divided into two subframes. The bit rate was designed at 4 kilobits / second. As shown in FIG. 5, the detailed bit allocation per frame is a linear prediction parameter of 20 bits, an adaptive codebook of 13 bits, and a fixed codebook of 34 bits. The gain codebook is 12 bits in the vowel section and is in the sections other than vowels. It is set to 10 bits. The mode information has 1 bit in the vowel section and 3 bits in the section other than the vowel.

【００３５】音声データは、修正ＩＲＳ特性と呼ばれる
一般的な電話の特性に準拠したもので、ＳＮ比が１５ｄ
Ｂの自動車雑音付加音声と、ＳＮ比が３０ｄＢのオフィ
ス雑音付加音声を、それぞれ従来法とこの発明を用いて
符号化および復号して、それぞれ再生された音を実際に
聞いて比較した。被験者は一般人２４名で、試験方法
は、原音と符号化音声の品質を比較して、非常に悪い
（−３）、悪い（−２）、少し悪い（−１）、同品質
（０）、少し良い（＋１）、良い（＋２）、非常に良い
（＋３）の７段階で評価した。The voice data conforms to a general telephone characteristic called a modified IRS characteristic and has an SN ratio of 15d.
The vehicle noise-added speech of B and the office noise-added speech having an SN ratio of 30 dB were encoded and decoded using the conventional method and the present invention, respectively, and the reproduced sounds were actually heard and compared. The test subjects were 24 ordinary people, and the test method was very bad (-3), bad (-2), a little bad (-1), the same quality (0), comparing the quality of the original sound and the coded speech. It was rated on a 7-point scale: slightly good (+1), good (+2), and very good (+3).

【００３６】図６に試験結果を示す。グラフは低いほど
原音に比べて品質が悪いことを示す。図より、この発明
を用いた場合の品質は、この発明を用いない従来の方法
による品質に比べて大きく改善することが示された。従
来法とこの発明による音の性質の差を言葉で表現する
と、従来法による再生音は、子音区間と背景雑音区間に
おいて、大変に不自然で不快な音であったが、この発明
によって再生された音は、自動車騒音とオフィス雑音の
それぞれの雰囲気が再現されているうえに、安心して聞
ける自然な品質の音であった。また、背景雑音区間だけ
でなく母音区間においても、音声と背景雑音が混合し
た、より自然な再生音という観点でこの発明のほうが優
れていた。The test results are shown in FIG. The lower the graph, the poorer the quality of the original sound. From the figure, it is shown that the quality when the present invention is used is greatly improved as compared with the quality obtained by the conventional method which does not use the present invention. Expressing the difference in sound properties between the conventional method and the present invention in words, the reproduced sound by the conventional method was very unnatural and unpleasant in the consonant section and the background noise section. The sound was reproduced with the atmosphere of automobile noise and office noise, and was of a natural quality that could be heard with peace of mind. Further, the present invention is superior in terms of a more natural reproduced sound in which voice and background noise are mixed not only in the background noise section but also in the vowel section.

【００３７】なお、この発明を背景雑音のない音声に適
用した場合には、「背景雑音」は無音区間として分類さ
れ、「背景雑音」の平均レベルは０であると判断される
ため、理論的に悪影響を及ぼさないことは言うまでもな
い。実際の音声を入力した場合も、母音区間と無音区間
（背景雑音区間）については従来法と同等の音質が、子
音区間については、この発明によるほうが自然な音質で
あった。When the present invention is applied to speech without background noise, "background noise" is classified as a silent section, and it is determined that the average level of "background noise" is 0. It goes without saying that it does not adversely affect. Even when an actual voice is input, the vowel section and the silent section (background noise section) have a sound quality equivalent to that of the conventional method, and the consonant section has a more natural sound quality according to the present invention.

[Brief description of drawings]

【図１】この発明による符号化装置の機能構成例を示す
ブロック図。FIG. 1 is a block diagram showing a functional configuration example of an encoding device according to the present invention.

【図２】この発明による復号化装置の機能構成例を示す
ブロック図。FIG. 2 is a block diagram showing a functional configuration example of a decoding device according to the present invention.

【図３】従来の音声の符号駆動線形予測符号化（Code-E
xcited Linear Prediction：ＣＥＬＰ）装置の機能構成
を示すブロック図。FIG. 3 is a diagram illustrating a conventional speech code-driven linear predictive coding (Code-E).
The block diagram which shows the function structure of a xcited Linear Prediction (CELP) apparatus.

【図４】従来の音声の符号駆動線形予測符号化（Code-E
xcited Linear Prediction：ＣＥＬＰ）に対応する復号
装置の機能構成を示すブロック図。FIG. 4 is a diagram illustrating conventional code-driven linear predictive coding (Code-E) for speech.
The block diagram which shows the function structure of the decoding apparatus corresponding to xcited Linear Prediction (CELP).

【図５】シミュレーション実験におけるビット配合を示
す図。FIG. 5 is a diagram showing a bit combination in a simulation experiment.

【図６】シミュレーション実験結果を示す図。FIG. 6 is a diagram showing a simulation experiment result.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 19/12 ─────────────────────────────────────────────────── ─── Continuation of front page (58) Fields surveyed (Int.Cl. ⁷ , DB name) G10L 19/12

Claims

(57) [Claims]

1. A drive vector is generated by selecting a code vector for each frame from at least one of an adaptive codebook and a fixed codebook, and multiplying the selected code vector by a gain selected from a gain codebook. Then, in the acoustic signal decoding device that drives the synthesis filter with the driving sound source vector and outputs the acoustic signal, a means for receiving information indicating whether or not the frame is the background noise section, and the background noise section. A means for calculating a power level representing a long-term average value of the output signal power from the synthesis filter in a frame; and an average spectrum representing a long-term average of the spectrum from the filter coefficient of the synthesis filter in a frame which is a background noise section. And a second sum driven by white noise, the average spectrum being provided as filter coefficients. A filter, a means for amplitude-adjusting the signal generated by the second synthesis filter based on the calculated power level to generate a stationary component signal of background noise; Means for adding a signal to the output signal from the synthesis filter to output a reproduced sound signal regardless of whether it is a noise section, and whether the frame is a background noise section or a consonant section
For measuring the power of the driving sound source vector of the frame concerned,
Means, the measured power, and the background noise section or consonant section
The first and second weights based on information indicating whether or not
Of the background noise by multiplying the white noise by the first weight.
Means for generating a split signal, and means for multiplying the driving sound source vector by the second weight
And in the frame that is the background noise section or consonant section,
Driving source obtained by multiplying the variable component signal by the second weight
Means for adding to the vector, the power of the fluctuation component signal, and the second weight
The sum of the power of the driving sound source vector multiplied by
The first and second weights should be equal to the specified power.
The audio signal decoding device is characterized in that only the audio signal is determined .

2. A stationary component signal obtained by adding a stationary component signal from the means for generating a stationary sound having a constant spectral characteristic and a constant amplitude, and a stationary component signal generated by the means for generating a stationary component signal of the background noise. 2. The acoustic signal decoding device according to claim 1, further comprising:

3. A drive vector is generated by selecting a code vector on a frame-by-frame basis from at least one of an adaptive codebook and a fixed codebook, and multiplying the selected code vector by a gain selected from a gain codebook. Then, in the acoustic signal encoding device that compares the acoustic signal obtained by driving the synthesis filter with the driving sound source vector and the input acoustic signal, and selects the code vector and the gain, the input acoustic signal is analyzed. , The frame is background noise
And a means for determining whether it corresponds to a consonant section or a vowel section and outputting a code representing the characteristic of the determined section; and a power level difference between the output signal of the synthesis filter and the input acoustic signal. means for, upon said gain selection, the distance measure the power level difference in the frame of the background noise period and consonant segment, mother
Input signal of the output signal of the synthesis filter in the sound section
And a means for selecting a gain that minimizes the distance measure from the gain codebook , using the waveform distortion with respect to as the distance measure, and the acoustic signal encoding device.

4. In the vowel section, the waveform distortion is prioritized and
Within the scene noise section or consonant section, the power level difference is
Weighted sum is the distance between the preceding waveform distortion and power level difference
The acoustic signal encoding device according to claim 3 , wherein the acoustic signal encoding device is used as a scale .

5. A drive vector is generated by selecting a code vector on a frame-by-frame basis from at least one of an adaptive codebook and a fixed codebook, and multiplying the selected code vector by a gain selected from a gain codebook. Then, in the acoustic signal decoding method for driving the synthesis filter with the driving sound source vector to output the acoustic signal, the process of receiving information indicating whether the frame is the background noise interval and the background noise interval. Calculating a power level representing a long-term average value of the output signal power from the synthesizing filter in a frame; calculating an average spectrum representing a long-term average of spectra from filter coefficients of the synthesizing filter; Driving the second synthesis filter with white noise as a filter coefficient, and generating the second synthesis filter. A step of amplitude-adjusting the signal based on the calculated power level to generate a stationary component signal of background noise; and combining the stationary component signal regardless of whether the frame is a background noise section or not. The process of adding the output signal from the filter and outputting the acoustic signal, and whether the frame is in the background noise section or the consonant section
And the power of the driving sound source vector of the frame is measured.
Process, the measured power, the background noise section or the consonant section
The first and second weights based on information indicating whether or not
Of the background noise by multiplying the white noise with the first weight.
Generating a split signal and multiplying the driving source vector by the second weight
And the fluctuation component in the background noise section or the consonant section frame.
A signal to the drive source vector multiplied by a second weight
And the power of the fluctuation component signal is multiplied by the second weight.
The sum of the power of the driving source vector
The first and second weights to be equal to
An acoustic signal decoding method characterized by:

6. A step of generating a stationary sound having a constant spectral characteristic and a constant amplitude, and a step of adding the stationary component signal of the background noise and the stationary sound and outputting as a stationary component signal. The audio signal decoding method according to claim 5 , wherein

7. A code vector is selected from at least one of an adaptive codebook and a fixed codebook on a frame-by-frame basis, and a gain selected from each gain codebook is multiplied by the selected code vector to obtain a driving excitation vector. In the acoustic signal coding method, which generates and compares the acoustic signal obtained by driving the synthesis filter with the driving sound source vector and the input acoustic signal, and selects the code vector and the gain, the input acoustic signal is analyzed. The relevant frame is the background noise section
And a process of determining whether it corresponds to a consonant section or a vowel section and outputting a code representing the characteristic of the determined section, and calculating the power level difference between the output signal of the synthesis filter and the input acoustic signal. And the gain selection, the power level difference is used as a distance measure in the frames of the background noise section and the consonant section, and the vowel section is set.
In the waveform of the output signal of the synthesis filter with respect to the input signal
An acoustic signal encoding method comprising: a step of selecting a gain that minimizes the distance measure from a gain codebook using distortion as a distance measure.

8. In the vowel section, prioritizing the waveform distortion,
In the noise or consonant section, the power level difference is
Wherein a weighted sum of the previous waveform distortion and path Wareberu difference distance
The acoustic signal encoding method according to claim 7, which is used as a measure .

9. A computer-readable recording medium that stores a program for causing a computer to execute the acoustic signal decoding method according to claim 5 or 6 .

10. The method of claim 7 or computer readable recording medium an audio signal coding method for storing a program which Ru is executed by a computer as described in 8.