JPH09244695A

JPH09244695A - Voice coding device and decoding device

Info

Publication number: JPH09244695A
Application number: JP8046191A
Authority: JP
Inventors: Yoshiro Nishimoto; 善郎西元; Tetsuya Takahashi; 哲也高橋; Toshiaki Shimoda; 敏章下田; Toru Sakatani; 亨坂谷; Takayuki Hiekata; 孝之稗方
Original assignee: Kobe Steel Ltd
Current assignee: Kobe Steel Ltd
Priority date: 1996-03-04
Filing date: 1996-03-04
Publication date: 1997-09-19

Abstract

PROBLEM TO BE SOLVED: To make an unpleasant sound based on a background noise not to be generated by smoothing the fluctuation of voice featured values only in a voiceless section timewisely or by performing a stronger smoothing in the voiceless section than in a voice section. SOLUTION: In a voice coding device, various parameters becoming featured values of a voice are calculated in a featured value calculating part from an inputted voice signal. Next, whether the input signal is the voice section including the voice or the voiceless section of only a background noise at this point in time is judged based on the featured values in a voice/voiceless section discriminating part. When the input signal is judged to be the voiceless section, the fluctuation of the calculated featured values is suppressed in a featured value smoothing part. Conversely, when the input signal is judged to be the voice section, the calculated values are used as they arc without performing the smoothing of the featured values. Then, the featured values are quantized in a parameter quantizing part. Moreover, when the voice is inputted, optimum indexes for reproducing a sound near to an original sound are retrieved by retrieving code book part in an optimum driving signal selecting part.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は，携帯電話等で問題
となる背景雑音に基づく異音を抑制するための音声符号
化装置及び復号化装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech coding apparatus and a decoding apparatus for suppressing abnormal noise due to background noise which is a problem in mobile phones and the like.

【０００２】[0002]

【従来の技術】本発明の適用対象となる音声圧縮・復元
（符号化・復号化）装置の一例として, ＣＥＬＰ（Code
Excited Linear Prediction) に基づく携帯電話の基本
構成は，図１０及び図１１に示されている。図１０は音
声符号化装置，図１１は音声復号化装置の構成例を示
す。図１０に示す符号化装置は，入力された原音声から
音声の特徴量を計算する「特徴量計算部」と，それらを
量子化してデータ圧縮する「パラメータ量子化部」と，
後に音声を合成するために用いる駆動信号に対応するイ
ンデックスを複数記憶している「コードブック部」と，
原音と近い音を再現するための最適なコードブック内の
インデックスを選択する「最適駆動信号選択部」とを具
備して構成されている。上記最適駆動信号選択部では，
合成される音声が原音声に近い音声になるような駆動信
号に対応するインデックスが上記コードブックの中の複
数のインデックスの内から選択される。即ち，最適駆動
信号選択部は，入力音声を受信すると，コードブック内
のインデックスから複数のインデックスを選択し，これ
を順次音声合成フィルタに入力して合成音を作成する手
順を選択したインデックスの数だけ繰り返し，選択され
たインデックスの数の合成音の中から入力音声に最も近
い合成音を生成するインデックスを抽出して出力する。
この時，音声合成フィルタは，前記音声特徴量に対応す
る係数に応じて駆動される。2. Description of the Related Art As an example of a voice compression / decompression (encoding / decoding) device to which the present invention is applied, a CELP (Code
The basic configuration of a mobile phone based on Excited Linear Prediction) is shown in FIGS. 10 and 11. FIG. 10 shows a configuration example of a speech coding apparatus, and FIG. 11 shows a configuration example of a speech decoding apparatus. The encoding device shown in FIG. 10 includes a “feature amount calculation unit” that calculates a feature amount of a voice from an input original voice, a “parameter quantization unit” that quantizes them to compress data,
A “codebook section” that stores a plurality of indexes corresponding to drive signals used to synthesize speech later,
It is configured to include an "optimal drive signal selection unit" that selects an index in an optimal codebook for reproducing a sound close to the original sound. In the optimum drive signal selection section,
An index corresponding to the drive signal such that the synthesized voice becomes a voice close to the original voice is selected from the plurality of indexes in the codebook. That is, when the optimum drive signal selection unit receives an input voice, it selects a plurality of indices from the indices in the codebook and sequentially inputs these to a voice synthesis filter to select a procedure for creating a synthesized voice. By repeating the above, the index that produces the synthetic speech closest to the input speech is extracted and output from the synthetic speech of the selected index number.
At this time, the voice synthesis filter is driven according to the coefficient corresponding to the voice feature amount.

【０００３】符号化されたデータを復号化するには，量
子化された音声特徴量や選ばれたインデックス等が図１
１の復号化装置に送信され，「音声特徴量復号部」にお
いて特徴量の値が復号化され，「駆動信号生成部」で
は，符号化装置から送信されたインデックスに基づいて
コードブックからの駆動信号が作成される。また，音声
のスペクトル包絡形状を表す特徴量を用いて，「音声合
成フィルタ」の係数が決められ，そのフィルタに駆動信
号を入力することにより音声が合成される。ここで，駆
動信号は人間の声帯の振動等の音源に対応し，スペクト
ル包絡形状を作りだす合成フィルタは声道の形状等によ
り音源から発せられるまでの伝達関数に対応する。従っ
て，音声特徴量を量子化する際の量子化テーブル並びに
コードブックに格納されているインデックスは，人間の
音声を合成するのに適したものが，種々の規格に基づい
て用意されている。In order to decode the encoded data, the quantized voice feature quantity, the selected index, etc. are used in FIG.
1 is transmitted to the decoding device, the value of the feature amount is decoded in the “voice feature amount decoding unit”, and the “driving signal generation unit” drives from the codebook based on the index transmitted from the encoding device. The signal is created. In addition, the coefficient of the “voice synthesis filter” is determined using the feature amount representing the spectrum envelope shape of the voice, and the voice is synthesized by inputting the drive signal to the filter. Here, the drive signal corresponds to a sound source such as human vocal cord vibration, and the synthesis filter that creates a spectral envelope shape corresponds to a transfer function until the sound source emits due to the shape of the vocal tract. Therefore, as the quantization table and the index stored in the codebook when quantizing the voice feature amount, those suitable for synthesizing the human voice are prepared based on various standards.

【０００４】その一方で，このような方式で低ビットレ
ートに音声を圧縮する音声圧縮・復元方式では，人間の
声以外の音が入力された場合には，必ずしも元の音を忠
実に再現することが出来ず，特に空調音や雨音等，ごく
身近に存在する背景雑音が混在した場合でも，極めて音
質が劣化することが知られている。その一つはswirling
noiseと呼ばれるものであり, 背景雑音が「キュルキュ
ル」いったふうに変動する極めて不快な音になるという
現象である。この原因は, スペクトル包絡形状を表す音
声特徴量等を計算するのに通常使われるフレームでは,
雑音の特徴量を安定して計算するためにはデータが少な
すぎ, 音声パラメータが変動するためであることが, 論
文“Improvement of Background Sound coding in Line
ar Predictive Speech Codes",Proc. ICASSP 1995 等に
明らかにされている。On the other hand, in the voice compression / decompression system which compresses voice at a low bit rate by such a system, when a sound other than a human voice is input, the original sound is always faithfully reproduced. It is known that the sound quality is extremely deteriorated even when there is a background noise such as air conditioning noise or rain noise that is present in the immediate environment. One of them is swirling
It is called noise, and it is a phenomenon in which the background noise becomes a very unpleasant sound that fluctuates like “curcules”. The reason for this is that in the frame that is usually used to calculate the speech features that represent the spectral envelope shape,
This is because the amount of data is too small for stable calculation of the noise feature quantity, and the speech parameters fluctuate.
ar Predictive Speech Codes ", Proc. ICASSP 1995 etc.

【０００５】[0005]

【発明が解決しようとする問題点】このような問題を解
決するための従来技術の一つとしては、特開平7-152395
号公報に記載のように, 無声区間において背景雑音のス
ペクトル特徴量を求め, それに基づいて雑音スペクトル
を抑圧するフィルタにより雑音を減衰させる方式があ
る。しかし, この方式ではS/N 比は改良されるが, 雑音
を完全に無くすことは不可能であり, 残った雑音から計
算される特徴量が変動して不快に聞こえるため, 問題の
本質は解決できていない。また, 電話等においては背景
雑音も周囲の臨場感を伝える情報であり，背景の音が無
くなってしまうことにより, かえって不自然な印象を与
えるという別の問題も発生する。As one of the conventional techniques for solving such a problem, Japanese Patent Laid-Open No. 7-152395 is available.
As described in the publication, there is a method in which the spectral feature amount of background noise is obtained in the unvoiced section, and the noise is attenuated by a filter that suppresses the noise spectrum based on it. However, although the S / N ratio is improved by this method, it is impossible to completely eliminate the noise, and the feature amount calculated from the remaining noise fluctuates and sounds unpleasant. Not done. In addition, background noise on telephones is also information that conveys the realism of the surroundings, and the disappearance of the background sound causes another problem of giving an unnatural impression.

【０００６】また別の従来技術として特開平7-160294号
公報では, 復号化装置側で雑音のコードブックを用意し
ておき, 符号化装置から送信されるインデックス等によ
って生成される信号の代わりに，無声区間では「雑音コ
ードブック」から選んだ別の信号を使う。この場合「雑
音コードブック」からの信号を，合成フィルタに入力す
る前の駆動信号と置き換える方法と，合成フィルタの出
力である合成信号と置き換える方法とが示されている
が，そのいずれもが問題を有している。先ず，合成フィ
ルタ前の信号を変えても，合成フィルタの係数はスペク
トル包絡形状の音声特徴量からきめられるため，swirli
ng noiseの原因が音声パラメータの変動である以上，結
局はフィルタから出力される合成音声には不快な変動が
生じることになる。逆に，合成フィルタ後の音声を「雑
音コードブック」からの信号に置き換える方法では，存
在する可能性のあるスペクトル特性を持った雑音信号を
全てコードブックとして用意しておかなければ，原音を
再現できないという問題がある。さらに、コードブック
の探索は，ＣＥＬＰ方式において最も計算量を必要とす
る部分であり，符号化装置だけでなく復号化装置におい
てもコードブック探索を行うことは，極めて高速な信号
処理用演算装置が必要となってしまう。As another conventional technique, in Japanese Laid-Open Patent Application No. 7-160294, a noise codebook is prepared on the decoding device side, and instead of a signal generated by an index or the like transmitted from the coding device. , In the unvoiced section, another signal selected from “Noise Codebook” is used. In this case, a method of replacing the signal from the “noise codebook” with the drive signal before input to the synthesis filter and a method of replacing with the synthesis signal that is the output of the synthesis filter are shown, but both of them are problematic. have. First, even if the signal before the synthesis filter is changed, the coefficient of the synthesis filter can be determined from the speech feature amount of the spectral envelope shape.
Since the cause of ng noise is the variation of the voice parameter, the synthesized voice output from the filter will eventually have an unpleasant variation. On the contrary, in the method of replacing the speech after the synthesis filter with the signal from the “noise codebook”, the original sound is reproduced unless all the noise signals having the spectral characteristics that may exist are prepared as the codebook. There is a problem that you cannot do it. Furthermore, the codebook search is the most computationally intensive part of the CELP method, and the codebook search can be performed not only by the encoding device but also by the decoding device by an extremely high-speed signal processing arithmetic unit. It will be necessary.

【０００７】また，別の従来技術として特開平7-036485
号公報では, 過去の音声信号を記憶しておくバッファを
設けておき, 無声区間では通常の音声区間の計算よりも
多いフレーム数のデータを用いて音声特徴量を計算する
ことにより, 計算結果を安定にして特徴量の変動を抑え
ようとしている。しかし, 前述の論文によれば, 計算結
果を安定させるために320ms 程度の音声データを使用し
ており,8Khz サンプリングのディジタル音声データで
は,2560 サンプルものデータが必要となる。従って, こ
の方法では, バッファだけのために多くのメモリ資源が
必要となってしまう。従って本発明の目的は, 大きいメ
モリ容量を必要とすることなく, また高速の演算装置を
用いることなく, 前記背景雑音に基づく不快音を生じる
ことのない音声符号化装置及び復号化装置を提供するこ
とである。As another conventional technique, Japanese Patent Laid-Open No. 7-036485
In the gazette, a buffer for storing past speech signals is provided, and in the unvoiced section, the speech feature quantity is calculated using data with a larger number of frames than in the ordinary speech section, and the calculation result is calculated. It is trying to stabilize and suppress the fluctuation of the feature quantity. However, according to the above paper, 320 ms of voice data is used to stabilize the calculation results, and 2560 samples of data are required for 8 Khz sampling digital voice data. Therefore, this method requires a lot of memory resources only for the buffer. Therefore, an object of the present invention is to provide a speech coding apparatus and a decoding apparatus which do not generate an unpleasant sound based on the background noise without requiring a large memory capacity and without using a high-speed arithmetic unit. That is.

【０００８】[0008]

【課題を解決するための手段】上記目的を達成するため
に本願が採用する音声符号化装置は，音声信号から音声
特徴量を計算し，上記音声特徴量を構成するパラメータ
を圧縮して送信すると共に，上記音声信号に応じて音声
合成用の駆動信号に対応するインデックスをコードブッ
クから選択して上記圧縮されたパラメータと共に送出す
る音声符号化装置において，上記音声信号内における有
声・無声区間を識別する符号化有声・無声区間識別手段
と，上記無声区間内においてのみ上記音声特徴量の変動
を時間的に平滑化するか，無声区間においては有声区間
よりも強い平滑化を行うかのいずれかの平滑化処理を行
う符号化平滑化手段とを具備してなることを特徴とする
音声符号化装置として構成されている。この場合, 上記
音声特徴量としては, 線スペクトル対を計算し，無声区
間では各線スペクトル対の値を時間的に平滑化した後に
音声特徴量を量子化するものが挙げられる。A speech coding apparatus adopted by the present application to achieve the above object calculates a speech feature amount from a speech signal, compresses parameters constituting the speech feature amount, and transmits the compressed parameters. At the same time, in a voice encoding device for selecting an index corresponding to a drive signal for voice synthesis from a codebook according to the voice signal and transmitting the index together with the compressed parameter, a voiced / unvoiced section in the voice signal is identified. Either the coded voiced / unvoiced section discriminating means for smoothing temporally the variation of the voice feature amount only in the unvoiced section, or performing the smoothing stronger in the unvoiced section than in the voiced section. The speech coding apparatus is characterized by comprising coding smoothing means for performing smoothing processing. In this case, as the above-mentioned speech feature, a line spectrum pair is calculated, and in the unvoiced section, the value of each line spectrum pair is temporally smoothed and then the speech feature is quantized.

【０００９】また本願における音声復号化装置に関する
第１発明は，音声符号化装置からの出力である圧縮され
たパラメータ及び駆動信号生成のためのインデックス等
を受信し，これらのパラメータ及びインデックス等から
音声特徴量及び駆動信号を復元し，それらを用いて音声
を合成する音声復号装置において，上記音声符号化装置
からの圧縮されたパラメータから上記音声信号内の有声
・無声部分を識別する復号化有声・無声区間識別手段
と，上記無声区間内においてのみ上記音声特徴量の変動
を時間的に平滑化するか，無声区間においては有声区間
よりも強い平滑化を行うかのいずれかの平滑化処理を行
う復号化平滑化手段と，平滑化した後の各音声特徴量を
用いて音声を合成する音声合成手段とを具備してなるこ
とを特徴とする音声復号化装置として構成されている。
この場合，上記音声特徴量として線スペクトル対を計算
し，無声区間では各線スペクトル対の値を時間的に平滑
化した後に上記駆動信号を計算し，音声を合成すること
ができる。A first aspect of the present invention relating to a speech decoding apparatus receives a compressed parameter which is an output from a speech encoding apparatus and an index for generating a drive signal and the like, and outputs the speech from the parameter and the index. In a voice decoding device for restoring a feature amount and a driving signal and synthesizing a voice using them, a decoded voiced / voiced / voiced voice / voiceless part for identifying a voiced / unvoiced part in the voice signal from a compressed parameter from the voice encoding device. An unvoiced section identifying means and a smoothing process of either smoothing the variation of the speech feature amount temporally only within the unvoiced section or performing stronger smoothing than the voiced section in the unvoiced section. A voice characterized by comprising a decoding smoothing means and a voice synthesizing means for synthesizing a voice using each smoothed voice feature amount. It is configured as Goka device.
In this case, a line spectrum pair is calculated as the voice feature amount, and in the unvoiced section, the value of each line spectrum pair is temporally smoothed and then the drive signal is calculated to synthesize the voice.

【００１０】また第２発明は，音声符号化装置からの出
力である圧縮されたパラメータ及び駆動信号生成のため
のインデックス等を受信し，これらのパラメータ及びイ
ンデックス等から音声特徴量及び駆動信号を復元し，そ
れらを用いて音声を合成する音声復号装置において，符
号化装置からからの圧縮されたパラメータから上記音声
信号内の有声・無声部分を識別する復号化有声・無声区
間識別手段と，無声区間と判断された部分では，上記イ
ンデックスを元に上記コードブックから復元された第１
の駆動信号の一部若しくは全てを他の方法で生成された
第２の駆動信号で置き換えた第３の駆動信号を作成し，
この第３の駆動信号に基づいて音声を合成することを特
徴とする音声復号化装置として構成されている。ここに
上記第２の駆動信号を，乱数等によって発生される雑音
信号を用い，第３の駆動信号をコードブックから復元さ
れた第１の駆動信号と第２の駆動信号の重み付け加算に
より作成することができる。また，有声・無声の切り替
わり境界部では，上記第３の駆動信号を作成する際に，
重み付け加算する第１の駆動信号と乱数等の第２の駆動
信号との割合を連続的に変化させるようにしても良い。
さらに，上記無声部分では，音声特徴量の変動を時間的
に平滑化し，平滑化した後の特徴量と上記駆動信号を用
いて音声を合成するようにしても良い。The second aspect of the invention receives a compressed parameter which is an output from the audio encoding device and an index for generating a driving signal, and restores the audio feature amount and the driving signal from the parameter and the index. Then, in a speech decoding apparatus for synthesizing speech using them, a decoding voiced / unvoiced section identifying means for identifying a voiced / unvoiced section in the speech signal from a compressed parameter from the encoding apparatus, and an unvoiced section In the part determined to be the first, the first restored from the above codebook based on the above index
Creating a third drive signal in which some or all of the drive signals of are replaced by a second drive signal generated by another method,
The speech decoding apparatus is configured to synthesize speech based on the third drive signal. The second drive signal is generated by using a noise signal generated by a random number or the like, and the third drive signal is created by weighted addition of the first drive signal and the second drive signal restored from the codebook. be able to. Also, at the voice / unvoiced switching boundary, when creating the third drive signal,
The ratio between the first drive signal for weighted addition and the second drive signal such as a random number may be continuously changed.
Further, in the unvoiced part, the variation of the voice feature amount may be temporally smoothed, and the voice may be synthesized using the smoothed feature amount and the drive signal.

【００１１】[0011]

【発明の実施の形態】続いて，添付図面を参照して本発
明を具体化した実施例につき説明し，本発明の理解に供
する。尚，以下に示す実施例は本発明を具体化した一例
であって，本発明の技術的範囲を限定するものではな
い。ここに，図１は，本発明の一実施形態に係る音声符
号化装置の一構成を示すブロック図，図２は，本発明の
一実施形態に係る音声復号化装置の一構成を示すブロッ
ク図，図３は，特徴量に関する線スペクトル対の変動を
示すグラフ，図４は，図３に示した線スペクトル対の量
子化後の変動を示すグラフ（平滑化なし），図５は，図
３に示した線スペクトル対の量子化後の変動を示すグラ
フ（平滑化あり），図６は，本発明の一実施例に係る音
声復号化装置の構成を示すブロック図，図７は，上記図
６に示した音声復号化装置における特徴量のパラメータ
の平滑化を行う構成を示すブロック図，図８は，原音声
の背景雑音のスペクトルを示すグラフ，図９は，合成さ
れた背景雑音のスペクトルを示すグラフ，図１０は，従
来のＣＥＬＰ符号化装置の構成を示すブロック図，図１
１は，従来のＣＥＬＰ復号化装置の構成を示すブロック
図である。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of the present invention; It should be noted that the embodiments described below are examples embodying the present invention, and do not limit the technical scope of the present invention. 1 is a block diagram showing a configuration of a speech encoding apparatus according to an embodiment of the present invention, and FIG. 2 is a block diagram showing a configuration of a speech decoding apparatus according to an embodiment of the present invention. 3 is a graph showing the variation of the line spectrum pair related to the feature amount, FIG. 4 is a graph showing the variation of the line spectrum pair shown in FIG. 3 after quantization (without smoothing), and FIG. FIG. 6 is a block diagram showing the configuration of a speech decoding apparatus according to an embodiment of the present invention, and FIG. 7 is the above-mentioned diagram. 6 is a block diagram showing a configuration for smoothing a parameter of a feature amount in the speech decoding apparatus shown in FIG. 6, FIG. 8 is a graph showing a spectrum of background noise of original speech, and FIG. 9 is a spectrum of synthesized background noise. Fig. 10 shows a conventional CELP coding device. Block diagram illustrating the configuration of FIG. 1
FIG. 1 is a block diagram showing the configuration of a conventional CELP decoding device.

【００１２】図１および図２は，本発明の一実施形態に
係る音声符号化装置及び音声復号化装置の構成を示す。
まず音声符号化装置（図１）では入力された音声信号か
ら「特徴量計算部」にて音声の特徴量となる各種パラメ
ータが計算される。次に「有声・無声区間識別部」にお
いて上記特徴量に基づいて，入力信号が現時点で音声を
含んだ有声区間か，背景雑音のみの無声区間かを判断す
る。無声区間と判断された場合には，計算された特徴量
は「特徴量平滑化部」において変動を抑制される。逆
に，有声区間と判断された場合には，特徴量の平滑化は
行わずに計算値をそのまま使う。これ以降は，前記従来
技術で説明した図１０のＣＥＬＰ方式の符号化装置等と
同様であり，特徴量は「パラメータ量子化部」において
量子化（圧縮）される。また音声が入力されると，最適
駆動信号選択部においてコードブック部を検索し，複数
のインデックスの中から，原音に近い音を再現するため
の最適なインデックスを「コードブック部」から検索す
る。こうして符号化装置の出力である量子化された特徴
量パラメータと，コードブック探索結果であるインデッ
クスが復号化装置に送信される。1 and 2 show the configurations of a speech coding apparatus and speech decoding apparatus according to an embodiment of the present invention.
First, in the voice encoding device (FIG. 1), various parameters that are the feature amount of voice are calculated from the input voice signal by the "feature amount calculation unit". Next, the "voiced / unvoiced section identifying unit" determines whether the input signal is a voiced section containing speech at the present time or an unvoiced section containing only background noise, based on the feature amount. When it is determined to be the unvoiced section, the calculated feature amount is suppressed in variation in the “feature amount smoothing unit”. On the contrary, when it is determined that the voiced section is used, the calculated value is used as it is without smoothing the feature amount. The subsequent steps are the same as those of the CELP system encoding device of FIG. 10 described in the above-mentioned related art, and the feature amount is quantized (compressed) in the “parameter quantization unit”. When a voice is input, the optimum drive signal selection section searches the codebook section, and the optimum index for reproducing a sound close to the original sound is searched from the “codebook section” among a plurality of indexes. In this way, the quantized feature parameter output from the encoder and the index resulting from the codebook search are transmitted to the decoder.

【００１３】復号化装置では，図２に示すように，それ
らに基づいて音声合成を行う。すなわち，まず「音声特
徴量復号部」では，量子化されているパラメータを音声
特徴量の値に逆変換する。次に，「有声・無声区間識別
部」では，特徴量に基づいて，音声の状態が有声区間
か，無声区間かを判断する。もちろん，符号化装置の
「有声，無声区間識別部」で識別された結果を，陽にパ
ラメータとして含んでいてもよい。無声区間と判断され
た場合には，計算された特徴量は「特徴量平滑化部」に
おいて変動を抑制される。逆に，有声区間と判断された
場合には，特徴量はの平滑化は行わずに計算値をそのま
ま使う。このような平滑化処理を受けた音声のスペクト
ル包絡形状の特徴量から，音声を合成するための音声合
成フィルタのパラメータが決定される。また，合成のた
めの入力に使う駆動信号もインデックスから復元され，
前記特徴量パラメータにより決定される「合成フィルタ
部」に駆動信号を入力することにより，原音に近い音声
が合成される。多くの場合，音声の復号化装置は，音声
合成フィルタ部の後に「ポストフィルタ部」（不図示）
を有し，合成された音声の聴覚上の品質を高めて出力さ
れる。In the decoding device, as shown in FIG. 2, speech synthesis is performed based on them. That is, first, in the "speech feature amount decoding unit", the quantized parameters are inversely converted into the value of the voice feature amount. Next, the "voiced / unvoiced section identification unit" determines whether the voice state is a voiced section or an unvoiced section based on the feature amount. Of course, the result identified by the "voiced / unvoiced section identification unit" of the encoding device may be explicitly included as a parameter. When it is determined to be the unvoiced section, the calculated feature amount is suppressed in variation in the “feature amount smoothing unit”. On the contrary, when it is determined that the voiced section is used, the calculated feature value is used as it is without smoothing of. The parameter of the voice synthesis filter for synthesizing the voice is determined from the feature amount of the spectrum envelope shape of the voice subjected to such smoothing processing. In addition, the drive signal used as an input for synthesis is also restored from the index,
By inputting a drive signal to the "synthesis filter unit" determined by the feature amount parameter, a voice close to the original sound is synthesized. In many cases, the speech decoding device has a "post-filter section" (not shown) after the speech synthesis filter section.
And is output with the auditory quality of the synthesized speech being enhanced.

【００１４】続いて各部分の具体的計算手法を示すが，
一般的に用いられる音声信号処理のアルゴリズムの説明
に関しては「音声情報処理の基礎（オーム社）」等の参
考書に記載されており，一般的なＣＥＬＰ方式と同様な
処理部分に関しては「ＲＣＲ規格２７Ｃ」等の規格書等
に詳しく記載されているので，ここでは省略する。「特
徴量計算部」で計算される主なパラメータには，音声の
音量，スペクトル包絡形状，ピッチ等がある。特徴量の
計算は，フレームと呼ばれる時間的に連続した音声デー
タＸ（ｎ），ｎ＝１，２，…Ｎ_Aを用いた分析によって
行う。先ず音量のパラメータとしては，次式のようなフ
レーム内での音声信号の平均振幅Ａ等が採用可能であ
る。Next, the concrete calculation method of each part will be shown.
Descriptions of commonly used speech signal processing algorithms are given in reference books such as "Basics of Speech Information Processing (Ohm Co.)", and regarding processing parts similar to the general CELP method, refer to "RCR Standard". Since it is described in detail in standards such as “27C”, it will be omitted here. The main parameters calculated by the “feature amount calculation unit” include the volume of the voice, the spectrum envelope shape, and the pitch. The calculation of the characteristic amount is performed by analysis using temporally continuous audio data X (n), n = 1, 2, ... N _A called a frame. First, as the volume parameter, the average amplitude A of the audio signal in the frame as shown in the following equation can be adopted.

【００１５】[0015]

【数１】また，スペクトル包絡形状の特徴量としては，下の方程
式を解いて得られる線形予測係数α_i，ｉ＝１，２，
…，Ｍがある。[Equation 1] Further, as the feature quantity of the spectrum envelope shape, linear prediction coefficients α _i , i = 1, 2,
…, There is M.

【００１６】[0016]

【数２】ここで，Ｒ_i,ｉ＝０，１，２，…，Ｍは連続した音声デ
ータＸ（ｎ），ｎ＝１，２，…Ｎ_aを用いて次式により
計算した自己相関である。ただし，通常ここでＲ_iの計
算に用いられるＮ_a個の音声データは，ハミングウイン
ドウなどを施されたものが使われる。[Equation 2] _{Wherein, R i, i = 0,1,2,} ..., the audio data X M is a continuous (n), n = 1,2, ... the autocorrelation calculated by the following equation using N _a. However, as the N _a pieces of voice data which are usually used for calculation of R _i here, those subjected to a Hamming window or the like are used.

【００１７】[0017]

【数３】この方程式を解く方法としては，Durbinの再帰的解法等
を用いれば効率的であることが, よく知られている。ま
た, ピッチを求める手法としては, 自己相関法等があ
る。下式のＺ変換で表されるような逆合成フィルタに音
声Ｘ（ｎ）を通した出力として得られる予測残差信号の
自己相関関数を計算すれば，音声のピッチに対応する部
分で自己相関関数がピークを持つことが知られている。(Equation 3) It is well known that the method of solving this equation is efficient if the Durbin recursive solution method is used. Moreover, as a method for obtaining the pitch, there is an autocorrelation method or the like. If the autocorrelation function of the prediction residual signal obtained as the output through the speech X (n) is calculated by the inverse synthesis filter represented by the following Z-transform, the autocorrelation at the portion corresponding to the pitch of the speech is calculated. It is known that the function has a peak.

【００１８】[0018]

【数４】ピッチに関してはこのような計算方法以外に，後で述べ
るコードブック探索において駆動信号が最適な周期性を
持つように評価関数で選択する方法もよく知られてい
る。「有声／無声区間識別」において無声部を検出する
手段としては，音声に基づく方法や信号の周期性に着目
した方法などが用いられる。つまり，比較的小さい音量
で変動が少ないフレームが続いた場合に，無声区間と判
定する方法や，有声部分ではピッチによる周期性を持っ
た信号になるということから周期性の少ない部分を無声
区間と判定する方法などが知られている。周期性を表わ
すパラメータとしては，特徴量の計算で求められた線形
予測係数による予測残差信号の自己相関関数のピーク値
などがよく使われる。(Equation 4) With respect to the pitch, in addition to such a calculation method, a method of selecting the drive signal with an evaluation function so that the drive signal has an optimum periodicity in the codebook search described later is also well known. As the means for detecting the unvoiced portion in the "voiced / unvoiced section discrimination", a method based on voice or a method focusing on the periodicity of the signal is used. In other words, when there are a lot of frames with relatively low volume and little fluctuation, it is judged as an unvoiced section, and because the voiced part has a signal with periodicity due to the pitch, the part with less periodicity is referred to as an unvoiced section. A method for making a determination is known. The peak value of the autocorrelation function of the prediction residual signal by the linear prediction coefficient obtained by the calculation of the feature quantity is often used as the parameter representing the periodicity.

【００１９】「特徴量平滑化部」においては，特に背景
雑音で生じる不快な音の変動に大きく寄与しているスペ
クトル包絡パラメータを時間的に平滑化する。ただし，
上述の線形予測係数をそのまま平滑化した場合には，平
滑化後のパラメータによって構成される合成フィルタが
安定である保証はない。合成フィルタが不安定になる
と，合成される出力信号は発振音になってしまい，極め
て不都合である。この問題を解決する方法として，線形
予測係数αｉ，ｉ＝１，２，…，Ｍを線スペクトル対ω
ｉ，ｉ＝１，２，…，Ｍと呼ばれる等価なパラメータに
変換した後に平滑化すればよい。線スペクトル対への変
換はＭが偶数であれば，αｉから決まる下式の多項式の
根を求めることで行える。Ｐ（ｚ）＝１＋（α₁＋α₁₀）ｚ^-1＋（α₂＋α₁₀）ｚ
^-2・・・＋（α₁₀＋α₁）ｚ^-10＋ｚ^-11 Ｑ（ｚ）＝１＋（α₁−α₁₀）ｚ^-1＋（α₂−α₁₀）ｚ
^-2・・・＋（α₁₀−α₁）ｚ^-10＋ｚ^-11 In the "feature amount smoothing section", the spectral envelope parameter which greatly contributes to the fluctuation of the unpleasant sound caused by the background noise is smoothed in time. However,
When the above linear prediction coefficient is smoothed as it is, there is no guarantee that the synthesis filter formed by the smoothed parameters is stable. If the synthesis filter becomes unstable, the synthesized output signal becomes an oscillating sound, which is extremely inconvenient. As a method of solving this problem, the linear prediction coefficients αi, i = 1, 2, ...
It may be smoothed after conversion into equivalent parameters called i, i = 1, 2, ..., M. The conversion into a line spectrum pair can be performed by finding the root of the polynomial in the following equation determined from αi if M is an even number. P (z) = 1 + (α ₁ + α ₁₀ ) z ⁻¹ + (α ₂ + α ₁₀ ) z
^-2・・・ + (α ₁₀ + α ₁ ) z ^-10 + z ^-11 Q (z) = 1 + (α ₁ −α ₁₀ ) z ⁻¹ + (α ₂ −α ₁₀ ) z
^-2・・・ + (α ₁₀ −α ₁ ) z ^-10 + z ^-11

【００２０】この２個の多項式Ｐ（ｚ）及びＱ（ｚ）の
根は安定であれば，ｚｉ＝ｅ^-j.wiの形の根を持ち，こ
のときのＰ（ｚ）の根０＜ω₁＜ω₃＜ω₅＜…＜ω
_M-1＜πおよび０＜ω₂＜ω₄＜ω₆＜…＜ω_M＜πが
線スペクトル対と呼ばれる。線スペクトル対は０＜ω_i
＜ω_j＜π（ｉ＜ｊ）という条件を満たす限りは安定で
あることが知られているので，元のパラメータが安定で
あれば平滑化後も容易に安定性を維持できる。例えば，
第ｋフレームの線スペクトル対をωｉ（ｋ），ｉ＝１，
２，…，Ｍとすれば，次式のように平滑化を行うことが
できる。 Ω_i（ｋ）＝β・Ω_i（ｋ−１）＋（１−β）・ω
_i（ｋ）（ただし，初期状態としてΩ_i（１）＝ω_i（１）とす
る）０＜β＜１であり，βが大きいほど平滑化は強く行われ
る。明らかに，平滑化前の線スペクトル対ω_i（ｋ）が
上記の安定条件を満たせば，平滑化後のΩ_i（ｋ）も安
定条件を満たす。If the roots of these two polynomials P (z) and Q (z) are stable, they have roots of the form zi = e- ^j.wi , and the root of P (z) at this time is 0 < ω ₁ <ω ₃ <ω ₅ <… <ω
_M-1 <π and 0 <ω ₂ <ω ₄ <ω ₆ <... <ω _M <π are called line spectrum pairs. Line spectrum pair is 0 <ω _i
It is known that it is stable as long as the condition of <ω _j <π (i <j) is satisfied, so that the stability can be easily maintained even after smoothing if the original parameters are stable. For example,
Let the line spectrum pair of the k-th frame be ωi (k), i = 1,
2, ..., M, smoothing can be performed as in the following equation. Ω _i (k) = β · Ω _i (k−1) + (1−β) · ω
_i (k) (however, Ω _i (1) = ω _i (1) in the initial state) 0 <β <1, and the larger β is, the stronger the smoothing is. Obviously, if the line spectrum pair ω _i (k) before smoothing satisfies the above-mentioned stability condition, then Ω _i (k) after smoothing also satisfies the stability condition.

【００２１】無声区間においては，このようにして得ら
れる平滑化された線スペクトル対Ω _i（ｋ），ｉ＝１，
２，…，Ｍにもとづいて合成フィルタを構成すれば，背
景雑音部における音質の不快な変動が軽減される。平滑
化後の線スペクトル対に対応する合成フィルタの係数
は，ｅ^-j.Qiを根に持つ多項式を展開して作り出せば上
記の多項式Ｐ（ｚ），Ｑ（ｚ）が構成できることから，
αｉに対応する量子化後の線形予測係数ａ_i，ｉ＝１，
２，…Ｍを以下のように求めることができる。In the unvoiced section, the
Smoothed line spectrum vs. Ω _i(K), i = 1,
If a synthesis filter is constructed based on 2, ..., M,
Unpleasant fluctuations in sound quality in the scene noise part are reduced. smooth
Coefficients of the synthesis filter corresponding to the pair of line spectra after conversion
Is e^-j.QiIf you develop a polynomial rooted at
Since the above polynomials P (z) and Q (z) can be constructed,
Quantized linear prediction coefficient a corresponding to αi_i, I = 1,
2, ... M can be obtained as follows.

【００２２】[0022]

【数５】有声区間においてパラメータを平滑化してしまうと，人
の声のスペクトルを歪めてしまうことになるので平滑化
せずに元のω_iを用いるか，もしくは声が歪まない程度
で雑音によるパラメータ変動を軽減させる目的でβの小
さな軽い平滑化を行ってもよい。(Equation 5) If the parameters are smoothed in the voiced section, the spectrum of the human voice will be distorted. Therefore, use the original ω _i without smoothing or reduce the parameter fluctuation due to noise to the extent that the voice is not distorted. For this purpose, light smoothing with a small β may be performed.

【００２３】平滑化の効果は，スペクトル包絡形状を表
わす特徴量に対して施すのが最も効果的であるが，もち
ろん音量やピッチなどの他の特徴量に対して行ってもよ
い。「パラメータ量子化」「コードブック」「コードブ
ック探索」に関しては，各種の規格において様々なもの
が用いられており，本実施例でもそれらに準拠したもの
を使うことができる。例えば，「パラメータ量子化部」
ではスカラー量子化や「ベクトル量子化」によってパラ
メータのビット数を削減し，データの圧縮がされる。ま
た，通常「コードブック」は，過去の駆動信号を記憶し
ておき周期波形を作り出す適応コードブックと，周期波
形だけでは表わしきれない部分を補償するための固定的
なコードブックから成るのが一般的である。The smoothing effect is most effectively applied to the feature quantity representing the spectrum envelope shape, but may be applied to other feature quantities such as volume and pitch, of course. Regarding the “parameter quantization”, the “codebook”, and the “codebook search”, various ones are used in various standards, and those conforming to them can be used in this embodiment as well. For example, "parameter quantizer"
Then, the number of bits of the parameter is reduced by scalar quantization or “vector quantization” to compress the data. In general, a "codebook" generally consists of an adaptive codebook that stores past drive signals and creates a periodic waveform, and a fixed codebook that compensates for portions that cannot be represented by the periodic waveform alone. Target.

【００２４】このコードブックから生成可能な駆動信号
のうち，入力の原音に近い音声を合成する最適な駆動信
号を選択するのが「コードブック探索」である。ここで
は，聴覚重み付けされた歪みを表わす評価関数が最小と
なるような波形を選択する方法が通常用いられている。
その際に，周期波形を作り出す適応コードブックに対し
ては過去に選ばれた駆動信号から周期波形を作り出すた
めのピッチやゲインが決定されインデックスとして出力
され，固定的な信号波形をテーブルとして持っているコ
ードブック部では，それらの信号の中から選ばれた最適
な波形の組み合せや符号・ゲインなどがインデックスと
して出力される。符号化装置から出力された量子化後の
パラメータやコードブックのインデックスは，復号化装
置への入力となり，「特徴量複号部」において特徴量の
値が復元される。また，コードブックのインデックスか
らは符号化装置で選択された駆動信号が生成される。復
元された特徴量は，符号化装置と同様に復号化装置にお
いても平滑化することが可能である。これにより，デー
タ送信元の符号化装置が特徴量平滑化の機能を持たない
ものであっても，復号化装置において平滑化による効果
を与えることができる。「合成フィルタ」の特性は，複
号スペクトル包絡形状を表わす特徴量によって設定され
る。本実施例では，平滑化された線スペクトル対を線形
予測係数に変換したａ_i，ｉ＝１，２，…，Ｍを用い，
次式のｚ変換で表わされるフィルタに駆動信号を入力す
ることにより音声を合成する。Of the drive signals that can be generated from this codebook, the "codebook search" is to select the optimum drive signal for synthesizing a voice close to the original input sound. Here, a method of selecting a waveform that minimizes the evaluation function representing the perceptually weighted distortion is usually used.
At that time, for the adaptive codebook that creates the periodic waveform, the pitch and gain for creating the periodic waveform from the drive signal selected in the past are determined and output as an index, and the fixed signal waveform is stored as a table. In the codebook section, the optimal combination of waveforms selected from these signals, code, gain, etc. are output as an index. The quantized parameters and codebook indexes output from the encoding device are input to the decoding device, and the value of the feature amount is restored in the “feature amount decoding unit”. In addition, the drive signal selected by the encoding device is generated from the codebook index. The restored feature amount can be smoothed by the decoding device as well as the encoding device. As a result, even if the data transmission source coding apparatus does not have the feature quantity smoothing function, the decoding apparatus can provide the smoothing effect. The characteristic of the “synthesis filter” is set by the feature amount representing the complex spectrum envelope shape. In the present embodiment, a _i , i = 1, 2, ..., M obtained by converting the smoothed line spectrum pair into a linear prediction coefficient is used,
Speech is synthesized by inputting a drive signal to a filter represented by z conversion of the following equation.

【００２５】[0025]

【数６】このように合成された音声は，より聴覚的な音質を向上
させるために通常なんらかの「ポストフィルタ」で処理
する。ポストフィルタとしては，高域強調，スペクトル
整形，ピッチ強調などのフィルタがよく用いられる。(Equation 6) The speech synthesized in this way is usually processed by some kind of "post filter" in order to improve the auditory sound quality. As post filters, filters such as high-frequency emphasis, spectrum shaping, and pitch emphasis are often used.

【００２６】以上述べた実施の形態では複号化装置に，
符号化装置から送られてきた音声パラメータなどに基づ
き無声区間／有声区間を判定する機能を持たせ，無性と
判断された区間においては符号化装置から送られてきた
インデックスなどによってコードブックから生成される
駆動信号をそのまま使用しているので，コードブックか
ら生成される駆動信号が人の声に適したものだけであ
り，背景雑音に対して使用した際に不自然な音となる。
これをより自然な雑音に聞こえる音に変えることができ
る実施例を次に示す。この際，有声部分と無声部分とで
背景雑音の質が急に変化する不自然感を緩和するため
に，境界部分ではコードブックから復元した第１の駆動
信号と乱数などの他の方法で発生させた第２の駆動信号
から重み付け加算などによって第３の駆動信号を作り出
す際に，両者を加え合わせる重み付けを連続的に変化さ
せてもよい。さらに，無声部分において，符号化装置か
ら送られてきた音声特徴量を時間的に平滑化して背景雑
音に起因して発生する特徴量の変動を抑える方法と，上
記の駆動信号の加工処理とを併用することにより，背景
雑音の不快感をさらに軽減できる。In the embodiment described above, the decoding device is
The unvoiced section / voiced section is determined based on the voice parameters sent from the encoder, and the section judged to be unvoiced is generated from the codebook by the index sent from the encoder. Since the drive signal generated by the codebook is used as it is, only the drive signal generated from the codebook is suitable for the human voice, and an unnatural sound is produced when used against background noise.
An example in which this can be changed to a sound that sounds more natural noise is shown below. At this time, in order to mitigate the unnatural feeling that the quality of the background noise changes abruptly between the voiced part and the unvoiced part, the first drive signal restored from the codebook and other methods such as random numbers are generated at the boundary part. When the third drive signal is created by weighted addition or the like from the generated second drive signal, the weighting for adding both may be continuously changed. Furthermore, in the unvoiced portion, a method of smoothing the speech feature amount sent from the encoding device in time to suppress the variation of the feature amount caused by background noise, and the above-described drive signal processing By using them together, the discomfort of background noise can be further reduced.

【００２７】[0027]

【実施例】まず，符号化装置からは元の音声から計算さ
れた音声特徴量を量子化したパラメータや，コードブッ
クから駆動信号を生成するためのインデックスが出力さ
れ，復号化装置への入力となる。復号化装置「特徴量復
号」部において特徴量の値が復元される。また，コード
ブックのインデックスからは符号化装置で選択された駆
動信号が生成される。復号化装置の場合には，図６に示
す如く，符号化装置から送られてきた量子化された音量
のパラメータから音量による有声・無声区間の判断が可
能である。また，コードブックから生成される駆動信号
は人の声帯などで発生するピッチ周期の振動に対応して
おり，この駆動信号の自己相関関数から周期性を検出す
ることが可能であり，有声・無声の判定に用いることが
できる。[First Embodiment] First, an encoder outputs parameters for quantizing a speech feature amount calculated from original speech and an index for generating a drive signal from a codebook, and inputs them to a decoding device. Become. The value of the feature amount is restored in the "feature amount decoding" unit of the decoding device. In addition, the drive signal selected by the encoding device is generated from the codebook index. In the case of the decoding device, as shown in FIG. 6, it is possible to judge the voiced / unvoiced section based on the volume based on the quantized volume parameter sent from the encoding device. Further, the drive signal generated from the codebook corresponds to the pitch period vibration generated in the human vocal cords, and the periodicity can be detected from the autocorrelation function of this drive signal. Can be used for the determination.

【００２８】「駆動信号生成部」では符号化装置で選ば
れたコードブックのインデックスに基づいて「コードブ
ック」部から第１の駆動信号が生成される。また，本実
施例では乱数などによる「ランダム雑音発生部」を設
け，第１の駆動信号とは異なり，より通常の背景雑音と
して自然な音源波形により第２の駆動信号を作り出す。
無声部分と識別された場合には，「駆動信号加工部」に
おいて第１の駆動信号の全部もしくは一部を第２の駆動
信号と置き換えて第３の駆動信号を作り出す。具体例と
しては，第１の駆動信号をｅ１（ｉ），第２の駆動信号
をｅ２（ｉ），ｉ＝１，２，…，Ｎ（Ｎはフレームのサ
イズ）とすれば，以下のような重み付き線形和で第３の
駆動信号を作ることができる。ｅ₃（ｉ）＝ｇ１・ｅ₁（ｉ）＋ｇ２・ｅ₂（ｉ）ｇ１とｇ２はｅ１とｅ２を加え合わせるための重みであ
り，例えばｅ１とｅ２が独立で相関がないとすると，次
式のようにすれば元の駆動信号ｅ１と同じ程度のパワー
を持ったｅ３を作ることができる。In the "driving signal generating section", the first driving signal is generated from the "codebook" section based on the index of the codebook selected by the encoding device. In addition, in the present embodiment, a "random noise generating section" such as a random number is provided, and unlike the first drive signal, the second drive signal is generated by a natural sound source waveform as more normal background noise.
If the unvoiced portion is identified, the "driving signal processing section" replaces all or part of the first driving signal with the second driving signal to generate the third driving signal. As a specific example, assuming that the first drive signal is e1 (i) and the second drive signal is e2 (i), i = 1, 2, ..., N (N is the frame size), The third drive signal can be created by a linear sum with different weights. e ₃ (i) = g1 · e ₁ (i) + g2 · e ₂ (i) g1 and g2 are weights for adding e1 and e2 together. For example, if e1 and e2 are independent and have no correlation, By using the formula, e3 having the same level of power as the original drive signal e1 can be produced.

【００２９】[0029]

【数７】もちろん，ｇ１とｇ２の値をこれよりも少なめにして，
無声部における背景雑音を抑えぎみに出力することもで
きる。有声部では，基本的に第１の駆動信号をそのまま
（Ｋ＝０）で使用するが，このような駆動信号の置き換
えによる有声部と無声部の境界での背景雑音の質の急激
な変化を軽減するために，有声部と無声部の切り換えの
際にＫを連続的に変化させることも可能である。生成さ
れた駆動信号は，「合成フィルタ」を通すことにより音
声が合成される。合成フィルタの特徴は，復号されたス
ペクトル包絡形状を表わす特徴量によって設定される。
よく知られたパラメータとしては，線形予測係数とよば
れるパラメータａ_i，ｉ＝１，２，…，Ｍがある。これ
を用い，次式のｚ変換で表わされるフィルタに駆動信号
を入力することにより音声を合成する。(Equation 7) Of course, make the values of g1 and g2 smaller than this,
It is also possible to suppress the background noise in the unvoiced part. In the voiced part, the first drive signal is basically used as it is (K = 0). However, such a replacement of the drive signal causes a sudden change in the background noise quality at the boundary between the voiced part and the unvoiced part. To reduce it, it is possible to change K continuously when switching between the voiced and unvoiced parts. The generated drive signal is passed through a “synthesis filter” to synthesize voice. The feature of the synthesis filter is set by the feature amount representing the decoded spectral envelope shape.
Well-known parameters include parameters a _i , i = 1, 2, ..., M called linear prediction coefficients. Using this, a drive signal is input to the filter represented by the z conversion of the following equation to synthesize voice.

【００３０】[0030]

【数８】合成フィルタに使われるスペクトル包絡を表わすパラメ
ータは，符号化装置において元の音声から計算され量子
化されて復号化装置に渡されるが，その際に量子化誤差
の影響を受けにくくするなどの目的で，ＰＡＲＣＯＲ係
数や線スペクトル対といったようなパラメータに変換さ
れた後に量子化するものが多い。これらは，線形予測係
数と等価なパラメータなので，互いに変換することが可
能である。これらの音声信号処理の基礎に関しては，参
考書等（“音声情報処理の基礎”，オーム社などを参
照）に詳しく記載されているので，ここでは詳細説明を
省く。(Equation 8) The parameters used in the synthesis filter, which represent the spectral envelope, are calculated from the original speech in the coding device, quantized, and passed to the decoding device. At that time, for the purpose of making it less susceptible to quantization errors, etc. , PARCOR coefficients and line spectrum pairs are often converted into parameters and then quantized. Since these are parameters equivalent to the linear prediction coefficient, they can be converted to each other. The basics of these audio signal processing are described in detail in reference books and the like (see “Basics of audio information processing”, Ohmsha, etc.), and therefore detailed description thereof is omitted here.

【００３１】特に，線スペクトル対の場合，時間的に平
滑化しても安定性が保証されるので，無声部分などにお
いてパラメータを平滑化することによって，背景雑音の
不自然な変動が抑制され，さらに不快感を軽減させるこ
とも可能である。図７には，そのようなパラメータの平
滑化も行う実施例が示してある。合成された音声は，よ
り聴覚的な音質を向上させるために通常なんらかの「ポ
ストフィルタ」（不図示）で処理する。ポストフィルタ
としては，高域強調，スペクトル整形，ピッチ強調など
のフィルタがよく用いられる。（ＲＣＲ規格２７Ｃなど
を参照）In particular, in the case of a line spectrum pair, the stability is assured even if it is smoothed in time. Therefore, by smoothing the parameters in the unvoiced part, the unnatural fluctuation of the background noise is suppressed, and further, It is also possible to reduce discomfort. FIG. 7 shows an embodiment in which such parameter smoothing is also performed. The synthesized speech is usually processed by some kind of "post filter" (not shown) in order to improve the audio quality. As post filters, filters such as high-frequency emphasis, spectrum shaping, and pitch emphasis are often used. (Refer to RCR standard 27C, etc.)

【００３２】[0032]

【発明の効果】本発明によって得られる効果を以下に示
す。また，図３には２０ｄＢ程度の雨音が背景雑音とし
て存在している入力音声から計算された線スペクトル対
の値が示されている。線スペクトル対は，フレームの前
後のデータも含めて２０ｍｓ（１６０サンプル）のデー
タを使って計算した。この値に対して，平滑化を行わな
い従来の方式で線スペクトル対を量子化すると，例えば
図４のようになる。これに対し，本発明によれば，符号
化装置ならびに複号化装置において線スペクトル対は平
滑化され，図５のようになる。これにより，無声区間に
おける背景雑音部の特徴量変動が抑制されていることが
分かり，聴覚による官能試験でも，音質の不自然な変動
による不快感が軽減されている。The effects obtained by the present invention are shown below. Further, FIG. 3 shows the values of the line spectrum pairs calculated from the input voice in which the rain sound of about 20 dB exists as the background noise. The line spectrum pair was calculated using the data of 20 ms (160 samples) including the data before and after the frame. When a line spectrum pair is quantized to this value by a conventional method without smoothing, for example, it becomes as shown in FIG. On the other hand, according to the present invention, the line spectrum pair is smoothed in the encoding device and the decoding device, as shown in FIG. As a result, it was found that the fluctuation of the characteristic amount of the background noise part in the unvoiced section was suppressed, and even in the sensory test by hearing, the discomfort caused by the unnatural fluctuation of the sound quality was reduced.

【００３３】無声部分における合成フィルタの特性は，
背景雑音スペクトル特性を表わすことになる。したがっ
て，ｅ３を入力として音声を合成した場合にも，出力の
スペクトルは図８，図９でみるように，元の背景雑音と
類似したものとなる。図８と図９は，背景雑音である空
調騒音のスペクトルと，本発明の複合化装置によって合
成された後の音のスペクトルである。もちろん，この方
式によって合成された背景雑音は元の音とは異なるもの
となるが，官能試験によれば，第１の駆動信号をそのま
ま使う従来の方法によって出力される音よりも，第３の
駆動信号によって合成された音のほうが，より自然で不
快感がないという結果が得られた。The characteristics of the synthesis filter in the unvoiced part are
It represents the background noise spectrum characteristic. Therefore, even when synthesizing a voice with e3 as an input, the output spectrum is similar to the original background noise, as seen in FIGS. FIG. 8 and FIG. 9 are the spectrum of the air conditioning noise that is the background noise and the spectrum of the sound after being synthesized by the compounding device of the present invention. Of course, the background noise synthesized by this method is different from the original sound, but according to the sensory test, the background noise synthesized by the third method is better than the sound output by the conventional method using the first drive signal as it is. The result is that the sound synthesized by the drive signal is more natural and has no discomfort.

[Brief description of drawings]

【図１】本発明の実施の形態にかかる符号化装置の構
成を示すブロック図。FIG. 1 is a block diagram showing a configuration of an encoding device according to an embodiment of the present invention.

【図２】本発明の実施の形態にかかる複号化装置の構
成を示すブロック図。FIG. 2 is a block diagram showing a configuration of a decoding device according to an embodiment of the present invention.

【図３】線スペクトル対（特徴量計算部での値）の変
動を示すグラフ。FIG. 3 is a graph showing variations in a pair of line spectra (values in the feature amount calculation unit).

【図４】量子化後の線スペクトル対（平滑化しなかっ
た場合の値）の変動を示すグラフ。FIG. 4 is a graph showing a variation of a pair of line spectra after quantization (values obtained without smoothing).

【図５】量子化後の線スペクトル対（平滑化した値）
の変動を示すグラフ。FIG. 5: Quantized line spectrum pair (smoothed value)
The graph which shows the fluctuation of.

【図６】本発明の一実施例に係る音声復号化装置の構
成を示すブロック図。FIG. 6 is a block diagram showing a configuration of a speech decoding apparatus according to an embodiment of the present invention.

【図７】図６に示した音声復号化装置における特徴量
のパラメータの平滑化を行う構成を示すブロック図。7 is a block diagram showing a configuration for smoothing a parameter of a feature amount in the speech decoding apparatus shown in FIG.

【図８】原音声の背景雑音のスペクトルを示すグラ
フ。FIG. 8 is a graph showing a spectrum of background noise of original speech.

【図９】合成された背景雑音のスペクトルを示すグラ
フ。FIG. 9 is a graph showing a spectrum of synthesized background noise.

【図１０】従来のＣＥＬＰ符号化装置の構成を示すブ
ロック図。FIG. 10 is a block diagram showing the configuration of a conventional CELP encoding device.

【図１１】従来のＣＥＬＰ復号化装置の構成を示すブ
ロック図。FIG. 11 is a block diagram showing a configuration of a conventional CELP decoding device.

───────────────────────────────────────────────────── フロントページの続き (72)発明者坂谷亨兵庫県神戸市西区高塚台１丁目５番５号株式会社神戸製鋼所神戸総合技術研究所内 (72)発明者稗方孝之兵庫県神戸市西区高塚台１丁目５番５号株式会社神戸製鋼所神戸総合技術研究所内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Toru Sakata 1-5-5 Takatsukadai, Nishi-ku, Kobe-shi, Hyogo Prefecture Kobe Steel Research Institute, Kobe Steel Co., Ltd. (72) Inventor Takayuki Higata Nishi-ku, Kobe-shi, Hyogo Prefecture Takatsukadai 1-5-5 Kobe Steel Co., Ltd. Kobe Research Institute

Claims

[Claims]

1. A voice feature amount is calculated from a voice signal, parameters constituting the voice feature amount are compressed and transmitted, and an index corresponding to a drive signal for voice synthesis is codebook according to the voice signal. A voice coding apparatus for selecting and transmitting the voiced / unvoiced section in the voice signal; and a voice feature amount only in the unvoiced section. And a coding smoothing means for performing a smoothing process for either smoothing the fluctuations in time or a stronger smoothing in the unvoiced section than in the voiced section. Speech coding device.

2. The speech coding apparatus according to claim 1, wherein a line spectrum pair is calculated as the speech feature quantity, and in the unvoiced section, the value of each line spectrum pair is temporally smoothed and then the speech feature quantity is quantized.

3. A compressed parameter which is an output from a speech encoding device, an index for generating a driving signal, etc. are received, a speech feature amount and a driving signal are restored from these parameters, indexes, etc. In a speech decoding apparatus for synthesizing speech using a decoding voiced / unvoiced section identifying means for identifying a voiced / unvoiced portion in the speech signal from a compressed parameter from the speech encoding apparatus, and in the unvoiced section Only after smoothing the variation of the above-mentioned speech feature quantity, smoothing processing that smoothes temporally or stronger in the unvoiced section than in the voiced section, and after smoothing And a voice synthesizing means for synthesizing a voice by using each voice feature amount of.

4. The voice according to claim 3, wherein a line spectrum pair is calculated as the voice feature quantity, and in the unvoiced section, the values of each line spectrum pair are temporally smoothed and then the drive signal is calculated to synthesize the voice. Decoding device.

5. A compressed parameter which is an output from the audio encoding device, an index for generating a drive signal, etc. are received, a voice feature amount and a drive signal are restored from these parameters, indexes, etc. In a speech decoding apparatus for synthesizing speech by using a voice parameter in the speech signal from the compressed parameters from the encoding apparatus,
Decoded voiced / unvoiced section identification means for identifying unvoiced parts,
In the portion determined to be the unvoiced section, a third drive signal obtained by replacing a part or all of the first drive signal restored from the codebook based on the index with a second drive signal generated by another method is used. A voice decoding device characterized by generating a drive signal and synthesizing a voice based on the third drive signal.

6. A noise signal generated by a random number or the like is used as the second drive signal, and the third drive signal is created by weighted addition of the first drive signal and the second drive signal restored from the codebook. The audio decoding device according to claim 5.

7. At the voiced / unvoiced switching boundary portion, a first weighting addition is performed when the third drive signal is created.
7. The speech decoding apparatus according to claim 5, wherein the ratio of the drive signal of 1 to the second drive signal such as a random number is continuously changed.

8. The unvoiced part smoothes temporally the variation of the voice feature quantity, and synthesizes the voice using the smoothed feature quantity and the drive signal. The speech decoding device according to any one of claims.