JPH06202695A

JPH06202695A - Speech signal processor

Info

Publication number: JPH06202695A
Application number: JP5001368A
Authority: JP
Inventors: Atsushi Matsumoto; 淳松本; Masayuki Nishiguchi; 正之西口; Shinobu Ono; 忍小野
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1993-01-07
Filing date: 1993-01-07
Publication date: 1994-07-22

Abstract

PURPOSE:To improve the articulation of a synthesized speech without spoiling a feeling of naturalness by emphasizing the high-frequency formant of a frequency spectrum by directly operating parameters of the frequency range. CONSTITUTION:Quantized amplitude data from an input terminal 11 are sent to an inverse quantization 14 and quantized inversely, and the resulting data are sent to a data quantity inverse conversion part 15 and converted inversely into amplitude values by bands, which are sent to a high-frequency formant emphasis part 16. The high-frequency emphasis part 16 directly operates the amplitude values by the bands as the parameters of the frequency range to perform a high frequency formant emphasizing process and a high-frequency emphasizing process. Amplitude data obtained by the emphasizing process are sent to a voiced sound synthesis part 17 and a voiceless sound synthesis part 20. Then signals of the voiced sound part and voiceless sound part which are synthesized by the synthesis parts 17 and 20 and put back onto the time base are added by an addition part 31 at a proper fixed mixing rate and then a reproduced speech signal is outputted from an output terminal 32.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声合成系に用いられ
る音声信号処理装置に関し、特に、マルチバンド励起符
号化（ＭＢＥ）の音声復号装置の音声合成系の後置フィ
ルタ（ポストフィルタ）に適用して好ましい音声信号処
理装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech signal processing apparatus used in a speech synthesis system, and more particularly to a post filter of a speech synthesis system of a speech decoding apparatus of multi-band excitation coding (MBE). The present invention relates to a preferred audio signal processing device.

【０００２】[0002]

【従来の技術】音声信号の時間領域や周波数領域におけ
る統計的性質と人間の聴感上の特性を利用して信号圧縮
を行うような符号化方法が種々知られている。この音声
符号化方法としては、大別して、時間領域での符号化、
周波数領域での符号化、合成分析符号化等が挙げられ
る。2. Description of the Related Art Various coding methods are known in which signal compression is performed by utilizing the statistical characteristics of a voice signal in the time domain and frequency domain and the characteristics of human hearing. This speech coding method is roughly classified into coding in the time domain,
Examples include coding in the frequency domain and synthesis analysis coding.

【０００３】この音声信号の符号化の具体的な例として
は、ＭＢＥ（Multiband Excitation: マルチバンド励
起）符号化、ＳＢＥ（Singleband Excitation:シングル
バンド励起）符号化、ハーモニック（Harmonic）符号
化、ＳＢＣ（Sub-band Coding:帯域分割符号化）、ＬＰ
Ｃ（Linear Predictive Coding: 線形予測符号化）、あ
るいはＤＣＴ（離散コサイン変換）、ＭＤＣＴ（モデフ
ァイドＤＣＴ）、ＦＦＴ（高速フーリエ変換）等があ
る。Specific examples of the coding of the voice signal include MBE (Multiband Excitation) coding, SBE (Singleband Excitation) coding, Harmonic coding, and SBC (Multiband Excitation). Sub-band Coding), LP
There are C (Linear Predictive Coding), DCT (Discrete Cosine Transform), MDCT (Modified DCT), FFT (Fast Fourier Transform), and the like.

【０００４】[0004]

【発明が解決しようとする課題】ところで、上記ＭＢＥ
符号化等のように、周波数領域での処理を中心とする音
声の分析・合成系においては、量子化誤差によりスペク
トル歪が生じ、特に通常ビット割り当てが少ない高域で
の劣化が著しくなることが多い。結果として、このスペ
クトルから合成された音声は、高域フォルマントの消失
もしくはパワー不足、及び高域全体でのパワー不足等に
より明瞭度が失われ、いわゆる鼻のつまったような感じ
が耳につくようになってくる。By the way, the above MBE
In a speech analysis / synthesis system such as coding, which mainly involves processing in the frequency domain, quantization distortion causes spectral distortion, and deterioration in the high frequency range, where bit allocation is usually small, becomes significant. Many. As a result, the speech synthesized from this spectrum loses clarity due to the loss of high-range formants or lack of power, and lack of power in the entire high range, so that a so-called stuffy nose can be heard. Is becoming.

【０００５】これを補正するためには、時間領域でその
補償処理を行うような例えばＩＩＲ（無限インパルス応
答）フィルタ等を用いたフォルマント強調フィルタが用
いられていたが、この場合には音声処理フレーム毎にフ
ォルマントを強調するためのフィルタ係数を算出しなけ
ればならず、実時間処理が困難である。またフィルタの
安定性についても留意する必要があり、演算処理量の割
には効果が大きくないという欠点があった。To correct this, a formant emphasizing filter using, for example, an IIR (infinite impulse response) filter for performing the compensation processing in the time domain was used, but in this case, a speech processing frame is used. The filter coefficient for emphasizing the formant must be calculated for each time, and real-time processing is difficult. Also, it is necessary to pay attention to the stability of the filter, and there is a drawback that the effect is not great for the amount of calculation processing.

【０００６】本発明は、上記実情に鑑みてなされたもの
であり、音声合成系でのフォルマント強調のような処理
が簡単化され、容易に実時間処理が行えるような音声信
号処理装置の提供を目的とする。The present invention has been made in view of the above circumstances, and provides a speech signal processing apparatus in which processing such as formant enhancement in a speech synthesis system is simplified and real-time processing can be easily performed. To aim.

【０００７】[0007]

【課題を解決するための手段】本発明に係る音声信号処
理装置は、上述した課題を解決するために、周波数領域
での処理を中心とする音声合成系に用いられる音声信号
処理装置において、周波数スペクトルの高域フォルマン
トを周波数領域のパラメータを直接操作して強調するこ
とを特徴としている。SUMMARY OF THE INVENTION In order to solve the above-mentioned problems, a voice signal processing device according to the present invention is a voice signal processing device used in a voice synthesizing system centering on processing in a frequency domain. The feature is that the high-band formant of the spectrum is emphasized by directly operating the parameters in the frequency domain.

【０００８】ここで、上記周波数スペクトルの高域フォ
ルマントのみならず、高域全体を周波数領域のパラメー
タを直接操作して強調することが考えられる。このよう
な特徴を有する音声信号処理装置は、マルチバンド励起
符号化（ＭＢＥ）方式の音声復号装置の音声合成系の後
置フィルタに適用することが好ましい。Here, it is conceivable to emphasize not only the high-frequency formant of the frequency spectrum but also the high-frequency band by directly operating the parameters in the frequency domain. The speech signal processing device having such characteristics is preferably applied to a post-filter of a speech synthesis system of a multiband excitation coding (MBE) type speech decoding device.

【０００９】[0009]

【作用】周波数領域で直接そのパラメータを操作してい
るため、時間軸方向の高域強調フィルタ（例えばＩＩＲ
フィルタ）等を用いることなく、簡単な構成及び簡単な
操作で、強調したい部分だけを正確に強調でき、実時間
処理が容易に行える。Since the parameter is directly operated in the frequency domain, a high-frequency emphasis filter (eg IIR) in the time axis direction is used.
Without using a filter or the like, only a portion to be emphasized can be accurately emphasized with a simple configuration and a simple operation, and real-time processing can be easily performed.

【００１０】[0010]

【実施例】以下、本発明に係る音声信号処理装置の実施
例について、図面を参照しながら説明する。図１は、本
発明の一実施例となる音声信号処理装置が適用された音
声合成系としてのマルチバンド励起符号化（ＭＢＥ）方
式の音声復号装置の概略構成を示している。Embodiments of the audio signal processing apparatus according to the present invention will be described below with reference to the drawings. FIG. 1 shows a schematic configuration of a multi-band excitation coding (MBE) type speech decoding apparatus as a speech synthesis system to which a speech signal processing apparatus according to an embodiment of the present invention is applied.

【００１１】この図１の入力端子１１には、後述するＭ
ＢＥ方式の音声符号化装置、いわゆるＭＢＥボコーダか
ら伝送されてきた量子化振幅データが供給されている。
この量子化振幅データは、上記ＭＢＥボコーダにおい
て、入力音声信号の処理フレーム毎のスペクトルを該音
声信号のピッチを単位として分割した各バンド毎の振幅
値を、ピッチの値によらない一定のデータ数に変換し、
ベクトル量子化して得られたデータである。入力端子１
２及び１３には、上記ＭＢＥボコーダにおいて符号化さ
れたピッチデータ及び各バンド毎に有声音か無声音かを
示すＶ／ＵＶ判別データがそれぞれ供給されている。The input terminal 11 shown in FIG.
Quantized amplitude data transmitted from a BE type speech encoding device, a so-called MBE vocoder, is supplied.
In the MBE vocoder, the quantized amplitude data is obtained by dividing the spectrum of each processing frame of the input voice signal by the pitch of the voice signal as a unit, and determining the amplitude value for each band by a fixed number of data independent of the pitch value. Converted to
This is the data obtained by vector quantization. Input terminal 1
Pitch data encoded by the MBE vocoder and V / UV discrimination data indicating voiced sound or unvoiced sound for each band are supplied to 2 and 13, respectively.

【００１２】入力端子１１からの上記量子化振幅データ
は、逆ベクトル量子化部１４に送られて逆量子化され、
データ数逆変換部１５に送られて逆変換されて上記バン
ド毎の振幅値とされた後、本発明実施例の要部となる高
域フォルマント強調部１６に送られる。この高域フォル
マント強調部１６では、周波数領域のパラメータである
各バンド毎の振幅値を直接操作することで、高域フォル
マントの強調処理及び高域強調処理が施される。この強
調処理されて得られた振幅データが有声音合成部１７及
び無声音合成部２０に送られる。The quantized amplitude data from the input terminal 11 is sent to the inverse vector quantizer 14 and inversely quantized,
After being sent to the data number inverse conversion unit 15 and inversely converted into the amplitude value for each band, it is sent to the high-frequency formant emphasizing unit 16 which is a main part of the embodiment of the present invention. The high-frequency formant emphasizing unit 16 performs high-frequency formant emphasizing processing and high-frequency emphasizing processing by directly operating the amplitude value for each band, which is a parameter in the frequency domain. The amplitude data obtained by this emphasis processing is sent to the voiced sound synthesis unit 17 and the unvoiced sound synthesis unit 20.

【００１３】入力端子１２からの上記符号化ピッチデー
タは、ピッチ復号化部１８で復号化され、データ数逆変
換部１５、有声音合成部１７及び無声音合成部２０に送
られる。また入力端子１３からのＶ／ＵＶ判別データ
は、有声音合成部１７及び無声音合成部２０に送られ
る。有声音合成部１７では例えば余弦(cosine)波合成に
より時間軸上の有声音波形を合成して、加算部３１に送
る。The encoded pitch data from the input terminal 12 is decoded by the pitch decoding unit 18 and sent to the data number inverse conversion unit 15, voiced sound synthesis unit 17 and unvoiced sound synthesis unit 20. The V / UV discrimination data from the input terminal 13 is sent to the voiced sound synthesis unit 17 and the unvoiced sound synthesis unit 20. The voiced sound synthesis unit 17 synthesizes a voiced sound waveform on the time axis by, for example, cosine wave synthesis and sends it to the addition unit 31.

【００１４】無声音合成部２０においては、先ず、ホワ
イトノイズ発生部２１からの時間軸上のホワイトノイズ
信号波形を、所定の長さ（例えば２５６サンプル）で適
当な窓関数（例えばハミング窓）により窓かけをし、Ｓ
ＴＦＴ処理部２２によりＳＴＦＴ（ショートタームフー
リエ変換）処理を施すことにより、ホワイトノイズ信号
の周波数軸上のパワースペクトルを得る。このＳＴＦＴ
処理部２２からのパワースペクトルをバンド振幅処理部
２３に送り、ＵＶ（無声音）とされたバンドについて上
記振幅｜Ａ_m｜_UVを乗算し、他のＶ（有声音）とされた
バンドの振幅を０にする。このバンド振幅処理部２３に
は上記振幅データ、ピッチデータ、Ｖ／ＵＶ判別データ
が供給されている。バンド振幅処理部２３からの出力
は、ＩＳＴＦＴ処理部２４に送られ、位相は元のホワイ
トノイズの位相を用いて逆ＳＴＦＴ処理を施すことによ
り時間軸上の信号に変換する。ＩＳＴＦＴ処理部２４か
らの出力は、オーバーラップ加算部２５に送られ、時間
軸上で適当な（元の連続的なノイズ波形を復元できるよ
うに）重み付けをしながらオーバーラップ及び加算を繰
り返し、連続的な時間軸波形を合成する。オーバーラッ
プ加算部２５からの出力信号が上記加算部３１に送られ
る。In the unvoiced sound synthesizing unit 20, first, the white noise signal waveform on the time axis from the white noise generating unit 21 is windowed with an appropriate window function (eg, Hamming window) with a predetermined length (eg, 256 samples). Call S
By performing STFT (Short Term Fourier Transform) processing by the TFT processing unit 22, a power spectrum on the frequency axis of the white noise signal is obtained. This STFT
The power spectrum from the processing unit 22 is sent to the band amplitude processing unit 23, and the band of UV (unvoiced sound) is multiplied by the amplitude | A _m | _UV to determine the amplitude of another V (voiced sound) band. Set to 0. The band amplitude processing unit 23 is supplied with the amplitude data, the pitch data, and the V / UV discrimination data. The output from the band amplitude processing unit 23 is sent to the ISTFT processing unit 24, and the phase is converted into a signal on the time axis by performing inverse STFT processing using the phase of the original white noise. The output from the ISTFT processing unit 24 is sent to the overlap adding unit 25, which repeats overlap and addition while appropriately weighting (so that the original continuous noise waveform can be restored) on the time axis, and continuously. Time-domain waveforms are synthesized. The output signal from the overlap adder 25 is sent to the adder 31.

【００１５】このように、各合成部１７、２０において
合成されて時間軸上に戻された有声音部及び無声音部の
各信号を、加算部３１により適当な固定の混合比で加算
することにより、出力端子３２より再生された音声信号
を取り出す。As described above, the signals of the voiced sound portion and the unvoiced sound portion which are synthesized in the respective synthesis units 17 and 20 and returned on the time axis are added by the addition unit 31 at an appropriate fixed mixing ratio. , The reproduced audio signal is taken out from the output terminal 32.

【００１６】次に、上記高域フォルマント強調部１６で
の高域フォルマントの強調処理及び高域強調処理につい
て、図２のフローチャートを参照しながら説明する。上
記図１のデータ数逆変換部からのスペクトル情報である
上記各バンド毎の振幅値、すなわち上記ＭＢＥボコーダ
側での入力音声信号のスペクトルをピッチに応じて周波
数分割した各バンド毎の振幅値をＡ(n) とする。ここで
ｎはいわゆるハーモニックスの番号あるいはバンドのイ
ンデックス番号であり、周波数軸上でピッチ周期でイン
クリメントされる整数値である。Next, the high-frequency formant emphasis processing and high-frequency emphasis processing in the high-frequency formant emphasis unit 16 will be described with reference to the flowchart of FIG. The amplitude value for each band, which is the spectrum information from the data number inverse conversion unit in FIG. 1, that is, the amplitude value for each band obtained by frequency-dividing the spectrum of the input voice signal on the MBE vocoder side in accordance with the pitch, Let A (n). Here, n is a so-called harmonic number or band index number, which is an integer value incremented by the pitch period on the frequency axis.

【００１７】先ず、図２のステップＳ１では、上記各バ
ンド毎の振幅値Ａ(n) の内の最大値を検出する。次のス
テップＳ２で、この最大値が所定の閾値Ｊ₁よりも大き
いか否かを判別し、ＮＯのときには高域フォルマント強
調処理を何も行わないで終了する。ＹＥＳのときにはス
テップＳ３に進んで、以下に説明するような高域フォル
マント強調処理を開始する。これは、もともとスペクト
ルの値（各バンド毎の振幅値）が小さいときには、強調
を行うとかえって不自然な感じになることを考慮したも
のであり、最大値が上記閾値Ｊ₁よりも大きいときの
み、以下のような強調処理を行う。First, in step S1 of FIG. 2, the maximum value of the amplitude values A (n) for each band is detected. In the next step S2, it is determined whether or not this maximum value is larger than a predetermined threshold value J ₁ , and when the result is NO, the high-frequency formant emphasis process is not performed and the process ends. If YES, the process proceeds to step S3 to start the high frequency formant emphasis process as described below. This is because when the spectrum value (amplitude value for each band) is originally small, it is rather unnatural when emphasis is performed, and only when the maximum value is larger than the threshold value J _1. The following emphasis processing is performed.

【００１８】ここで、音声のフォルマントを強調する処
理を説明するための具体例として、各バンド毎の振幅値
Ａ(n) が例えば図３に示すようなものである場合に、図
中のＦ₁、Ｆ₂に示す部分には山の頂上となっている形
状が明瞭に表れており、これらが音声の第１、第２フォ
ルマントとなることが分かるが、Ｆ₃に示す部分はその
形状が不明確であり、フォルマントであるか否かがはっ
きりしない。そこで、これらのＡ(n) のエンベロープを
Ａ(n) の移動平均で代表あるいは推定し、この移動平均
で代表あるいは推定されたエンベロープの各バンド毎の
値Ａ'(n)に対してＡ(n) が大きいか小さいかに応じて、
大きいときはより大きく、小さいときはより小さくする
ことで、フォルマント強調を行う。Here, as a specific example for explaining the processing for emphasizing the voice formant, when the amplitude value A (n) for each band is as shown in FIG. 3, for example, F in the figure is used. ₁ and F ₂ clearly show the shape of the peak of the mountain, and it can be seen that these are the first and second formants of the voice, but the portion shown in F ₃ has the shape It is unclear whether it is a formant or not. Therefore, these A (n) envelopes are representatively or estimated by the moving average of A (n), and A ((n) is represented with respect to the value A '(n) of each envelope representative or estimated by this moving average. n) is larger or smaller,
Formant enhancement is performed by making it larger when it is large and smaller when it is small.

【００１９】すなわち、図２のステップＳ３では、上記
各バンド毎の振幅値Ａ(n) の移動平均Ａ'(n)を、That is, in step S3 of FIG. 2, the moving average A '(n) of the amplitude value A (n) for each band is

【００２０】[0020]

【数１】 [Equation 1]

【００２１】によって算出する。It is calculated by

【００２２】この（１）式のＢは移動平均をとる対象と
なる全バンド幅を示し、ｎ_maxは定められた周波数の中
での最大値を示し、ピッチの角周波数ω₁によって変化
するものである。B in the equation (1) represents the total bandwidth for which the moving average is to be taken, n _max represents the maximum value among the determined frequencies, and changes with the angular frequency ω _{1 of the} pitch. Is.

【００２３】図４に上記各バンド毎の振幅値Ａ(n) 及び
上記移動平均Ａ'(n)をプロットした例を示す。この図４
において、高域側の埋もれてしまっているフォルマント
Ｆ₃を、次の（２）式で強調する。Ａe(n)＝Ａ(n) ＋ｆ（Ａ(n) −Ａ'(n)）・ａ(n) ・・・（２）FIG. 4 shows an example in which the amplitude value A (n) and the moving average A '(n) for each band are plotted. This Figure 4
In, the buried formant F ₃ on the high frequency side is emphasized by the following expression (2). Ae (n) = A (n) + f (A (n) -A '(n)) ・ a (n) (2)

【００２４】この（２）式で、Ａe(n)はもとのスペクト
ル（各バンド毎の振幅値）Ａ(n) のフォルマントを強調
したものである。すなわち、移動平均によるエンベロー
プに比較して元のスペクトルが大きい場合には、さらに
元のスペクトルを大きくし、逆の場合にはさらに小さ
く、という操作を行うことになる。また上記（２）式中
の関数ｆは、具体的には例えば次の（３）式のような関
数を用いればよい。In the equation (2), Ae (n) is an emphasis of the formant of the original spectrum (amplitude value for each band) A (n). That is, when the original spectrum is larger than the envelope obtained by the moving average, the original spectrum is further increased, and in the opposite case, the operation is further reduced. Further, as the function f in the above formula (2), specifically, for example, a function like the following formula (3) may be used.

【００２５】[0025]

【数２】 [Equation 2]

【００２６】この（３）式中の sgn(p) はｐの符号（正
負の極性）を示す関数であり、Ｋはｐがある一定値Ｋ以
上になった場合にそこで飽和させるための定数である。
この（３）式の関数を図５に示す。In the equation (3), sgn (p) is a function indicating the sign (positive or negative polarity) of p, and K is a constant for saturation when p exceeds a certain value K. is there.
The function of equation (3) is shown in FIG.

【００２７】さらに、上記（２）式中のａ(n) は、バン
ドによってその強調する度合を変化させるためのデータ
であり、低域側では０あるいは極めて小さな値とし、高
域側で非０の値として強調を行わせればよい。Further, a (n) in the above equation (2) is data for changing the degree of emphasis depending on the band, and is set to 0 or an extremely small value on the low frequency side and non-zero on the high frequency side. The value of should be emphasized.

【００２８】ただし、強調後のＡe(n)が非常に大きくな
ってしまうと想定されるような振幅値Ａ(n) 自体が所定
の閾値Ｊ₂以上となるときには、上記強調を行わない。
このため、図２のステップＳ４では、振幅値Ａ(n) が上
記閾値Ｊ₂より小さいか否かを判別し、ＹＥＳのときの
みステップＳ５に進んで上記（２）式の計算を行って強
調された振幅値Ａe(n)を求めた後ステップＳ７に進み、
ＮＯのときにはステップＳ６に進んで元の振幅値Ａ(n)
をＡe(n)として（上記強調を行わないで）ステップＳ７
に進む。However, when the amplitude value A (n) itself, which is assumed to increase Ae (n) after the emphasis, becomes equal to or larger than a predetermined threshold value J ₂ , the above emphasis is not performed.
Therefore, in step S4 of FIG. 2, it is determined whether or not the amplitude value A (n) is smaller than the threshold value J ₂ , and only when YES, the process proceeds to step S5 to calculate the equation (2) and emphasize it. After obtaining the calculated amplitude value Ae (n), the process proceeds to step S7,
If NO, the process proceeds to step S6 and the original amplitude value A (n)
As Ae (n) (without emphasizing the above) step S7
Proceed to.

【００２９】以上のような高域フォルマント強調の他
に、高域の全体的なパワー不足を補正するため、予め設
定しておいた高域強調データに基づき、次の（４）式に
示すように高域成分を増加させて高域強調を行う。Ａee(n) ＝Ａe(n)＋ em(n) ・・・（４）この（４）式の em(n)は、バンドｎについて、どの程度
強調を行うかという情報を持ったデータ列である。In addition to the high range formant enhancement as described above, in order to correct the overall power shortage in the high range, the following formula (4) is used based on preset high range enhancement data. The high frequency component is increased to enhance the high frequency range. Aee (n) = Ae (n) + em (n) (4) em (n) in the equation (4) is a data string that has information about how much emphasis is to be applied to band n. is there.

【００３０】この高域強調については、分析対象の音声
ピッチが高いとき（女声等）には、分析対象となる周波
数帯域（例えば 200〜3400Hz）内のバンド数が少なく、
スペクトル全体の誤差も小さいことを考慮して、上記 e
m(n)をより小さい値に、例えば半分の値 em(n)／２にす
る。すなわち、分析対象の音声ピッチをサンプル数で表
した値Ｎp が所定の閾値Ｐよりも小さいか否かをステッ
プＳ７で判別し、ＹＥＳのときにはステップＳ８に進ん
で、 em(n)＝ em(n)／２とした後ステップＳ９に進み、
ＮＯのときには直接ステップＳ９に進む。この場合のピ
ッチの閾値Ｐとしては、例えばＰ＝４０（８ｋHzサンプ
リングで２００Hzのピッチに相当）とすることが挙げら
れ、Ｎp ＜４０のときは音声ピッチが２００Hzより高い
ことになる。ステップＳ９では、上記（４）式の高域強
調を行う。With regard to this high-frequency emphasis, when the voice pitch to be analyzed is high (female voice, etc.), the number of bands in the frequency band to be analyzed (for example, 200 to 3400 Hz) is small,
Considering that the error of the whole spectrum is also small, the above e
Set m (n) to a smaller value, for example half the value em (n) / 2. That is, it is determined in step S7 whether or not the value Np representing the voice pitch of the analysis target in the number of samples is smaller than a predetermined threshold value P. If YES, the process proceeds to step S8, where em (n) = em (n ) / 2 and proceed to step S9,
If NO, the process directly proceeds to step S9. The pitch threshold value P in this case may be, for example, P = 40 (corresponding to a pitch of 200 Hz at 8 kHz sampling). When Np <40, the voice pitch is higher than 200 Hz. In step S9, the high range emphasis of the above formula (4) is performed.

【００３１】以上のような高域フォルマント強調や高域
強調を行う際には、強調後のレベル等によっては不自然
な音声を生じることがあり、これを回避するため、上記
ステップＳ２や、ステップＳ４や、ステップＳ７、Ｓ８
のような処置を講じている。また、上記高域フォルマン
ト強調の際には、強調を行うバンドを限定して、例えば
上記ハーモニックスの番号ｎが１２以下のバンドに対し
てはフォルマント強調を行わないように予め設定すれば
よい。このバンド制限の処理は、上記（２）式のａ(n)
の関数で行わせてもよい。When performing high-frequency formant emphasis or high-frequency emphasis as described above, unnatural sound may occur depending on the level after emphasis, and in order to avoid this, the above-mentioned steps S2 and S4 and steps S7 and S8
The following measures are taken. Further, when the high-frequency formant is emphasized, the band to be emphasized may be limited, and for example, it may be set in advance so that the formant emphasis is not carried out on the band having the harmonics number n of 12 or less. This band limiting process is performed by using a (n) in the above equation (2).
It may be performed by the function of.

【００３２】次に、本発明に係る音声信号処理装置が適
用可能な音声信号の合成分析符号化装置（いわゆるボコ
ーダ）の一種のＭＢＥ（Multiband Excitation: マルチ
バンド励起）ボコーダの具体例について、図面を参照し
ながら説明する。このＭＢＥボコーダは、D. W. Griffi
n and J. S. Lim, Multiband Excitation Vocoder,"IEE
E Trans.Acoustics,Speech,and Signal Processing, vo
l.36, No.8, pp. 1223-1235, Aug.1988に開示されてい
るものであり、従来のＰＡＲＣＯＲ (PARtialauto-CORr
elation: 偏自己相関）ボコーダ等では、音声のモデル
化の際に有声音区間と無声音区間とをブロックあるいは
フレーム毎に切り換えていたのに対し、ＭＢＥボコーダ
では、同時刻（同じブロックあるいはフレーム内）の周
波数軸領域に有声音（Voiced）区間と無声音（Unvoice
d）区間とが存在するという仮定でモデル化している。Next, a specific example of a kind of MBE (Multiband Excitation) vocoder of a voice signal synthesis analysis coding apparatus (so-called vocoder) to which the voice signal processing apparatus according to the present invention can be applied will be described with reference to the drawings. It will be explained with reference to FIG. This MBE vocoder is based on DW Griffi
n and JS Lim, Multiband Excitation Vocoder, "IEE
E Trans. Acoustics, Speech, and Signal Processing, vo
L.36, No.8, pp. 1223-1235, Aug.1988, the conventional PARCOR (PARtialauto-CORr
elation: Partial autocorrelation) In a vocoder or the like, a voiced sound section and an unvoiced sound section were switched for each block or frame when modeling speech, whereas in the MBE vocoder, the same time (in the same block or frame) Voiced section and unvoiced sound (Unvoice
d) It is modeled on the assumption that an interval and exists.

【００３３】図６は、上記ＭＢＥボコーダの実施例の全
体の概略構成を示すブロック図である。この図６におい
て、入力端子１０１には音声信号が供給されるようにな
っており、この入力音声信号は、ＨＰＦ（ハイパスフィ
ルタ）等のフィルタ１０２に送られて、いわゆるＤＣ
（直流）オフセット分の除去や帯域制限（例えば２００
〜３４００Hzに制限）のための少なくとも低域成分（２
００Hz以下）の除去が行われる。このフィルタ１０２を
介して得られた信号は、ピッチ抽出部１０３及び窓かけ
処理部１０４にそれぞれ送られる。ピッチ抽出部１０３
では、入力音声信号データが所定サンプル数Ｎ（例えば
Ｎ＝２５６）単位でブロック分割され（あるいは方形窓
による切り出しが行われ）、このブロック内の音声信号
についてのピッチ抽出が行われる。このような切り出し
ブロック（２５６サンプル）を、例えばＬサンプル（例
えばＬ＝１６０）のフレーム間隔で時間軸方向に移動さ
せており、各ブロック間のオーバラップはＮ−Ｌサンプ
ル（例えば９６サンプル）となっている。また、窓かけ
処理部１０４では、１ブロックＮサンプルに対して所定
の窓関数、例えばハミング窓をかけ、この窓かけブロッ
クを１フレームＬサンプルの間隔で時間軸方向に順次移
動させている。窓かけ処理された出力信号のデータ列に
対して、直交変換部１０５により例えばＦＦＴ（高速フ
ーリエ変換）等の直交変換処理が施される。FIG. 6 is a block diagram showing an overall schematic configuration of the embodiment of the MBE vocoder. In FIG. 6, an audio signal is supplied to an input terminal 101, and this input audio signal is sent to a filter 102 such as an HPF (high-pass filter) to be a so-called DC signal.
Removal of (DC) offset and band limitation (for example, 200
At least low frequency component (2)
(Less than 00 Hz) is removed. The signal obtained through the filter 102 is sent to the pitch extraction unit 103 and the windowing processing unit 104, respectively. Pitch extraction unit 103
In, the input voice signal data is divided into blocks in units of a predetermined number N (for example, N = 256) (or cut out by a rectangular window), and pitch extraction is performed on the voice signals in this block. Such a cut block (256 samples) is moved in the time axis direction at a frame interval of, for example, L samples (for example, L = 160), and the overlap between blocks is NL samples (for example, 96 samples). Has become. In addition, the windowing processing unit 104 applies a predetermined window function, for example, a Hamming window, to one block of N samples, and sequentially moves the windowed block in the time axis direction at intervals of one frame of L samples. The orthogonal transform unit 105 performs an orthogonal transform process such as FFT (Fast Fourier Transform) on the data string of the output signal subjected to the windowing process.

【００３４】ピッチ抽出部１０３では、例えばセンタク
リップ波形の自己相関法を用いて、ピーク周期を決めて
いる。このとき、現在フレームに属する自己相関データ
（自己相関は１ブロックＮサンプルのデータを対象とし
て求められる）から複数のピークを求めておき、これら
の複数のピークの内の最大ピークが所定の閾値以上のと
きには該最大ピーク位置をピッチ周期とし、それ以外の
ときには、現在フレーム以外のフレーム、例えば前後の
フレームで求められたピッチに対して所定の関係を満た
すピッチ範囲内、例えば前フレームのピッチを中心とし
て±２０％の範囲内にあるピークを求め、このピーク位
置に基づいて現在フレームのピッチを決定するようにし
ている。このピッチ抽出部１０３ではオープンループに
よる比較的ラフなピッチのサーチが行われ、抽出された
ピッチデータは高精度（ファイン）ピッチサーチ部１０
６に送られて、クローズドループによる高精度のピッチ
サーチ（ピッチのファインサーチ）が行われる。The pitch extraction unit 103 determines the peak period by using, for example, the auto-correlation method of the center clip waveform. At this time, a plurality of peaks are obtained from the autocorrelation data belonging to the current frame (the autocorrelation is obtained for the data of N samples of one block), and the maximum peak of the plurality of peaks is equal to or larger than a predetermined threshold value. In the case of, the maximum peak position is set as the pitch cycle, and in other cases, the pitch is within the pitch range that satisfies a predetermined relationship with the pitch other than the current frame, for example, the pitch of the previous frame and the pitch of the previous frame. As a result, a peak in the range of ± 20% is obtained, and the pitch of the current frame is determined based on this peak position. In this pitch extraction unit 103, a relatively rough pitch search is performed by an open loop, and the extracted pitch data has a high precision (fine) pitch search unit 10.
6, the pitch search (pitch fine search) with high accuracy is performed by the closed loop.

【００３５】高精度（ファイン）ピッチサーチ部１０６
には、ピッチ抽出部１０３で抽出された整数（インテジ
ャー）値の粗（ラフ）ピッチデータと、直交変換部１０
５により例えばＦＦＴされた周波数軸上のデータとが供
給されている。この高精度ピッチサーチ部１０６では、
上記粗ピッチデータ値を中心に、0.２〜0.５きざみで±
数サンプルずつ振って、最適な小数点付き（フローティ
ング）のファインピッチデータの値へ追い込む。このと
きのファインサーチの手法として、いわゆる合成による
分析 (Analysis by Synthesis)法を用い、合成されたパ
ワースペクトルが原音のパワースペクトルに最も近くな
るようにピッチを選んでいる。High precision (fine) pitch search unit 106
Includes rough pitch data of integer (integer) values extracted by the pitch extraction unit 103, and the orthogonal transformation unit 10.
5, for example, FFT-processed data on the frequency axis is supplied. In this high precision pitch search unit 106,
Centering on the above coarse pitch data value, ± in increments of 0.2 to 0.5
Shake several samples at a time to reach the optimum fine pitch data value with a decimal point (floating). As a fine search method at this time, a so-called analysis by synthesis method is used, and the pitch is selected so that the synthesized power spectrum is closest to the power spectrum of the original sound.

【００３６】上記高精度ピッチサーチ部１０６からの最
適ピッチ及び振幅｜Ａ_m｜のデータは、有声音／無声音
判別部１０７に送られ、上記各バンド毎に有声音／無声
音の判別が行われる。この判別のために、ＮＳＲ（ノイ
ズｔｏシグナル比）を利用する。すなわち、このＮＳＲ
値が所定の閾値（例えば0.３）より大のとき（エラーが
大きい）ときには、当該バンドをＵＶ（Unvoiced、無声
音）と判別する。これ以外のときは、近似がある程度良
好に行われていると判断でき、そのバンドをＶ（Voice
d、有声音）と判別する。The optimum pitch and amplitude | A _m | data from the high precision pitch search unit 106 is sent to the voiced sound / unvoiced sound discrimination unit 107, and the voiced sound / unvoiced sound is discriminated for each band. NSR (noise to signal ratio) is used for this determination. That is, this NSR
When the value is larger than a predetermined threshold value (for example, 0.3) (large error), the band is determined to be UV (Unvoiced, unvoiced sound). In other cases, it can be determined that the approximation has been performed to some extent, and that band is set to V (Voice
d, voiced sound).

【００３７】次に、振幅再評価部１０８には、直交変換
部１０５からの周波数軸上データ、高精度ピッチサーチ
部１０６からのファインピッチと評価された振幅｜Ａ_m
｜との各データ、及び上記有声音／無声音判別部１０７
からのＶ／ＵＶ（有声音／無声音）判別データが供給さ
れている。この振幅再評価部１０８では、有声音／無声
音判別部１０７において無声音（ＵＶ）と判別されたバ
ンドに関して、再度振幅｜Ａ_m｜_UVを求めている。Next, in the amplitude re-evaluation unit 108, the amplitude on the frequency axis data from the orthogonal transformation unit 105 and the amplitude | A _m evaluated as the fine pitch from the high-precision pitch search unit 106.
| And each voiced sound / unvoiced sound discrimination unit 107
V / UV (voiced sound / unvoiced sound) discrimination data from is supplied. The amplitude re-evaluation unit 108 again obtains the amplitude | A _m | _UV for the band determined to be unvoiced sound (UV) by the voiced sound / unvoiced sound determination unit 107.

【００３８】この振幅再評価部１０８からのデータは、
データ数変換（一種のサンプリングレート変換）部１０
９に送られる。このデータ数変換部１０９は、上記ピッ
チに応じて周波数軸上での分割帯域数が異なり、データ
数（特に振幅データの数）が異なることを考慮して、一
定の個数にするためのものである。すなわち、例えば有
効帯域を３４００Hzまでとすると、この有効帯域が上記
ピッチに応じて、８バンド〜６３バンドに分割されるこ
とになり、これらの各バンド毎に得られる上記振幅｜Ａ
_m｜（ＵＶバンドの振幅｜Ａ_m｜_UVも含む）データの個
数も８〜６３と変化することになる。このためデータ数
変換部１０９では、この可変個数の振幅データを一定個
数Ｎ_C（例えば４４個）のデータに変換している。The data from the amplitude re-evaluation unit 108 is
Data number conversion (a kind of sampling rate conversion) unit 10
Sent to 9. The data number conversion unit 109 is for making the number constant in consideration of the fact that the number of divided bands on the frequency axis differs according to the pitch and the number of data (especially the number of amplitude data) differs. is there. That is, for example, when the effective band is up to 3400 Hz, the effective band is divided into 8 bands to 63 bands according to the pitch, and the amplitude | A obtained for each of these bands | A
_{The number of m} | (including UV band amplitude | A _m | _UV ) data also varies from 8 to 63. Therefore, the data number conversion unit 109 converts the variable number of amplitude data into a fixed number N _C (for example, 44) of data.

【００３９】ここで本実施例においては、周波数軸上の
有効帯域１ブロック分の振幅データに対して、ブロック
内の最後のデータからブロック内の最初のデータまでの
値を補間するようなダミーデータを付加してデータ個数
をＮ_F個に拡大した後、帯域制限型のＫ_OS倍（例えば８
倍）のオーバーサンプリングを施すことによりＫ_OS倍の
個数の振幅データを求め、このＫ_OS倍の個数（( ｍ_MX＋
１）×Ｋ_OS個）の振幅データを直線補間してさらに多く
のＮ_M個（例えば２０４８個）に拡張し、このＮ_M個の
データを間引いて上記一定個数Ｎ_C（例えば４４個）の
データに変換する。Here, in this embodiment, dummy data for interpolating values from the last data in the block to the first data in the block with respect to the amplitude data of one block of the effective band on the frequency axis. Is added to increase the number of data to N _F , and then the bandwidth-limited K _OS times (for example, 8
By multiplying the number of K _OS times the amplitude data, and multiplying the number of K _OS times ((m _MX +
1) × K _OS pieces of amplitude data are linearly interpolated to be expanded to a larger number of N _M pieces (for example, 2048 pieces), and the N _M pieces of data are thinned out to obtain the fixed number N _C (for example, 44 pieces). Convert to data.

【００４０】このデータ数変換部１０９からのデータ
（上記一定個数Ｎ_Cの振幅データ）がベクトル量子化部
１１０に送られて、所定個数のデータ毎にまとめられて
ベクトルとされ、ベクトル量子化が施される。ベクトル
量子化部１１０からの量子化出力データは、出力端子１
１１を介して取り出される。また、上記高精度のピッチ
サーチ部１０６からの高精度（ファイン）ピッチデータ
は、ピッチ符号化部１１５で符号化され、出力端子１１
２を介して取り出される。さらに、上記有声音／無声音
判別部１０７からの有声音／無声音（Ｖ／ＵＶ）判別デ
ータは、出力端子１１３を介して取り出される。これら
の各出力端子１１１〜１１３からのデータは、所定の伝
送フォーマットの信号とされて伝送される。The data from the data number conversion unit 109 (the above-mentioned fixed number N _C of amplitude data) is sent to the vector quantization unit 110, and is grouped into a vector for each predetermined number of data, and vector quantization is performed. Is given. The quantized output data from the vector quantizer 110 is output to the output terminal 1
It is taken out via 11. Further, the high-precision (fine) pitch data from the high-precision pitch search unit 106 is coded by the pitch coding unit 115, and the output terminal 11
It is taken out via 2. Further, the voiced sound / unvoiced sound (V / UV) discrimination data from the voiced sound / unvoiced sound discrimination unit 107 is taken out through the output terminal 113. The data from these output terminals 111 to 113 are transmitted as signals in a predetermined transmission format.

【００４１】なお、これらの各データは、上記Ｎサンプ
ル（例えば２５６サンプル）のブロック内のデータに対
して処理を施すことにより得られるものであるが、ブロ
ックは時間軸上を上記Ｌサンプルのフレームを単位とし
て前進することから、伝送するデータは上記フレーム単
位で得られる。すなわち、上記フレーム周期でピッチデ
ータ、Ｖ／ＵＶ判別データ、振幅データが更新されるこ
とになる。Each of these data is obtained by processing the data in the block of N samples (for example, 256 samples), and the block is a frame of the L samples on the time axis. , The data to be transmitted is obtained in the frame unit. That is, the pitch data, the V / UV discrimination data, and the amplitude data are updated at the above frame period.

【００４２】なお、上記図６の音声分析側（エンコード
側）の構成や図１の音声合成側（デコード側）の構成に
ついては、各部をハードウェア的に記載しているが、い
わゆるＤＳＰ（ディジタル信号プロセッサ）等を用いて
ソフトウェアプログラムにより実現することも可能であ
る。Regarding the configuration of the voice analysis side (encoding side) of FIG. 6 and the configuration of the voice synthesis side (decoding side) of FIG. 1, although each unit is described in hardware, a so-called DSP (digital) is used. It is also possible to realize it by a software program using a signal processor or the like.

【００４３】[0043]

【発明の効果】本発明に係る音声信号処理装置によれ
ば、周波数領域での処理を中心とする音声合成系に用い
られる音声信号処理装置において、周波数スペクトルの
高域フォルマントを、周波数領域のパラメータを直接操
作して強調しているため、簡単な構成及び簡単な操作
で、強調したい部分だけを正確に強調でき、自然感を損
なうことなく合成音の明瞭度を向上させることができ
る。さらに、時間軸方向の高域強調フィルタ（例えばＩ
ＩＲフィルタ）等を用いて時間領域で処理するときに不
可欠とされたフィルタのポール（極）の位置の計算が不
要となるので、容易に実時間処理を行うことができ、ま
たフィルタの不安定さによる悪影響等も完全に回避でき
るという利点がある。According to the voice signal processing device of the present invention, in the voice signal processing device used in the voice synthesizing system centering on the processing in the frequency domain, the high-frequency formant of the frequency spectrum is converted into the parameter in the frequency domain. Since it is directly operated to emphasize, only a portion to be emphasized can be accurately emphasized with a simple configuration and a simple operation, and the clarity of the synthesized sound can be improved without impairing the natural feeling. Furthermore, a high-frequency emphasis filter (for example, I
Since it is not necessary to calculate the position of the pole of the filter, which is indispensable when processing in the time domain using an (IR filter), etc., real-time processing can be performed easily, and the filter is unstable. There is an advantage that adverse effects due to the size can be completely avoided.

[Brief description of drawings]

【図１】本発明に係る音声信号処理装置の一実施例が適
用可能な装置の具体例としての音声合成分析符号化装置
の合成側（デコード側）の音声復号装置の概略構成を示
す機能ブロック図である。FIG. 1 is a functional block diagram showing a schematic configuration of a speech decoding apparatus on a synthesis side (decoding side) of a speech synthesis analysis coding apparatus as a specific example of an apparatus to which an embodiment of a speech signal processing apparatus according to the present invention can be applied. It is a figure.

【図２】上記実施例の動作を説明するためのフローチャ
ートである。FIG. 2 is a flow chart for explaining the operation of the above embodiment.

【図３】周波数軸上のスペクトルデータである各バンド
毎の振幅値を示す図である。FIG. 3 is a diagram showing an amplitude value for each band which is spectrum data on a frequency axis.

【図４】周波数軸上のスペクトルデータ及びその移動平
均をとって得られたスペクトル包絡線（エンベロープ）
を示す図である。FIG. 4 is a spectrum envelope (envelope) obtained by taking spectrum data on a frequency axis and a moving average thereof.
FIG.

【図５】フォルマント強調の際の強調の仕方の関数を示
す図である。FIG. 5 is a diagram showing a function of an emphasis method at the time of formant emphasis.

【図６】本発明に係る音声信号処理装置の上記実施例が
適用される音声復号装置に信号を送る音声合成分析符号
化装置の分析側（エンコード側）の概略構成を示す機能
ブロック図である。FIG. 6 is a functional block diagram showing a schematic configuration of an analysis side (encoding side) of a speech synthesis analysis coding apparatus which sends a signal to a speech decoding apparatus to which the embodiment of the speech signal processing apparatus according to the present invention is applied. .

[Explanation of symbols]

１１・・・・・量子化振幅データ入力端子１２・・・・・符号化ピッチデータ入力端子１３・・・・・Ｖ／ＵＶ判別データ入力端子１６・・・・・高域フォルマント強調部 11: Quantized amplitude data input terminal 12: Encoded pitch data input terminal 13: V / UV discrimination data input terminal 16: High frequency formant emphasis section

Claims

[Claims]

1. A speech signal processing device used in a speech synthesis system, which is mainly processing in the frequency domain, wherein a high-frequency formant of a frequency spectrum is emphasized by directly operating a parameter in the frequency domain. Signal processing device.

2. The audio signal processing apparatus according to claim 1, wherein the entire high frequency band of the frequency spectrum is emphasized by directly operating parameters in the frequency domain.

3. The speech signal processing device according to claim 1, wherein the speech synthesis system is a speech synthesis system of a speech decoding device for multi-band excitation coding.