JPS60260098A

JPS60260098A - Drive signal generation for vioce synthesization

Info

Publication number: JPS60260098A
Application number: JP59115927A
Authority: JP
Inventors: 新居　康彦; 古屋　正久; 利光蓑輪
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1984-06-06
Filing date: 1984-06-06
Publication date: 1985-12-23

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】産業上の利用分野本発明は、音声を分析合成する方法において、高品質の
合成音声を得るだめの音声合成用駆動信号生成方法に関
するものである。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a method for generating a driving signal for speech synthesis to obtain high quality synthesized speech in a method for analyzing and synthesizing speech.

従来例の構成とその問題点従来の音声分析合成方法では、第１図ａ、ｂに示すよう
に離散的な音声信号に一定長の窓関数、例えば３０ｍ５
長のハミング窓等を掛けて切シ出した有限個のデータか
ら、音声のスペクトル情響を表現するスペクトルパラメ
ータ（線形予測係数。Structure of the conventional example and its problems In the conventional speech analysis and synthesis method, as shown in Fig. 1a and b, a window function of a fixed length, for example, 30m5, is applied to the discrete speech signal.
Spectral parameters (linear prediction coefficients) that express the spectral emotion of speech are obtained from a finite number of data extracted by applying a long Hamming window, etc.

パーコール係数または線スペクトル対等）ト、音源情報
を表現する音源パラメータ（振幅、ピンチ周期、および
有声無声判定）を分離して抽出し、この抽出したパラメ
ータを用いて元の音声信号を復元するようにしている。Percoll coefficients or line spectra), the sound source parameters (amplitude, pinch period, and voiced/unvoiced judgment) expressing sound source information are separated and extracted, and the extracted parameters are used to restore the original audio signal. ing.

上記スペクトルパラメータは、声道フィルタの伝達特性
を規定し、まだ上記音源パラメータは、声道フィルタの
駆動信号を規定するものである。The spectral parameters define the transfer characteristics of the vocal tract filter, and the source parameters define the driving signal of the vocal tract filter.

音声信号には、周期性のある有声音部分と、雑音性の無
声音部分があるが、有声無声判定パラメータは、声道フ
ィルタの励振関数（駆動波形）を有声音と無声音で切換
えるだめのものである。A speech signal has a periodic voiced part and a noisy unvoiced part, but the voiced/unvoiced determination parameter is used to switch the excitation function (drive waveform) of the vocal tract filter between voiced and unvoiced sounds. be.

通常、有声音を合成する時は、振動関数としてパルス波
形や三角波形が用いられ、また無声音を合成する時は、
ランダムパルスが用いられている。Normally, when synthesizing voiced sounds, a pulse waveform or triangular waveform is used as the vibration function, and when synthesizing unvoiced sounds,
Random pulses are used.

スペクトルパラメータは、音声信号を声道逆フィルタに
通して得られる残差信号のスペクトルが白色化するよう
に決定されるものである。また、音源パラメータとして
、前記残差信号からエネルギー計算によって振幅が、ま
た自己相関法によって周期性の有無（有声無声判定）お
よびピンチ周期が抽出される。従って、音声を合成する
時は分析の際に得られる残差信号に相当する駆動信号を
音源パラメータから作り出して声道フィルタに入力すれ
ば良い。この場合、有声音を合成する時の駆動信号を一
様スベクトル分布を有するパルス波形を用い、その繰返
し周期と振幅を制御して作り出すのが一般的な方法であ
る。これは、スペクトルパラメータを抽出する際に、残
差信号のスペクトルを白色化するようにしているため、
合成の際にも、白色スペクトルをもつ信号で駆動するの
が理想的であるという理由による。The spectral parameters are determined so that the spectrum of the residual signal obtained by passing the audio signal through the vocal tract inverse filter becomes white. Further, as the sound source parameters, the amplitude is extracted from the residual signal by energy calculation, and the presence or absence of periodicity (voiced/unvoiced determination) and the pinch period are extracted by the autocorrelation method. Therefore, when synthesizing speech, it is sufficient to generate a drive signal corresponding to the residual signal obtained during analysis from the sound source parameters and input it to the vocal tract filter. In this case, a common method is to use a pulse waveform having a uniform vector distribution as a drive signal when synthesizing voiced sounds, and to control the repetition period and amplitude of the pulse waveform. This is because the spectrum of the residual signal is whitened when extracting the spectral parameters.
This is because it is ideal to drive with a signal having a white spectrum even during synthesis.

しかしながら、実際の音声分析では逆フィルタの段数が
８段〜１０段程度であり、また逆フィルタのモデルが必
ずしも音声信号の生成モデルと合致しないため、残差信
号のスペクトルは必ずしも理想的に白色化されるもので
はない。従って、スペクトルパラメータでは表現しきれ
ないスペクトル情報が残差信号に含まれており、この残
差信号をパルスや三角波の繰返しでおきかえるところに
合成音声の品質を劣化させる１つの原因が存在する。残
差信号に含まれるスペクトル情報を利用する方法として
、従来は音声信号を逆フィルタリングして得られる残差
信号の長時間平均パワースペクトルをめ、位相項をゼロ
として逆フーリエ変換して得られる対称時間波形を駆動
波として用いるなどしていた。ところが、音声などの非
楽音信号では、無声部から有声部への過渡部分の音色に
位相情報が寄与していることが知られておシ、残差信号
のパワースペクトル情報のみを利用する従来の方法では
、合成音声が鼻声になるなど１合成音の音色の再現性に
難点があった。However, in actual speech analysis, the number of inverse filter stages is about 8 to 10 stages, and the inverse filter model does not necessarily match the speech signal generation model, so the spectrum of the residual signal is not necessarily ideally whitened. It is not something that will be done. Therefore, the residual signal contains spectral information that cannot be expressed by spectral parameters, and one cause of deterioration in the quality of synthesized speech is that this residual signal is replaced by repeated pulses or triangular waves. Conventionally, the spectral information contained in the residual signal is utilized by taking the long-term average power spectrum of the residual signal obtained by inverse filtering the audio signal, and performing an inverse Fourier transform with the phase term set to zero. A time waveform was used as a driving wave. However, in non-musical signals such as speech, it is known that phase information contributes to the timbre of the transitional part from the unvoiced part to the voiced part. This method had problems with the reproducibility of the timbre of one synthesized voice, such as the synthesized voice becoming nasal.

発明の目的本発明は、上記従来例の問題点を除去するものであり、
合成音声の音色の再現性を向上する音声合成用駆動信号
を生成する方法を提供することを目的とするものである
。Purpose of the Invention The present invention eliminates the problems of the above-mentioned conventional example,
It is an object of the present invention to provide a method for generating a drive signal for speech synthesis that improves the reproducibility of the timbre of synthesized speech.

発明の構成本発明は、上記目的を達成するために、音声信号の有声
音部分の予測残差信号から代表残差信号を導出し、その
自己相関関数をピッチ周期ごとに接続して有声区間の駆
動信号とするもので、合成音声の音色の再現性を向上さ
せる効果を得るものである。SUMMARY OF THE INVENTION In order to achieve the above object, the present invention derives a representative residual signal from the predicted residual signal of the voiced part of an audio signal, connects its autocorrelation function for each pitch period, and calculates the voiced section. This is used as a drive signal and has the effect of improving the reproducibility of the timbre of synthesized speech.

実施例の説明以下に、本発明による駆動信号の生成方法について説明
する。本発明では、まず残差信号から代表残差信号を導
出する。ここで言う代表残差信号とは、原残差信号のス
ペクトル情報が保存された１分析区間長（１０〜２０ｍ
５）の時間波形である。DESCRIPTION OF EMBODIMENTS A method of generating a drive signal according to the present invention will be described below. In the present invention, first, a representative residual signal is derived from the residual signal. The representative residual signal here refers to the length of one analysis section (10 to 20 m) in which the spectral information of the original residual signal is saved.
5) is the time waveform.

文献（櫛木、新居他、「ＰＡＩＲＣＯＲ形音声合成ＬＳ
Ｉにおける駆動波形の考察」日本音響学会音声研究会８
８１−４０，１９８１年１０月）によれば母音部分の残
差信号でエネルギーが最大となるフレームから切出した
１ピツチ長の残差波形を駆動信号として使用すると、従
来のパルス駆動よりも良質の音声が合成できることが示
されている。Literature (Kushiki, Arai et al., “PAIRCOR-type speech synthesis LS
"Consideration of drive waveform in I" Acoustical Society of Japan Speech Study Group 8
81-40, October 1981), using a 1-pitch-long residual waveform extracted from the frame with the maximum energy in the residual signal of the vowel part as the drive signal produces better quality than conventional pulse drive. It has been shown that speech can be synthesized.

また、別の文献（置屋、新居他「汎用ＤＩＰを用いたＬ
’ＳＰ音声合成器」日本音響学会講演論文。In addition, another document (Okiya, Arai et al. ``L using general-purpose DIP
'SP Speech Synthesizer' Lecture Paper of the Acoustical Society of Japan.

１−７−７、昭５７年１０月）によれば、有声部分の残
差信号の長時間平均パワースペクトルを逆ＦＦＴ（位相
項をＯとして）して得られる対称波形を駆動波として利
用することが示されている。1-7-7, October 1982), the symmetrical waveform obtained by inverse FFT (with the phase term set to O) of the long-term average power spectrum of the residual signal of the voiced part is used as the driving wave. It has been shown that

本発明では上記の例に基づき、（１）母音（例えば／ａ／）の残差信号でエネルギーが
最大となる１フレームのパワースペクトル、または、（２）有声部分の残差信号の長時間平均パワースペクト
ルを代表残差信号の中に保存するようにしている。In the present invention, based on the above example, (1) the power spectrum of one frame in which the energy is maximum in the residual signal of a vowel (for example /a/), or (2) the long-term average of the residual signal of a voiced part The power spectrum is stored in the representative residual signal.

上記の（１）まだは（２）のパワースペクトルを保存し
た時間波形（即ち、代表残差信号）を得るためには、位
相項を与えて逆Ｉ”　Ｉ”　Ｔすれば良い。In order to obtain a time waveform (that is, a representative residual signal) that preserves the power spectrum of (1) and (2) above, it is sufficient to apply a phase term and perform inverse I''I''T.

位相項を与える方法には、０位相を与える方法と長時間
平均位相を与える方法がある。前者の場合、時間波形が
対称となり、しかもエネルギーが中央に集中するため、
代表残差信号を得る目的には適さない。後者の場合、短
時間（灸分析フレームごとの）位相角をめる際に、実部
と虚部の符号から位相角が第何象現にあるかの判定を必
要とする欠点がある。そこで本発明では、（１）′　母音部（例えば／ａ／）の残差信号でエネル
ギーが最大となる１フレームの位相角を利用する。There are two methods of providing a phase term: a method of providing 0 phase and a method of providing a long-term average phase. In the former case, the time waveform is symmetrical and the energy is concentrated in the center, so
It is not suitable for the purpose of obtaining a representative residual signal. In the latter case, when calculating the phase angle for a short period of time (for each moxibustion analysis frame), there is a drawback that it is necessary to determine which quadrant the phase angle is in from the signs of the real and imaginary parts. Therefore, in the present invention, (1)' The phase angle of one frame at which the energy of the residual signal of the vowel part (for example, /a/) is maximum is used.

（２）′　有声部分の残差信号を分析フレームごとにＦ
ＦＴした後、実部の全フレームに亘る平均および虚部の
全フレームに亘る平均からまる余弦値および正弦値を利
用するようにしている。(2) ′ The residual signal of the voiced part is calculated by F for each analysis frame.
After the FT, the cosine and sine values obtained from the average of the real part over all frames and the average of the imaginary part over all frames are used.

上記（１）′の場合、各周波数成分（ＫＪごとの余弦値
Ｃ０５（θＫ）、および正弦値ｓｉｎ　（θＫ）は、と
表わされる。ただし＋”ＫＪｂＫは第に成分の実部およ
び虚部である。（２）′の場合は、上記ａＫ＋ｂＫのか
わりに全フレームに亘る平均値を用いることになるから
、〔１〕式のａＫ＋’）Ｋを、として、余弦値および正
弦値をめる。〔１〕式から逆ＦＦＴされるべき実部成分
ａ′えおよび虚部成分ｂ′やは、となる。ここで、ＰＫは母音残差の１フレーム（エネル
ギー最大の）パワースペクトルＰ　ａ　Ｉ　Ｋ　＋また
は有声部残差の平均パワースペクトルＰｒｒ、、Ｋをあ
られす。In the case of (1)' above, each frequency component (cosine value C05 (θK) and sine value sin (θK) for each KJ is expressed as follows. However, +"KJbK is the real part and imaginary part of the component. In the case of (2)', the average value over all frames is used instead of the above aK+bK, so aK+')K in equation [1] is used to calculate the cosine value and the sine value. From equation [1], the real component a' and imaginary component b' to be subjected to inverse FFT are as follows.Here, PK is the power spectrum of one frame (maximum energy) of the vowel residual P a I K + or the average power spectrum of the voiced part residual Prr, , K.

本発明では〔３〕式を逆ＦＦＴして代表残差信号を導出
するようにしているので、位相角が第何象現にあるかの
判定を必要とせず、しかも位相角そのものを計算する必
要がないため、逆三角関数演算と三角関数演算が不要と
なる利点がある。In the present invention, since the representative residual signal is derived by performing inverse FFT on equation [3], there is no need to determine which quadrant the phase angle is in, and there is no need to calculate the phase angle itself. Therefore, there is an advantage that inverse trigonometric function operations and trigonometric function operations are not necessary.

次に、代表残差信号から駆動信号を導出する方法につい
て説明する。Next, a method of deriving the drive signal from the representative residual signal will be explained.

いま、駆動信号Ｅ（ｎ）、　ｒ＋＝ｏ、　１．２．−、
がピッチ周期ごとにインパルスで励振される〔４〕式の
線形フィルタの出力として生成されるものとする。Now, drive signal E(n), r+=o, 1.2. -,
is generated as the output of the linear filter of equation [4] which is excited by an impulse every pitch period.

Ｅ　（ｎ）−Σβｉ　ｘｎ−ｉ　・・−＝−［４）１＋
０ここで、■は標本数で、Ｂ　（ｎ）が次のピッチ周期ま
でに収束する程度の個数に定める。例えば、１０ＫＨｚ
サンプリングの時、女声ではｌ−３１，男声ではＩ−６
３程度に定めれば良い。本発明では、〔４〕式のβ１と
して先に導出した代表残差信号の自己相関係数を用いる
ようにしている。従って〔４〕式の駆動信号には原残差
信号のスペクトル情報が保存され、かつピッチ周期が自
由に制御できる利点がある。E (n)−Σβi xn−i ・・−=−[4)1+
0 Here, ■ is the number of samples, and is set to a number that allows B (n) to converge by the next pitch period. For example, 10KHz
When sampling, l-31 for female voices and I-6 for male voices.
It is sufficient to set it to about 3. In the present invention, the autocorrelation coefficient of the representative residual signal derived previously is used as β1 in equation [4]. Therefore, the drive signal of formula [4] has the advantage that the spectral information of the original residual signal is preserved and the pitch period can be freely controlled.

実際のハードウェアでは、〔４〕式のｎ　＝　Ｏ〜（■
−１）までの波形をメモリに格納しておき、ピンチ周期
ごとに繰返して読み出す方法をとれば極めて容易に実現
もきるものである。In actual hardware, n = O~(■
This can be realized very easily by storing the waveforms up to -1) in a memory and reading them out repeatedly at every pinch cycle.

第２図（ａ）（ｂ）（ｃ）に本発明による駆動信号を聴
感的に評価した結果を示す。前述のように、本発明では
パワースペクトルと、位相の与え方によって４種類の代
表残差が導出し得る。第２図で、（Ａ１）は（１１のパ
ワースペクトルと（１）′の位相、（Ａ２）は（１）の
パワースペクトルと（２）′の位相、（ｖｌ）は（２）
のパワースペクトルと（１）′の位相、（Ｖ２）は（２
）のパワースペクトルと（２）′の位相を与えた場合を
示している。約６秒の天気予報メソセージ（男声）を１
０ＫＨｚで標本化し、窓長３０ｍ５．フＬ／−ムシ７ト
１０ｍ５でＬＳＰ分析・合成した音声を７段階評定尺度
法で評価したものである。第２図で（ａ）は豊かさの評
価、（ｂ）は聞きやすさの評価、（Ｃ）はその人らしさ
く合成音声が原音声の°°〆′に似ているかどうか）の
評価で、いずれも従来のパルス駆動の場合よりも１段階
評価が高く、本発明による効果が著しいことを示すもの
である。FIGS. 2(a), 2(b), and 2(c) show the results of auditory evaluation of the drive signal according to the present invention. As described above, in the present invention, four types of representative residuals can be derived depending on the power spectrum and how the phase is given. In Figure 2, (A1) is the power spectrum of (11) and the phase of (1)', (A2) is the power spectrum of (1) and the phase of (2)', and (vl) is (2)
The power spectrum of and the phase of (1)′, (V2) is (2
) and the phase of (2)′ are shown. 1 weather forecast message (male voice) of about 6 seconds
Sampled at 0KHz, window length 30m5. Speech analyzed and synthesized by LSP using F/L/-mushi7 and 10m5 was evaluated using the 7-step rating scale method. In Figure 2, (a) is an evaluation of richness, (b) is an evaluation of ease of listening, and (C) is an evaluation of whether the synthesized speech sounds human-like and resembles the original speech. , all of them have a one-step higher evaluation than the conventional pulse drive, indicating that the effects of the present invention are significant.

発明の効果本発明は、上記のような構成であシ、以下に示だ実部の
平均と虚部の平均からまる余弦値と正弦値を用いるよう
にしているため、位相角そのものを演算する必要がなく
、計算が単純化される効果がある。Effects of the Invention The present invention has the above-mentioned configuration and uses cosine and sine values obtained from the average of the real part and the average of the imaginary part shown below, so there is no need to calculate the phase angle itself. This has the effect of simplifying calculations.

（ｂ）　原残差信号のパワースペクトルおよび位相情報
を保存した代表残差信号から駆動信号を導出するように
しているため、位相項を０とした対称波形を使用するよ
りも情報量が多く（２倍）合成音声の品質が著しく向上
する効果がある。(b) Since the drive signal is derived from the representative residual signal that preserves the power spectrum and phase information of the original residual signal, the amount of information is greater than when using a symmetrical waveform with the phase term set to 0 ( 2x) This has the effect of significantly improving the quality of synthesized speech.

は本発明の一実施例における音声合成用駆動信号生成方
法による駆動信号を用いて合成した音声の聴感的な評価
結果を示す図で、ある。FIG. 2 is a diagram showing audible evaluation results of speech synthesized using drive signals according to a drive signal generation method for speech synthesis according to an embodiment of the present invention.

代理人の氏名　弁理士　中　尾　敏　男　ほか１名第１
図ヒｉンｆ　１７”！、ｉ’　簾？Ｉ昏　゛フンノトシク
ィ糸数第２図（０−）　（ｂ）（Ｃ）Name of agent: Patent attorney Toshio Nakao and 1 other person No. 1
Figure Hinf 17”!, i' Blind?

Claims

[Claims]

The power spectrum of the frame where the power of the vowel residual signal is maximum or the average power spectrum over all frames of the voiced part residual signal, and the phase spectrum (each frequency component) of the frame where the power of the vowel residual signal is maximum (cosine and sine values for each frequency component) or cosine and sine values for each frequency component derived from the average values over all frames of the real and imaginary parts obtained by short-time Fourier transform of the voiced part residual signal. For speech synthesis, the driving signal is generated as an impulse response of an all-zero filter whose coefficient is the autocorrelation coefficient of the obtained time waveform (representative residual signal) by performing inverse Fourier transform using Drive signal generation method.