JPH06308997A

JPH06308997A - Voice synthesizing method

Info

Publication number: JPH06308997A
Application number: JP5094359A
Authority: JP
Inventors: Hideyuki Mizuno; 秀之水野; Masanobu Abe; 匡伸阿部; Tomohisa Hirokawa; 智久広川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1993-04-21
Filing date: 1993-04-21
Publication date: 1994-11-04
Anticipated expiration: 2017-08-26
Also published as: JP3317458B2

Abstract

PURPOSE:To reduce deterioration quality caused by a change of a pitch frequency at the time of synthesizing a voice by editing a voice waveform. CONSTITUTION:Based on an input phoneme and an input pitch frequency, a voice waveform is selected from a waveform dictionary (S1), and from both a pitch frequency of its selected waveform, and the input pitch frequency, a target formant frequency is determined by referring to a change table (S2). Also, the pitch frequency of the selected waveform is changed in accordance with the input pitch frequency (S3). This format frequency of the voice waveform obtained by changing the pitch frequency is changed to the target formant frequency (S4).

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、例えば任意のテキス
トを音声に変換する場合に適用され、入力音韻及び入力
ピッチ周波数に従って音声波形を選択し、その選択した
音声波形のピッチ周波数を入力ピッチ周波数に従って制
御すると共に、選択した音声波形を入力音韻継続時間に
従った長さとし、かつ入力大きさに従ったパワーとし
て、順次波形重畳して合成音声を得る音声合成方法に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is applied, for example, to the conversion of arbitrary text into speech, selects a speech waveform according to an input phoneme and an input pitch frequency, and determines the pitch frequency of the selected speech waveform as the input pitch frequency. The present invention relates to a voice synthesizing method for controlling a voice in accordance with the input phoneme duration, and setting the length of the selected voice waveform according to the input phoneme duration and sequentially superimposing the waveform as power according to the input magnitude.

【０００２】[0002]

【従来の技術】従来の音声合成の技術としては、ＬＰＣ
分析等によって得られたスペクトルパラメータとパルス
等の駆動音源信号とを用いて音声合成を行う方式がこれ
まで一般的である。この方式は音声をモデル化して有限
個のパラメータで音声を表現するものであり、一定の手
順に従えば簡単に音声の分析や合成が可能である。ま
た、波形辞書を用いて適切な波形を選択し、波形重畳法
を用いてピッチ制御を行い合成音を生成する方式がある
（特開平０１−２８４８９８）。この方式では、自然音
声の波形をそのまま用いているため音声の品質は良く、
自然音声と同等の音声を得ることが可能である。2. Description of the Related Art As a conventional speech synthesis technique, LPC is used.
A method of synthesizing a voice using a spectrum parameter obtained by analysis and a driving sound source signal such as a pulse has been generally used. This method models a voice and expresses the voice with a finite number of parameters, and it is possible to easily analyze and synthesize the voice by following a certain procedure. There is also a method of selecting an appropriate waveform using a waveform dictionary and performing pitch control using a waveform superposition method to generate a synthesized voice (Japanese Patent Laid-Open No. 01-284898). In this method, since the waveform of natural voice is used as it is, the quality of voice is good,
It is possible to obtain voice equivalent to natural voice.

【０００３】[0003]

【発明が解決しようとする課題】前者のスペクトルパラ
メータと駆動音源信号とを用いる方法では音声の品質が
悪いなどの問題があった。また後者の編集合成では、ピ
ッチ周波数とスペクトル構造との相関関係について考慮
していないため、ピッチ周波数の変更処理によって音声
の一部で品質劣化が生じる。The former method using the spectrum parameter and the driving sound source signal has a problem such as poor voice quality. Also, in the latter edit synthesis, since the correlation between the pitch frequency and the spectral structure is not taken into consideration, the pitch frequency changing process causes quality deterioration in a part of the voice.

【０００４】この発明の目的は、波形重畳法を用いてピ
ッチ制御を行い合成音を生成する方法において、ピッチ
周波数の変更処理による品質劣化が少ないようにした音
声合成方法を提供することにある。An object of the present invention is to provide a method of synthesizing a synthesized voice by performing pitch control using a waveform superposition method, which provides a speech synthesizing method in which quality deterioration due to pitch frequency changing processing is reduced.

【０００５】[0005]

【課題を解決するための手段】この発明によれば、波形
重畳法を用いてピッチ制御を行い合成音を生成する方法
において、入力ピッチ周波数及び選択された波形のピッ
チ周波数に従って目的フォルマント周波数を決定し、ピ
ッチ制御された音声波形のフォルマント周波数を目的フ
ォルマント周波数に従って変換する。According to the present invention, a target formant frequency is determined according to an input pitch frequency and a pitch frequency of a selected waveform in a method of generating a synthesized sound by performing pitch control using a waveform superposition method. Then, the formant frequency of the pitch-controlled voice waveform is converted according to the target formant frequency.

【０００６】[0006]

【実施例】図１にこの発明の方法の実施例の流れ図を示
す。この発明では音韻（入力音韻）とピッチ周波数（入
力ピッチ周波数とが入力されるが、例えばテキストを音
声に変換する場合は、テキストが解析されて、音韻系列
とされ、更に各音韻ごとのピッチ周波数と音韻間でのピ
ッチ周波数の連続性とを考慮したピッチパターンと、各
音韻に対する音韻継続時間と、各音韻ごとのパワー（大
きさ）とその音韻間での連続性を考慮したパワーパター
ンとが設定される。1 is a flow chart of an embodiment of the method of the present invention. In the present invention, a phoneme (input phoneme) and a pitch frequency (input pitch frequency) are input. For example, when converting text into speech, the text is analyzed and made into a phoneme sequence, and the pitch frequency for each phoneme is further input. And a pitch pattern that considers the continuity of the pitch frequency between phonemes, a phoneme duration for each phoneme, a power (magnitude) for each phoneme, and a power pattern that considers the continuity between phonemes. Is set.

【０００７】この発明の要部で、音韻系列である各入力
音韻と、その各音韻に対する入力ピッチ周波数とが主と
して関係する。つまり、まず波形選択ステップＳ１で
は、入力音韻と入力ピッチ周波数に基づき、適当な評価
関数を用いて波形辞書から波形を選択する。ここで評価
関数及び波形選択手法については例えば評価関数を用い
た波形選択手法（参考文献「波形編集型規則音声合成法
における波形選択法」、音講論１−２−２１（１９８８
−１０）広川ら）を用いることができる。この手法では
目的とする音韻種別、音素内のピッチ、継続時間長、音
韻一致率等をパラメータとして、（１）式のような評価
関数Ｗに基づき波形辞書中から最適な波形、つまり最も
小さいＷとなる波形を選択する。In the main part of the present invention, each input phoneme that is a phoneme sequence and the input pitch frequency for each phoneme are mainly related. That is, first, in the waveform selection step S1, a waveform is selected from the waveform dictionary using an appropriate evaluation function based on the input phoneme and the input pitch frequency. Here, regarding the evaluation function and the waveform selection method, for example, a waveform selection method using the evaluation function (reference “Waveform selection method in waveform editing type regular speech synthesis method”, Sound lecture 1-2-21 (1988)
-10) Hirokawa et al. Can be used. In this method, the optimum waveform from the waveform dictionary, that is, the smallest W, is selected from the waveform dictionary based on the evaluation function W as shown in equation (1) with the target phoneme type, the pitch within the phoneme, the duration, the phoneme match rate, etc. as parameters. Select the waveform to be.

【０００８】Ｗ＝Σω_i｜Ｐｉ_t−Ｐｉ_s｜（１） Σはｉ＝１からｎまでωｉはｉ番目のパラメータの重み
係数、Ｐｉ_tはｉ番目のパラメータの目標値（入力
値）、Ｐｉ_sはｉ番目のパラメータの実際の値（波形辞
書内の波形の値）。次にフォルマント周波数設定ステッ
プＳ２で、図２に示すようなテーブルを用いて入力ピッ
チ周波数と、選択した波形の持つピッチ周波数とからフ
ォルマント周波数を決定する。例えば選択した波形のピ
ッチ周波数がＰ１（Ｐ_i-1≦Ｐ１＜Ｐ_i）であり、これ
を入力ピッチ周波数Ｐ２（Ｐ_j-1≦Ｐ２＜Ｐ_j）に変更
する場合、図２から第１フォルマント周波数Ｆ１_tをＦ１_t＝Ｆ１₀＋ｆ１_ij 〔Ｈｚ〕（Ｆ１₀は選択された波形の第１フォルマント周波数）（２）として設定する。このような表を、例えばピッチによる
影響が大きいと思われる第１フォルマントから第３フォ
ルマントまで用意しておき、選択した波形が持つピッチ
周波数と入力ピッチ周波数とから各フォルマント毎に表
を参照して目的フォルマント周波数を決定する。[0008] _{_{W = Σω i | Pi t -Pi}} s | (1) Σ weighting coefficients ωi is the i-th parameter i = 1 to n, Pi _t is a target value of the i-th parameter (input value), Pi _s is the actual value of the i-th parameter (the value of the waveform in the waveform dictionary). Next, in the formant frequency setting step S2, the formant frequency is determined from the input pitch frequency and the pitch frequency of the selected waveform using a table as shown in FIG. For example, if the pitch frequency of the selected waveform is P1 (P _i-1 ≤P1 <P _i ), and it is changed to the input pitch frequency P2 (P _j-1 ≤P2 <P _j ), the _first frequency from FIG. The formant frequency F1 _t is set as F1 _t = F1 ₀ + f1 _ij [Hz] (F1 ₀ is the first formant frequency of the selected waveform) (2). Such a table is prepared, for example, from the first formant to the third formant which are considered to be greatly influenced by the pitch, and the table is referred to for each formant based on the pitch frequency and the input pitch frequency of the selected waveform. Determine the formant frequency.

【０００９】ピッチ制御ステップＳ３では、入力ピッチ
周波数に従って選択した波形のピッチ周波数の変更を行
う。ここでピッチ周波数変更手法としては、例えば波形
重畳法（参考文献「波形編集型規則合成法におけるピッ
チ制御法の検討」、音講論１−４−７（１９９０−３）
広川ら）を用いることができる。この手法では、図３に
示すように入力ピッチ周期をＴとすれば、２Ｔの長さを
有し中心部に対して前方部及び後方部が徐々に減少す
る。窓１１を用いて１ピッチ単位で選択された波形１２
を切り出し、つまり波形１２の大きなピークを中心とす
るピッチ周期Ｔごとに窓１１を用いて切り出し、入力ピ
ッチ周期Ｔでそれらの波形を重ね合わせて入力ピッチ周
期の波形１３を得る。In the pitch control step S3, the pitch frequency of the selected waveform is changed according to the input pitch frequency. Here, as the pitch frequency changing method, for example, a waveform superimposing method (reference document “Examination of pitch control method in waveform editing type rule synthesis method”, Sound Lecture 1-4-7 (1990-3)
Hirokawa et al.) Can be used. In this method, assuming that the input pitch period is T as shown in FIG. 3, the length is 2T and the front portion and the rear portion gradually decrease with respect to the central portion. Waveform 12 selected in 1 pitch units using window 11
Is cut out, that is, cut out using the window 11 for each pitch period T centered on the large peak of the waveform 12, and these waveforms are superposed at the input pitch period T to obtain the waveform 13 of the input pitch period.

【００１０】次にステップＳ３で得られたピッチ制御波
形に対し、フォルマント変換ステップＳ４では、フォル
マント周波数設定ステップＳ２で決定した目的フォルマ
ント周波数になるようにフォルマント周波数を変更した
波形を生成する。ここでは例えば「フォルマント制御方
法」（特願平４−２６１８２５）を用いてフォルマント
の変更を行う。即ち図４に示すようにピッチ制御ステッ
プＳ３でピッチが制御された選択波形についてそのスペ
クトル包絡（スペクトル密度関数）Ｐ（ｗ）を抽出し
（Ｓ₁）、また高速フーリエ変換（ＦＦＴ）により音声
波形をスペクトルＸ（ｗ）に変換し（Ｓ₂）、かつフォ
ルマントの抽出を行う（Ｓ₃）。その抽出したフォルマ
ントの周波数を目的とする周波数、つまりフォルマント
周波数設定ステップＳ２で決定された目的フォルマント
周波数に変更する（Ｓ₄）。ピッチ制御された選択音声
波形に対し、フォルマントの周波数を目的周波数に変更
した音声波形と対応するスペクトル包絡を求める
（Ｓ₅）。変換後のフォルマント周波数Ｆ' _iにおける
スペクトル密度Ｐ' （２πΔＴ・Ｆ' _i）と所望するス
ペクトル密度Ｐｔ（２πΔＴ・Ｆ' _i）との歪Ｄを求め
（Ｓ₆）、その歪Ｄが十分小さくない場合は（Ｓ₇）、
変更対象フォルマントのバンド幅を変更してステップＳ
₅に戻る（Ｓ₈）。ステップＳ₆で歪Ｄが十分小と判定
されると、Ｐ（ｗ）とＸ（ｗ）とからステップＳ₅で最
終的に求めたスペクトル包絡Ｐ'(ｗ）と対応したスペク
トルＸ'(ｗ）を求め（Ｓ₉）、このＸ'(ｗ）とＸ（ｗ）
とのスペクトル歪ｄを求める（Ｓ₁₀）、この歪ｄが十分
小さくなれば（Ｓ₁₁）、Ｘ'(ｗ）をＸ（ｗ）とし、Ｘ'
(ｗ）のスペクトル包絡Ｐ"(ｗ）をＰ'(ｗ）としてステ
ップＳ₉に戻る（Ｓ₁₂）。このことを繰返して歪ｄが十
分小さくなったらステップＳ₉で最終的に得られている
Ｘ'(ｗ）を逆ＦＦＴして音声波形に変換する（Ｓ₁₃）。Next, with respect to the pitch control waveform obtained in step S3, in the formant conversion step S4, a waveform in which the formant frequency is changed so as to have the target formant frequency determined in the formant frequency setting step S2 is generated. Here, for example, the formant is changed by using the "formant control method" (Japanese Patent Application No. 4-261825). That is, as shown in FIG. 4, the spectral envelope (spectral density function) P (w) of the selected waveform whose pitch is controlled in the pitch control step S3 is extracted (S ₁ ), and the speech waveform is obtained by the fast Fourier transform (FFT). Is converted into a spectrum X (w) (S ₂ ), and formants are extracted (S ₃ ). Frequency for the purpose of frequency of the formant in which the extracted, i.e. to change the purpose formant frequency determined by the formant frequency setting step S2 (S _4). A spectrum envelope corresponding to the speech waveform in which the formant frequency is changed to the target frequency is calculated for the pitch-controlled selected speech waveform (S ₅ ). The distortion D between the spectral density P ′ (2πΔT · F ′ _i ) at the converted formant frequency F ′ _i and the desired spectral density Pt (2πΔT · F ′ _i ) is obtained (S ₆ ), and the distortion D is sufficiently small. If not (S ₇ ),
Change the bandwidth of the target formant to be changed
Return to ₅ (S ₈ ). When the distortion D is determined to be sufficiently small in step S ₆ , the spectrum X ′ (w) corresponding to the spectrum envelope P ′ (w) finally obtained in step S ₅ from P (w) and X (w). ) Is obtained (S ₉ ), and this X ′ (w) and X (w)
(S ₁₀ ), if the distortion d becomes sufficiently small (S ₁₁ ), X ′ (w) is set to X (w), and X ′
The spectral envelope P ″ (w) of (w) is set as P ′ (w) and the process returns to step S ₉ (S ₁₂ ). If this is repeated and the distortion d becomes sufficiently small, it is finally obtained in step S _9. inverse FFT of X '(w) which are converted into speech waveform (S _13).

【００１１】このようにして選択された波形についてピ
ッチ周波数を入力ピッチ周波数に変更し、そのピッチ周
波数が変更された波形のフォルマント周波数を、目的フ
ォルマント周波数に変更した音声波形を得、この音声波
形を、入力テキストに応じて設定された音韻継続時間
と、音韻ごとのパワーとその音韻間での連続性を考慮し
たパワーパターンとに従って各音韻の長さ、パワーを制
御して順次波形重畳して合成音声を得る。With respect to the waveform thus selected, the pitch frequency is changed to the input pitch frequency, the formant frequency of the waveform whose pitch frequency is changed is changed to the target formant frequency, and a voice waveform is obtained. , The phoneme duration is set according to the input text and the power of each phoneme and the power pattern considering the continuity between the phonemes are controlled by controlling the length and power of each phoneme, and the waveforms are sequentially superimposed and synthesized. Get the voice.

【００１２】この発明はテキストを音声に変換する場合
に限らず、入力音韻と入力ピッチ周波数に従って音声波
形を選択して音声合成する場合に適用できる。The present invention is applicable not only to the case of converting text into speech but also to the case of selecting a speech waveform according to an input phoneme and an input pitch frequency to synthesize speech.

【００１３】[0013]

【発明の効果】以上述べたようにこの発明によれば、自
然音声のピッチ周波数とスペクトル構造と相関関係を考
慮して音声波形のピッチ周波数を目的の周波数にすると
共に、その両ピッチ周波数に応じてフォルマント周波数
を変更し、つまりスペクトルを変更しているため、音声
の品質を損うことなくピッチ周波数が変更された音声が
得られ、自然音声に近い品質の合成音声を得ることがで
きる。As described above, according to the present invention, the pitch frequency of the voice waveform is set to the target frequency in consideration of the correlation between the pitch frequency of the natural voice and the spectral structure, and both pitch frequencies are adjusted. Since the formant frequency is changed by changing the formant frequency, that is, the spectrum is changed, a voice with a changed pitch frequency can be obtained without deteriorating the voice quality, and a synthesized voice with a quality close to natural voice can be obtained.

[Brief description of drawings]

【図１】この発明の実施例を示す流れ図。FIG. 1 is a flow chart showing an embodiment of the present invention.

【図２】フォルマント周波数変更テーブルを示す図。FIG. 2 is a diagram showing a formant frequency change table.

【図３】ピッチ周波数変更処理の概略を示す波形図。FIG. 3 is a waveform diagram showing an outline of pitch frequency changing processing.

【図４】フォルマント変換処理の例を示す流れ図。FIG. 4 is a flowchart showing an example of formant conversion processing.

Claims

[Claims]

1. A voice waveform is selected according to an input phoneme and an input pitch frequency, a pitch frequency of the selected voice waveform is controlled according to the input pitch frequency, and the selected voice waveform is controlled according to an input phoneme duration. A target formant frequency is determined according to the input pitch frequency and the pitch frequency of the selected waveform in a speech synthesis method in which waveforms are sequentially superimposed to obtain synthesized speech as power according to the input magnitude. And a formant conversion step of converting the formant frequency of the pitch-controlled voice waveform according to the target formant frequency.