JP5345967B2

JP5345967B2 - Speech synthesis apparatus, speech synthesis method, and speech synthesis program

Info

Publication number: JP5345967B2
Application number: JP2010073006A
Authority: JP
Inventors: 信行西澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2010-03-26
Filing date: 2010-03-26
Publication date: 2013-11-20
Anticipated expiration: 2030-03-26
Also published as: JP2011203653A

Description

本発明は、音素の集合として構成される音声合成用情報から合成音声波形を生成する音声合成装置、音声合成方法および音声合成プログラムに関する。 The present invention relates to a speech synthesizer, a speech synthesis method, and a speech synthesis program that generate a synthesized speech waveform from speech synthesis information configured as a set of phonemes.

音声合成技術の代表的な利用方法として、テキスト音声変換（Text-To-Speech）が挙げられる。以下、テキスト解析等の結果得られる音素の種類や韻律的特徴を表記した記号を入力とし、音声波形を生成する装置を音声合成装置と呼ぶ。音声合成装置は、テキスト音声変換システムの構成要素である。 Text-to-speech is a typical method of using speech synthesis technology. Hereinafter, a device that generates a speech waveform using a symbol representing the type of phoneme and prosodic features obtained as a result of text analysis or the like as an input is called a speech synthesizer. A speech synthesizer is a component of a text-to-speech conversion system.

この音声合成装置に入力される記号を、以下、音声合成用記号と呼ぶ。音声合成用記号には様々な形式があり得るが、ここでは、一連の音声を構成する音韻的情報と、主としてポーズや声の高さとして表現される韻律的情報を同時に表記したものを考える。そのような音声合成用記号の例として、ＪＥＩＴＡ（電子情報技術産業協会）規格ＩＴ−４００２「日本語テキスト音声合成用記号」がある（非特許文献１参照）。音声合成装置は、このような音声合成用記号に基づいてそれに対応する音声波形を生成する。ただし、一般に音声波形は合成対象の音素だけでなく、前後の音素の種類や韻律的特徴の影響を強く受けるため、一般的に記号と音声波形の対応関係は複雑になる。 The symbols input to this speech synthesizer are hereinafter referred to as speech synthesis symbols. There are various forms of the symbols for speech synthesis. Here, let us consider a case in which phonological information constituting a series of speech and prosodic information mainly expressed as a pose or a voice pitch are simultaneously described. An example of such a symbol for speech synthesis is JEITA (Electronic Information Technology Industries Association) standard IT-4002 “symbol for Japanese text speech synthesis” (see Non-Patent Document 1). The speech synthesizer generates a speech waveform corresponding to such a speech synthesis symbol. However, since the speech waveform is generally strongly influenced by not only the phonemes to be synthesized but also the types of phonemes before and after and the prosodic features, the correspondence between symbols and speech waveforms is generally complicated.

音声合成装置による音声波形の生成方法には様々な方式があるが、音声の短時間スペクトルの特徴や有声・無声情報、基本周波数（F0）を直接パラメータとし、このパラメータに基づき音声波形を生成する方法が主な背景技術である。代表的な音声波形の生成方法に、音源・フィルタモデルに基づく音声合成がある。音源・フィルタモデルでは、音声の響きをつくる調音フィルタを適当な音源で駆動することで、音声波形を信号処理的に合成する。 There are various methods for generating a speech waveform by a speech synthesizer. The speech waveform is generated based on the short-time spectral features of voice, voiced / unvoiced information, and the fundamental frequency (F0) as direct parameters. The method is the main background art. A typical speech waveform generation method is speech synthesis based on a sound source / filter model. In the sound source / filter model, a sound waveform is synthesized in a signal processing manner by driving an articulation filter that generates sound of sound with an appropriate sound source.

インパルス列や白色雑音源といった比較的に単純な構成の音源を用いる場合、インパルス列と白色雑音源の切り替えは有声・無声情報に基づき、インパルス列の基本周波数はF0パラメータに基づきそれぞれ制御することができる。一方、スペクトルの特徴を表すパラメータとしてはＭＦＣＣ（Mel-Frequency Cepstral Coefficient）や線形予測係数が用いられ、調音フィルタとしては、ＡＲ（自己回帰）型のフィルタや、特にパラメータとしてＭＦＣＣを用いる場合には、ＭＦＣＣを直接そのパラメータとする、ＭＬＳＡ（メル対数スペクトル近似）フィルタ（非特許文献２参照）等が用いられる。 When using a relatively simple sound source such as an impulse train or white noise source, switching between the impulse train and the white noise source is based on voiced / unvoiced information, and the fundamental frequency of the impulse train can be controlled based on the F0 parameter. it can. On the other hand, MFCC (Mel-Frequency Cepstral Coefficient) or a linear prediction coefficient is used as a parameter representing the characteristics of the spectrum, and an AR (autoregressive) type filter is used as an articulation filter, or in particular, when MFCC is used as a parameter. An MLSA (Mel logarithmic spectrum approximation) filter (see Non-Patent Document 2), which directly uses MFCC as its parameter, is used.

例えば子音のような音声を合成するためには、音声合成パラメータを時間的に変化させることが必要なため、この方法では、例えば５ｍｓ程度の一定周期で音声合成パラメータを更新し、その特徴を変化させながら音声を合成することが一般的である。この一定周期の１周期分は一般に１フレームと呼ばれる。したがって、一般的に音声を合成するためには、音声合成用記号から、音声合成パラメータについてフレーム周期の時系列データを作成する必要がある。 For example, in order to synthesize speech such as consonants, it is necessary to change the speech synthesis parameters over time. In this method, for example, the speech synthesis parameters are updated at a fixed period of about 5 ms and the characteristics are changed. It is common to synthesize speech while One period of this fixed period is generally called one frame. Therefore, in general, in order to synthesize speech, it is necessary to create time-series data of frame periods for speech synthesis parameters from speech synthesis symbols.

最も簡単な方法としては、ある音素を長さ分だけのフレーム周期の時系列データを、必要な音素のそれぞれについて事前に準備しておき、生成したい音声の音素系列に合わせて、それらの音声合成パラメータ時系列をつなぎ合わせて１発声の音声合成パラメータ時系列とする方法が考えられる。しかし、先述のように、同じ音素であっても、前後の音素の種類や、話速や声の高さ、直前や直後のポーズからの時間的距離によって、その特徴が大きく異なる場合がある。このような場合に対応するためには、前後の音素や韻律的特徴を考慮した複雑な音素分類を用いる必要があるが、このような複雑な音素分類を用いると、音素の種類の個数は莫大になり、必要な全ての音声合成パラメータ時系列のセットを事前に作成、蓄積しておくことは困難である。 The simplest method is to prepare time series data of a certain phoneme for the length of the frame period for each required phoneme in advance, and synthesizing them according to the phoneme sequence of the speech to be generated A method of concatenating the parameter time series to form a speech synthesis parameter time series for one utterance is conceivable. However, as described above, the characteristics of the same phoneme may vary greatly depending on the type of phonemes before and after, the speed of speech, the pitch of the voice, and the temporal distance from the immediately preceding or immediately following pose. In order to deal with such cases, it is necessary to use complex phoneme classifications that take into account the preceding and following phonemes and prosodic features, but with such complex phoneme classifications, the number of phoneme types is enormous. Therefore, it is difficult to create and store in advance all necessary speech synthesis parameter time series sets.

そこで実際には、音声合成パラメータ時系列の時間変化を適当なモデルに基づきモデル化し、そのモデルパラメータを音声合成用記号からまず予測することで生成し、得られたモデルから音声合成パラメータ時系列を生成することで、任意の音声を合成可能とする方法が用いられる。以下では、このモデルのことを音声生成モデルと呼ぶ。 Therefore, in practice, the time change of the speech synthesis parameter time series is modeled based on an appropriate model, and the model parameters are generated by first predicting from the speech synthesis symbols, and the speech synthesis parameter time series is obtained from the obtained model. By generating, a method that enables synthesis of arbitrary speech is used. Hereinafter, this model is referred to as a speech generation model.

例えば、ある音素の音声合成パラメータの特徴が時間的に３つの状態に分かれ、各状態のフレーム数について、それらの統計分布パラメータベクトルを最初の状態から順にd1、d2、d3とし、この３つのベクトルの要素を連結して１つのベクトルdを作り、また、音声合成パラメータの各状態の統計分布パラメータベクトルを最初の状態から順にv1、v2、v3とすれば、その音素を合成するための音声合成パラメータの特徴は、音声生成モデルのパラメータを構成するd、v1、v2、v3の4つのベクトルで表すことができる。さらに、音声合成用記号からこれらのパラメータベクトルを生成するような予測器を前もって構築し、音声合成時に予測器を用いることで、比較的少量のデータから音声を合成することができる。 For example, the features of a speech synthesis parameter of a phoneme are divided into three states in terms of time, and for the number of frames in each state, their statistical distribution parameter vectors are d1, d2, and d3 in order from the first state. The speech synthesis for synthesizing the phoneme is made by concatenating the elements of, making a vector d, and if the statistical distribution parameter vector of each state of the speech synthesis parameter is v1, v2, v3 in order from the first state The characteristics of the parameters can be expressed by four vectors d, v1, v2, and v3 that constitute the parameters of the speech generation model. Further, by constructing a predictor that generates these parameter vectors from speech synthesis symbols in advance and using the predictor during speech synthesis, it is possible to synthesize speech from a relatively small amount of data.

この方法に基づく代表的なものに、ＨＭＭ音声合成方式がある。ＨＭＭ音声合成方式は、音声生成モデルとしてＨＭＭ（隠れマルコフモデル）に基づくモデルを仮定している。そして、音声生成モデルのパラメータを構成する複数のベクトルは、音声認識技術における状態共有ＨＭＭで用いられる方法と同様に、それぞれ音声合成用記号から決定木に基づき決定される（非特許文献３参照）。ここで決定木は、予め用意しておいた学習音声と、それに対応する音声合成用記号を用いて構築（学習）する。 A typical example based on this method is an HMM speech synthesis method. The HMM speech synthesis method assumes a model based on HMM (Hidden Markov Model) as a speech generation model. The plurality of vectors constituting the parameters of the speech generation model are each determined based on a decision tree from speech synthesis symbols, as in the method used in the state sharing HMM in speech recognition technology (see Non-Patent Document 3). . Here, the decision tree is constructed (learned) using a prepared learning speech and a corresponding speech synthesis symbol.

１発話の音声を合成する際には、まず単位音声毎の音声生成モデルを連結して１発話分の音声生成モデルをまず構成する。そして、その構成された音声生成モデルに対し、尤度が最大となる音声合成パラメータ時系列を求め、これを音声波形生成に用いる。音声合成パラメータ時系列に対する、音声生成モデルの尤度は、例えば、音声生成モデルにおいて、次のように表わされる。 When synthesizing one utterance voice, first, a voice generation model for one utterance is first constructed by connecting the voice generation models for each unit voice. Then, a speech synthesis parameter time series having the maximum likelihood is obtained for the constructed speech generation model, and this is used for speech waveform generation. The likelihood of the speech generation model with respect to the speech synthesis parameter time series is expressed as follows in the speech generation model, for example.

すなわち、フレームiにおける音声合成パラメータxの値x_iの統計的分布が他の種類の音声合成パラメータに対し独立でかつ正規分布に従い、その分布の平均がμ_i、分散がσ_i ²であるとき、音声の長さが全体でnフレームとすると、１発声の音声合成パラメータxの時系列x_iに対する音声生成モデルの対数尤度は、以下の数式で与えられる。

That is, according independently a and normally distributed statistical distribution to other types of speech synthesis parameters values x _i of the speech synthesis parameters x at frame i, mean mu _i of the distribution, when the variance is sigma _i ² , the length of the speech is to a total of n frames, one log likelihood of the speech production model for sequence x _i when the utterance of the speech synthesis parameters x is given by the following equation.

しかし、フレーム周期の音声合成パラメータを数個の正規分布で直接モデル化した場合、最尤なパラメータ系列は、状態内で正規分布の平均値が連続的に出力されたものとなり、状態が切り替わる際に、その値が不連続となる。すなわち、階段状のパラメータ時系列となる。これは実際の音声の特徴と異なるため、音声合成パラメータそのものだけでなく（以下、これを静的特徴と呼ぶ）、音声合成パラメータの動的特徴として、音声合成パラメータ時系列データの一階差分（デルタ）や二階差分（デルタデルタ）等を組み合わせたベクトルを特徴ベクトルとすることで、音声合成パラメータの連続的な変化も考慮したモデル化が行われる（非特許文献４参照）。 However, when the speech synthesis parameters of the frame period are directly modeled with several normal distributions, the maximum likelihood parameter series is the one in which the average value of the normal distribution is continuously output within the state, and the state is switched. In addition, the value becomes discontinuous. That is, it becomes a stepwise parameter time series. Since this is different from actual speech features, not only speech synthesis parameters themselves (hereinafter referred to as static features), but also dynamic features of speech synthesis parameters, first-order differences of speech synthesis parameter time series data ( Delta), second-order difference (delta delta), and the like are used as feature vectors to perform modeling in consideration of continuous changes in speech synthesis parameters (see Non-Patent Document 4).

ある音声合成パラメータxのi番目のフレームにおける値x_iのデルタΔx_iおよびデルタデルタΔ²x_iは、例えばそれぞれ数式（２）、数式（３）により与えられる。

The delta Δx _i and the delta delta Δ ² x _i of the value x _i in the i-th frame of a certain speech synthesis parameter x are given by, for example, Expression (2) and Expression (3), respectively.

以下、音声合成パラメータの時系列データの計算方法を説明する。まず説明のためにフレームｉにおける特徴ベクトルをo_iとする。数式中の英大文字および太字の英小文字はベクトルを意味する（以下、同様）。

Hereinafter, a method for calculating time-series data of speech synthesis parameters will be described. First, for the sake of explanation, the feature vector in frame i is assumed to be o _i . Uppercase letters and lowercase letters in bold in the formula mean vectors (the same applies hereinafter).

また音声の長さはｎフレームとする。また、以下の行列を定義する。ただし、上付きのTは転置行列、上付きの-1は逆行列を表す（以下同様）。

The length of the voice is n frames. In addition, the following matrix is defined. However, the superscript T represents a transposed matrix, and the superscript -1 represents an inverse matrix (the same applies hereinafter).

さらに、数式（２）、（３）で定義される静的特徴の時系列Xから動的特徴を含む特徴ベクトル時系列Oを求める変換行列をここではＷとする。つまり、以下の関係が成り立つ。ここでＷは３ｎ行×ｎ列の行列である。

Furthermore, a transformation matrix for obtaining a feature vector time series O including a dynamic feature from a static feature time series X defined by Equations (2) and (3) is W here. That is, the following relationship holds. Here, W is a matrix of 3n rows × n columns.

パラメータの分布が正規分布に従う場合、Xの対数尤度p(X)は以下の数式で与えられる。ここでm_iはo_iの分布の平均ベクトル、U_iはo_iの分布の分散共分散行列である。m_iおよびU_iは事前に学習した決定木により、音声合成用記号から求める。

When the parameter distribution follows a normal distribution, the log likelihood p (X) of X is given by the following equation. Here, _mi is an average vector of the distribution of o _i , and U _i is a variance covariance matrix of the distribution of o _i . m _i and U _i are obtained from the speech synthesis symbols by a decision tree learned in advance.

対数尤度p(X)を最大とするXは以下の関係を満たす。

X that maximizes log likelihood p (X) satisfies the following relationship.

数式（１１）をXについて解くと以下の数式が得られる。

Solving Equation (11) for X yields:

すなわち、数式（１２）を計算することで、最尤基準に基づく、動的特徴を考慮したパラメータ時系列が得られる。 That is, by calculating Equation (12), a parameter time series in consideration of dynamic features based on the maximum likelihood criterion can be obtained.

「日本語テキスト音声合成用記号」ＪＥＩＴＡ規格ＩＴ−４００２、２００５年３月"Symbols for Japanese text-to-speech synthesis" JEITA standard IT-4002, March 2005 今井聖、住田一男、古市千枝子、「音声合成のためのメル対数スペクトル近似（ＭＬＳＡ）フィルタ」、電子情報通信学会論文誌(A), J66-A, 2, Feb.1983, pp.122-129Sei Imai, Kazuo Sumita, Chieko Furuichi, "Mel Log Spectrum Approximation (MLSA) Filter for Speech Synthesis", IEICE Transactions (A), J66-A, 2, Feb.1983, pp.122-129 吉村貴克、徳田恵一、益子貴史、小林隆夫、北村正、「ＨＭＭに基づく音声合成におけるスペクトル・ピッチ・継続長の同時モデル化」、電子情報通信学会論文誌(D-II), J83-D-II, 11, Nov.2000, pp.2099-2107Takamura Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura, “Simultaneous Modeling of Spectrum, Pitch, and Duration in HMM-Based Speech Synthesis”, IEICE Transactions (D-II), J83-D -II, 11, Nov.2000, pp.2099-2107 益子貴史、徳田恵一、小林隆夫、今井聖、「動的特徴を用いたＨＭＭに基づく音声合成」、電子情報通信学会論文誌(D-II), J79-D-II, 12, Dec.1996, pp.2184-2190Masashi Takashi, Tokuda Keiichi, Kobayashi Takao, Imai Kiyoshi, "HMM-based speech synthesis using dynamic features", IEICE Transactions (D-II), J79-D-II, 12, Dec. 1996, pp.2184-2190

上記のようなパラメータ時系列を生成するには、高い計算精度が必要になる。ところが、基本周波数に関するパラメータ分布のパラメータベクトルを予測するための決定木を、音韻的な音素の種類だけでなく、アクセント型やアクセント句境界といった言語的な韻律的特徴の違いもその説明変数に含めて学習すると、学習された決定木で予測される特徴ベクトルの分布において、しばしば、デルタ特徴やデルタデルタ特徴に関連する要素の分布の分散が、静的特徴に関連する要素の分布の分散より小さくなる傾向が現れる。これは、言語的な韻律的特徴が基本周波数の絶対値よりもその短時間変化に対して強い相関を持つことに起因すると考えられる。 In order to generate the parameter time series as described above, high calculation accuracy is required. However, the decision tree for predicting the parameter vector of the parameter distribution related to the fundamental frequency includes not only phonemic phoneme types but also differences in linguistic prosodic features such as accent types and accent phrase boundaries. In the distribution of feature vectors predicted by a learned decision tree, the distribution of elements related to delta features and delta-delta features is often smaller than the distribution of elements related to static features. The tendency to become appears. This is considered to be due to the fact that the linguistic prosodic feature has a stronger correlation with the short-time change than the absolute value of the fundamental frequency.

このため音声合成のために最尤基準に基づき基本周波数パラメータの時系列データを生成すると、決定木学習に用いた音声の基本周波数の絶対値ではなく、そのデルタ特徴およびデルタデルタ特徴を大きく評価して再現するような基本周波数のパラメータ時系列が生成される傾向が見られる。このような方法は、基本周波数の短時間変化を正確に再現するには優れているが、長時間で学習音声と同じ基本周波数分布を再現するには劣っている。このことは合成音声の不自然さの原因となる。 For this reason, when generating time-series data of fundamental frequency parameters based on the maximum likelihood criterion for speech synthesis, not the absolute value of the fundamental frequency of speech used for decision tree learning, but its delta and delta-delta features are greatly evaluated. There is a tendency that a parameter time series of the fundamental frequency that can be reproduced is generated. Such a method is excellent for accurately reproducing short-time changes in the fundamental frequency, but is inferior for reproducing the same fundamental frequency distribution as the learning speech over a long period of time. This causes unnaturalness of the synthesized speech.

このような傾向は、生成するパラメータ時系列の対数尤度の式を用いて分析すると、静的特徴に関連する項の値が、動的特徴に関連する項の値よりも極端に小さくなるために生じる。しかし実際には、生成されるパラメータが学習した基本周波数の分布から外れてしまうと対数尤度がごく僅かではあっても小さくなるため、合成音声の基本周波数が学習音声の分布から極端に外れてしまうことは少ない。 Such a tendency is analyzed by using the log likelihood formula of the parameter time series to be generated, because the value of the term related to the static feature becomes extremely smaller than the value of the term related to the dynamic feature. To occur. However, in practice, if the generated parameters deviate from the learned fundamental frequency distribution, the logarithmic likelihood becomes small even if very small, so the fundamental frequency of the synthesized speech deviates extremely from the learned speech distribution. It is rare to end up.

しかし携帯端末のように計算資源が限られ固定小数点数演算が必要で、処理可能な値の範囲（例えば最大値と最小値の比）を充分にとることができない装置による計算では、尤度計算の際の桁落ちにより、静的パラメータに関連する項の値が基本周波数に因らずゼロとなりうる。静的パラメータに関連する項の値がゼロになると、静的パラメータに対する制約が無くなる。そして、学習音声の基本周波数と比較して、周波数軸上で極端に上または下に平行移動したような基本周波数時系列データが生成されうる。 However, when computing with a device that has limited computing resources and requires fixed-point arithmetic, such as a mobile device, and cannot handle a sufficient range of values that can be processed (for example, the ratio between the maximum and minimum values), the likelihood calculation Due to the digit loss, the value of the term related to the static parameter can be zero regardless of the fundamental frequency. When the value of the term related to the static parameter becomes zero, the restriction on the static parameter is removed. Then, it is possible to generate basic frequency time-series data that is translated extremely up or down on the frequency axis as compared with the basic frequency of the learning speech.

本発明は、このような事情に鑑みてなされたものであり、予測された基本周波数パラメータの計算精度が不充分な場合でも、学習音声との間で生じる基本周波数分布のずれを低減できる音声合成装置、音声合成方法および音声合成プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and speech synthesis that can reduce the deviation of the fundamental frequency distribution that occurs with the learning speech even when the calculation accuracy of the predicted fundamental frequency parameter is insufficient. An object is to provide a device, a speech synthesis method, and a speech synthesis program.

（１）上記の目的を達成するため、本発明の音声合成装置は、一連の単位音声列に含まれる単位音声の種類を記述する音声合成用情報から合成音声波形を生成する音声合成装置であって、与えられた音声合成用情報に基づく第１の特徴ベクトルの分布情報を用いて、第１の基本周波数時系列データを予測して生成する第１の基本周波数時系列データ生成部と、前記与えられた音声合成用情報に基づく、第２の特徴ベクトルの分布情報を用いて、第２の基本周波数時系列データを予測して生成する第２の基本周波数時系列データ生成部と、前記第２の基本周波数時系列データを用いて、前記第１の基本周波数時系列データを修正する基本周波数時系列データ修正部とを備え、前記修正された前記第１の基本周波数時系列データに基づく合成音声波形を生成することを特徴としている。 (1) In order to achieve the above object, a speech synthesizer according to the present invention is a speech synthesizer that generates a synthesized speech waveform from speech synthesis information that describes the type of unit speech included in a series of unit speech sequences. A first fundamental frequency time-series data generation unit that predicts and generates first fundamental frequency time-series data using distribution information of a first feature vector based on given speech synthesis information; A second fundamental frequency time-series data generation unit that predicts and generates second fundamental frequency time-series data using distribution information of a second feature vector based on given speech synthesis information; A basic frequency time-series data correction unit that corrects the first basic frequency time-series data using two basic frequency time-series data, and synthesis based on the corrected first basic frequency time-series data voice It is characterized by generating a shape.

このように、本発明の音声合成装置は、第２の特徴ベクトルの時系列データを用いて、第１の特徴ベクトルの時系列データを修正するため、予測された基本周波数パラメータの計算精度が不充分な場合でも、学習音声との間で生じる基本周波数分布のずれを低減できる。その結果、特に、携帯端末などの浮動小数点演算のできないプロセッサを有する装置においても学習音声に近い音声合成を実現できる。 As described above, since the speech synthesizer of the present invention corrects the time series data of the first feature vector using the time series data of the second feature vector, the calculation accuracy of the predicted fundamental frequency parameter is poor. Even if it is sufficient, it is possible to reduce the deviation of the fundamental frequency distribution that occurs with the learning speech. As a result, speech synthesis close to learning speech can be realized even in a device having a processor that cannot perform floating point arithmetic, such as a portable terminal.

（２）また、本発明の音声合成装置は、前記第１の基本周波数時系列データ生成部は、時間的変化を表す動的特徴を要素として含む第１の特徴ベクトルの分布情報を用い、前記第２の基本周波数時系列データ生成部は、直接的に時間的変化を表さない静的特徴を要素として含む第２の特徴ベクトルの分布情報を用いることを特徴としている。このように、時間的変化を表す動的特徴を要素として含む特徴ベクトルから生成した時系列データを、時間的変化を表さない静的特徴を要素として含む特徴ベクトルから生成した時系列データを用いて修正するため、学習音声と合成音声の間に生じ易い基本周波数分布のずれを低減できる。 (2) Further, in the speech synthesizer according to the present invention, the first basic frequency time-series data generation unit uses the distribution information of the first feature vector including dynamic features representing temporal changes as elements, The second fundamental frequency time-series data generating unit is characterized by using distribution information of a second feature vector including static features that do not directly represent temporal changes as elements. In this way, time-series data generated from feature vectors containing dynamic features representing temporal changes as elements is used, and time-series data generated from feature vectors containing static features that do not express temporal changes as elements. Therefore, it is possible to reduce the deviation of the fundamental frequency distribution that is likely to occur between the learning speech and the synthesized speech.

（３）また、本発明の音声合成装置は、前記基本周波数時系列データ修正部は、所定の時系列区間毎に、前記第１の基本周波数時系列データの平均値を前記第２の基本周波数時系列データの平均値に一致させることを特徴としている。これにより、基本周波数パラメータの時系列データについて周波数の高低方向のずれを、平均値を用いて修正することができる。その結果、長時間平均的には、第２の基本周波数分布パラメータ生成部の出力結果を強く反映させた基本周波数時系列データを生成することができる。 (3) Further, in the speech synthesizer of the present invention, the fundamental frequency time-series data correction unit calculates an average value of the first fundamental frequency time-series data for the second fundamental frequency for each predetermined time-series section. It is characterized by matching the average value of time series data. Thereby, the shift in the frequency direction of the time series data of the basic frequency parameter can be corrected using the average value. As a result, it is possible to generate fundamental frequency time-series data that strongly reflects the output result of the second fundamental frequency distribution parameter generation unit on a long-term average.

（４）また、本発明の音声合成装置は、前記基本周波数時系列データ修正部は、所定の時系列区間毎に、前記第１の基本周波数時系列データの分散を前記第２の基本周波数時系列データの分散に一致させることを特徴としている。これにより、基本周波数パラメータの時系列データについて周波数の分布のずれを、分散を用いて修正することができる。その結果、１発声単位内のパラメータ変動の最小値・最大値を、学習音声のそれに近付け、学習音声をより正確に再現するように音声を合成することができる。 (4) Further, in the speech synthesizer according to the present invention, the fundamental frequency time-series data correction unit distributes the dispersion of the first fundamental frequency time-series data at the second fundamental frequency for each predetermined time-series section. It is characterized by matching the distribution of series data. Thereby, it is possible to correct the deviation of the frequency distribution of the time-series data of the basic frequency parameter using the variance. As a result, it is possible to synthesize a voice so that the minimum and maximum values of parameter fluctuation within one utterance unit are close to that of the learning voice and the learning voice is reproduced more accurately.

（５）また、本発明の音声合成装置は、前記基本周波数時系列データ修正部は、有声が連続する時系列区間毎に、前記第１の基本周波数時系列データを修正することを特徴としている。これにより、より短い時間単位で、第２の基本周波数分布パラメータ生成部１３３の出力結果を強く反映させた基本周波数時系列データを生成することができる。 (5) Further, the speech synthesizer of the present invention is characterized in that the basic frequency time-series data correction unit corrects the first basic frequency time-series data for each time-series section in which voice is continuous. . Thereby, the fundamental frequency time series data that strongly reflects the output result of the second fundamental frequency distribution parameter generation unit 133 can be generated in shorter time units.

（６）また、本発明の音声合成方法は、音素の集合として構成される音声合成用情報から合成音声波形を生成する音声合成方法であって、与えられた音声合成用情報に基づく第１の特徴ベクトルの分布情報を用いて、第１の基本周波数時系列データを予測して生成するステップと、前記与えられた音声合成用情報に基づく、第２の特徴ベクトルの分布情報を用いて、第２の基本周波数時系列データを予測して生成するステップと、前記第２の基本周波数時系列データを用いて、前記第１の基本周波数時系列データを修正するステップとを含み、前記修正された前記第１の基本周波数時系列データに基づく合成音声波形を生成することを特徴としている。これにより、予測された基本周波数パラメータの計算精度が不充分な場合でも、学習音声との間で生じる基本周波数分布のずれを抑えることができる。 (6) The speech synthesis method of the present invention is a speech synthesis method for generating a synthesized speech waveform from speech synthesis information configured as a set of phonemes, and is a first synthesis method based on given speech synthesis information. Predicting and generating first basic frequency time-series data using feature vector distribution information, and using second feature vector distribution information based on the given speech synthesis information, Predicting and generating two fundamental frequency time-series data, and modifying the first fundamental frequency time-series data using the second fundamental frequency time-series data, the modified A synthesized speech waveform based on the first basic frequency time-series data is generated. Thereby, even when the calculation accuracy of the predicted fundamental frequency parameter is insufficient, the deviation of the fundamental frequency distribution that occurs with the learning speech can be suppressed.

（７）また、本発明の音声合成プログラムは、一連の単位音声列に含まれる単位音声の種類を記述する音声合成用情報から合成音声波形を生成するためにコンピュータに実行させる音声合成プログラムであって、与えられた音声合成用情報に基づく第１の特徴ベクトルの分布情報を用いて、第１の基本周波数時系列データを予測して生成する処理と、前記与えられた音声合成用情報に基づく、第２の特徴ベクトルの分布情報を用いて、第２の基本周波数時系列データを予測して生成する処理と、前記第２の基本周波数時系列データを用いて、前記第１の基本周波数時系列データを修正する処理とを含み、前記修正された前記第１の基本周波数時系列データに基づく合成音声波形を生成することを特徴としている。これにより、予測された基本周波数パラメータの計算精度が不充分な場合でも、学習音声との間で生じる基本周波数分布のずれを抑えることができる。 (7) The speech synthesis program of the present invention is a speech synthesis program that is executed by a computer to generate a synthesized speech waveform from speech synthesis information that describes the type of unit speech included in a series of unit speech sequences. Then, using the distribution information of the first feature vector based on the given speech synthesis information, a process for predicting and generating the first basic frequency time-series data, and based on the given speech synthesis information The second basic frequency time-series data is predicted and generated using the distribution information of the second feature vector, and the second basic frequency time-series data is used to generate the first basic frequency time-series data. Including a process of correcting the sequence data, and generating a synthesized speech waveform based on the corrected first basic frequency time-series data. Thereby, even when the calculation accuracy of the predicted fundamental frequency parameter is insufficient, the deviation of the fundamental frequency distribution that occurs with the learning speech can be suppressed.

本発明によれば、予測された基本周波数パラメータの計算精度が不充分な場合でも、学習音声との間で生じる基本周波数分布のずれを低減できる。 According to the present invention, even when the calculation accuracy of the predicted fundamental frequency parameter is insufficient, the deviation of the fundamental frequency distribution that occurs with the learning speech can be reduced.

本発明に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to the present invention. 本発明に係る音声合成装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the speech synthesizer which concerns on this invention.

本発明の実施形態を説明する。以下の説明において“単位音声”とは、音声合成装置における、音声の最小処理単位である。単位音声の具体例には、音素、音節、単語がある。単位音声は、例えば前後の音素の種類といった音韻環境に関する違い、またアクセントやイントネーション、話速といった韻律的特徴の違いを考慮して分類される。“単位発話”は、連続的な特徴を有する一連の単位音声列を指し、１文の発声や、呼気段落（一息で読む単位）に対応する。また“音声合成用記号”とは、１単位発話の音声に含まれる単位音声のそれぞれの種類を記述するための一連の記号である。また、以下では基本周波数を全て対数基本周波数軸上で取り扱う。 An embodiment of the present invention will be described. In the following description, “unit speech” is a minimum speech processing unit in the speech synthesizer. Specific examples of unit speech include phonemes, syllables, and words. Unit speech is classified in consideration of differences in phonological environment such as the types of phonemes before and after, and differences in prosodic features such as accent, intonation, and speech speed. “Unit utterance” refers to a series of unit speech strings having continuous features, and corresponds to a single sentence utterance or an exhalation paragraph (unit read in one breath). The “speech synthesis symbols” are a series of symbols for describing each type of unit speech included in speech of one unit utterance. In the following, all fundamental frequencies are handled on the logarithmic fundamental frequency axis.

［第１の実施形態］
（音声合成装置の機能的構成）
図面を参照して、音声合成装置１００の機能的構成の一例を説明する。図１は、音声合成装置１００のブロック図である。音声合成装置１００は、例えば音声合成用記号のような音声合成用情報から合成音声波形を生成する。図１に示すように、音声合成装置１００は、状態継続長系列生成部１１０、スペクトル特徴分布パラメータ生成部１２０、基本周波数時系列データ生成モジュール１３０、スペクトル特徴時系列データ生成部１４０、および音声波形生成部１５０を備えている。 [First Embodiment]
(Functional configuration of speech synthesizer)
An example of a functional configuration of the speech synthesizer 100 will be described with reference to the drawings. FIG. 1 is a block diagram of the speech synthesizer 100. The speech synthesizer 100 generates a synthesized speech waveform from speech synthesis information such as a speech synthesis symbol. As shown in FIG. 1, the speech synthesizer 100 includes a state duration sequence generator 110, a spectrum feature distribution parameter generator 120, a fundamental frequency time series data generator module 130, a spectrum feature time series data generator 140, and a speech waveform. A generation unit 150 is provided.

状態継続長系列生成部１１０は、音声合成用記号列に対応した音声単位毎に、その音声単位を構成する各状態の継続長（フレーム数）を予測し、１発話単位分をまとめて、状態継続長系列として出力する。一方、スペクトル特徴分布パラメータ生成部１２０は、音声合成用記号列に対応した状態系列における各状態のスペクトル特徴分布パラメータを生成する。 The state duration sequence generator 110 predicts the duration (number of frames) of each state constituting the speech unit for each speech unit corresponding to the speech synthesis symbol string, collects one utterance unit, Output as a continuous length sequence. On the other hand, the spectral feature distribution parameter generation unit 120 generates a spectral feature distribution parameter for each state in the state sequence corresponding to the speech synthesis symbol string.

有声・無声パラメータ生成部１３１は、音声合成用記号列に対応した状態系列における有声・無声確率パラメータを生成する。第１の基本周波数分布パラメータ生成部１３２は、音声合成用記号列に対応した状態系列における各状態の基本周波数特徴分布パラメータを生成する。第１の基本周波数分布パラメータ生成部１３２は、学習音声を用いて基本周波数の分布パラメータを予測する予測器である。 The voiced / unvoiced parameter generation unit 131 generates a voiced / unvoiced probability parameter in the state sequence corresponding to the speech synthesis symbol string. The first fundamental frequency distribution parameter generation unit 132 generates a fundamental frequency feature distribution parameter for each state in the state sequence corresponding to the speech synthesis symbol string. The first fundamental frequency distribution parameter generation unit 132 is a predictor that predicts a fundamental frequency distribution parameter using learning speech.

他方、第２の基本周波数分布パラメータ生成部１３３では、同様に音声合成用記号列に対応した状態系列における各状態の基本周波数特徴分布パラメータを生成する。なお、スペクトル特徴および基本周波数特徴はそれぞれ静的特徴だけでなく、そのデルタ特徴やデルタデルタ特徴等の動的特徴を含む。第２の基本周波数分布パラメータ生成部１３３は、学習音声を用いて基本周波数の分布パラメータを予測する予測器である。 On the other hand, the second fundamental frequency distribution parameter generation unit 133 similarly generates a fundamental frequency feature distribution parameter of each state in the state sequence corresponding to the speech synthesis symbol string. The spectral feature and the fundamental frequency feature include not only static features but also dynamic features such as delta features and delta-delta features. The second fundamental frequency distribution parameter generation unit 133 is a predictor that predicts a fundamental frequency distribution parameter using learning speech.

なお、上記の特徴は全て正規分布を仮定し、分布パラメータはそれぞれその平均と分散で構成される。また、上記のパラメータはそれぞれ決定木を用いて生成できる。ここで用いる決定木は、音声合成用記号と、それぞれ対応する特徴との関係について、それぞれ学習用音声を用いて事前に学習されている。 Note that all the above features assume a normal distribution, and the distribution parameters are each composed of an average and a variance. Each of the above parameters can be generated using a decision tree. In the decision tree used here, the relationship between the speech synthesis symbol and the corresponding feature is learned in advance using the learning speech.

次に、スペクトル特徴時系列データ生成部１４０は、上記状態継続長系列と、上記状態毎のスペクトル特徴分布パラメータから、スペクトル特徴ベクトルの時系列データを生成する。この生成処理では、各状態における各特徴パラメータの分布が、状態継続長系列に格納されたフレーム数だけそれぞれ継続しているとして、一発話単位で見た尤度が最大となるようなパラメータ時系列を求めることで行われる。これは数式（１２）を計算することで求めることができる。 Next, the spectrum feature time series data generation unit 140 generates time series data of a spectrum feature vector from the state duration sequence and the spectrum feature distribution parameter for each state. In this generation process, it is assumed that the distribution of each characteristic parameter in each state continues for the number of frames stored in the state duration length sequence, and the parameter time series that maximizes the likelihood seen in one utterance unit. It is done by asking for. This can be obtained by calculating equation (12).

同様に、第１の基本周波数時系列データ生成部１３４は、上記状態継続長系列と、上記第１の基本周波数分布パラメータ生成部１３２が出力した状態毎の基本周波数特徴分布パラメータから、基本周波数パラメータの時系列データ（基本周波数時系列データ）を生成する。ただし、基本周波数については、状態毎の有声・無声情報も参照し、有声の状態のみに対してパラメータ時系列データ生成処理を行う。このようにして、与えられた音声合成用情報に基づく第１の特徴ベクトルの分布情報を用いて、第１の基本周波数時系列データを予測して生成する。第１の特徴ベクトルとしては、時間的変化を表す動的特徴を要素として含むベクトルを用いる。 Similarly, the first fundamental frequency time-series data generation unit 134 calculates the fundamental frequency parameter from the state duration length series and the fundamental frequency feature distribution parameter for each state output by the first fundamental frequency distribution parameter generation unit 132. Time series data (basic frequency time series data) is generated. However, with respect to the fundamental frequency, the parameter time-series data generation processing is performed only for the voiced state with reference to voiced / unvoiced information for each state. In this way, the first basic frequency time-series data is predicted and generated using the distribution information of the first feature vector based on the given speech synthesis information. As the first feature vector, a vector including a dynamic feature representing a temporal change as an element is used.

この処理の際、無声フレームが隣接する有声フレームについては、そのデルタ特徴やデルタデルタ特徴を定義できないので、そのようなフレームではデルタ特徴、デルタデルタ特徴の分布の分散の値を無限大に設定する。これにより、数式（１２）を展開し実際に計算する際に、それらのフレームのデルタ特徴やデルタデルタ特徴に関連した項が０となり、そのようなフレームの動的特徴を計算上無視することができる。なお、以下では、無声の場合についても基本周波数データについて適当な特別な値を設定し、それを無声符号として扱うことで、基本周波数時系列データに有声・無声情報も埋め込まれているものとする。「特別な値」としては、０や、たとえば基本周波数が１６ビット符号無し整数値で表される場合の最大値である６５５３５などを設定することができる。 In this process, since the delta feature or the delta delta feature cannot be defined for a voiced frame adjacent to an unvoiced frame, the distribution value of the distribution of the delta feature and the delta delta feature is set to infinity in such a frame. . As a result, when the mathematical expression (12) is expanded and actually calculated, the delta features of those frames and the terms related to the delta-delta features become 0, and the dynamic features of such frames may be ignored in the calculation. it can. In the following description, it is assumed that voiced / unvoiced information is also embedded in the basic frequency time-series data by setting an appropriate special value for the basic frequency data even in the case of unvoiced and treating it as a voiceless code. . As the “special value”, 0, for example, 65535, which is the maximum value when the fundamental frequency is represented by a 16-bit unsigned integer value, can be set.

同じく、第２の基本周波数時系列データ生成部１３５は、上記状態継続長系列と、第２の基本周波数分布パラメータ生成部１３３が出力した状態毎の基本周波数特徴分布パラメータから、基本周波数パラメータの時系列データを生成する。ただし、第１の基本周波数時系列データ生成部１３４とは異なり、ここでは動的特徴が考慮されないため、生成される基本周波数時系列データが、状態内は一定の値をとり、状態が切り替わる毎に値が不連続となる、階段状の時系列データとなる。このようにして、与えられた音声合成用情報に基づく第２の特徴ベクトルの分布情報を用いて、基本周波数時系列データを予測して生成する。長時間的な特徴の基準となる第２の特徴ベクトルとして、基本周波数の時間的変化を表さない静的特徴を要素とするベクトルを用いる。 Similarly, the second fundamental frequency time-series data generation unit 135 uses the above-mentioned state duration sequence and the fundamental frequency feature distribution parameter for each state output by the second fundamental frequency distribution parameter generation unit 133 to determine the time of the fundamental frequency parameter. Generate series data. However, unlike the first basic frequency time-series data generation unit 134, dynamic characteristics are not considered here, so that the generated basic frequency time-series data takes a constant value in the state and every time the state is switched. Stepwise time-series data with discontinuous values. In this way, the basic frequency time-series data is predicted and generated using the distribution information of the second feature vector based on the given information for speech synthesis. As the second feature vector serving as a long-term feature reference, a vector having a static feature that does not represent a temporal change in the fundamental frequency as an element is used.

一方で、第１の基本周波数時系列データ生成部１３４とは異なり分布の分散の影響がなく桁落ち等は生じにくいため、計算精度上は、比較的に正確な出力結果が得られる。なお、第２の基本周波数時系列データ生成部１３５も、第１の基本周波数時系列データ生成部１３４同様、有声・無声情報が出力結果に同時に埋め込まれているものとする。 On the other hand, unlike the first basic frequency time-series data generation unit 134, since there is no influence of distribution dispersion and digits are not easily lost, a relatively accurate output result can be obtained in terms of calculation accuracy. It is assumed that voiced / unvoiced information is also embedded in the output result in second fundamental frequency time-series data generation unit 135 as well as first fundamental frequency time-series data generation unit 134.

基本周波数時系列データ修正部１３６は、第１の基本周波数時系列データ生成部１３４で生成された基本周波数時系列データ（第１の基本周波数時系列データ）を、第２の基本周波数時系列データ生成部１３５で生成された基本周波数時系列データ（第２の基本周波数時系列データ）を用いて修正する。 The fundamental frequency time series data correction unit 136 converts the fundamental frequency time series data (first fundamental frequency time series data) generated by the first fundamental frequency time series data generation unit 134 into the second fundamental frequency time series data. Correction is performed using the basic frequency time-series data (second basic frequency time-series data) generated by the generation unit 135.

これにより、予測された基本周波数パラメータの計算精度が不充分な場合でも、学習音声との間で生じる基本周波数分布のずれを低減できる。その結果、特に、携帯端末や白物家電などに搭載される浮動小数点演算のできないプロセッサにおいても学習音声に近い音声合成を実現できる。特に、時間的変化を表す動的特徴を要素として含む特徴ベクトルから生成した基本周波数時系列データを、時間的変化を表さない静的特徴を要素とするベクトルから生成した基本周波数時系列データを用いて修正するときには、学習音声と合成音声の間に生じ易い基本周波数分布のずれを低減できる。 Thereby, even when the calculation accuracy of the predicted fundamental frequency parameter is insufficient, the deviation of the fundamental frequency distribution that occurs with the learning speech can be reduced. As a result, it is possible to realize speech synthesis close to learning speech even in a processor that is not capable of floating-point arithmetic, which is mounted on a portable terminal or white goods. In particular, fundamental frequency time-series data generated from feature vectors containing dynamic features representing temporal changes as elements, and fundamental frequency time-series data generated from vectors containing static features that do not represent temporal changes as elements. When using and correcting, it is possible to reduce the deviation of the fundamental frequency distribution that easily occurs between the learning speech and the synthesized speech.

修正処理では、まず、第２の基本周波数時系列データ生成部で生成された基本周波数時系列データについて、所定の時系列区間だけの平均値を求める。次に、第１の基本周波数時系列データ生成部１３４で生成された基本周波数時系列データの平均値が、先に求めた第２の基本周波数分布パラメータ生成部１３３の出力結果の平均値と等しくなるように、基本周波数時系列データの全要素に、ある定数を加算し加算した結果を出力する。 In the correction process, first, an average value of only a predetermined time series section is obtained for the basic frequency time series data generated by the second basic frequency time series data generation unit. Next, the average value of the basic frequency time-series data generated by the first basic frequency time-series data generating unit 134 is equal to the average value of the output result of the second basic frequency distribution parameter generating unit 133 obtained previously. Thus, a certain constant is added to all elements of the basic frequency time-series data, and the result of addition is output.

これにより、基本周波数パラメータの時系列データについて周波数の高低方向のずれを、平均値を用いて修正することができる。その結果、長時間平均的には、第２の基本周波数分布パラメータ生成部１３３の出力結果を強く反映させた基本周波数時系列データを生成することができる。所定の時系列区間とは、例えば有声区間である。 Thereby, the shift in the frequency direction of the time series data of the basic frequency parameter can be corrected using the average value. As a result, it is possible to generate fundamental frequency time-series data that strongly reflects the output result of the second fundamental frequency distribution parameter generation unit 133 on a long-term average. The predetermined time series section is, for example, a voiced section.

なお、有声・無声パラメータ生成部１３１、第１の基本周波数分布パラメータ生成部１３２、第２の基本周波数分布パラメータ生成部１３３、第１の基本周波数時系列データ生成部１３４、第２の基本周波数時系列データ生成部１３５、基本周波数時系列データ修正部１３６は、基本周波数時系列データ生成モジュール１３０を構成する。 The voiced / unvoiced parameter generation unit 131, the first fundamental frequency distribution parameter generation unit 132, the second fundamental frequency distribution parameter generation unit 133, the first fundamental frequency time series data generation unit 134, and the second fundamental frequency time The sequence data generation unit 135 and the basic frequency time series data correction unit 136 constitute a basic frequency time series data generation module 130.

音声波形生成部１５０は、音源１５１および調音フィルタ１５２で構成される。音声波形生成部１５０は、修正された基本周波数時系列データに基づき、音源１５１が出力する音源波形で、スペクトル特徴時系列データに基づき制御された調音フィルタ１５２を駆動し、音声波形を合成する。このように、修正された第１の特徴ベクトルの時系列データに基づく合成音声波形を生成する。 The audio waveform generation unit 150 includes a sound source 151 and an articulation filter 152. The speech waveform generator 150 synthesizes a speech waveform by driving the articulation filter 152 controlled based on the spectral feature time-series data with the sound source waveform output from the sound source 151 based on the modified basic frequency time-series data. In this way, a synthesized speech waveform based on the time series data of the modified first feature vector is generated.

（音声合成装置の動作）
次に、音声合成装置１００の動作の一例を説明する。図２は、音声合成装置１００の動作の一例を示すフローチャートである。以下、音声合成用記号から合成音声波形を生成する処理の流れを順に説明する。 (Operation of speech synthesizer)
Next, an example of the operation of the speech synthesizer 100 will be described. FIG. 2 is a flowchart showing an example of the operation of the speech synthesizer 100. Hereinafter, a flow of processing for generating a synthesized speech waveform from speech synthesis symbols will be described in order.

図２に示すように、まず、音声合成装置１００は、入力された音声合成用記号に基づいて状態継続長系列を生成する（ステップＳ１）。次に、音声合成用記号に基づいてスペクトル特徴分布パラメータを生成する（ステップＳ２）。そして、生成された状態継続長系列およびスペクトル特徴分布パラメータを用い、スペクトル特徴パラメータの時系列データを生成する（ステップＳ３）。また、音声合成用記号に基づいて有声・無声パラメータを生成する（ステップＳ４）。 As shown in FIG. 2, first, the speech synthesizer 100 generates a state duration sequence based on the input speech synthesis symbol (step S1). Next, a spectrum feature distribution parameter is generated based on the speech synthesis symbol (step S2). Then, using the generated state duration sequence and the spectrum feature distribution parameter, time series data of the spectrum feature parameter is generated (step S3). Also, voiced / unvoiced parameters are generated based on the speech synthesis symbols (step S4).

そして、音声合成用記号に基づいて、第１の基本周波数分布パラメータを生成する（ステップＳ５）。生成された第１の基本周波数分布パラメータ、状態継続長系列および有声・無声パラメータを用い、音声合成用記号に基づいて第１の基本周波数パラメータの時系列データを生成する（ステップＳ６）。 Then, based on the speech synthesis symbol, a first fundamental frequency distribution parameter is generated (step S5). Using the generated first fundamental frequency distribution parameter, state duration sequence, and voiced / unvoiced parameter, time series data of the first fundamental frequency parameter is generated based on the speech synthesis symbol (step S6).

一方、音声合成用記号に基づいて第２の基本周波数分布パラメータを生成する（ステップＳ７）。そして、生成された第２の基本周波数分布パラメータ、状態継続長系列および有声・無声パラメータを用い、音声合成用記号に基づいて第２の基本周波数パラメータの時系列データを生成する（ステップＳ８）。そして、第２の基本周波数パラメータの時系列データを用いて、第１の基本周波数パラメータの時系列データを修正する（ステップＳ９）。最後に、スペクトル特徴パラメータの時系列データと、修正された第１の基本周波数パラメータの時系列データを用いて音声波形を生成する（ステップＳ１０）。このようにして、音声合成用記号の入力から合成音声波形を出力できる。 On the other hand, a second fundamental frequency distribution parameter is generated based on the speech synthesis symbol (step S7). Then, using the generated second fundamental frequency distribution parameter, state duration sequence, and voiced / unvoiced parameter, time-series data of the second fundamental frequency parameter is generated based on the speech synthesis symbol (step S8). Then, the time series data of the first fundamental frequency parameter is corrected using the time series data of the second fundamental frequency parameter (step S9). Finally, a speech waveform is generated using the time-series data of the spectrum feature parameters and the time-series data of the modified first basic frequency parameter (step S10). In this way, a synthesized speech waveform can be output from the input of a speech synthesis symbol.

［第２の実施形態］
上記の実施形態では基本周波数パラメータ時系列データ修正部１３６において、周波数軸上の平行移動のみ行っているが、平行移動以外の修正を行ってもよい。例えば、第２の基本周波数分布パラメータ生成部１３３の出力に対し、その平均値に加えて分散も計算し、基本周波数パラメータ時系列データ修正部１３６が出力する基本周波数時系列データの分散が、第２の基本周波数分布パラメータ生成部１３３の出力の分散と等しくなるように、特徴パラメータの周波数軸方向への伸縮を行ってもよい。これにより、１発声単位内のパラメータ変動の最小値・最大値を、学習音声のそれに近付け、学習音声をより正確に再現するように音声を合成することができる。 [Second Embodiment]
In the above embodiment, the basic frequency parameter time-series data correction unit 136 performs only translation on the frequency axis, but correction other than translation may be performed. For example, for the output of the second fundamental frequency distribution parameter generation unit 133, the variance is calculated in addition to the average value, and the variance of the fundamental frequency time series data output by the fundamental frequency parameter time series data correction unit 136 is The characteristic parameters may be expanded or contracted in the frequency axis direction so as to be equal to the variance of the output of the second fundamental frequency distribution parameter generation unit 133. Thereby, the minimum and maximum values of parameter fluctuation within one utterance unit can be brought close to that of the learning voice, and the voice can be synthesized so as to reproduce the learning voice more accurately.

［第３の実施形態］
また、上記の実施形態では第１の基本周波数分布パラメータ生成部１３２は、基本周波数の静的特徴分布と動的特徴分布のパラメータを同時に出力しているが、動的特徴量のみを出力する構成としてもよい。この場合、第１の基本周波数分布パラメータ生成部１３２において何らかの静的特徴量分布に関する制約、または基本周波数の初期値制約が必要となる。ただし、それはある適当な一定の分布または値でもよい。第１の基本周波数時系列データ生成部１３４で生成された基本周波数時系列データは、最終的に基本周波数パラメータ時系列データ修正部１３６において、第２の基本周波数分布パラメータ生成部１３２の出力に基づき、適切な周波数範囲となるよう修正される。なお、この場合、第１の基本周波数分布パラメータ生成部１３２の予測器学習は、動的特徴のみを参照しても、または従来同様に静的特徴と動的特徴を同時に参照してもよい。 [Third Embodiment]
In the above embodiment, the first fundamental frequency distribution parameter generation unit 132 outputs the static feature distribution and the dynamic feature distribution parameters of the fundamental frequency at the same time. However, the first fundamental frequency distribution parameter generation unit 132 outputs only the dynamic feature amount. It is good. In this case, the first fundamental frequency distribution parameter generation unit 132 needs some restriction on the static feature quantity distribution or the initial value restriction of the fundamental frequency. However, it may be some suitable constant distribution or value. The fundamental frequency time series data generated by the first fundamental frequency time series data generation unit 134 is finally based on the output of the second fundamental frequency distribution parameter generation unit 132 in the fundamental frequency parameter time series data correction unit 136. The frequency range is corrected to an appropriate frequency range. In this case, the predictor learning of the first fundamental frequency distribution parameter generation unit 132 may refer to only dynamic features, or may simultaneously refer to static features and dynamic features as in the conventional case.

［第４の実施形態］
また、上記の実施形態では、第２の基本周波数分布パラメータ生成部１３３で静的特徴の分布のみを出力し、これを第２の基本周波数時系列データ生成部で用いているが、第１の基本周波数時系列データ生成部１３４の出力の長時間的な特徴の精度よりも、第２の基本周波数時系列データ生成部１３５の出力の長時間的な特徴の精度が高くなるのであれば、第２の基本周波数時系列データ生成にそれ以外の特徴の分布を用いてもよい。またそれは、２次元以上の特徴ベクトルを考慮したものでもよい。その場合、第２の基本周波数時系列データ生成部の処理は、第１の基本周波数時系列データ生成部１３４における処理と同様のものとなる。 [Fourth Embodiment]
In the above embodiment, only the static feature distribution is output by the second fundamental frequency distribution parameter generation unit 133 and is used by the second fundamental frequency time-series data generation unit. If the accuracy of the long-time feature of the output of the second fundamental frequency time-series data generation unit 135 is higher than the accuracy of the long-time feature of the output of the fundamental frequency time-series data generation unit 134, the first The distribution of other features may be used for generating the basic frequency time-series data of 2. It may also take into account two or more dimension feature vectors. In that case, the process of the second fundamental frequency time series data generation unit is the same as the process in the first fundamental frequency time series data generation unit 134.

［第５の実施形態］
また、上記の実施形態では制御単位を一発話単位としているが、例えば有声が連続する状態を単位とするより短い時間単位で、それぞれ独立して修正量を求め修正してもよい。これにより、より短い時間単位で、第２の基本周波数分布パラメータ生成部１３３の出力結果を強く反映させた基本周波数時系列データを生成することができる。ただし、平均を求める区間が短くなる分、一発話単位で修正する場合より、第２の基本周波数分布パラメータ生成部１３３の出力の精度を高める必要がある。 [Fifth Embodiment]
In the above-described embodiment, the control unit is one utterance unit. However, for example, the correction amount may be independently obtained and corrected in a shorter time unit with a voiced continuous state as a unit. Thereby, the fundamental frequency time series data that strongly reflects the output result of the second fundamental frequency distribution parameter generation unit 133 can be generated in shorter time units. However, it is necessary to increase the accuracy of the output of the second fundamental frequency distribution parameter generation unit 133 as compared with the case where correction is made in units of one utterance as the interval for obtaining the average becomes shorter.

［第６の実施形態］
また、静的特徴については、短時間的なノイズを除去するために、前後数フレーム間の平均値を代わりに用いてもよい。また、上記動的特徴を含む特徴ベクトルは基本周波数時系列データ生成モジュール１３０において時系列データを定めることができる構成のものであればよく、３次元の特徴ベクトルに限定されない。例えば、静的特徴、デルタ特徴で構成される２次元のベクトルや、さらに高次の特徴を含んだ高次元の特徴ベクトルを用いてもよい。 [Sixth Embodiment]
For static features, an average value between several frames before and after may be used instead in order to remove short-term noise. The feature vector including the dynamic feature is not limited to a three-dimensional feature vector as long as the feature vector can be determined by the basic frequency time-series data generation module 130. For example, a two-dimensional vector composed of static features and delta features, or a high-dimensional feature vector including higher-order features may be used.

［第７の実施形態］
なお、上記の実施形態では対数周波数軸上での議論を行っているが、基本周波数のパラメータはこれに限定されない。例えば話者を限定する等、充分に狭い範囲の基本周波数しか扱わない場合は、線形周波数軸上で同様の操作を行ってもよい。 [Seventh Embodiment]
In the above embodiment, the discussion on the logarithmic frequency axis is performed, but the fundamental frequency parameter is not limited to this. For example, when only a sufficiently narrow range of fundamental frequencies is handled, such as limiting speakers, the same operation may be performed on the linear frequency axis.

［第８の実施形態］
また、上記の実施形態では音源・フィルタモデルに基づく音声合成の例を示しているが、基本周波数時系列データ生成に関する部分のみを他の音声合成方式と組み合わせて用いてもよい。例えば、基本周波数時系列データを明示的な音声合成目標として用いる接続合成において、基本周波数時系列データの予測器として本発明を用いることができる。 [Eighth Embodiment]
In the above embodiment, an example of speech synthesis based on a sound source / filter model is shown. However, only a part related to generation of basic frequency time-series data may be used in combination with another speech synthesis method. For example, in connection synthesis using fundamental frequency time series data as an explicit speech synthesis target, the present invention can be used as a predictor of fundamental frequency time series data.

なお、以上の実施形態では、音声合成装置１００は、予測器として基本周波数分布パラメータ生成部を２つ有しているが、３つ以上有していてもよい。ただし、予測器の数に比例して処理するデータサイズが増大するため、予測器は２つであることが好ましい。 In the above embodiment, the speech synthesizer 100 has two fundamental frequency distribution parameter generation units as predictors, but may have three or more. However, since the data size to be processed increases in proportion to the number of predictors, it is preferable that there are two predictors.

１００音声合成装置
１１０状態継続長系列生成部
１２０スペクトル特徴分布パラメータ生成部
１３０基本周波数時系列データ生成モジュール
１３１有声・無声パラメータ生成部
１３２第１の基本周波数分布パラメータ生成部
１３３第２の基本周波数分布パラメータ生成部
１３４第１の基本周波数時系列データ生成部
１３５第２の基本周波数時系列データ生成部
１３６基本周波数時系列データ修正部
１４０スペクトル特徴時系列データ生成部
１５０音声波形生成部
１５１音源
１５２調音フィルタ
DESCRIPTION OF SYMBOLS 100 Speech synthesizer 110 State continuation length series generation part 120 Spectral feature distribution parameter generation part 130 Fundamental frequency time series data generation module 131 Voiced / unvoiced parameter generation part 132 1st fundamental frequency distribution parameter generation part 133 2nd fundamental frequency distribution Parameter generating unit 134 First basic frequency time-series data generating unit 135 Second basic frequency time-series data generating unit 136 Basic frequency time-series data correcting unit 140 Spectrum feature time-series data generating unit 150 Speech waveform generating unit 151 Sound source 152 Articulation filter

Claims

A speech synthesizer that generates a synthesized speech waveform from speech synthesis information that describes the type of unit speech included in a series of unit speech sequences,
-Out based on the given speech synthesis information using the distribution information of the first feature vector including dynamic characteristic representing a temporal change as elements, generated by predicting a first fundamental frequency time-series data A first fundamental frequency time-series data generator;
-Out based on the given speech synthesis information using the distribution information of the second feature vector comprising a static feature which does not represent a direct temporal change as an element, a second fundamental frequency time-series data A second fundamental frequency time-series data generation unit that predicts and generates
Using the second fundamental frequency time-series data, a fundamental frequency time-series data correction unit that corrects a frequency shift of the frequency of the first fundamental frequency time-series data,
A speech synthesizer for generating a synthesized speech waveform based on the modified first basic frequency time-series data.

The first basic frequency time-series data generation unit uses distribution information of a first feature vector including dynamic features representing temporal changes as elements,
The second basic frequency time-series data generation unit uses distribution information of a second feature vector including a static feature that does not directly represent a temporal change as an element. Speech synthesizer.

The fundamental frequency time-series data correction unit matches an average value of the first fundamental frequency time-series data with an average value of the second fundamental frequency time-series data for each predetermined time-series section. The speech synthesizer according to claim 1 or 2.

The fundamental frequency time-series data correction unit matches the variance of the first fundamental frequency time-series data with the variance of the second fundamental frequency time-series data for each predetermined time-series section. Item 4. The speech synthesizer according to Item 3.

The said basic frequency time series data correction part corrects said 1st basic frequency time series data for every time series area where voiced continues, The one in any one of Claims 1-4 characterized by the above-mentioned. Speech synthesizer.

A speech synthesis method for generating a synthesized speech waveform from speech synthesis information configured as a set of phonemes,
-Out based on the given speech synthesis information using the distribution information of the first feature vector including dynamic characteristic representing a temporal change as elements, generated by predicting a first fundamental frequency time-series data Steps,
-Out based on the given speech synthesis information using the distribution information of the second feature vector comprising a static feature which does not represent a direct temporal change as an element, a second fundamental frequency time-series data Predicting and generating
Using the second fundamental frequency time-series data to correct a shift in the frequency direction of the first fundamental frequency time-series data,
A speech synthesis method, comprising: generating a synthesized speech waveform based on the modified first basic frequency time-series data.

A speech synthesis program that is executed by a computer to generate a synthesized speech waveform from speech synthesis information that describes a type of unit speech included in a series of unit speech sequences,
-Out based on the given speech synthesis information using the distribution information of the first feature vector including dynamic characteristic representing a temporal change as elements, generated by predicting a first fundamental frequency time-series data Processing,
-Out based on the given speech synthesis information using the distribution information of the second feature vector comprising a static feature which does not represent a direct temporal change as an element, a second fundamental frequency time-series data Processing to predict and generate
Using the second basic frequency time-series data to correct a frequency shift in the frequency direction of the first basic frequency time-series data,
A speech synthesis program for generating a synthesized speech waveform based on the modified first basic frequency time-series data.