JP5474713B2

JP5474713B2 - Speech synthesis apparatus, speech synthesis method, and speech synthesis program

Info

Publication number: JP5474713B2
Application number: JP2010199288A
Authority: JP
Inventors: 信行西澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2010-09-06
Filing date: 2010-09-06
Publication date: 2014-04-16
Anticipated expiration: 2030-09-06
Also published as: JP2012058343A

Description

本発明は、音素の集合として構成される音声合成用情報から合成音声波形を生成する音声合成装置、音声合成方法および音声合成プログラムに関する。 The present invention relates to a speech synthesizer, a speech synthesis method, and a speech synthesis program that generate a synthesized speech waveform from speech synthesis information configured as a set of phonemes.

音声合成技術の代表的な利用方法として、テキスト音声変換（Text-To-Speech）が挙げられる。以下、テキスト解析等の結果得られる音素の種類や韻律的特徴を表記した記号を入力とし、音声波形を生成する装置を音声合成装置と呼ぶ。音声合成装置は、テキスト音声変換システムの構成要素である。 Text-to-speech is a typical method of using speech synthesis technology. Hereinafter, a device that generates a speech waveform using a symbol representing the type of phoneme and prosodic features obtained as a result of text analysis or the like as an input is called a speech synthesizer. A speech synthesizer is a component of a text-to-speech conversion system.

この音声合成装置に入力される記号を、以下、音声合成用記号と呼ぶ。音声合成用記号には様々な形式があり得るが、ここでは、一連の音声を構成する音韻的情報と、主としてポーズや声の高さとして表現される韻律的情報を同時に表記したものを考える。そのような音声合成用記号の例として、ＪＥＩＴＡ（電子情報技術産業協会）規格ＩＴ−４００２「日本語テキスト音声合成用記号」がある（非特許文献１参照）。音声合成装置は、このような音声合成用記号に基づいてそれに対応する音声波形を生成する。ただし、一般に音声波形は合成対象の音素だけでなく、前後の音素の種類や韻律的特徴の影響を強く受けるため、一般的に記号と音声波形の対応関係は複雑になる。 The symbols input to this speech synthesizer are hereinafter referred to as speech synthesis symbols. There are various forms of the symbols for speech synthesis. Here, let us consider a case in which phonological information constituting a series of speech and prosodic information mainly expressed as a pose or a voice pitch are simultaneously described. An example of such a symbol for speech synthesis is JEITA (Electronic Information Technology Industries Association) standard IT-4002 “symbol for Japanese text speech synthesis” (see Non-Patent Document 1). The speech synthesizer generates a speech waveform corresponding to such a speech synthesis symbol. However, since the speech waveform is generally strongly influenced by not only the phonemes to be synthesized but also the types of phonemes before and after and the prosodic features, the correspondence between symbols and speech waveforms is generally complicated.

音声合成装置による音声波形の生成方法には様々な方式があるが、音声の短時間スペクトルの特徴や有声・無声情報、基本周波数（F0）を直接パラメータとし、このパラメータに基づき音声波形を生成する方法が主な背景技術である。代表的な音声波形の生成方法に、音源・フィルタモデルに基づく音声合成がある。音源・フィルタモデルでは、音声の響きをつくる調音フィルタを適当な音源で駆動することで、音声波形を信号処理的に合成する。 There are various methods for generating a speech waveform by a speech synthesizer. The speech waveform is generated based on the short-time spectral features of voice, voiced / unvoiced information, and the fundamental frequency (F0) as direct parameters. The method is the main background art. A typical speech waveform generation method is speech synthesis based on a sound source / filter model. In the sound source / filter model, a sound waveform is synthesized in a signal processing manner by driving an articulation filter that generates sound of sound with an appropriate sound source.

インパルス列や白色雑音源といった比較的に単純な構成の音源を用いる場合、インパルス列と白色雑音源の切り替えは有声・無声情報に基づき、インパルス列の基本周波数はF0パラメータに基づきそれぞれ制御することができる。一方、スペクトルの特徴を表すパラメータとしてはＭＦＣＣ（Mel-Frequency Cepstral Coefficient）や線形予測係数が用いられ、調音フィルタとしては、ＡＲ（自己回帰）型のフィルタや、特にパラメータとしてＭＦＣＣを用いる場合には、ＭＦＣＣを直接そのパラメータとする、ＭＬＳＡ（メル対数スペクトル近似）フィルタ（非特許文献２参照）等が用いられる。 When using a relatively simple sound source such as an impulse train or white noise source, switching between the impulse train and the white noise source is based on voiced / unvoiced information, and the fundamental frequency of the impulse train can be controlled based on the F0 parameter. it can. On the other hand, MFCC (Mel-Frequency Cepstral Coefficient) or a linear prediction coefficient is used as a parameter representing the characteristics of the spectrum, and an AR (autoregressive) type filter is used as an articulation filter, or in particular, when MFCC is used as a parameter. An MLSA (Mel logarithmic spectrum approximation) filter (see Non-Patent Document 2), which directly uses MFCC as its parameter, is used.

例えば子音のような音声を合成するためには、音声合成パラメータを時間的に変化させることが必要である。そのため、この方法では、例えば５ｍｓ程度の一定周期で音声合成パラメータを更新し、その特徴を変化させながら音声を合成することが一般的である。この一定周期の１周期分は一般に１フレームと呼ばれる。したがって、一般的に音声を合成するためには、音声合成用記号から、音声合成パラメータについてフレーム周期の時系列データを作成する必要がある。 For example, in order to synthesize speech such as consonants, it is necessary to change the speech synthesis parameters over time. Therefore, in this method, for example, it is common to synthesize speech while changing the characteristics by updating the speech synthesis parameters at a constant period of about 5 ms, for example. One period of this fixed period is generally called one frame. Therefore, in general, in order to synthesize speech, it is necessary to create time-series data of frame periods for speech synthesis parameters from speech synthesis symbols.

最も簡単な方法としては、ある音素の長さ分だけのフレーム周期の時系列データを、必要な音素のそれぞれについて事前に準備しておき、生成したい音声の音素系列に合わせて、それらの音声合成パラメータ時系列をつなぎ合わせて１発声の音声合成パラメータ時系列とする方法が考えられる。しかし、先述のように、同じ音素であっても、前後の音素の種類や、話速や声の高さ、直前や直後のポーズからの時間的距離によって、その特徴が大きく異なる場合がある。このような場合に対応するためには、前後の音素や韻律的特徴を考慮した複雑な音素分類を用いる必要があるが、このような複雑な音素分類を用いると、音素の種類の個数は莫大になり、必要な全ての音声合成パラメータ時系列のセットを事前に作成、蓄積しておくことは困難である。 The simplest method is to prepare time-series data of the frame period for a certain phoneme length in advance for each required phoneme, and synthesizing them according to the phoneme sequence of the speech to be generated A method of concatenating the parameter time series to form a speech synthesis parameter time series for one utterance is conceivable. However, as described above, the characteristics of the same phoneme may vary greatly depending on the type of phonemes before and after, the speed of speech, the pitch of the voice, and the temporal distance from the immediately preceding or immediately following pose. In order to deal with such cases, it is necessary to use complex phoneme classifications that take into account the preceding and following phonemes and prosodic features, but with such complex phoneme classifications, the number of phoneme types is enormous. Therefore, it is difficult to create and store in advance all necessary speech synthesis parameter time series sets.

そこで、実際には、音声合成パラメータ時系列の時間変化を適当なモデルに基づきモデル化し、そのモデルパラメータを音声合成用記号からまず予測することで生成し、得られたモデルから音声合成パラメータ時系列を生成することで、任意の音声を合成可能とする方法が用いられる。以下では、このモデルのことを音声生成モデルと呼ぶ。 Therefore, in practice, the time change of the speech synthesis parameter time series is modeled based on an appropriate model, and the model parameters are generated by first predicting from the speech synthesis symbols, and the speech synthesis parameter time series is obtained from the obtained model. Is used to generate an arbitrary speech. Hereinafter, this model is referred to as a speech generation model.

例えば、ある音素の音声合成パラメータの特徴が時間的に３つの状態に分かれ、各状態のフレーム数について、それらの統計分布パラメータベクトルを最初の状態から順にd1、d2、d3とし、この３つのベクトルの要素を連結して１つのベクトルdを作り、また、音声合成パラメータの各状態の統計分布パラメータベクトルを最初の状態から順にv1、v2、v3とすれば、その音素を合成するための音声合成パラメータの特徴は、音声生成モデルのパラメータを構成するd、v1、v2、v3の4つのベクトルで表すことができる。さらに、音声合成用記号からこれらのパラメータベクトルを生成するような予測器を前もって構築し、音声合成時に予測器を用いることで、比較的少量のデータから音声を合成することができる。 For example, the features of a speech synthesis parameter of a phoneme are divided into three states in terms of time, and for the number of frames in each state, their statistical distribution parameter vectors are d1, d2, and d3 in order from the first state. The speech synthesis for synthesizing the phoneme is made by concatenating the elements of, making a vector d, and if the statistical distribution parameter vector of each state of the speech synthesis parameter is v1, v2, v3 in order from the first state The characteristics of the parameters can be expressed by four vectors d, v1, v2, and v3 that constitute the parameters of the speech generation model. Further, by constructing a predictor that generates these parameter vectors from speech synthesis symbols in advance and using the predictor during speech synthesis, it is possible to synthesize speech from a relatively small amount of data.

この方法に基づく代表的なものに、ＨＭＭ音声合成方式がある。ＨＭＭ音声合成方式は、音声生成モデルとしてＨＭＭ（隠れマルコフモデル）に基づくモデルを仮定している。そして、音声生成モデルのパラメータを構成する複数のベクトルは、音声認識技術における状態共有ＨＭＭで用いられる方法と同様に、それぞれ音声合成用記号から決定木に基づき決定される（非特許文献３参照）。ここで決定木は、予め用意しておいた学習音声と、それに対応する音声合成用記号を用いて構築（学習）する。 A typical example based on this method is an HMM speech synthesis method. The HMM speech synthesis method assumes a model based on HMM (Hidden Markov Model) as a speech generation model. The plurality of vectors constituting the parameters of the speech generation model are each determined based on a decision tree from speech synthesis symbols, as in the method used in the state sharing HMM in speech recognition technology (see Non-Patent Document 3). . Here, the decision tree is constructed (learned) using a prepared learning speech and a corresponding speech synthesis symbol.

１発話の音声を合成する際には、まず単位音声毎の音声生成モデルを連結して１発話分の音声生成モデルを構成する。そして、その構成された音声生成モデルに対し、ゆう度が最大となる音声合成パラメータ時系列を求め、これを音声波形生成に用いる。音声合成パラメータ時系列に対する、音声生成モデルのゆう度は、例えば、音声生成モデルにおいて、次のように表わされる。 When synthesizing one utterance voice, first, a voice generation model for one utterance is constructed by connecting the voice generation models for each unit voice. Then, a speech synthesis parameter time series having the maximum likelihood is obtained for the constructed speech generation model, and this is used for speech waveform generation. The likelihood of the speech generation model with respect to the speech synthesis parameter time series is expressed as follows in the speech generation model, for example.

すなわち、フレームiにおける音声合成パラメータxの値xiの統計的分布が他の種類の音声合成パラメータに対し独立でかつ正規分布に従い、その分布の平均がμi、分散がσi²であるとき、音声の長さが全体でnフレームとすると、１発声の音声合成パラメータxの時系列xiに対する音声生成モデルの対数ゆう度は、以下の数式で与えられる。

That is, according independently a and normally distributed statistical distribution of values xi are to other types of speech synthesis parameters of the speech synthesis parameters x at frame i, mean μi of the distribution, when the variance is .sigma.i ^2, the voice When the total length is n frames, the log likelihood of the speech generation model for the time series xi of the speech synthesis parameter x of one utterance is given by the following equation.

しかし、フレーム周期の音声合成パラメータを数個の正規分布で直接モデル化した場合、最ゆうなパラメータ系列は、状態内で正規分布の平均値が連続的に出力されたものとなり、状態が切り替わる際に、その値が不連続となる。すなわち、階段状のパラメータ時系列となる。これは実際の音声の特徴と異なるため、音声合成パラメータそのものだけでなく（以下、これを静的特徴と呼ぶ）、音声合成パラメータの動的特徴として、音声合成パラメータ時系列データの一階差分（デルタ）や二階差分（デルタデルタ）等を組み合わせたベクトルを特徴ベクトルとすることで、音声合成パラメータの連続的な変化も考慮したモデル化が行われる（非特許文献４参照）。ある音声合成パラメータxのi番目のフレームにおける値xiのデルタΔxiおよびデルタデルタΔ²xiは、例えばそれぞれ数式（２）、数式（３）により与えられる。

However, when the speech synthesis parameters of the frame period are directly modeled with several normal distributions, the most likely parameter series is the one in which the average value of the normal distribution is continuously output within the state, and when the state switches , The value becomes discontinuous. That is, it becomes a stepwise parameter time series. Since this is different from actual speech features, not only speech synthesis parameters themselves (hereinafter referred to as static features), but also dynamic features of speech synthesis parameters, first-order differences of speech synthesis parameter time series data ( Delta), second-order difference (delta delta), and the like are used as feature vectors to perform modeling in consideration of continuous changes in speech synthesis parameters (see Non-Patent Document 4). The delta Δxi and delta delta Δ ² xi of the value xi in the i-th frame of a certain speech synthesis parameter x are given by, for example, Expression (2) and Expression (3), respectively.

以下、音声合成パラメータの時系列データの計算方法を説明する。まず説明のためにフレームｉにおける特徴ベクトルをo_iとする。数式中の英大文字および太字の英小文字はベクトルを意味する（以下、同様）。

Hereinafter, a method for calculating time-series data of speech synthesis parameters will be described. First, for the sake of explanation, the feature vector in frame i is assumed to be o _i . Uppercase letters and lowercase letters in bold in the formula mean vectors (the same applies hereinafter).

また音声の長さはｎフレームとする。また、以下の行列を定義する。ただし、上付きのTは転置行列、上付きの-1は逆行列を表す（以下同様）。

The length of the voice is n frames. In addition, the following matrix is defined. However, the superscript T represents a transposed matrix, and the superscript -1 represents an inverse matrix (the same applies hereinafter).

さらに、数式（２）、（３）で定義される静的特徴の時系列Xから動的特徴を含む特徴
ベクトル時系列Oを求める変換行列をここではＷとする。つまり、以下の関係が成り立つ。ここでＷは３ｎ行×ｎ列の行列である。

Furthermore, a transformation matrix for obtaining a feature vector time series O including a dynamic feature from a static feature time series X defined by Equations (2) and (3) is W here. That is, the following relationship holds. Here, W is a matrix of 3n rows × n columns.

パラメータの分布が正規分布に従う場合、Xの対数ゆう度p(X)は以下の数式で与えられる。ここでμはＯの分布の平均ベクトル、UはＯの分布の分散共分散行列である。μおよびＵの各要素は事前に学習した決定木により、音声合成用記号から決定する。

When the parameter distribution follows a normal distribution, the logarithmic likelihood p (X) of X is given by the following equation. Here, μ is an average vector of O distribution, and U is a variance-covariance matrix of O distribution. Each element of μ and U is determined from a speech synthesis symbol by a decision tree learned in advance.

対数ゆう度p(X)を最大とするXは以下の関係を満たす。

X that maximizes the log likelihood p (X) satisfies the following relationship.

数式（８）および数式（９）をXについて解くと以下の数式が得られる。

すなわち、数式（１０）を計算することで、最ゆう基準に基づく、動的特徴を考慮したパラメータ時系列が得られる。音声合成パラメータxを多次元のベクトルに拡張した場合も同様である。 Solving Equation (8) and Equation (9) with respect to X yields the following equation:

That is, by calculating Equation (10), it is possible to obtain a parameter time series in consideration of dynamic features based on the maximum likelihood criterion. The same applies when the speech synthesis parameter x is extended to a multidimensional vector.

「日本語テキスト音声合成用記号」ＪＥＩＴＡ規格ＩＴ−４００２、２００５年3月"Symbols for Japanese text-to-speech synthesis" JEITA standard IT-4002, March 2005 今井聖、住田一男、古市千枝子、「音声合成のためのメル対数スペクトル近似（ＭＬＳＡ）フィルタ」、電子情報通信学会論文誌(A), J66-A, 2, Feb.1983, pp.122-129Sei Imai, Kazuo Sumita, Chieko Furuichi, "Mel Log Spectrum Approximation (MLSA) Filter for Speech Synthesis", IEICE Transactions (A), J66-A, 2, Feb.1983, pp.122-129 吉村貴克、徳田恵一、益子貴史、小林隆夫、北村正、「ＨＭＭに基づく音声合成におけるスペクトル・ピッチ・継続長の同時モデル化」、電子情報通信学会論文誌(D-II), J83-D-II, 11, Nov.2000, pp.2099-2107Takamura Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura, “Simultaneous Modeling of Spectrum, Pitch, and Duration in HMM-Based Speech Synthesis”, IEICE Transactions (D-II), J83-D -II, 11, Nov.2000, pp.2099-2107 益子貴史、徳田恵一、小林隆夫、今井聖、「動的特徴を用いたＨＭＭに基づく音声合成」、電子情報通信学会論文誌(D-II), J79-D-II, 12, Dec.1996, pp.2184-2190Masashi Takashi, Tokuda Keiichi, Kobayashi Takao, Imai Kiyoshi, "HMM-based speech synthesis using dynamic features", IEICE Transactions (D-II), J79-D-II, 12, Dec. 1996, pp.2184-2190

音声合成パラメータ分布のパラメータベクトルを予測するための決定木を、音韻的な音素の種類だけでなく、アクセント型やアクセント句境界といった言語的な韻律的特徴の違いもその説明変数に含めて学習すると、学習された決定木で予測される特徴ベクトルの分布において、しばしば、デルタ特徴やデルタデルタ特徴に関連する要素の値の分散が、静的特徴に関連する要素の値の分散より小さくなる傾向が現れる。これは、言語的な韻律的特徴が音声合成パラメータの絶対値よりもその短時間変化に対して強い相関を持つことに起因すると考えられる。 When learning a decision tree for predicting the parameter vector of the speech synthesis parameter distribution, including not only the phonemic phoneme type but also the linguistic prosodic features such as accent type and accent phrase boundary in its explanatory variables In the distribution of feature vectors predicted by a learned decision tree, the variance of element values associated with delta features and delta-delta features often tends to be smaller than the variance of element values associated with static features. appear. This is considered to be due to the fact that the linguistic prosodic feature has a stronger correlation with the short-time change than the absolute value of the speech synthesis parameter.

生成するパラメータ時系列の対数ゆう度の数式（８）によると、対数ゆう度の計算では、分布平均は常に分布の分散の逆数で重み付けられる。よって上記の傾向から、計算の過程において、デルタ特徴やデルタデルタ特徴の情報を含む値が、静的特徴の情報を含む値よりも相対的に大きくなる場合が多い。 According to the logarithmic likelihood equation (8) of the parameter time series to be generated, in the calculation of the logarithmic likelihood, the distribution average is always weighted by the reciprocal of the distribution variance. Therefore, from the above tendency, in the calculation process, the value including the delta feature and the information including the delta-delta feature information is often relatively larger than the value including the static feature information.

携帯端末のように計算資源が限られ固定小数点数演算が必要な装置による計算では、計算時の桁あふれを防止するために、計算結果の値をある一定値以下にする必要が生じる。このため処理可能な値の範囲（例えば最大値と最小値の比）を充分にとることができない場合、小さい値で表された情報は桁落ちにより値が丸められ、誤差が生じやすくなる。つまり、音声合成パラメータ時系列の生成過程で、静的特徴分布に関する正確な情報が失われやすい。静的特徴は特徴パラメータの絶対位置を決める情報であるため、この丸め誤差により、特徴軸にそって上下に位置がずれたような音声合成パラメータ時系列が生成されうるが、このずれは合成音声の不自然さの原因となる。 In a calculation by a device that has limited calculation resources and requires a fixed-point number operation such as a portable terminal, it is necessary to set the value of the calculation result to a certain value or less in order to prevent overflow in the calculation. For this reason, when the range of values that can be processed (for example, the ratio between the maximum value and the minimum value) cannot be taken sufficiently, the information represented by a small value is rounded off due to a digit loss, and an error is likely to occur. That is, in the process of generating the speech synthesis parameter time series, accurate information regarding the static feature distribution is likely to be lost. Since static features are information that determines the absolute position of feature parameters, this rounding error can generate a speech synthesis parameter time series whose position is shifted up and down along the feature axis. Causes unnaturalness.

本発明は、このような事情に鑑みてなされたものであり、高精度な計算が困難な場合でも、正確な音声合成パラメータ時系列に基づく、音声合成装置、音声合成方法および音声合成プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and provides a speech synthesizer, a speech synthesis method, and a speech synthesis program based on an accurate speech synthesis parameter time series even when high-precision calculation is difficult. The purpose is to do.

（１）上記の目的を達成するため、本発明の音声合成装置は、一連の単位音声列に含まれる単位音声の種類を記述する音声合成用情報から合成音声波形を生成する音声合成装置であって、与えられた音声合成用情報に基づく特徴ベクトルの分布情報を用いて、数値範囲の大きい第１の音声合成パラメータ時系列データを生成する第１の音声合成パラメータ生成部と、前記与えられた音声合成用情報に基づく特徴ベクトルの分布情報を修正し、前記第１の音声合成パラメータとの差の時系列データとして、前記第１の音声合成パラメータ時系列データよりも数値範囲が小さい第２の音声合成パラメータ時系列データを生成する第２の音声合成パラメータ生成部と、前記第１の音声合成パラメータ時系列データと前記第２の音声合成パラメータ時系列データとを加算し、第３の音声合成パラメータ時系列データを生成する音声合成パラメータ加算部を備え、前記第３の音声合成パラメータ時系列データに基づく合成音声波形を生成することを特徴としている。 (1) In order to achieve the above object, a speech synthesizer according to the present invention is a speech synthesizer that generates a synthesized speech waveform from speech synthesis information that describes the type of unit speech included in a series of unit speech sequences. A first speech synthesis parameter generation unit that generates first speech synthesis parameter time-series data having a large numerical range using feature vector distribution information based on the given speech synthesis information; A second feature of correcting the feature vector distribution information based on the speech synthesis information and having a numerical range smaller than the first speech synthesis parameter time-series data as the time-series data of the difference from the first speech synthesis parameter A second speech synthesis parameter generation unit for generating speech synthesis parameter time-series data; the first speech synthesis parameter time-series data; and the second speech synthesis parameter time A speech synthesis parameter addition unit that adds column data and generates third speech synthesis parameter time-series data is provided, and a synthesized speech waveform based on the third speech synthesis parameter time-series data is generated. .

このように、第１の音声合成パラメータ生成部では、数値範囲の大きい第１の音声合成パラメータ時系列データを生成し、第２の音声合成パラメータ生成部では、第１の音声合成パラメータ時系列データよりも数値範囲が小さい第２の音声合成パラメータ時系列データを生成している。 As described above, the first speech synthesis parameter generation unit generates first speech synthesis parameter time-series data having a large numerical range, and the second speech synthesis parameter generation unit generates first speech synthesis parameter time-series data. The second speech synthesis parameter time series data having a smaller numerical value range is generated.

これにより、第１の音声合成パラメータ生成部では、従来手法で丸め誤差が問題になっていた特徴情報を主な処理対象とし、それ以外の特徴を処理から除外することで、全ての特徴情報も一括して処理する場合に比べて処理途中での丸め誤差を抑えることができる。これにより、最終的な音声合成パラメータ時系列データの計算誤差を全体として小さくすることができる。 As a result, the first speech synthesis parameter generation unit sets feature information, which has been a problem of rounding error in the conventional method, as a main processing target, and excludes other features from the processing, so that all feature information is also collected. Therefore, it is possible to suppress a rounding error during the processing as compared with the case where the processing is performed. Thereby, the calculation error of the final speech synthesis parameter time-series data can be reduced as a whole.

（２）また、本発明の音声合成装置は、前記第２の音声合成パラメータ生成部は、最ゆう基準に基づく一般的なパラメータ時系列の算出過程において、特徴ベクトルの分布情報における平均パラメータを前記第１の音声合成パラメータ時系列データに対する特徴ベクトルとの差に置換して計算することで、前記特徴ベクトルの分布情報を修正することを特徴としている。 (2) In the speech synthesizer according to the present invention, the second speech synthesis parameter generation unit may calculate the average parameter in the distribution information of the feature vector in the process of calculating a general parameter time series based on the maximum likelihood criterion. The distribution information of the feature vector is corrected by replacing the first speech synthesis parameter time series data with the feature vector for calculation.

これにより、第２の音声合成パラメータを求めるための特徴分布パラメータの修正が、従来手法における分布平均に関するパラメータμを、第１の音声合成パラメータ時系列データＸ_０に対応する動的特徴を含む特徴ベクトルとの差に置き換えることで実現できる。 Thereby, the correction of the feature distribution parameter for obtaining the second speech synthesis parameter includes the feature μ including the dynamic feature corresponding to the first speech synthesis parameter time-series data X ₀ and the parameter μ relating to the distribution average in the conventional method. This can be realized by replacing the difference with the vector.

このとき、数式上、第１の音声合成パラメータ時系列データと第２の音声合成パラメータ時系列データの和は、従来手法による音声合成パラメータ時系列データと完全に一致するため、近似的な音声合成パラメータ時系列データ生成処理を含む手法よりも正確な音声合成パラメータ時系列を生成することができる。 At this time, since the sum of the first speech synthesis parameter time-series data and the second speech synthesis parameter time-series data completely coincides with the speech synthesis parameter time-series data according to the conventional method, the approximate speech synthesis is performed. It is possible to generate a speech synthesis parameter time series that is more accurate than a technique including parameter time series data generation processing.

（３）また、本発明の音声合成装置は、前記第１の音声合成パラメータ生成部は、直接的に時間的変化を表さない静的特徴の分布情報から音声合成パラメータ時系列データを生成することを特徴としている。これにより、従来手法で問題となっていた静的特徴に関する値の丸め誤差の影響を小さくすることができ、最終的に正確な音声合成パラメータ時系列を生成することができる。 (3) In the speech synthesizer according to the present invention, the first speech synthesis parameter generation unit generates speech synthesis parameter time-series data from static feature distribution information that does not directly represent temporal changes. It is characterized by that. As a result, it is possible to reduce the influence of the rounding error of the value related to the static feature, which has been a problem in the conventional method, and to finally generate an accurate speech synthesis parameter time series.

（４）また、本発明の音声合成装置は、前記静的特徴の特徴ベクトルの分布情報から生成される音声合成パラメータ時系列データは、静的特徴の分布平均パラメータの時系列であることを特徴としている。このとき、第２の音声合成パラメータを求めるための特徴分布パラメータの修正結果であるベクトル（μ−ＷＸ_０）において、静的特徴の分布平均パラメータに対応する要素の値は全て０となる。値０に対して計算による丸めの誤差は生じないため、従来手法で問題となっていた静的特徴に関する値の丸め誤差の影響が小さくなり、最終的に従来手法よりも正確な音声合成パラメータ時系列を生成することができる。 (4) Further, in the speech synthesizer according to the present invention, the speech synthesis parameter time series data generated from the feature vector distribution information of the static features is a time series of static feature distribution average parameters. It is said. At this time, in the vector (μ−WX ₀ ), which is a correction result of the feature distribution parameter for obtaining the second speech synthesis parameter, the values of the elements corresponding to the static feature distribution average parameter are all zero. Since no rounding error due to calculation occurs with respect to the value 0, the influence of the rounding error of the value related to the static feature, which has been a problem in the conventional method, is reduced, and finally the speech synthesis parameter time series more accurate than the conventional method Can be generated.

（５）また、本発明の音声合成装置は、前記第１の音声合成パラメータ生成部は、前記第１の音声合成パラメータの生成により、最終的に生成しようとする前記第３の音声合成パラメータ時系列データの区分された時間ごとの数値範囲情報を保存し、前記第２の音声合成パラメータ生成部は、前記第２の音声合成パラメータの生成により、前記第３の音声合成パラメータ時系列データの前記区分された時間ごとの数値変化を算出し、前記音声合成パラメータ加算部は、前記加算により、前記保存した数値範囲情報を前記算出された数値変化に反映させることを特徴としている。 (5) In the speech synthesizer according to the present invention, the first speech synthesis parameter generation unit may finally generate the first speech synthesis parameter by generating the first speech synthesis parameter. Numerical value range information for each divided time of the series data is stored, and the second speech synthesis parameter generation unit generates the second speech synthesis parameter to generate the second speech synthesis parameter time series data. A numerical change for each divided time is calculated, and the speech synthesis parameter addition unit reflects the stored numerical range information in the calculated numerical change by the addition.

このように最終的に生成しようとする音声合成パラメータ時系列データの区分された時間ごとの値を一時的に記憶し、第２の音声合成パラメータに反映させるだけであり、第１の音声合成パラメータについては実質的な計算が発生しないため、誤差も生じない。 Thus, it is only necessary to temporarily store the divided time-dependent values of the speech synthesis parameter time series data to be finally generated and reflect the value in the second speech synthesis parameter. Since no substantial calculation occurs for, no error occurs.

（６）また、本発明の音声合成方法は、一連の単位音声列に含まれる単位音声の種類を記述する音声合成用情報から合成音声波形を生成する音声合成方法であって、与えられた音声合成用情報に基づく特徴ベクトルの分布情報を用いて、数値範囲の大きい第１の音声合成パラメータ時系列データを生成するステップと、前記与えられた音声合成用情報に基づく特徴ベクトルの分布情報を修正し、前記第１の音声合成パラメータとの差の時系列データとして、前記第１の音声合成パラメータ時系列データよりも数値範囲が小さい第２の音声合成パラメータ時系列データを生成するステップと、前記第１の音声合成パラメータ時系列データと前記第２の音声合成パラメータ時系列データとを加算し、第３の音声合成パラメータ時系列データを生成するステップと、を含み、前記第３の音声合成パラメータ時系列データに基づく合成音声波形を生成することを特徴としている。 (6) A speech synthesis method according to the present invention is a speech synthesis method for generating a synthesized speech waveform from speech synthesis information that describes the type of unit speech included in a series of unit speech sequences. Using the feature vector distribution information based on the synthesis information, generating a first speech synthesis parameter time series data having a large numerical range, and correcting the feature vector distribution information based on the given speech synthesis information Generating second speech synthesis parameter time-series data having a numerical range smaller than that of the first speech synthesis parameter time-series data as time-series data of the difference from the first speech synthesis parameter; First speech synthesis parameter time-series data and the second speech synthesis parameter time-series data are added to generate third speech synthesis parameter time-series data That includes a step of, is characterized by generating a synthesized speech waveform based on the time-series data and the third speech synthesis parameters.

これにより、従来手法で丸め誤差が問題になっていた特徴情報を主な処理対象とし、それ以外の特徴を処理から除外することで、全ての特徴情報も一括して処理する場合に比べて処理途中での丸め誤差を抑えることができる。 In this way, feature information that has been subject to rounding errors in the conventional method is the main processing target, and other features are excluded from the processing, so that all the feature information is processed as compared to the case of processing all at once. Rounding error can be suppressed.

（７）また、本発明の音声合成プログラムは、一連の単位音声列に含まれる単位音声の種類を記述する音声合成用情報から合成音声波形を生成するためにコンピュータに実行させる音声合成プログラムであって、与えられた音声合成用情報に基づく特徴ベクトルの分布情報を用いて、数値範囲の大きい第１の音声合成パラメータ時系列データを生成する処理と、前記与えられた音声合成用情報に基づく特徴ベクトルの分布情報を修正し、前記第１の音声合成パラメータとの差の時系列データとして、前記第１の音声合成パラメータ時系列データよりも数値範囲が小さい第２の音声合成パラメータ時系列データを生成する処理と、前記第１の音声合成パラメータ時系列データと前記第２の音声合成パラメータ時系列データとを加算し、第３の音声合成パラメータ時系列データを生成する処理と、を含み、前記第３の音声合成パラメータ時系列データに基づく合成音声波形を生成することを特徴としている。 (7) The speech synthesis program of the present invention is a speech synthesis program that is executed by a computer to generate a synthesized speech waveform from speech synthesis information that describes the type of unit speech included in a series of unit speech sequences. Then, using the distribution information of the feature vector based on the given speech synthesis information, processing for generating the first speech synthesis parameter time series data having a large numerical range, and the feature based on the given speech synthesis information Correcting the vector distribution information, the second speech synthesis parameter time series data having a numerical range smaller than that of the first speech synthesis parameter time series data is obtained as time series data of the difference from the first speech synthesis parameter. Adding the first speech synthesis parameter time-series data and the second speech synthesis parameter time-series data to a third sound It includes a process for generating a composite parameter time series data, and is characterized by generating a synthesized speech waveform based on the time-series data and the third speech synthesis parameters.

第１の音声合成パラメータ生成部では、主に従来の手法において計算誤差が問題となっていた情報から第１の音声合成パラメータ時系列データを生成し、第２の音声合成パラメータ時系列生成部では、最終的な音声合成パラメータ時系列データと第１の音声合成パラメータ時系列データの差を第２の音声合成パラメータ時系列データとして生成する。 The first speech synthesis parameter generation unit generates first speech synthesis parameter time-series data mainly from information in which calculation error has been a problem in the conventional method, and the second speech synthesis parameter time-series generation unit Then, a difference between the final speech synthesis parameter time series data and the first speech synthesis parameter time series data is generated as second speech synthesis parameter time series data.

第１の音声合成パラメータ生成部では、従来手法で丸め誤差が問題になっていた特徴情報を主な処理対象とし、それ以外の特徴を処理から除外することで、全ての特徴情報も一括して処理していた従来手法と比較し、処理途中での丸め誤差を抑えることができる。これにより、最終的な音声合成パラメータ時系列データの計算誤差を全体として小さくすることができる。 In the first speech synthesis parameter generation unit, all feature information is processed in a lump by excluding the other features from the processing, with feature information for which rounding error has been a problem in the conventional method as a main processing target. Compared with the conventional method, the rounding error during the processing can be suppressed. Thereby, the calculation error of the final speech synthesis parameter time-series data can be reduced as a whole.

本発明の音声合成装置を示すブロック図である。It is a block diagram which shows the speech synthesizer of this invention. 本発明の音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer of this invention. （ａ）〜（ｃ）各音声合成パラメータ時系列データの一例を示す図である。(A)-(c) It is a figure which shows an example of each speech synthesis parameter time series data. （ａ）〜（ｃ）各音声合成パラメータ時系列データの一例を示す図である。(A)-(c) It is a figure which shows an example of each speech synthesis parameter time series data.

以下の説明において“単位音声”とは、音声合成装置における、音声の最小処理単位である。単位音声の具体例には、音素、音節、単語がある。単位音声は、例えば前後の音素の種類といった音韻環境に関する違い、またアクセントやイントネーション、話速といった韻律的特徴の違いを考慮して分類される。“単位発話”は、連続的な特徴を有する一連の単位音声列を指し、１文の発声や、呼気段落（一息で読む単位）に対応する。また“音声合成用記号”とは、１単位発話の音声に含まれる単位音声のそれぞれの種類を記述するための一連の記号である。 In the following description, “unit speech” is a minimum speech processing unit in the speech synthesizer. Specific examples of unit speech include phonemes, syllables, and words. Unit speech is classified in consideration of differences in phonological environment such as the types of phonemes before and after, and differences in prosodic features such as accent, intonation, and speech speed. “Unit utterance” refers to a series of unit speech strings having continuous features, and corresponds to a single sentence utterance or an exhalation paragraph (unit read in one breath). The “speech synthesis symbols” are a series of symbols for describing each type of unit speech included in speech of one unit utterance.

音声合成装置１００は、最終的に音声合成パラメータ時系列から音声波形を生成する。ただし、音声合成パラメータ時系列データから音源・調音フィルタにより信号処理的に音声波形を生成するシステムに限らない。例えば、事前収録した音声データから音声素片データベースを構築しておき、音声合成パラメータ時系列データに対応する音声素片系列を選択し、接続することで音声を合成する、素片接続型音声合成システムも対象に含まれる。また、音声合成パラメータは多次元のベクトルでもよい。 The speech synthesizer 100 finally generates a speech waveform from the speech synthesis parameter time series. However, the present invention is not limited to a system that generates a speech waveform in a signal processing from speech synthesis parameter time-series data using a sound source / articulation filter. For example, a speech segment database is constructed from pre-recorded speech data, speech segment sequences corresponding to speech synthesis parameter time series data are selected, and speech is synthesized by connecting them. Systems are also included. The speech synthesis parameter may be a multidimensional vector.

（音声合成装置の構成）
図１は、音声合成装置１００を示すブロック図である。音声合成装置１００は音声合成用記号の入力に対して合成音声波形を出力する。図１に示すように、音声合成装置１００は、音声特徴分布パラメータ生成部１０５、第１の音声合成パラメータ生成部１１０、第２の音声合成パラメータ生成部１２０、音声合成パラメータ加算部１３０、音声波形生成部１４０で構成される。また、第２の音声合成パラメータ生成部は音声特徴分布パラメータ修正部１２１と音声合成パラメータ時系列計算部１２２で構成される。 (Configuration of speech synthesizer)
FIG. 1 is a block diagram showing the speech synthesizer 100. The speech synthesizer 100 outputs a synthesized speech waveform in response to input of a speech synthesis symbol. As shown in FIG. 1, the speech synthesizer 100 includes a speech feature distribution parameter generation unit 105, a first speech synthesis parameter generation unit 110, a second speech synthesis parameter generation unit 120, a speech synthesis parameter addition unit 130, a speech waveform. The generator 140 is configured. The second speech synthesis parameter generation unit includes a speech feature distribution parameter correction unit 121 and a speech synthesis parameter time series calculation unit 122.

以下、音声合成用記号から合成音声波形を生成する処理の流れに沿って各部を説明する。音声特徴分布パラメータ生成部１０５は、音声合成記号列から音声特徴分布パラメータを生成する。なお、ここで音声特徴は静的特徴だけでなく、そのデルタ特徴やデルタデルタ特徴等の動的特徴を含む。音声特徴分布パラメータ生成部は、学習音声を用いて音声特徴の分布パラメータを予測する予測器を有する。上記の特徴は全て正規分布を仮定し、分布パラメータはその平均ベクトルと分散共分散行列で構成される。上記のパラメータはそれぞれ決定木を用いて生成できる。ここで用いる決定木は、音声合成用記号と、それぞれ対応する特徴との関係について、それぞれ学習用音声を用いて事前に学習されている。 Hereinafter, each unit will be described along the flow of processing for generating a synthesized speech waveform from a speech synthesis symbol. The speech feature distribution parameter generation unit 105 generates speech feature distribution parameters from the speech synthesis symbol string. Here, the audio feature includes not only a static feature but also a dynamic feature such as a delta feature or a delta-delta feature. The speech feature distribution parameter generation unit includes a predictor that predicts a speech feature distribution parameter using the learning speech. All the above features assume a normal distribution, and the distribution parameters are composed of the mean vector and the variance-covariance matrix. Each of the above parameters can be generated using a decision tree. In the decision tree used here, the relationship between the speech synthesis symbol and the corresponding feature is learned in advance using the learning speech.

第１の音声合成パラメータ生成部１１０は、音声特徴分布パラメータから第１の音声合成パラメータ時系列データＸ_０を生成する。ただし、第１の音声合成パラメータ時系列データは音声波形生成部１４０で最終的に用いられる音声合成パラメータ時系列データと同一でなくてもよい。 The first voice synthesis parameter generation unit 110 generates a first speech synthesis parameter time series data X ₀ from the speech feature distribution parameter. However, the first speech synthesis parameter time-series data may not be the same as the speech synthesis parameter time-series data finally used by the speech waveform generation unit 140.

第２の音声合成パラメータ生成部１２０は、音声特徴分布パラメータと第１の音声合成パラメータ時系列データＸ_０を入力として受け付け、音声波形生成部１４０で最終的に用いられる音声合成パラメータ時系列データＸと、第１の音声合成パラメータ時系列データＸ_０の差の時系列データを第２の音声合成パラメータ時系列データＸ_１として生成する。すなわち、以下の関係が成り立つ。

The second speech synthesis parameter generation unit 120 receives the speech feature distribution parameter and the first speech synthesis parameter time series data X ₀ as inputs, and the speech synthesis parameter time series data X finally used by the speech waveform generation unit 140. If, to generate the time-series data of the first difference of the speech synthesis parameter time series data X ₀ as time-series data X ₁ second speech synthesis parameters. That is, the following relationship is established.

音声合成パラメータ加算部１３０は、第１の音声合成パラメータ時系列データＸ_０および第２の音声合成パラメータ時系列データＸ_１を入力として受け付け、各時刻における和の系列を、最終的な音声合成パラメータ時系列データ、すなわち第３の音声合成パラメータ時系列Ｘとして出力する。最後に、音声波形生成部１４０は、音声合成パラメータ時系列データＸに対応する音声波形を合成し出力する。 The speech synthesis parameter adding unit 130 receives the first speech synthesis parameter time-series data X ₀ and the second speech synthesis parameter time-series data X ₁ as inputs, and uses the sum sequence at each time as the final speech synthesis parameter. Output as time series data, that is, the third speech synthesis parameter time series X. Finally, the speech waveform generation unit 140 synthesizes and outputs a speech waveform corresponding to the speech synthesis parameter time series data X.

第２の音声合成パラメータ生成部１２０は、音声特徴分布パラメータ修正部１２１において、入力された音声特徴分布パラメータμ、Ｕを、Ｘ_０を用いて修正する。次に音声合成パラメータ時系列計算部で第２の音声合成パラメータ時系列データＸ_１を計算する。 The second speech synthesis parameter generation unit 120, the audio feature distribution parameter modifying section 121, audio feature distribution parameter input mu, a U, is corrected using the X _0. Then calculated time-series data X ₁ second speech synthesis parameters in speech synthesis parameter time series calculator.

数式（１０）および式（１１）から、Ｘ_１は以下の計算で求めることができる。

From equation (10) and Equation (11), _{X 1} can be obtained by the following calculation.

式（１２）は、Ｘ₁を求めるための特徴分布パラメータの修正が、一般的な場合の算出過程（式（１０））における分布平均に関するパラメータμを、Ｘ_０に対応する動的特徴を含む特徴ベクトルＷＸ_０との差に置き換えることで実現できることを示している。 Expression (12) includes a dynamic feature corresponding to X ₀ , with the parameter μ relating to the distribution average in the calculation process (expression (10)) in the case where correction of the feature distribution parameter for obtaining X ₁ is general. shows that can be realized by replacing the difference between the feature vector WX _0.

第１の音声合成パラメータ生成部１１０が出力する第１の音声合成パラメータ時系列データＸ_０は、任意の時系列データを設定することができる。その場合、最終的な計算誤差が小さくなるようなＸ_０を設定することがより好ましい。そのようなＸ_０として、各時刻における静的特徴の分布平均で構成した系列がある。このとき、ベクトル（μ−ＷＸ_０）における静的特徴の分布平均パラメータに対応する要素の値は全て０となる。値０に対して計算による丸めの誤差は生じないため、第２の音声合成パラメータ生成部でＸ_１を計算する際の、従来手法で問題となっていた静的特徴に関する値の丸め誤差の影響が小さくなり、最終的に従来手法よりも正確な音声合成パラメータ時系列データを生成することができる。 First speech synthesis parameter time series data X ₀ of the first speech synthesis parameter generation unit 110 outputs, it is possible to set an arbitrary time-series data. In that case, it is more preferable to set the X ₀ as the final calculation error is reduced. Such X _0, there is a sequence which is constituted by the distribution mean of the static feature at each time. At this time, all the values of the elements corresponding to the static feature distribution average parameter in the vector (μ−WX ₀ ) are zero. Because not occur rounding error by calculation for the value 0, in calculating X ₁ in second speech synthesis parameter generating unit, the influence of rounding error values for the static characteristic which is a problem in the conventional method Finally, it is possible to generate speech synthesis parameter time-series data that is smaller and more accurate than the conventional method.

あるいは、静的特徴の分布平均パラメータ時系列をローパスフィルタにより時間的に平滑化した系列等、Ｘ_０にＸとの差が小さいと考えられる系列を設定することで、生成されるＸ_１の値の範囲を従来手法によるＸの値の範囲よりも狭めることができる。これにより、固定小数点演算における小数点以下の桁数をより増やし、計算途中の丸め誤差を削減することができる。 Alternatively, the value of X ₁ generated by setting a series that is considered to have a small difference from X to X ₀ , such as a series obtained by temporally smoothing the distribution average parameter time series of static features using a low-pass filter Can be made narrower than the range of the value of X according to the conventional method. As a result, the number of digits after the decimal point in fixed-point arithmetic can be further increased, and rounding errors during calculation can be reduced.

（音声合成装置の動作）
上記のように構成される音声合成装置１００の動作を説明する。図２は、音声合成装置１００の動作を示すフローチャートである。まず音声合成記号列をもとに音声合成特徴分布パラメータを生成する（ステップＳ１）。次に、音声合成特徴分布パラメータから、予め設定された基準により第１の音声合成パラメータ時系列データＸ_０を生成する（ステップＳ２）。予め設定された基準は、たとえば計算結果の数値範囲の大きいものと小さいものに分離するという基準である。 (Operation of speech synthesizer)
The operation of the speech synthesizer 100 configured as described above will be described. FIG. 2 is a flowchart showing the operation of the speech synthesizer 100. First, a speech synthesis feature distribution parameter is generated based on the speech synthesis symbol string (step S1). Next, the speech synthesis feature distribution parameter, for generating a first speech synthesis parameter time series data X ₀ by a preset reference (step S2). The reference set in advance is, for example, a reference for separating the calculation result into a large numerical value range and a small numerical value range.

次に、上記の設定基準に基づいて第２の音声合成パラメータ時系列データの生成のための音声特徴分布パラメータを修正する（ステップＳ３）。そして、修正された音声特徴分布パラメータから第２の音声合成パラメータ時系列データＸ_１を生成する（ステップＳ４）。そして、上記のように得られた２つの音声合成パラメータ時系列データＸ_０、Ｘ_１を加算し、第３の音声合成パラメータ時系列データＸを生成する（ステップＳ５）。そして、第３の音声合成パラメータ時系列データＸを用いて音声波形を生成する（ステップＳ６）。なお、上記の一連の処理は、携帯端末等に実装されるプログラムを実行することにより実施可能である。また、上記のような手法は、最終的に生成しようとする時系列データの数値範囲情報を一旦保存し、それ以外の数値変化を算出し、保存した数値範囲情報を算出された数値変化に反映させるものとも言える。 Next, the speech feature distribution parameter for generating the second speech synthesis parameter time-series data is corrected based on the above setting criteria (step S3). Then, to produce a second speech synthesis parameter time series data X ₁ from the speech feature distribution parameter that has been modified (step S4). Then, the two speech synthesis parameter time series data X ₀ and X ₁ obtained as described above are added to generate the third speech synthesis parameter time series data X (step S5). Then, a speech waveform is generated using the third speech synthesis parameter time-series data X (step S6). The series of processes described above can be performed by executing a program installed in a mobile terminal or the like. In addition, the method as described above temporarily stores the numerical range information of the time series data to be finally generated, calculates other numerical changes, and reflects the stored numerical range information in the calculated numerical changes. It can be said that

（音声合成パラメータ時系列データの一例）
上記の実施形態により得られる音声合成パラメータ時系列データの一例を説明する。図３（ａ）〜（ｃ）は、各音声合成パラメータ時系列データの一例を示す図である。図の横軸は時間を、縦軸は音声特徴ベクトルのある次元の値を表している。図３（ａ）は、ある区分された時間ごとの平均値として得られた第１の音声合成パラメータ時系列データＸ_０を示している。これは区分された時間内で特徴分布パラメータが一定であることを想定した音声生成モデルに対応する。第１の音声合成パラメータは広い数値範囲にわたっているが、区分された時間ごとに独立に平均値を計算できるので、計算の際に誤差は生じにくい。図３（ｂ）は、これと最終的に得ようとする第３の音声合成パラメータ時系列データＸとの差として得られた第２の音声合成パラメータ時系列データＸ_１を示している。第２の音声合成パラメータは、時系列の変化は複雑であるが、狭い数値範囲に制約されている。図３（ｃ）は、第１の音声合成パラメータ時系列データＸ_０と第２の音声合成パラメータ時系列データＸ_１とを加算して得られる第３の音声合成パラメータ時系列データＸを示している。 (Example of speech synthesis parameter time-series data)
An example of the speech synthesis parameter time series data obtained by the above embodiment will be described. FIGS. 3A to 3C are diagrams illustrating an example of each speech synthesis parameter time series data. In the figure, the horizontal axis represents time, and the vertical axis represents a certain dimension value of the speech feature vector. FIGS. 3 (a) shows a first speech synthesis parameter time series data X ₀ obtained as an average value for each is segmented time. This corresponds to a speech generation model assuming that the feature distribution parameter is constant within the divided time. Although the first speech synthesis parameter covers a wide numerical range, an average value can be calculated independently for each divided time, so that an error hardly occurs during the calculation. FIG. 3B shows second speech synthesis parameter time series data X ₁ obtained as a difference between this and the third speech synthesis parameter time series data X to be finally obtained. The second speech synthesis parameter has a complicated time series change, but is limited to a narrow numerical range. FIG. 3C shows third speech synthesis parameter time-series data X obtained by adding the first speech synthesis parameter time-series data X ₀ and the second speech synthesis parameter time-series data X _1. Yes.

図４（ａ）〜（ｃ）も、同様に各音声合成パラメータ時系列データの一例を示す図である。図の横軸は時間を、縦軸は音声特徴ベクトルのある次元の値を表している。図４（ａ）は、区分された時間ごとの平均値系列に対し、折れ線近似による平滑化を行って得られた第１の音声合成パラメータ時系列データＸ_０を示している。第１の音声合成パラメータ時系列データＸ_０は広い数値範囲にわたっているが、時系列の変化は単純で計算の際に誤差は生じにくい。この場合には、第１の音声合成パラメータ時系列データＸ_０は、区分された区間ごとに計算した平均値による階段状の系列より、最終的に得ようとする音声合成パラメータに近いものとなる。図４（ｂ）は、これと最終的に得ようとする第３の音声合成パラメータ時系列データＸとの差として得られた第２の音声合成パラメータ時系列データＸ_１を示している。第２の音声合成パラメータ時系列データＸ_１は、時系列の変化は複雑であるが、上記の図３（ｂ）に示す場合よりさらに狭い数値範囲に制約されている。図４（ｃ）は、第１の音声合成パラメータ時系列データＸ_０と第２の音声合成パラメータ時系列データＸ_１とを加算して得られる第３の音声合成パラメータ時系列データＸを示している。 FIGS. 4A to 4C are also diagrams illustrating an example of each voice synthesis parameter time series data. In the figure, the horizontal axis represents time, and the vertical axis represents a certain dimension value of the speech feature vector. 4 (a) is to the average value series for each partitioned time, shows a first speech synthesis parameter time series data X ₀ obtained by performing smoothing by polygonal line approximation. First speech synthesis parameter time series data X ₀ is that over a wide range, errors in the time change of the series a simple calculation hardly occurs. In this case, the first speech synthesis parameter time series data X ₀ becomes more stepped sequence by mean value calculated for each segmented section, close to the finally obtained will to speech synthesis parameters . FIG. 4B shows second speech synthesis parameter time series data X ₁ obtained as a difference between this and the third speech synthesis parameter time series data X to be finally obtained. Time series data X ₁ and the second speech synthesis parameters, the change in time series is complex, and is constrained to a narrower numerical range than that shown in Figure 3 above (b). FIG. 4C shows third speech synthesis parameter time series data X obtained by adding the first speech synthesis parameter time series data X ₀ and the second speech synthesis parameter time series data X _1. Yes.

（変形例）
以上の説明では、１つのベクトルＸから音声波形を生成するが、スペクトル、基本周波数等、音声の音響的特徴の種類毎に音声合成ベクトル時系列を独立に計算し、音声波形生成処理でそれらを結合して用いてもよい。 (Modification)
In the above description, a speech waveform is generated from one vector X, but a speech synthesis vector time series is independently calculated for each type of acoustic feature of speech such as spectrum, fundamental frequency, etc., and these are generated by speech waveform generation processing. You may combine and use.

また、上記の実施形態では、第１の音声合成パラメータ生成部と第２の音声合成パラメータ生成部で音声合成記号列から生成した同一の音声特徴分布パラメータからそれぞれ音声合成パラメータ時系列を生成しているが、両者で異なる音声特徴分布パラメータを用いてもよい。例えば、Ｘ_０の生成処理を簡略化するために、より簡素化された音声特徴分布パラメータ生成処理で生成された音声特徴分布パラメータを用いて、Ｘ_０を生成することができる。 In the above embodiment, the first speech synthesis parameter generation unit and the second speech synthesis parameter generation unit generate the respective speech synthesis parameter time series from the same speech feature distribution parameter generated from the speech synthesis symbol string. However, different audio feature distribution parameters may be used. For example, in order to simplify the process of generating the X _0, using the speech feature distribution parameter generated by a more simplified speech feature distribution parameter generation processing, it is possible to generate X _0.

１００音声合成装置
１０５音声特徴分布パラメータ生成部
１１０第１の音声合成パラメータ生成部
１２０第２の音声合成パラメータ生成部
１２１音声特徴分布パラメータ修正部
１２２音声合成パラメータ時系列計算部
１３０音声合成パラメータ加算部
１４０音声波形生成部 100 speech synthesis apparatus 105 speech feature distribution parameter generation unit 110 first speech synthesis parameter generation unit 120 second speech synthesis parameter generation unit 121 speech feature distribution parameter correction unit 122 speech synthesis parameter time series calculation unit 130 speech synthesis parameter addition unit 140 Speech waveform generator

Claims

A speech synthesizer that generates a synthesized speech waveform from speech synthesis information that describes the type of unit speech included in a series of unit speech sequences,
A first speech synthesis parameter generation unit that generates first speech synthesis parameter time-series data having a large numerical range using feature vector distribution information based on given speech synthesis information;
In a general parameter time-series calculation process based on the maximum likelihood criterion, an average parameter in the feature vector distribution information, an average parameter in the feature vector distribution information, and a feature vector for the first speech synthesis parameter time-series data, The feature vector distribution information based on the given speech synthesis information is corrected by calculating the time series data of the difference from the first speech synthesis parameter. A second speech synthesis parameter generation unit that generates second speech synthesis parameter time series data having a numerical range smaller than the speech synthesis parameter time series data;
A speech synthesis parameter addition unit that adds the first speech synthesis parameter time-series data and the second speech synthesis parameter time-series data to generate third speech synthesis parameter time-series data;
A speech synthesizer for generating a synthesized speech waveform based on the third speech synthesis parameter time-series data.

The first speech synthesis parameter generation unit obtains numerical range information for each divided time of the third speech synthesis parameter time-series data to be finally generated by generating the first speech synthesis parameter. Save and
The second speech synthesis parameter generation unit calculates a numerical change for each divided time of the third speech synthesis parameter time-series data by generating the second speech synthesis parameter,
The speech synthesis apparatus according to claim 1, wherein the speech synthesis parameter addition unit reflects the stored numerical range information on the calculated numerical change by the addition.

A speech synthesizer that generates a synthesized speech waveform from speech synthesis information that describes the type of unit speech included in a series of unit speech sequences,
First speech synthesis parameter time series data having a large numerical range is generated from static feature distribution information that does not directly represent temporal changes among feature vector distribution information based on given speech synthesis information. A first speech synthesis parameter generation unit;
The feature vector distribution information based on the given speech synthesis information is corrected, and the numerical range as the time series data of the difference from the first speech synthesis parameter is larger than that of the first speech synthesis parameter time series data. A second speech synthesis parameter generation unit for generating small second speech synthesis parameter time-series data;
A speech synthesis parameter addition unit that adds the first speech synthesis parameter time-series data and the second speech synthesis parameter time-series data to generate third speech synthesis parameter time-series data;
A speech synthesizer for generating a synthesized speech waveform based on the third speech synthesis parameter time-series data.

4. The speech synthesizer according to claim 3, wherein the speech synthesis parameter time series data generated from the feature vector distribution information of the static features is a time series of static feature distribution average parameters.

A speech synthesis method for generating a synthesized speech waveform from speech synthesis information that describes a type of unit speech included in a series of unit speech sequences,
Generating first speech synthesis parameter time-series data having a large numerical range using feature vector distribution information based on given speech synthesis information;
In a general parameter time-series calculation process based on the maximum likelihood criterion, an average parameter in the feature vector distribution information, an average parameter in the feature vector distribution information, and a feature vector for the first speech synthesis parameter time-series data, The feature vector distribution information based on the given speech synthesis information is corrected by calculating the time series data of the difference from the first speech synthesis parameter. Generating second speech synthesis parameter time series data having a numerical range smaller than the speech synthesis parameter time series data;
Adding the first speech synthesis parameter time-series data and the second speech synthesis parameter time-series data to generate third speech synthesis parameter time-series data;
A speech synthesis method, comprising: generating a synthesized speech waveform based on the third speech synthesis parameter time-series data.

A speech synthesis method for generating a synthesized speech waveform from speech synthesis information that describes a type of unit speech included in a series of unit speech sequences,
First speech synthesis parameter time series data having a large numerical range is generated from static feature distribution information that does not directly represent temporal changes among feature vector distribution information based on given speech synthesis information. Steps,
The feature vector distribution information based on the given speech synthesis information is corrected, and the numerical range as the time series data of the difference from the first speech synthesis parameter is larger than that of the first speech synthesis parameter time series data. Generating small second speech synthesis parameter time-series data;
Adding the first speech synthesis parameter time-series data and the second speech synthesis parameter time-series data to generate third speech synthesis parameter time-series data;
A speech synthesis method, comprising: generating a synthesized speech waveform based on the third speech synthesis parameter time-series data.

A speech synthesis program that is executed by a computer to generate a synthesized speech waveform from speech synthesis information that describes a type of unit speech included in a series of unit speech sequences,
Processing for generating first speech synthesis parameter time series data having a large numerical range using distribution information of feature vectors based on given speech synthesis information;
In a general parameter time-series calculation process based on the maximum likelihood criterion, an average parameter in the feature vector distribution information, an average parameter in the feature vector distribution information, and a feature vector for the first speech synthesis parameter time-series data, The feature vector distribution information based on the given speech synthesis information is corrected by calculating the time series data of the difference from the first speech synthesis parameter. Processing for generating second speech synthesis parameter time-series data having a numerical range smaller than the speech synthesis parameter time-series data;
Adding the first speech synthesis parameter time-series data and the second speech synthesis parameter time-series data to generate third speech synthesis parameter time-series data,
A speech synthesis program for generating a synthesized speech waveform based on the third speech synthesis parameter time-series data.

A speech synthesis program that is executed by a computer to generate a synthesized speech waveform from speech synthesis information that describes a type of unit speech included in a series of unit speech sequences,
First speech synthesis parameter time series data having a large numerical range is generated from static feature distribution information that does not directly represent temporal changes among feature vector distribution information based on given speech synthesis information. Processing,
The feature vector distribution information based on the given speech synthesis information is corrected, and the numerical range as the time series data of the difference from the first speech synthesis parameter is larger than that of the first speech synthesis parameter time series data. Processing for generating small second speech synthesis parameter time-series data;
Adding the first speech synthesis parameter time-series data and the second speech synthesis parameter time-series data to generate third speech synthesis parameter time-series data,
A speech synthesis program for generating a synthesized speech waveform based on the third speech synthesis parameter time-series data.