JP2020064143A

JP2020064143A - Time series data generation device, method and program

Info

Publication number: JP2020064143A
Application number: JP2018195035A
Authority: JP
Inventors: 信行西澤; Nobuyuki Nishizawa
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2020-04-23
Anticipated expiration: 2038-10-16
Also published as: JP6959901B2

Abstract

To provide a time series data generation device capable of suppressing a calculation amount while quality of a calculation result is secured on feature parameter trajectory generation required in voice synthesis processing and the like.SOLUTION: A time series data generation device outputs multi-dimensional time series data optimized as a series on the basis of a prescribed reference with a plurality of types of feature distribution parameter time series defined as combination of values of a plurality of points of time at each point of time as input. The time series data generation device reads the feature distribution parameter time series data as frame data in a window moving in a time progressing direction in accordance with the number of times of steps at every processing unit step, and calculates time series data of respective dimensions of the multi-dimensional time series data in a prescribed repeating order corresponding to the number of times of steps. In the prescribed repeating order, the dimensions in which influence when errors exist becomes large are separated in the order in the respective dimensions of the multi-dimensional time series data.SELECTED DRAWING: Figure 4

Description

本発明は、音声合成やその他の目的のために滑らかな時系列データを生成するための特徴パラメータ軌跡生成に関して、計算結果の品質を確保しつつ計算量も抑制することが可能な時系列データ生成装置、方法及びプログラムに関する。 The present invention relates to generation of characteristic parameter loci for generating smooth time series data for speech synthesis or other purposes, and time series data generation capable of suppressing the amount of calculation while ensuring the quality of calculation results. An apparatus, a method, and a program.

音声合成技術の代表的な利用方法として、テキスト音声変換（Text-To-Speech）が挙げられる。以下、テキスト解析等の結果得られる音素の種類や韻律的特徴を表記した記号（以下、音声合成記号と呼ぶ。）を入力とし、音声波形を生成する装置を音声合成装置と呼ぶ。すなわち、音声合成装置は以下の第二処理を担うものであることから、以下の第一処理及び第二処理を担うテキスト音声変換システムの構成要素として利用可能である。
（第一処理）テキストを入力として音声合成記号を出力する。
（第二処理）音声合成記号を入力として音声波形を出力する。 Text-to-speech is a typical method of using the speech synthesis technology. Hereinafter, a device that generates a speech waveform by inputting a symbol (hereinafter, referred to as a speech synthesis symbol) that describes a type of phoneme or a prosodic feature obtained as a result of text analysis or the like is referred to as a speech synthesis device. That is, since the speech synthesizer is responsible for the following second processing, it can be used as a constituent element of a text-to-speech conversion system for performing the following first processing and second processing.
(First processing) A text-to-speech symbol is output with text input.
(Second process) The voice waveform is output with the voice synthesis symbol as an input.

音声合成装置への入力としての音声合成記号には様々な形式があり得るが、ここでは、一連の音声を構成する音素の情報と、主としてポーズや声の高さとして表現される韻律的情報を同時に表記したものを考える。そのような音声合成用記号の例として、ＪＥＩＴＡ（電子情報技術産業協会）規格ＩＴ−４００６「日本語テキスト音声合成用記号」がある（非特許文献１参照）。音声合成装置は、このような音声合成記号に基づいてそれに対応する音声波形を生成する。ただし、一般に音声波形は合成対象の音素だけでなく、前後の音素の種類や韻律的特徴の影響を強く受けるため、一般的に音声合成記号と音声波形の対応関係は複雑になる。 There are various formats of speech synthesis symbols as input to a speech synthesizer, but here, information on phonemes that make up a series of speech and prosodic information mainly expressed as pauses or pitches is used. Consider what is written at the same time. An example of such a speech synthesis symbol is JEITA (Japan Electronics and Information Technology Industries Association) standard IT-4006 "Japanese text speech synthesis symbol" (see Non-Patent Document 1). The speech synthesizer generates a speech waveform corresponding to such a speech synthesis symbol based on the speech synthesis symbol. However, in general, the speech waveform is strongly influenced not only by the phoneme to be synthesized but also by the type and prosodic features of the preceding and following phonemes, so that the correspondence between the speech synthesis symbol and the speech waveform is generally complicated.

音声合成装置による音声波形の生成方法には様々な方式があるが、音声の短時間スペクトルの特徴や有声・無声情報、基本周波数（F0）を特徴パラメータとし、5ms（ミリ秒）程度の間隔で特徴パラメータを更新し、更新される特徴パラメータに基づき音声波形を信号処理的に生成する方法がある。以下、この更新間隔を１フレームと呼ぶ。自然音声の分析により求めた各フレームの特徴パラメータと、当該自然音声の当該フレームに対応する音声合成記号の組を学習データとして、音声合成記号から対応する音声特徴パラメータを予測するモデルを機械学習手法により推定し、このように学習したモデルを用いて音声合成を行う手法が広く用いられている。 There are various methods for generating a speech waveform by a speech synthesizer, but the characteristics of the short-time spectrum of voice, voiced / unvoiced information, and the fundamental frequency (F0) are used as characteristic parameters and are set at intervals of about 5 ms (milliseconds). There is a method of updating a characteristic parameter and generating a speech waveform by signal processing based on the updated characteristic parameter. Hereinafter, this update interval will be referred to as one frame. A model for predicting a corresponding speech feature parameter from a speech synthesis symbol is used as a machine learning method, using a set of the characteristic parameter of each frame obtained by analysis of the natural speech and the speech synthesis symbol corresponding to the frame of the natural speech as learning data. The method of performing speech synthesis using the model thus estimated is widely used.

その1つであるHMM（隠れマフコフモデル）音声合成方式では、時間軸方向の音声の特徴変化をHMMの隠れ状態遷移に対応させてモデル化している。一音素内でも音声特徴パラメータは変化するが、音素内の状態遷移を学習データとして陽に与えるのではなく、隠れ状態としており、状態遷移も同時に学習される形となる。最終的に、音声合成記号からHMMの各パラメータを予測する決定木が学習される。一方、音声合成時はそのように学習したモデルから隠れ状態の遷移を決定的に決めている。このため音声合成の際、ある時刻に状態が切り替わると、その時刻でHMMの出力も不連続に変化する。実際の音声の特徴は時間的に連続的に変化するので、HMMの出力を直接音声合成に用いてしまうと、自然音声と比較し聴感的にも不連続な音声となる。この問題に対処するため、元の特徴の動的特徴量、具体的にはデルタパラメータ（一階差分）やデルタデルタパラメータ（二階差分）も併せて特徴としてモデル化する手法が用いられる（非特許文献２参照）。 The HMM (Hidden Muffkov Model) speech synthesis method, which is one of them, models the characteristic changes of the speech along the time axis in correspondence with the hidden state transitions of the HMM. Although the speech feature parameter changes within one phoneme, the state transitions within the phoneme are hidden rather than explicitly given as learning data, and the state transitions are learned at the same time. Finally, the decision tree that predicts each HMM parameter from the speech synthesis symbol is learned. On the other hand, at the time of speech synthesis, the hidden state transition is decisively determined from the model learned in this way. Therefore, when the state is switched at a certain time during voice synthesis, the output of the HMM also changes discontinuously at that time. Since the characteristics of actual speech change continuously with time, if the output of the HMM is directly used for speech synthesis, it will be audibly discontinuous compared to natural speech. In order to deal with this problem, a method is used in which the dynamic feature amount of the original feature, specifically, the delta parameter (first-order difference) and the delta-delta parameter (second-order difference) are also modeled as a feature (Non-patent Reference). Reference 2).

非特許文献２の手法は次の通りである。 The method of Non-Patent Document 2 is as follows.

以下、ある時刻の元の音声特徴パラメータに直接対応する特徴を静的特徴といい、ある時刻の動的特徴は、複数時刻（当該ある時刻の所定近傍範囲（過去及び未来の両方向に延びる範囲）での複数時刻）の静的特徴値の重み付け和で計算できるものとする。この関係は以下の式(1)で表現することができる。
o＝Ｗc …(1) Hereinafter, a feature that directly corresponds to the original voice feature parameter at a certain time is called a static feature, and a dynamic feature at a certain time is a plurality of times (a predetermined neighborhood range of the certain time (a range extending in both past and future directions)). It can be calculated by the weighted sum of static feature values at multiple times in. This relationship can be expressed by the following equation (1).
o = Wc (1)

式(1)でcは静的特徴を時刻順に並べたベクトル（すなわち音声合成パラメータの時系列データ（軌跡）を表すベクトル）であり、oはHMMの出力ベクトルで、静的特徴と動的特徴を連結し、それをさらに時刻順に並べたベクトルである。Wはcからoへの変換行列である。 In Equation (1), c is a vector in which static features are arranged in order of time (that is, a vector representing time-series data (trajectory) of the voice synthesis parameter), and o is an output vector of the HMM. Is a vector obtained by concatenating and arranging them in order of time. W is the conversion matrix from c to o.

音声合成で最終的に必要となるのはcだが、HMM音声合成ではまず音声合成記号から、HMMの出力ベクトルoの分布に関するパラメータを決定木等で予測し、この分布のパラメータで決まる出力oのモデルの尤度が最大となるようなベクトルcを音声合成に用いる。oの分布を正規分布でモデル化する場合、oの分布平均ベクトルをμ、oの分散共分散行列をΣとすると、対数尤度を最大化するベクトルcは以下の式(2)を解くことで求めることができる。
D_c{(Wc-μ)^TΣ^-1(Wc-μ)}=0 …(2) The final requirement for speech synthesis is c, but for HMM speech synthesis, the parameters related to the distribution of the HMM output vector o are first predicted from the speech synthesis symbol using a decision tree, and the output o determined by the parameters of this distribution is calculated. The vector c that maximizes the likelihood of the model is used for speech synthesis. When modeling the distribution of o as a normal distribution, if the distribution mean vector of o is μ and the variance-covariance matrix of o is Σ, the vector c that maximizes the log-likelihood can be solved by the following equation (2). Can be found at.
D _c {(Wc-μ) ^T Σ ^-1 (Wc-μ)} = 0… (2)

式(2)においてD_c{・}はベクトルcの各要素による偏微分∂/∂cの演算を表す。また、T（上付きのT）は転置を表し、以下同様とする。式(2)を解くことで以下の式(3)のようにベクトルcを得ることができる。
c=(W^TΣ^-1W)^-1W^TΣ^-1μ …(3) In Equation (2), D _c {·} represents the calculation of the partial differential ∂ / ∂c by each element of the vector c. Further, T (superscript T) represents transposition, and the same applies hereinafter. By solving equation (2), the vector c can be obtained as in equation (3) below.
c = (W ^T Σ ^-1 W) ^-1 W ^T Σ ^-1 μ… (3)

このように計算したcは動的特徴も考慮された特徴パラメータ軌跡になる。動的特徴は音声特徴パラメータ軌跡の時間的な連続性に対する強い制約となるので、HMM状態遷移を決定的に決めることでHMMの出力であるoが不連続に変化しても、計算されるcはある程度連続なものとなる。（従って、cから得られる音声も動的特徴を考慮しない場合と比べてより自然なものとなる。）より一般的には、o=Wc（式(1)）で求まる、元の音声特徴パラメータに対応するベクトルcよりも次元数の大きい連続性に関する何らかの特徴を要素として含んだベクトルoを定義し、oの分布パラメータで表現されるoのモデルの尤度を最大にするcを求めることで連続性を考慮した特徴パラメータ軌跡を求める方法といえる。このような計算に基づく処理は、特徴パラメータ軌跡生成処理と呼ばれる。なお、これまでの説明では、oの分布を決めるためにHMMを用いていたが、ここで説明する特徴パラメータ軌跡生成処理はHMMを用いることが必須ではない。各フレームのoの分布が何らかの方法で決まれば良く、例えば、音素先頭からの相対位置と音声合成記号から各フレームのoの分布をそれぞれ決定木で決める方法や、ニューラルネットワークを用いて決める方法を用いても良い。この場合、音素の長さも同様に決定木やニューラルネットワークを用いて決めることで、各フレームにおいて音素と音素先頭からの相対位置から決まり、全てのフレームのoの分布をそれぞれ音声合成記号から決めることができる。 The c calculated in this way becomes a feature parameter trajectory in which dynamic features are also considered. Since the dynamic feature is a strong constraint on the temporal continuity of the speech feature parameter trajectory, even if the HMM output o discontinuously changes, it can be calculated by deciding the HMM state transition decisively. Will be continuous to some extent. (Therefore, the speech obtained from c is more natural than when the dynamic characteristics are not taken into consideration.) More generally, the original speech characteristic parameter obtained by o = Wc (Equation (1)) By defining a vector o that contains as an element some feature related to continuity that has a greater number of dimensions than the vector c corresponding to, and find c that maximizes the likelihood of the model of o represented by the distribution parameter of o. It can be said that this is a method of obtaining the characteristic parameter locus considering continuity. The process based on such calculation is called a characteristic parameter trajectory generation process. Note that, in the above description, the HMM is used to determine the distribution of o, but it is not essential to use the HMM in the feature parameter trajectory generation processing described here. The distribution of o in each frame may be determined by some method.For example, a method of determining the distribution of o in each frame from the relative position from the beginning of the phoneme and the speech synthesis symbol by a decision tree or a method using a neural network is available. You may use. In this case, the phoneme length is similarly determined by using a decision tree or neural network, so that it is determined from the phoneme in each frame and the relative position from the beginning of the phoneme, and the distribution of o in all frames is determined from the speech synthesis symbols. You can

ここで、分散共分散行列Σとして対角行列を採用することが多い。これはベクトルoの各次元間に相関がないことを仮定していることに対応し、cを構成する多次元の特徴に対しても、その次元間の相関を無視できると仮定したことになる。なお、多次元の音声特徴における次元間の相関を無視した場合の影響が大きくないことは経験的にも知られている。さらに対象時刻の前後限られた区間のcの要素からoを計算するようにWを定義すると、式(3)の（W^TΣ^-1W）はバンド幅が（（考慮するフレーム数）−１）の対称帯行列となり、cを求めるために必要な計算量はcの大きさのみに比例する。cの大きさは音声合成対象の音声のフレーム数に対応するので、合成対象の音声が時間的に長い場合、一度にcを計算しようとすると、cの大きさに比例する処理時間が掛かってしまい、音声合成指示から音声再生が開始されるまでの時間、すなわちシステムの応答遅延が長くなるという問題がある。また、cから信号処理で実際の音声合成波形を生成する処理(波形生成処理)のためにcを一時的に保存する必要があり、cが大きくなるとそのために必要なメモリ量も増えるという問題もある。 Here, a diagonal matrix is often adopted as the variance-covariance matrix Σ. This corresponds to the assumption that there is no correlation between the dimensions of vector o, and it is assumed that the correlation between dimensions can be ignored even for the multidimensional features that make up c. . It is also empirically known that the influence when the correlation between dimensions in the multidimensional voice feature is ignored is not large. Further, if W is defined so that o is calculated from the elements of c in a limited section before and after the target time, (W ^T Σ ^-1 W) in equation (3) has a bandwidth ((the number of frames to consider) − It becomes the symmetric band matrix of 1), and the amount of calculation required to obtain c is proportional only to the size of c. Since the size of c corresponds to the number of frames of the speech to be synthesized, if the speech to be synthesized is long in time, it will take a processing time proportional to the size of c to calculate c at once. Therefore, there is a problem that the time from the voice synthesis instruction to the start of voice reproduction, that is, the response delay of the system becomes long. In addition, it is necessary to temporarily save c for the process of generating an actual speech synthesis waveform from c by signal processing (waveform generation process), and as c increases, the amount of memory required for that also increases. is there.

このような問題に対して、システムの応答遅延を短縮することを主な目的として、特徴パラメータ軌跡生成をRLS(再帰的最小二乗法)アルゴリズムに相当する計算アルゴリズムで行う方法は提案されており、これを用いることでフレーム毎の逐次パラメータ計算が可能となる（非特許文献３参照）。 With respect to such a problem, a method of performing feature parameter locus generation by a calculation algorithm corresponding to the RLS (recursive least squares) algorithm has been proposed mainly for the purpose of reducing the response delay of the system. By using this, it is possible to perform sequential parameter calculation for each frame (see Non-Patent Document 3).

「日本語テキスト音声合成用記号」ＪＥＩＴＡ規格ＩＴ−４００６、電子情報技術産業協会、２０１０年３月"Symbols for Japanese text-to-speech synthesis" JEITA standard IT-4006, Japan Electronics and Information Technology Industries Association, March 2010 益子貴史、徳田恵一、小林隆夫、今井聖、「動的特徴を用いたＨＭＭに基づく音声合成」、電子情報通信学会論文誌(D-II), J79-D-II, 12, pp.2184-2190, Dec. 1996.Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Kiyoshi Imai, "HMM-based Speech Synthesis Using Dynamic Features", IEICE Transactions (D-II), J79-D-II, 12, pp.2184- 2190, Dec. 1996. Kazuhito Koishida, Keiichi Tokuda, Takashi Masuko and Takao Kobayashi, ``Vector quantization of speech spectral parameters using statistics of static and dynamic features,'' IEICE Trans. Inf. & Syst., vol. E84-D, no. 10, pp. 1427-1434, October, 2001.Kazuhito Koishida, Keiichi Tokuda, Takashi Masuko and Takao Kobayashi, `` Vector quantization of speech spectral parameters using statistics of static and dynamic features, '' IEICE Trans. Inf. & Syst., Vol. E84-D, no. 10, pp 1427-1434, October, 2001.

非特許文献３においては上記の通りシステムの応答遅延は短縮できたものの、こうした従来技術には特徴パラメータ軌跡生成の計算処理量の最適化に関して更なる改善の余地が残っていた。 In Non-Patent Document 3, although the response delay of the system can be shortened as described above, there is still room for further improvement in optimizing the calculation processing amount of the characteristic parameter locus generation in such a conventional technique.

すなわち、非特許文献３の手法はパラメータ計算において考慮する過去の出力フレーム数の二乗に比例する計算量となるため、考慮する過去の出力フレーム数を増やす（長くする）と一発話全体の処理量が大幅に増加することになる。つまり、応答遅延は短縮できても、処理量の増加に対処すべく、より高速な計算器を必要とするものであった。一方で、非特許文献３の手法において処理量の増加を抑制するための単純な対処として、考慮する過去の出力フレーム数を減らしたとすると、一発話全体で誤差を最小としていたベース手法（非特許文献２の手法）との誤差が大きくなり、合成音声品質が劣化する可能性が高い。 That is, the method of Non-Patent Document 3 has a calculation amount proportional to the square of the number of past output frames to be considered in the parameter calculation. Therefore, if the number of past output frames to be considered is increased (lengthened), the processing amount of the entire utterance is increased. Will be greatly increased. That is, even if the response delay can be shortened, a faster computer is required to cope with the increase in the processing amount. On the other hand, if the number of past output frames to be considered is reduced as a simple measure for suppressing the increase in the processing amount in the method of Non-Patent Document 3, the base method that minimizes the error in the entire utterance ( There is a high possibility that the error from the method of Document 2) becomes large and the synthesized speech quality deteriorates.

その他にも、非特許文献２や３の手法を前提として特徴パラメータ軌跡生成の計算の最適化を行うアプローチとして図１ないし図３で示すものも考えられるが、以下に説明する通り、このようなアプローチも改善の余地がある。 In addition, the approaches shown in FIGS. 1 to 3 can be considered as approaches for optimizing the calculation of the characteristic parameter locus generation on the premise of the methods of Non-Patent Documents 2 and 3. However, as described below, such an approach is possible. There is room for improvement in the approach.

図１の手法は次のような考察に基づく。音声合成では多くの場合、発話単位で音声合成指示が与えられるので、一発話の音声合成処理開始時点で、その発話の終了時点までの指令情報が確定している。従って、極端なメモリ削減や応答遅延時間の短縮が必要なければ、フレーム毎のパラメータの逐次計算を行うのではなく、数十から数百フレームを1ブロックとして、各ブロックをベース手法（非特許文献２）で計算する方法を発話先頭から順に繰り返し、最終的に一発話分のパラメータを計算する方法を取ることができる。図１ではこのようなブロック例として4つのブロックC(k1),(Ck2),C(k3),C(k4)が示されている。この方法でも、1発話全体のcを最初に一度計算する方法よりも処理遅延時間を短縮でき、かつメモリ量を抑えることができる。 The method of FIG. 1 is based on the following considerations. In many cases, in voice synthesis, a voice synthesis instruction is given in units of utterances, so that at the start of voice synthesizing processing of one utterance, the command information up to the end of the utterance is fixed. Therefore, if it is not necessary to drastically reduce the memory or shorten the response delay time, the parameters are not sequentially calculated for each frame, but dozens to hundreds of frames are set as one block and each block is used as a base method (Non-Patent Document 1). The method of 2) can be repeated in order from the beginning of the utterance to finally calculate the parameters for one utterance. In FIG. 1, four blocks C (k1), (Ck2), C (k3), and C (k4) are shown as an example of such a block. Even with this method, the processing delay time can be shortened and the memory amount can be suppressed as compared with the method in which c of the entire utterance is calculated once at the beginning.

図１の手法ではさらに、次のような考察に基づき、ブロック間に重複を設けるようにしている。すなわち、ブロック分割して計算する場合、動的特徴に、既に確定したブロックより前のフレームの値を反映させることで、ブロックの先頭側についてはパラメータ軌跡の連続性を担保できるが、ブロックの末尾に近い部分では、当該ブロックよりもさらに後ろの情報を考慮せずにパラメータ軌跡計算をすることになるので、一発話全体を一度に計算した場合とパラメータ軌跡の差異が大きくなると考えられる（例えば図１の[3]に示すブロックC(k1)の後端側部）。さらに確定した値が次のブロックに影響するので、実際には、ブロックの末尾部分だけでなく、先頭部分も同様の影響を受ける（例えば図１の[1]に示すブロックC(k1)の前端側部）。これらを除いた中間部は影響が少なくパラメータ軌跡の差異は小さい（例えば図１の[2]に示すブロックC(k1)の中間部）。 In the method of FIG. 1, further, an overlap is provided between blocks based on the following consideration. That is, when calculating by dividing into blocks, it is possible to ensure the continuity of the parameter locus at the head side of the block by reflecting the value of the frame before the already confirmed block in the dynamic feature, but at the end of the block. In the portion close to, the parameter locus calculation is performed without considering the information behind the block, and the difference in the parameter locus from the case where the whole utterance is calculated at once is considered to be large (for example, The rear end side of the block C (k1) shown in [3] of 1). Further, since the determined value affects the next block, in fact, not only the end portion of the block but also the beginning portion is affected similarly (for example, the front end of the block C (k1) shown in [1] of FIG. 1). side). The middle part excluding these has little influence and the difference in the parameter locus is small (for example, the middle part of the block C (k1) shown in [2] of FIG. 1).

この影響を緩和するためには、単純には、1ブロックのパラメータ軌跡計算において、ブロックのさらに後ろのフレームまで考慮して計算し、ブロック範囲の外側のパラメータ軌跡計算結果を捨てればよい。図１ではこのようにブロックの後端側部の結果を捨てることが可能なように各ブロックに重複を設けている。ここで、捨てるものとして考慮する後ろのフレームの数（長さ）を長くするほど、ベース手法（非特許文献２）で一度に一発話分を計算した結果に近づくと考えられるが、処理量が増加する。例えば一ブロックの長さが100フレームで、1ブロックの計算でブロック終端からさらに10フレーム後まで計算した場合、従来法と比較して一発話全体の処理量は10%増加することになる。 In order to mitigate this effect, in the parameter locus calculation of one block, calculation is performed simply by considering the frame further behind the block and discarding the parameter locus calculation result outside the block range. In FIG. 1, each block is overlapped so that the result on the rear end side of the block can be discarded. Here, it is considered that the longer the number (length) of the subsequent frames to be considered to be discarded is, the closer the result is to calculate one utterance at a time by the base method (Non-Patent Document 2). To increase. For example, when the length of one block is 100 frames and the calculation of one block is performed 10 more frames after the end of the block, the processing amount of one utterance is increased by 10% as compared with the conventional method.

しかし、図１の手法を用いるとしても、実際の特徴パラメータ軌跡の計算においては、多次元で構成される特徴パラメータの各次元のパラメータをどのような順番で計算するか、ということに関して検討の余地が残っている。 However, even if the method shown in FIG. 1 is used, in the actual calculation of the feature parameter locus, there is room for consideration regarding the order in which the parameters of each dimension of the multidimensional feature parameters are calculated. Is left.

当該順番に関して、まず、音声合成出力の理想的には1フレーム毎に特定の特徴の次元のパラメータ軌跡計算を行い、多数次元のパラメータ軌跡計算が特定時間内に集中しないようにすることが望ましい。例えば図２は全次元のパラメータ軌跡生成を行ってから音声波形生成処理を行う例であり、図３は音声波形生成処理の1フレーム分毎に1次元分のパラメータ軌跡計算を行う例である。ここで、F0(k1:k2)は時刻k1（フレーム時刻）からk2までの基本周波数F0のパラメータ軌跡生成処理、Cm(k1:k2)は時刻k1からk2のメルケプストラムm次係数（図２，３では例としてm=0,1,2,…,7）（スペクトル包絡特性を表す特徴量である）のパラメータ軌跡生成処理、W(k)はフレーム時刻kの音声波形生成処理である。ここで、F0(k1:k2)やCm(k1:k2)は最終的な生成対象が時刻k1からk2までであることを意味し、図１の手法として説明したように、実際の生成処理ではフレーム時刻k2よりも後についても考慮した処理方法を用いることができること（すなわち、フレーム時刻k2よりも後のフレーム時刻k3までも計算しておくが、当該計算したk2<k≦k3の間のフレーム時刻kに関して生成されたパラメータは音声合成には利用しないようにすること）に注意されたい。 Regarding the order, first, it is desirable that the parameter trajectory calculation of the dimension of the specific feature is ideally performed for each frame of the speech synthesis output so that the multi-dimensional parameter trajectory calculation is not concentrated within the specific time. For example, FIG. 2 shows an example of performing the voice waveform generation processing after generating the all-dimensional parameter trajectory, and FIG. 3 shows an example of performing the one-dimensional parameter trajectory calculation for each frame of the voice waveform generation processing. Here, F0 (k1: k2) is the parameter locus generation process of the fundamental frequency F0 from time k1 (frame time) to k2, and Cm (k1: k2) is the mel-cepstrum m-th order coefficient from time k1 to k2 (Fig. 2, In 3, the parameter locus generation process of m = 0, 1, 2, ..., 7) (which is a feature amount indicating the spectrum envelope characteristic) is performed, and W (k) is a voice waveform generation process of the frame time k. Here, F0 (k1: k2) and Cm (k1: k2) mean that the final generation target is from time k1 to k2, and as described as the method of FIG. 1, in the actual generation processing, It is possible to use a processing method that also considers after the frame time k2 (that is, the frame time k3 after the frame time k2 is also calculated, but the frame between the calculated k2 <k ≦ k3 Note that the parameters generated for time k should not be used for speech synthesis).

図２では横軸を処理時刻（フレーム時刻kとは別の、各パラメータ軌跡生成処理を行う実時刻に対応した処理時刻t）とし、[1]にパラメータ軌跡生成処理が示され、[2]に音声波形生成処理が示されている。（なお、既に述べた図１も、横軸はこの意味での処理時刻tとなっている。）図３でも同様に横軸方向を処理時刻tとして、[1]にパラメータ軌跡生成処理が示され、[2]に音声波形生成処理が示されている。 In FIG. 2, the horizontal axis is the processing time (the processing time t, which is different from the frame time k and corresponds to the actual time when each parameter locus generation process is performed), and the parameter locus generation process is shown in [1], and [2] The voice waveform generation processing is shown in FIG. (Note that in FIG. 1 already described, the horizontal axis indicates the processing time t in this sense.) Also in FIG. 3, the horizontal axis indicates the processing time t, and the parameter locus generation processing is shown in [1]. The voice waveform generation process is shown in [2].

合成音声を実際に出力する処理が途切れないようにするためには、図２のタイミングでは特徴パラメータ軌跡処理（[1]）を出来るだけ高速化し、（すなわち、完了時刻t1,t2等が可能な限り早くなるようにし、）かつ音声波形生成結果（[2]の結果）をバッファリングして、特徴パラメータ軌跡処理の処理中に音が途切れないようにする必要がある。なお、既にパラメータ軌跡の記号に関して定義した通り、図２において時刻t1はフレーム時刻kからk+8までの全パラメータF0,C0,…,C7の生成完了時刻であり、時刻t2はフレーム時刻k+9からk+17までの全パラメータF0,C0,…,C7の生成完了時刻である。図３の処理順にすることで、処理負荷が時間的に平滑化（処理完了時刻の分布が平滑化）され、バッファリングのためのメモリサイズも削減される。これにより低い性能の計算機でも処理が可能になる。 In order to prevent the process of actually outputting the synthetic speech from being interrupted, the characteristic parameter trajectory process ([1]) is speeded up as much as possible at the timing of FIG. 2 (that is, the completion times t1, t2, etc. are possible). It is necessary to buffer the voice waveform generation result (result of [2]) so that the sound is not interrupted during the feature parameter trajectory processing. As already defined with respect to the symbol of the parameter locus, the time t1 in FIG. 2 is the generation completion time of all parameters F0, C0, ..., C7 from the frame time k to k + 8, and the time t2 is the frame time k +. It is the generation completion time of all parameters F0, C0, ..., C7 from 9 to k + 17. By adopting the processing order of FIG. 3, the processing load is temporally smoothed (the distribution of the processing completion time is smoothed), and the memory size for buffering is also reduced. This allows processing with a low-performance computer.

図３では、パラメータX（例えばX=F0(k:k+8)）の生成完了時刻をt[X]と表記している。図３の処理順は（その少なくとも一部が）示される通り、生成対象のパラメータの種類をF0,C0,C1,…,C7と所定順序で切り替えつつ、生成対象のパラメータのフレーム時刻範囲を(k:k+8),(k+1:k+9),(k+2,k+10),…,(k+8:k+16)と1ずつ前方（時刻の進む方向）へとシフトさせていく、というものである。また、これと並行させて、時刻範囲(k+i:k+i+8)(i=0,1,2,…,8)のパラメータ生成が完了したところで、フレーム時刻k+iにおける全種類のパラメータF0,C0,C1,…,C7が生成完了していることにより、音声波形W(k+i)を生成するようにしている。図示される範囲外となるフレーム時刻k+9以降やフレーム時刻k以前に関してもこれと同様に周期的なパラメータ及び音声波形の生成処理が行われることとなる。 In FIG. 3, the generation completion time of the parameter X (for example, X = F0 (k: k + 8)) is represented as t [X]. As shown in the processing order of FIG. 3 (at least a part thereof), the frame time range of the parameter to be generated is changed while switching the types of the parameter to be generated to F0, C0, C1, ..., C7 in a predetermined order. k: k + 8), (k + 1: k + 9), (k + 2, k + 10),…, (k + 8: k + 16) and forward one by one (towards the time) It is to shift. In parallel with this, when the parameter generation for the time range (k + i: k + i + 8) (i = 0,1,2, ..., 8) is completed, all types at frame time k + i Since the parameters F0, C0, C1, ..., C7 have been generated, the voice waveform W (k + i) is generated. The periodical parameter and voice waveform generation processing is performed in the same manner for the frame time k + 9 and later and the frame time k and earlier that are out of the illustrated range.

しかしながら、図２や図３の手法において、各次元の計算は図１と同じ手法によるものであるため、計算の際に計算において考慮するブロック末尾よりさらに後のフレーム数を減らしてしまうと、ベース手法（非特許文献２）による計算結果との相違が大きくなるという問題は同様に発生する。 However, in the methods of FIG. 2 and FIG. 3, the calculation of each dimension is performed by the same method as that of FIG. 1. Therefore, if the number of frames further after the end of the block to be considered in the calculation is reduced, the base The problem that the difference from the calculation result by the method (Non-Patent Document 2) becomes large similarly occurs.

先述したように、図１の手法では、メモリ消費量削減および応答遅延時間短縮の目的に、ベース手法（非特許文献２）を繰り返す方法でパラメータ軌跡計算を行う場合、処理量を減らすために計算において考慮するブロック末尾よりさらに後のフレーム数を減らしてしまうと、特にブロック境界付近（ブロック先頭に近いフレームおよびブロック末尾に近いフレーム）において、一発話全体をベース手法で計算した結果との差異が大きくなると考えられる。特に図２のタイミングによる方法の場合、差異が大きくなる時刻が全ての特徴を通じて同じような時刻になって短時間的に影響がより大きくなってしまう。一方、図３の方法では差異が大きくなる時刻が特徴毎に異なり、図２よりも影響は小さくなる。しかし、差異への影響がより大きい複数の特徴において、差異が大きくなる時刻が近接し、依然としてその部分だけ短時間的に差異が大きくなってしまう可能性がある。 As described above, in the method of FIG. 1, when the parameter trajectory calculation is performed by the method of repeating the base method (Non-Patent Document 2) for the purpose of reducing the memory consumption amount and the response delay time, the calculation is performed to reduce the processing amount. If the number of frames after the end of the block to be considered in is reduced, there is a difference from the result calculated by the base method for the entire utterance, especially near the block boundary (frame near the block start and frame near the block end). It is expected to grow. In particular, in the case of the method based on the timing of FIG. 2, the time when the difference becomes large becomes the same time among all the characteristics, and the influence becomes larger in a short time. On the other hand, in the method of FIG. 3, the time at which the difference becomes large differs for each feature, and the influence is smaller than that in FIG. However, in a plurality of features that have a greater influence on the difference, there is a possibility that the time at which the difference becomes larger will be close and the difference will still be large for that portion in a short time.

以上のように、従来技術においては、ベース手法での計算結果からの相違（当該相違が大きいほど得られる合成音声も不自然となる）すなわち計算結果の品質と計算量（メモリ消費量（空間計算量）や応答遅延（時間計算量）として評価される計算量）との間のトレードオフがあるなかで、必ずしも特徴パラメータ軌跡計算処理の最適化がなされていないという課題があった。以上では合成音声の場合を例として説明したが、その他一般の時系列データにおける特徴パラメータ軌跡計算処理に関しても同様の課題が成立する。 As described above, in the conventional technique, the difference from the calculation result of the base method (the larger the difference, the more unnatural the synthesized speech is), that is, the quality of the calculation result and the calculation amount (memory consumption (spatial calculation) However, there is a problem in that the feature parameter trajectory calculation process is not necessarily optimized, while there is a trade-off between the amount) and the response delay (time complexity). In the above description, the case of synthetic speech has been described as an example, but the same problem holds true for other general characteristic parameter locus calculation processing for time-series data.

本発明は、当該従来技術の課題に鑑み、計算結果の品質を確保しつつ計算量も抑制することが可能な時系列データ生成装置、方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the problems of the related art, and an object thereof is to provide a time-series data generation device, method, and program capable of suppressing the amount of calculation while ensuring the quality of calculation results.

上記目的を達成するため、本発明は、各時刻において複数時刻の値の組み合わせとして定義される複数種類の特徴分布パラメータ時系列を入力として、所定の基準に基づき系列として最適化された多次元時系列データを出力する時系列データ生成装置であって、前記時系列データ生成装置は、処理単位ステップごとに当該ステップ回数に応じて時間進行方向に移動するウィンドウ内のフレームデータとして特徴分布パラメータ時系列を読み込み、当該ステップ回数に応じた所定の繰り返し順序で多次元時系列データの各次元の時系列データを算出し、当該所定の繰り返し順序は、多次元時系列データの各次元のうち、誤差がある場合の影響が大きくなる次元同士が当該順序において離れるように構成されていることを特徴とする。また、前記生成装置に対応する方法及びプログラムであることを特徴とする。 In order to achieve the above-mentioned object, the present invention uses a plurality of types of feature distribution parameter time series defined as a combination of values at a plurality of times at each time, and optimizes the multidimensional time as a series based on a predetermined criterion. A time series data generation device for outputting series data, wherein the time series data generation device is a feature distribution parameter time series as frame data in a window that moves in the time advancing direction according to the number of steps for each processing unit step. The time series data of each dimension of the multi-dimensional time series data is calculated in a predetermined repetition order according to the number of steps, and the predetermined repetition order is the error of each dimension of the multi-dimensional time series data. It is characterized in that the dimensions in which the influence in a certain case becomes large are separated from each other in the order. Moreover, the method and the program correspond to the generation device.

本発明によれば、誤差がある場合の影響が大きくなるもの同士が当該順序上において離れるように構成された繰り返し順序で各次元の特徴パラメータを、ウィンドウ内のフレームデータから得るようにすることで、結果的に得られる特徴パラメータ軌跡において影響の大きなものの出現を時間軸上で分散させることが可能となり、計算結果の品質を確保しつつ計算量も抑制して特徴パラメータ軌跡を生成することが可能となる。 According to the present invention, the feature parameters of each dimension are obtained from the frame data in the window in a repeating order configured such that the ones that have a large influence when there is an error are separated from each other in the order. , It is possible to disperse the appearance of the ones that have a large influence in the resulting characteristic parameter locus on the time axis, and it is possible to generate the characteristic parameter locus while ensuring the quality of the calculation result and suppressing the amount of calculation. Becomes

パラメータ特徴軌跡計算の一手法を示す図である。It is a figure which shows one method of parameter feature locus calculation. パラメータ特徴軌跡計算の一手法を示す図である。It is a figure which shows one method of parameter feature locus calculation. パラメータ特徴軌跡計算の一手法を示す図である。It is a figure which shows one method of parameter feature locus calculation. 一実施形態に係る時系列データ生成装置の機能ブロック図である。It is a functional block diagram of the time series data generation device concerning one embodiment. 一実施形態に係る時系列データ生成装置の動作のフローチャートである。It is a flow chart of operation of a time series data generation device concerning one embodiment. 図５のフローチャートの動作によって処理されるデータの模式例を示す図である。It is a figure which shows the schematic example of the data processed by the operation | movement of the flowchart of FIG. 図５のフローチャートの動作によって処理されるデータの模式例を示す図である。It is a figure which shows the schematic example of the data processed by the operation | movement of the flowchart of FIG. 図６の対比例を示す図である。It is a figure which shows the contrast of FIG. 図７の対比例を示す図である。It is a figure which shows the contrast of FIG. 設定部においてインデックスを2進数表現上で逆順にする方法を適用した具体的な計算過程を示す図である。It is a figure which shows the concrete calculation process which applied the method of making an index reverse in a binary number expression in a setting part.

図４は、一実施形態に係る生成装置の機能ブロック図である。生成装置10は、生成部1、合成部2及び設定部3を備え、音声を扱う場合、その全体的な動作として前述した第二処理を行うことが可能なもの、すなわち、音声合成対象となる音声合成記号を変換して得られる各フレーム時刻kの分布パラメータ（以下説明する前処理による）を入力とし、当該分布パラメータを処理して、音声波形合成のための多次元の特徴パラメータを各フレーム時刻kの時系列として逐次的に出力するものであり、さらに当該特徴パラメータを音声合成して合成音声を出力することも可能なものである。なお、後述するように、生成装置10は音声以外に関する時系列データについても同様に扱うことが可能であるが、以下の説明において具体例に言及する場合には、音声を例とする。 FIG. 4 is a functional block diagram of the generation device according to the embodiment. The generating device 10 includes a generating unit 1, a synthesizing unit 2, and a setting unit 3, and when handling a voice, is capable of performing the above-described second processing as an overall operation, that is, a voice synthesizing target. A distribution parameter at each frame time k obtained by converting the speech synthesis symbol (by preprocessing described below) is input, the distribution parameter is processed, and a multidimensional feature parameter for speech waveform synthesis is set for each frame. It is to output sequentially as a time series of time k, and it is also possible to output the synthesized voice by voice-synthesizing the feature parameter. As will be described later, the generation device 10 can also handle time-series data other than voice in the same manner, but when a specific example is referred to in the following description, voice will be taken as an example.

音声を扱う場合、次の前処理を予め行うことで時系列データ生成装置10の入力データを用意すればよいが、当該前処理は時系列データ生成装置10内で実施してもよい。すなわち、テキストから変換された音素及び韻律の情報を含んだ音声合成記号（時間進行方向の記号列となっている）に対し、音声合成記号に含まれる各音素の継続フレーム数を音声合成記号から決定して、各フレームの静的特徴量および動的特徴量の分布のモデルに対するパラメータ（正規分布であれば平均と分散、以下、「特徴分布パラメータ」という）を逐次的に読み込むようにすればよい。こうして、生成部1では、入力された特徴分布パラメータで決まる値の分布のモデルに対する最尤な出力時系列として、音声波形合成のための多次元の特徴パラメータを時系列として逐次的に出力することができる。 When handling voice, the input data of the time series data generation device 10 may be prepared by performing the following preprocessing in advance, but the preprocessing may be performed in the time series data generation device 10. That is, for a speech synthesis symbol (which is a symbol sequence in the time progression direction) containing phoneme and prosody information converted from text, the number of continuous frames of each phoneme included in the speech synthesis symbol is calculated from the speech synthesis symbol. By determining and sequentially reading the parameters for the model of the distribution of static and dynamic features in each frame (mean and variance for normal distributions, hereinafter referred to as “feature distribution parameters”) Good. In this way, the generation unit 1 sequentially outputs the multidimensional feature parameter for speech waveform synthesis as a time series as the maximum likelihood output time series for the model of the distribution of values determined by the input feature distribution parameter. You can

生成部1での逐次的な読み込み及び出力は、設定部3において予め設定しておいた順序に従ってなされる。設定部3で設定する順序やこれに従った生成部1での逐次的な処理の詳細は、図５ないし図７を参照して後述する。 The sequential reading and output by the generation unit 1 are performed according to the order preset by the setting unit 3. Details of the order set by the setting unit 3 and the sequential processing in the generation unit 1 according to the order will be described later with reference to FIGS. 5 to 7.

合成部2は、生成部1から得られた多次元時系列データに対してさらに合成処理を行うものであり、例えば音声を扱う場合であれば、実際の音声波形としての音声を合成して出力する。なお、生成装置10からは合成部2を省略し、合成音声ではなくて生成部1で生成された特徴パラメータを出力するようにする構成も可能である。 The synthesizing unit 2 further performs a synthesizing process on the multidimensional time-series data obtained from the generating unit 1. For example, in the case of handling a voice, a voice as an actual voice waveform is synthesized and output. To do. It is also possible to omit the synthesis unit 2 from the generation device 10 and output the characteristic parameter generated by the generation unit 1 instead of the synthesized speech.

図５は、一実施形態に係る時系列データ生成装置10の動作のフローチャートである。当該フローチャートの説明のための前提事項をまず説明する。図示されるように、図５のフローチャートの全体は、ステップS10,S11,S12,S13が繰り返されることで生成部1等での逐次的な処理が実現されるものである。当該繰り返し回数i=1,2,3,…は生成部1等での処理単位ステップの回数に相当するものであり、また、生成部1等での計算処理等の時間進行に沿ったものとなる。ただし、回数iと実際の計算時間の経過とは必ずしも比例で対応している必要はなく、回数iの経過と共に計算時間も経過してゆくという関係があることによって、回数iは現時刻番号iの意味も有する。以下、このような回数（現時刻番号）i=1,2,3,…により各ステップが繰り返されるものとして、図５の各ステップの説明を行う。 FIG. 5 is a flowchart of the operation of the time series data generation device 10 according to the embodiment. First, the prerequisites for explaining the flowchart will be described. As shown in the figure, in the entire flowchart of FIG. 5, steps S10, S11, S12, and S13 are repeated to realize sequential processing in the generation unit 1 and the like. The number of repetitions i = 1, 2, 3, ... Corresponds to the number of processing unit steps in the generation unit 1 and the like, and is also in line with time progress of calculation processing in the generation unit 1 and the like. Become. However, the number i and the actual calculation time do not necessarily have to correspond in proportion, and the calculation time elapses as the number of times i elapses. Also has the meaning of. Hereinafter, each step of FIG. 5 will be described assuming that each step is repeated by such a number of times (current time number) i = 1, 2, 3, ....

図５のフローは、回数（現時刻番号）i=1（初期値）のもとで開始されてステップS10へと進む。ステップS10では、生成部1が入力として、現時刻番号iに対応するパラメータ種別p(i)及びフレームデータf(i)（これらは特徴分布パラメータの各次元のものであるが、以下、図５のフロー等の説明においては、時系列データとしての処理内容を明確化するために、このような用語によって参照するものとする）を取得してからステップS11へと進む。 The flow of FIG. 5 is started under the number of times (current time number) i = 1 (initial value), and proceeds to step S10. In step S10, the generation unit 1 receives as input the parameter type p (i) and the frame data f (i) corresponding to the current time number i (these are for each dimension of the feature distribution parameter. In the description of the flow and so on, in order to clarify the processing content as time-series data, these terms shall be referred to) before proceeding to step S11.

ステップS10にて、生成部1が出力する多次元の特徴パラメータの次元数がNであるものとすると、現時刻iに対応するパラメータ種別p(i)は当該N次元のいずれかとして取得され、具体的には現時刻番号iを次元数Nで割った余り（剰余）「i mod N」によって一意に定まるものとして、パラメータ種別p(i)=p(i mod N)を取得する。剰余「i mod N」に応じて取得されるパラメータ種別p(i mod N)は、生成部1での処理順序を表現したものであり、設定部3において予め設定しておく順序である。 In step S10, assuming that the number of dimensions of the multidimensional feature parameter output by the generation unit 1 is N, the parameter type p (i) corresponding to the current time i is acquired as one of the N dimensions, Specifically, the parameter type p (i) = p (i mod N) is acquired as being uniquely determined by the remainder (remainder) “i mod N” obtained by dividing the current time number i by the number of dimensions N. The parameter type p (i mod N) acquired according to the remainder “i mod N” represents the processing order in the generation unit 1, and is the order preset in the setting unit 3.

ここで、設定部3による当該順序の設定例を述べる。そのための前提として、生成部1で得る特徴パラメータは、1次元のパラメータである基本周波数と、所定の次元のメルケプストラム係数で構成され、メルケプストラムの次元数はシステムにより異なるが、ここでは説明を簡単にするため、8次元（0次係数〜7次係数）とする。（これは図２，３で例示したのと同じ設定である。）実際のシステムは例えば30〜50次元程度でありより高次元であるが、8次元としても説明の一般性は失わない。この場合、1つの基本周波数と8つのメルケプストラム係数とによって、次元数N=1+8=9となる。 Here, an example of setting the order by the setting unit 3 will be described. As a premise for this, the characteristic parameter obtained by the generation unit 1 is composed of a fundamental frequency that is a one-dimensional parameter and a mel cepstrum coefficient of a predetermined dimension, and the number of dimensions of the mel cepstrum differs depending on the system, but here, the description will be given. For the sake of simplicity, it has 8 dimensions (0th to 7th order coefficients). (This is the same setting as illustrated in FIGS. 2 and 3.) The actual system has, for example, about 30 to 50 dimensions, which is higher, but the generality of the description does not lose even if it is 8 dimensions. In this case, one fundamental frequency and eight mel cepstrum coefficients give a dimensionality N = 1 + 8 = 9.

一般にメルケプストラムは低次係数ほど値の絶対値が大きくスペクトル包絡に大きく影響する。さらに基本周波数も聴感的影響が大きい。（すなわち、当該パラメータの算出値の誤差が大きい場合に、得られる合成音声の品質の低下度合いがより顕著となる。）ここでは、基本周波数（F0）、メルケプストラム（C0〜C7）を合わせて、影響が大きい順に
F0, C0, C1, C2, C3, C4, C5, C6, C7 …(影響順序1)
であるものとする。設定部3では、上記の(影響順序1)を入れ替えて、誤差の影響が大きいパラメータ同士が順序上においては離れるようなものとして、処理順番を設定する。例えば、上記の(影響順番1)をもとにして、以下の(処理順序1)を設定することができる。（当該設定するための手法の詳細については図１０等を参照して後述する。）ここで、影響の大きなF0は9個中の1番目、C0は9個中の6番目となっており、影響の大きなF0,C0が当該順序上において離れていることを見て取ることができる。
F0, C7, C3, C1, C5, C0, C4, C2, C6 …(処理順序1) Generally, in the mel cepstrum, the lower the coefficient, the larger the absolute value of the value and the greater the influence on the spectrum envelope. Furthermore, the fundamental frequency also has a large audible effect. (That is, when the error in the calculated value of the parameter is large, the degree of deterioration in the quality of the synthesized speech obtained becomes more significant.) Here, the fundamental frequency (F0) and the mel cepstrum (C0 to C7) are combined. , In descending order of impact
F0, C0, C1, C2, C3, C4, C5, C6, C7… (Influence order 1)
Shall be The setting unit 3 replaces the above (influence order 1) and sets the processing order such that the parameters having a large influence of the error are separated from each other in the order. For example, the following (processing order 1) can be set based on the above (influence order 1). (Details of the method for setting will be described later with reference to FIG. 10 and the like.) Here, F0, which has a large influence, is the first of nine, and C0 is the sixth of the nine, It can be seen that F0 and C0, which have a large influence, are separated in the order.
F0, C7, C3, C1, C5, C0, C4, C2, C6… (Processing order 1)

上記の(処理順番1)の場合、ステップS10で取得するパラメータ種別p(i)=p(i mod N)=p(i mod 9)は次のようになる。
p(1)=F0, p(2)=C7, p(3)=C3, p(4)=C1, p(5)=C5, p(6)=C0, p(7)=C4, p(8)=C2, p(9)=p(0)=C6 In the case of the above (processing order 1), the parameter type p (i) = p (i mod N) = p (i mod 9) acquired in step S10 is as follows.
p (1) = F0, p (2) = C7, p (3) = C3, p (4) = C1, p (5) = C5, p (6) = C0, p (7) = C4, p (8) = C2, p (9) = p (0) = C6

以上、ステップS10で取得するパラメータ種別p(i)を説明した。ステップS10ではさらに、現時刻番号iに対応する特徴分布パラメータのフレームデータf(i)を次のようにして取得する。ここで、特徴分布パラメータのフレームデータは番号j=1,2,3,…でそのフレーム時刻位置jにあるデータd(j)が指定されるものとする。現時刻番号iに対応するフレームデータf(i)は、データ系列d(j)上を現時刻番号iの進行と共にフレーム番号jの増加方向へと移動する、所定のウィンドウW(i)内にあるようなデータ群{d(j)}として取得することができる。 The parameter type p (i) acquired in step S10 has been described above. In step S10, the frame data f (i) of the characteristic distribution parameter corresponding to the current time number i is further acquired as follows. Here, it is assumed that the frame distribution data of the feature distribution parameter is designated by the number j = 1, 2, 3, ..., And the data d (j) at the frame time position j. The frame data f (i) corresponding to the current time number i moves on the data series d (j) in the increasing direction of the frame number j with the progress of the current time number i, within a predetermined window W (i). It can be obtained as a certain data group {d (j)}.

ステップS11では、ステップS10で取得したフレームデータf(i)を入力として、生成部1において前述の式(3)の計算をステップS10で取得したパラメータ種別p(i)に対して行うことで、一定長さの当該パラメータ種別p(i)の特徴パラメータ軌跡pt(i)を得てから、ステップS12へと進む。 In step S11, by inputting the frame data f (i) acquired in step S10, the generation unit 1 calculates the above equation (3) for the parameter type p (i) acquired in step S10. After obtaining the characteristic parameter locus pt (i) of the parameter type p (i) of a fixed length, the process proceeds to step S12.

ステップS11では、入力されるフレームデータf(i)及び計算対象となっているパラメータ種別p(i)に対応する式(3)の平均ベクトルμ及び分散共分散行列Σ（これらμ及びΣは前述のように前処理において音声合成用に予め学習済みのHMM、決定木、ニューラルネットワーク等から求められている）に対して式(3)の計算を行うことで、その出力cとして時刻iの多次元の特徴パラメータ軌跡の値pt(i)を得ることができる。具体的な計算に関しては前掲の非特許文献２のベース手法を用いれば良い。あるいは、前掲の非特許文献３によるRLS(再帰的最小二乗法)アルゴリズムに相当する計算アルゴリズムを用いてもよい。 In step S11, the average vector μ and the variance-covariance matrix Σ of the equation (3) corresponding to the input frame data f (i) and the parameter type p (i) to be calculated (these μ and Σ are as described above). (Calculated from the HMM, decision tree, neural network, etc. that have been pre-learned for speech synthesis in the pre-processing as in (3)) The value pt (i) of the dimensional feature parameter trajectory can be obtained. For specific calculation, the base method described in Non-Patent Document 2 may be used. Alternatively, a calculation algorithm corresponding to the RLS (recursive least squares) algorithm described in Non-Patent Document 3 may be used.

なお、ステップS11では式(3)から出力cの計算対象の次元のある区間の値の系列を一度得たうえで、出力cにおいて時間進行方向の先端側の所定長さのデータは削除したものとして、時刻iの特徴パラメータ軌跡の値pt(i)を得るようにしてもよい。すなわち、図１の[3]で説明したような、差異が大きくなると考えられる所定長さの後端側部（時間が未来の側）は削除して、時刻iの特徴パラメータ軌跡の値pt(i)を得るようにしてもよい。 In step S11, the equation (3) is used to first obtain a series of values in a certain section of the dimension to be calculated for the output c, and then delete the data of the predetermined length on the tip side in the time advancing direction in the output c. Alternatively, the value pt (i) of the characteristic parameter locus at time i may be obtained. That is, as described in [3] of FIG. 1, the rear end side portion (the time side is on the future side) of a predetermined length where the difference is considered to be large is deleted, and the value pt ( i) may be obtained.

ただしここで説明するように、S11で１次元分の特徴を計算する場合は、削除処理の後も保持する特徴パラメータ値のpt(i)の時間方向の長さは少なくともNであるように削除を行うものとする。仮に保持する値の長さをより短くする場合は、後述するステップS12において音声合成が可能となる（すなわち、ある分布パラメータのフレーム時刻jのデータd(i)に関して、N個の全ての特徴パラメータが算出される）ように１回のS11で計算しなければならない特徴の次元数はより多くなる。また、同様に後述するステップS12において音声合成が可能となるように、ステップS10での現時刻iに応じて移動する所定のウィンドウW(i)の位置（及び範囲）も設定しておくものとする。 However, as described here, when calculating a one-dimensional feature in S11, the length of the pt (i) of the feature parameter value retained even after the deletion process is at least N in the time direction is deleted. Shall be performed. If the length of the value to be held is further shortened, it becomes possible to perform voice synthesis in step S12 described later (that is, with respect to the data d (i) at the frame time j of a certain distribution parameter, all N feature parameters are included. The number of feature dimensions that must be calculated in one S11 becomes larger. Similarly, the position (and range) of a predetermined window W (i) that moves in accordance with the current time i in step S10 should also be set so that voice synthesis can be performed in step S12 described later. To do.

ステップS12では、現時刻番号iにおいて初めて、ある特徴分布パラメータのフレーム時刻jのデータd(i)に関して、N個の全ての特徴パラメータP(j)が既に算出済み（現時刻i以前の一連のステップS11の計算結果の蓄積によって算出済み）となっている場合に、当該N個全て揃った特徴パラメータP(j)を生成部1が出力してから、ステップS13へと進む。ステップS12ではさらに、当該出力した1時刻以上の特徴パラメータP(j)に対して合成部2が音声波形としての音声データを合成するようにしてもよい。この際、得られた特徴パラメータP(j)をただちに音声合成処理するのではなく、所定のバッファリング処理等により合成部2において合成音声を得るタイミングを調整するようにしてもよい。 In step S12, for the first time at the current time number i, with respect to the data d (i) at the frame time j of a certain feature distribution parameter, all N feature parameters P (j) have already been calculated (the series before the current time i). If it is already calculated by accumulating the calculation results of step S11), the generation unit 1 outputs the N characteristic parameters P (j), and then the process proceeds to step S13. Further, in step S12, the synthesizing unit 2 may synthesize voice data as a voice waveform with respect to the output characteristic parameter P (j) of one or more times. At this time, the obtained characteristic parameter P (j) may not be immediately subjected to the voice synthesizing process, but the timing for obtaining the synthetic voice in the synthesizing unit 2 may be adjusted by a predetermined buffering process or the like.

ステップS13では現時刻番号iを次の番号i+1へと更新したうえでステップS10へと戻る。こうして、図５のフローチャートの前提事項として既に説明した通り、以上説明したステップS10〜S13が各時刻番号（処理ステップ回数）i=1,2,3,…について繰り返されることとなる。 In step S13, the current time number i is updated to the next number i + 1, and then the process returns to step S10. In this way, as already described as a premise of the flowchart of FIG. 5, steps S10 to S13 described above are repeated for each time number (the number of processing steps) i = 1, 2, 3, ....

図６及び図７は、図５のフローチャートの動作によって処理されるデータの模式例を、ステップS10で例示したN=9個のパラメータF0,C0,…,C7を(処理順序1)に従って処理する場合に関して示すものである。図６及び図７は共に、横軸方向が読み込む特徴分布パラメータデータd(j)のフレーム位置j=1,2,3,…を表し、縦軸方向が図５のフローチャートにおける回数（現時刻番号）i=1,2,3,…を表している。図６では各回数i=1,2,3,…による一連のステップS11で生成される特徴パラメータ軌跡pt(i)の例が示され、図７では図６で当該示したものに上書きする形で、各回数i=1,2,3,…による一連のステップS12で出力されるN=9個全て揃った特徴パラメータP(j)（j=9,10,11,…）が示されている。 FIG. 6 and FIG. 7 are schematic examples of data processed by the operation of the flowchart of FIG. 5 in which N = 9 parameters F0, C0, ..., C7 illustrated in step S10 are processed according to (processing order 1). It shows about the case. 6 and 7, the horizontal axis represents the frame position j = 1,2,3, ... Of the feature distribution parameter data d (j) to be read, and the vertical axis represents the number of times (current time number) in the flowchart of FIG. ) I = 1,2,3, ... FIG. 6 shows an example of the characteristic parameter locus pt (i) generated in a series of steps S11 by each number of times i = 1, 2, 3, ..., And in FIG. 7, a form overwriting the one shown in FIG. , The characteristic parameter P (j) (j = 9,10,11, ...) with all N = 9 output in a series of steps S12 by each number i = 1,2,3, ... There is.

図６に示されるように、時刻i=1ではパラメータ種別p(1)=F0であってその長さ9の特徴パラメータ軌跡pt(1)=F0(1),F0(2),…,F0(9)（当該表記は図１，２で説明したのと同様であり、以下でも同様）が得られる。時刻i=2ではパラメータ種別p(2)=C7であってその長さ9の特徴パラメータ軌跡pt(2)=C7(2),…,C7(10)が得られ、…といったようにして以下同様に、各時刻i=1〜18（i=1が初期時刻）で(処理順序1)の順番で、移動するウィンドウW(i)と共に移動する、パラメータ種別p(i)のパラメータ軌跡pt(i)が得られている。 As shown in FIG. 6, at time i = 1, the parameter type p (1) = F0 and the characteristic parameter locus pt (1) = F0 (1), F0 (2), ... (9) (The notation is the same as that described in FIGS. 1 and 2, and the same applies below). At time i = 2, the parameter type p (2) = C7 and the characteristic parameter locus pt (2) = C7 (2), ..., C7 (10) of the length 9 is obtained, and so on. Similarly, at each time i = 1 to 18 (i = 1 is the initial time) in the order of (processing order 1), the parameter locus pt (of the parameter type p (i) that moves with the moving window W (i) i) is obtained.

さらに、図７に示されるように、時刻i=9の時点で初めてフレーム位置j=9でのデータd(9)の変換結果としてのN=9個全て揃った特徴パラメータP(9)={F0(9),C0(9),…,C7(9)}が出力され、時刻i=10の時点で初めてフレーム位置j=10でのデータd(10)の変換結果としてのN=9個全て揃った特徴パラメータP(10)={F0(10),C0(10),…,C7(10)}が出力され、…といったようにして以下同様に、時刻i=9,10,11,…,18においてそれぞれフレーム位置j=9,10,11,…,18のN=9個全て揃った特徴パラメータP(j)={F0(j),C0(j),…,C7(j)}が出力されることとなる。 Further, as shown in FIG. 7, at the time of time i = 9, the feature parameter P (9) = {that has all N = 9 as a conversion result of the data d (9) at the frame position j = 9 for the first time. F0 (9), C0 (9), ..., C7 (9)} is output, and N = 9 as the conversion result of the data d (10) at the frame position j = 10 for the first time at time i = 10. The complete set of feature parameters P (10) = {F0 (10), C0 (10), ..., C7 (10)} is output, and so on. Similarly, at time i = 9,10,11, …, 18 feature parameters P (j) = {F0 (j), C0 (j), ..., C7 (j) with N = 9 frame positions j = 9,10,11, ..., 18 respectively } Will be output.

(処理順序1)に関して既に説明した通り、当該並び替えられた順序においては影響の大きなパラメータF0,C0が当該順序上において離れている。既に説明した通り、パラメータF0,C0は、ウィンドウW(i)の時間進行方向の先端側において算出された場合に、合成音声の品質を顕著に損なう可能性があるものであり、図６及び図７では当該損なう可能性のあるF0,C0のデータF0(9),C0(14),F0(18),C0(23)を破裂マークで明示している。しかしながら、図７に示される通り、当該破裂マークで示される箇所は、出力される特徴パラメータP(j)(j=9,10,11,…,18)においては、P(9),P(14),P(18)と（フレーム）時間軸j上で離れたものとなっている。当該時間軸j上で離れた配置になるのは、(処理順序1)の設定による効果であり、本発明によればこのように出力される特徴パラメータP(j)のフレーム時間軸j上において、合成音声の品質を損ないうる特徴パラメータの発生を時間的に分散させることができ、計算量を抑制すると共に、合成音声の品質も確保することが可能となる。 As already described with respect to (Processing order 1), the parameters F0 and C0 having a large influence in the rearranged order are separated from each other in the order. As described above, the parameters F0 and C0 have a possibility of significantly impairing the quality of synthesized speech when calculated on the leading end side of the window W (i) in the time advancing direction. In Fig. 7, the data F0 (9), C0 (14), F0 (18), C0 (23) of F0 and C0 that may be damaged are clearly indicated by a burst mark. However, as shown in FIG. 7, in the output characteristic parameter P (j) (j = 9, 10, 11, ..., 18), the portion indicated by the rupture mark is P (9), P ( 14) and P (18) are separated from the (frame) time axis j. The distant arrangement on the time axis j is the effect of the setting of (processing order 1), and according to the present invention, on the frame time axis j of the feature parameter P (j) output in this way. It is possible to temporally disperse the generation of the characteristic parameter that may impair the quality of the synthetic speech, suppress the amount of calculation, and secure the quality of the synthetic speech.

図８及び図９はそれぞれ、図６及び図７の対比例として、(処理順序1)ではなく、(影響順序1)をそのまま処理順序として採用した場合を示している。図８及び図９では、合成音声の品質を損なう可能性のある、後端側に位置するF0,C0のデータF0(9),C0(10),F0(18),C0(19)を破裂マークで明示しており、図９に示される通り、当該破裂マークで示される箇所は、出力される特徴パラメータP'(j)(j=9,10,11,…,18)においては、P'(9),P'(10),P'(17)（及び不図示のP'(18)）と、（フレーム）時間軸j上で近接したものとなっている。劣化（算出パラメータが適切な値から大きな相違をもって算出されること）が時間軸上で近接して現れることから、図８及び図９の対比例の手法では、合成音声の品質劣化も顕著なものとなってしまう。 FIG. 8 and FIG. 9 show the case where (influence order 1) is adopted as the processing order as it is, instead of (processing order 1), as a comparison with FIGS. 6 and 7. In FIG. 8 and FIG. 9, the data F0 (9), C0 (10), F0 (18), C0 (19) of F0, C0 located on the rear end side, which may impair the quality of the synthesized speech, are ruptured. As shown in FIG. 9, the portion indicated by the burst mark is P in the output characteristic parameter P ′ (j) (j = 9,10,11, ..., 18). It is close to '(9), P' (10), P '(17) (and P' (18) not shown) on the (frame) time axis j. Since the deterioration (that the calculation parameter is calculated with a large difference from the appropriate value) appears closely on the time axis, the quality deterioration of the synthetic speech is also remarkable in the comparative method of FIGS. 8 and 9. Will be.

なお、図８及び図９の対比例の手法は、前掲の図３の手法に該当するものである。 Note that the method of contrast in FIGS. 8 and 9 corresponds to the method of FIG. 3 described above.

以下、設定部3が前述のような(影響順序1)を並び替えて(処理順序1)を得る手法の詳細を説明する。前述の通りこの例は多次元の特徴パラメータの次元数N=9としているが、一般のNの場合にも当該手法はそのまま適用可能である。 Hereinafter, details of a method in which the setting unit 3 rearranges the above-described (influence order 1) to obtain (processing order 1) will be described. As described above, in this example, the number of dimensions of the multidimensional feature parameter is N = 9, but even in the case of general N, the method can be applied as it is.

当該手法の一例として、インデックスを2進数表現上で逆順にする方法がある。まずは、影響の大きい特徴から順に、0から始まる仮想インデックスを先頭から順に付与する。(影響順序1)では以下の通りとなる。
0:F0, 1:C0, 2:C1, 3:C2, 4:C3, 5:C4, 6:C5, 7:C6, 8:C7 As an example of the method, there is a method of reversing the index in the binary number representation. First, a virtual index starting from 0 is assigned in order from the top in order from the feature having the greatest influence. (Influence order 1) is as follows.
0: F0, 1: C0, 2: C1, 3: C2, 4: C3, 5: C4, 6: C5, 7: C6, 8: C7

これに対して、仮想インデックスの2進数表現のビット順を逆にする。ここでは特徴の次元数N=9のため、まず2のべきで9以上の最も小さい値である16次元に拡張する。具体的には、仮想インデクス9〜15を付与して
9:9φ, 10:φ10, 11:φ11, 12:φ12, 13:φ13, 14:φ14, 15:φ15
の7次元を特徴に追加する。9φ〜15φは実際には何もない仮想的な特徴（並び替えのために導入したダミー特徴）である。 On the other hand, the bit order of the binary representation of the virtual index is reversed. Here, since the number of dimensions of the feature is N = 9, it is first expanded to 16 powers, which is the smallest value of 9 and more. Specifically, add virtual indexes 9 to 15
9: 9φ, 10: φ10, 11: φ11, 12: φ12, 13: φ13, 14: φ14, 15: φ15
Add 7 dimensions of to the feature. 9φ to 15φ are virtual features (dummy features introduced for sorting) that do not actually exist.

この仮想インデックスの2進数表現のビット順を逆にした値を小さい順に並べると、仮想インデックスの順で、
0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15
となり、これを実際の特徴順に並べると、同時にφを削除すると、
F0, C7, C3, C1, C5, C0, C4, C2, C6
となり、影響の大きいF0やC0の処理間隔を大きくできる。 If the values obtained by reversing the bit order of the binary representation of this virtual index are arranged in ascending order, then in the order of the virtual index,
0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15
Then, if you arrange them in the order of actual features, and delete φ at the same time,
F0, C7, C3, C1, C5, C0, C4, C2, C6
Therefore, it is possible to increase the processing interval of F0 or C0, which has a large effect.

このようにして影響の大きな順に並んだパラメータを並び替えた処理順序を得ることで、仮に大きな誤差が生じてしまう場合であっても影響の大きい特徴の大きな誤差が生じやすい部分が時間的に重なりにくくなり、誤差の影響を時間的に拡散させ緩和できる。 In this way, by obtaining the processing order by rearranging the parameters arranged in the descending order of influence, even if a large error occurs, the parts where the large influence is likely to occur are temporally overlapped. It becomes difficult, and the influence of the error can be diffused and alleviated in time.

図１０は、以上のインデックスを2進数表現上で逆順にする方法をその通りに適用した具体的な計算過程を示す図である。[1]では9次元F0,C0,…,C7に加えて7つのダミー次元φ9〜φ15をインデクスと共に設け、[2]では当該インデクスを2進数表現し、[3]では当該2進数表現を逆順とし、[4]では当該逆順を小さい順に並べ、[5]では当該並べた中からダミー次元φ9〜φ15のものを除外することで、最終的な(処理順序1)が得られている。 FIG. 10 is a diagram showing a specific calculation process in which the method of making the above indexes in the reverse order on the binary number expression is applied as it is. In [1], in addition to 9 dimensions F0, C0, ..., C7, seven dummy dimensions φ9 to φ15 are provided along with the index. In [2], the index is expressed in binary, and in [3], the binary expression is reversed. Then, in [4], the reverse order is arranged in ascending order, and in [5], the dummy dimensions φ9 to φ15 are excluded from the arranged order to obtain the final (processing order 1).

ここでは2進数表現におけるビット逆順による方法を示したが、これに限定されない。例えば、影響が大きい特徴が大きい順にF0, C0, C1, C2の4種類で、それ以外の5種類の特徴において影響の大きさの順序関係の影響が小さければ、先の例のうち、F0, C0, C1, C2のみ先述の例としての(処理順序1)と同じ位置として、それ以外は順に並べた
「F0」, C3, C4, 「C1」, C5,「C0」, C6, 「C2」, C7
のような順序で処理してもよい。上記では(処理順序1)と同じ位置となるものを「」で囲んで示している。その他、次元数Nと影響順序とを指定して予め設定してあるマニュアル規則によって処理順序を得るようにしてもよい。 Here, the method using the bit reverse order in the binary number representation is shown, but the method is not limited to this. For example, if the influence of the order relation of the magnitude of influence is small in four types of F0, C0, C1, and C2 in the order of the features having the largest influence, and if the influence of the order relation of the magnitude of the influence is small in the other five features, then F0, Only C0, C1, C2 have the same position as (Processing order 1) in the above example, and the others are arranged in order `` F0 '', C3, C4, `` C1 '', C5, `` C0 '', C6, `` C2 '' , C7
You may process in order like this. In the above description, the same position as (Processing order 1) is surrounded by "". Alternatively, the processing order may be obtained according to a preset manual rule by designating the number of dimensions N and the order of influence.

以上、本発明によれば、上記で説明した基本周波数やメルケプストラム低次係数、またメルケプストラム以外でスペクトル特徴を表すシステムにおいても、そのような多次元特徴量の一部、例えばケプストラムの低次係数やLSP等の低次係数といった、合成音声への影響が大きい特徴のパラメータ軌跡計算を時間的に離すことで、従来法（音声を扱う場合に一発話全体を一括で処理する方法）との差異が特定フレーム周辺に集中してしまうことを避け、従来法との差異の発生を時間的に拡散させて聴感上の差異を抑えることができる。 As described above, according to the present invention, even in the system which expresses the spectral features other than the fundamental frequency, the mel cepstrum low-order coefficient, and the mel cepstrum described above, a part of such a multidimensional feature amount, for example, a low-order cepstrum By separating parameter loci calculation of features such as coefficients and low-order coefficients such as LSP, which have a large effect on synthesized speech, it is possible to compare with conventional methods (methods that collectively process one utterance when dealing with speech). It is possible to prevent the difference from being concentrated around a specific frame, diffuse the occurrence of the difference from the conventional method over time, and suppress the difference in hearing.

生成装置10は一般的な構成のコンピュータとして実現可能である。すなわち、CPU（中央演算装置）、当該CPUにワークエリアを提供する主記憶装置、ハードディスクやSSDその他で構成可能な補助記憶装置、キーボード、マウス、タッチパネルその他といったユーザからの入力を受け取る入力インタフェース、ネットワークに接続して通信を行うための通信インタフェース、表示を行うディスプレイ、カメラ及びこれらを接続するバスを備えるような、一般的なコンピュータによって生成装置10を構成することができる。ここでさらに、音声出力のためのスピーカを備えていてもよい。また、図４に示す生成装置10の各部の処理はそれぞれ、当該処理を実行させるプログラムを読み込んで実行するCPUによって実現することができるが、任意の一部の処理を別途の専用回路等において実現するようにしてもよい。 The generation device 10 can be realized as a computer having a general configuration. That is, a CPU (central processing unit), a main storage device that provides a work area for the CPU, an auxiliary storage device that can be configured by a hard disk, SSD, etc., an input interface such as a keyboard, a mouse, a touch panel, and the like, a network. The generation device 10 can be configured by a general computer including a communication interface for connecting to and performing communication, a display for displaying, a camera, and a bus connecting these. Here, a speaker for outputting audio may be further provided. Further, the processing of each unit of the generation device 10 shown in FIG. 4 can be realized by a CPU that reads and executes a program for executing the processing, but an arbitrary part of the processing is realized by a separate dedicated circuit or the like. You may do it.

なお、本発明の対象は音声合成に限定されない。例えば口唇の形状をパラメータ化し、パラメータから動画像を用いるようなシステムの特徴パラメータ軌跡計算に適用することも可能である。入力も音素や韻律情報ではなく、静的特徴量、動的特徴量の分布を予測するようなシンボルであれば良く、例えば、ロボット等を対象に、動き指令に対する滑らかなモーションデータの生成に用いることができる。 The subject of the present invention is not limited to speech synthesis. For example, the shape of the lips can be parameterized and can be applied to the calculation of the characteristic parameter locus of a system that uses a moving image from the parameters. The input is not a phoneme or prosody information, but may be a symbol that predicts the distribution of static and dynamic features, and is used to generate smooth motion data for motion commands, for example, for robots. be able to.

10…時系列データ生成装置、1…生成部、2…合成部、3…設定部 10 ... Time-series data generation device, 1 ... Generation unit, 2 ... Synthesis unit, 3 ... Setting unit

Claims

A time-series data generation device that outputs multi-dimensional time-series data that is optimized as a series based on a predetermined criterion, by inputting a plurality of types of feature distribution parameter time series that are defined as combinations of values at multiple times at each time. There
The time-series data generation device reads the characteristic distribution parameter time series as frame data in a window that moves in the time advancing direction according to the number of steps for each processing unit step, and in a predetermined repeating order according to the number of steps. Calculate the time series data of each dimension of multidimensional time series data,
The predetermined repeating order is configured such that, out of the respective dimensions of the multidimensional time-series data, dimensions that are greatly affected by an error are separated from each other in the order. .

The predetermined repeating order is a binary expression in which the indexes assigned in order of increasing influence on each dimension of the multidimensional time-series data are represented in binary, and the upper and lower parts of the binary expression are reversed. The time-series data generation device according to claim 1, wherein the time-series data generation device is in order.

The time-series data generation device reads the frame data in each window from the end position to a predetermined range further ahead in the order of the progress time of the window, calculates the time-series data, and then calculates the time series data. The time series data generation device according to claim 1 or 2, wherein the series data is discarded.

The plurality of types of feature distribution parameter time series are related to voice,
The time-series data generation device according to any one of claims 1 to 3, wherein the multidimensional time-series data includes a fundamental frequency and a feature quantity representing a predetermined-dimensional spectrum envelope characteristic.

A time series data generation method that outputs multi-dimensional time series data that is optimized as a series based on predetermined criteria, with multiple types of feature distribution parameter time series defined as combinations of values at multiple times at each time There
In the time-series data generation method, the feature distribution parameter time series is read as frame data in a window that moves in the time advancing direction according to the number of steps for each processing unit step, and in a predetermined repeating order according to the number of steps. Calculate the time series data of each dimension of multidimensional time series data,
The predetermined repeating order is characterized in that, among the respective dimensions of the multidimensional time-series data, dimensions that are greatly affected by an error are separated from each other in the order. .

A program causing a computer to function as the time-series data generation device according to any one of claims 1 to 4.