JP2018013721A

JP2018013721A - Voice synthesis parameter generating device and computer program for the same

Info

Publication number: JP2018013721A
Application number: JP2016144765A
Authority: JP
Inventors: 橘　健太郎; Kentaro Tachibana; 健太郎橘; 戸田　智基; Tomoki Toda; 智基戸田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2016-07-22
Filing date: 2016-07-22
Publication date: 2018-01-25

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesis parameter generating device capable of achieving easy control and modification, and generating highly-accurate voice synthesis parameters.SOLUTION: A voice synthesis parameter generating device 270 includes a memory which stores a program, a HMM acoustic model 136, a DNN acoustic model 280, and a processor which executes the program and outputs a voice parameter series per frame. The program executed by the processor includes the steps of: outputting the HMM parameter series corresponding to a label string 132 by the HMM acoustic model 136; outputting the DNN parameter series corresponding to the label string by the DNN acoustic model 280 per frame; integrating the HMM parameter series and the DNN parameter series by a Product-of-Experts framework per frame and outputting integrated model parameters (step 284); and generating and outputting voice synthesis parameters on the basis of the integrated model parameters output by step 284 (step 290).SELECTED DRAWING: Figure 5

Description

この発明は統計的パラメトリック音声合成(Statistical Parametric Speech Synthesis SPSS)に関し、特に、合成音声の品質を従来技術より高めることができる統計的パラメトリック音声合成に関する。 The present invention relates to a statistical parametric speech synthesis (Statistical Parametric Speech Synthesis SPSS), and more particularly to a statistical parametric speech synthesis capable of improving the quality of synthesized speech over the prior art.

SPSSは、統計モデルに基づき音声信号を生成するためのパラメータを推定し、推定されたパラメータから音声を合成する枠組みである。SPSSは、発話様式制御等、柔軟な音声合成処理を容易に実現できる等、様々な利点を持つ。一方で、SPSSを用いた音声合成では、合成された音声の品質に関して、肉声と比べて種々の理由により劣化する傾向にある。したがって合成音声の改善は、SPSSにおける重要な研究課題である。 SPSS is a framework for estimating parameters for generating a speech signal based on a statistical model and synthesizing speech from the estimated parameters. SPSS has various advantages such as easy implementation of flexible speech synthesis processing such as speech style control. On the other hand, in speech synthesis using SPSS, the quality of synthesized speech tends to be deteriorated for various reasons as compared to real voice. Therefore, improvement of synthesized speech is an important research topic in SPSS.

SPSSには主に２種の手法がある。一方が隠れマルコフモデル(Hidden Markov Model：HMM)に基づく手法であり、他方がディープニューラルネットワーク（Deep Neural Network：DNN）に基づく手法である。以下、これら従来の２種の手法について説明する。 There are two main methods for SPSS. One is a method based on a Hidden Markov Model (HMM), and the other is a method based on a deep neural network (DNN). Hereinafter, these two conventional methods will be described.

HMMは、発話が音素に対応する状態の遷移から生じると考える。状態及び状態の遷移は観測できないが、状態の遷移に伴って出力される発話から得られる情報系列を観測し統計的に処理することにより、状態とその遷移、すなわち音素列を推定することが可能になる。発話から得られる情報としては、音声信号を所定時間長で所定シフト量のフレームに分割（重複可能）し、各フレームの音声信号から得た特徴量を用いる。特徴量としては複数の値が用いられ、特徴量ベクトルをなす。音声合成の分野では一般的に、特徴量としてはメルケプストラム係数（Mel-Cepstrum coefficients）及びパワー、それらのデルタ（隣接フレーム間の差分）及びデルタデルタ（隣接するデルタの差分）等が用いられる。 HMM considers utterances to arise from state transitions corresponding to phonemes. State and state transition cannot be observed, but it is possible to estimate the state and its transition, that is, phoneme sequence by observing and statistically processing the information sequence obtained from the utterances output with the state transition become. As information obtained from the utterance, a voice signal is divided into frames of a predetermined shift amount with a predetermined time length (can be overlapped), and a feature amount obtained from the voice signal of each frame is used. A plurality of values are used as the feature quantity to form a feature quantity vector. In the field of speech synthesis, mel cepstrum coefficients (Mel-Cepstrum coefficients) and power, their delta (difference between adjacent frames), delta delta (difference between adjacent deltas) and the like are generally used as feature quantities.

ここでは図１に示す３状態のHMM６０について説明する。なお、以下の説明では、同じ部品には同じ参照番号を付すものとし、それらの詳細については繰り返さない。 Here, the three-state HMM 60 shown in FIG. 1 will be described. In the following description, the same parts are denoted by the same reference numerals, and details thereof will not be repeated.

図１を参照して、HMM６０は、このHMM６０が表す音素に対応する状態Ｓ１２と、その音素の前に位置する音素に対応する状態Ｓ１１と、後に位置する音素に対応する状態Ｓ１３とを含む。発話に出現する各音素について、その前後の音素との組み合わせに応じてこのHMM６０のようなモデルを準備する。したがって、同じ音素でも前後の音素が異なる場合には別々のHMMが用いられる（記憶領域を節約するために、異なる組み合わせを１つのHMMで表す場合もある）。なお、状態数は３に限定されず、他の数の状態が用いられる場合もある。 Referring to FIG. 1, HMM 60 includes a state S12 corresponding to the phoneme represented by HMM 60, a state S11 corresponding to the phoneme located in front of the phoneme, and a state S13 corresponding to the phoneme located behind. For each phoneme that appears in the utterance, a model such as this HMM 60 is prepared according to the combination with the phonemes before and after the phoneme. Therefore, different HMMs are used when the phonemes before and after the same phoneme are different (in order to save the storage area, different combinations may be represented by one HMM). The number of states is not limited to 3, and other numbers of states may be used.

図１を参照して、状態Ｓ１１から状態Ｓ１２への遷移８０には遷移確率ａ１２が割り当てられる。状態Ｓ１２から状態Ｓ１３への遷移８２には遷移確率ａ２３が割り当てられる。状態Ｓ１３からこのHMM６０の終端への遷移８４には遷移確率ａ３ｅが割り当てられる。状態Ｓ１１、状態Ｓ１２、及び状態Ｓ１３はそれぞれ自分自身への遷移７０、７２、及び７４を持ち、これらにはそれぞれ遷移確率ａ１１、ａ２２、及びａ３３が割り当てられる。これら遷移確率は、学習データを用いて予め算出される。 Referring to FIG. 1, transition probability a12 is assigned to transition 80 from state S11 to state S12. Transition probability a23 is assigned to transition 82 from state S12 to state S13. A transition probability a3e is assigned to the transition 84 from the state S13 to the end of the HMM 60. State S11, state S12, and state S13 have transitions 70, 72, and 74 to themselves, respectively, and are assigned transition probabilities a11, a22, and a33, respectively. These transition probabilities are calculated in advance using learning data.

図２を参照して、HMMを用いる従来の音声合成装置１００は、入力テキスト１０２が与えられると、その入力テキストに応じた音声を生成するためのパラメータを出力するパラメータ生成部１１０と、パラメータ生成部１１０により生成されたパラメータを用いて音声信号１０４を合成し出力する音声合成部１１２とを含む。 Referring to FIG. 2, a conventional speech synthesizer 100 using HMM, when input text 102 is given, parameter generation unit 110 that outputs a parameter for generating speech according to the input text, and parameter generation And a speech synthesis unit 112 that synthesizes and outputs the speech signal 104 using the parameters generated by the unit 110.

パラメータ生成部１１０は、入力テキスト１０２に対して形態素解析、構文解析等を行って、発話すべき音素及びその文脈情報等を表すラベルからなるラベル列１３２を出力するテキスト解析処理部１３０と、ラベル列１３２に依存した決定木からなり、ラベル列１３２に対応する各音素の継続長を出力する継続長モデル１３４と、ラベル列１３２に依存した決定木に基づきHMMの状態ごとに音声合成パラメータの確率密度関数を出力するHMM音響モデル１３６と、ラベル列１３２を入力として、継続長モデル１３４の決定木に基づきHMMの状態系列を出力するHMM状態系列決定部１３８と、HMM状態系列決定部１３８が出力するHMM状態系列からHMM音響モデル１３６を用いて推定した各状態の確率密度関数に基づいて、Ｆ０パラメータ１４２、有声／無声パラメータ１４４及び調音のためのスペクトル包絡パラメータ１４６を出力する音声合成パラメータ算出部１４０とを含む。スペクトル包絡パラメータ１４６は、例えば、ケプストラム係数、メルケプストラム係数、線形予測係数などが考えられる。 The parameter generation unit 110 performs a morphological analysis, a syntax analysis, and the like on the input text 102, and outputs a label sequence 132 including labels representing phonemes to be uttered and context information thereof, and a label A duration model 134 that includes a decision tree depending on the sequence 132 and outputs the duration of each phoneme corresponding to the label sequence 132; and a probability of a speech synthesis parameter for each state of the HMM based on the decision tree that depends on the label sequence 132 The HMM acoustic model 136 that outputs the density function, the HMM state sequence determination unit 138 that outputs the HMM state sequence based on the decision tree of the duration model 134 with the label sequence 132 as input, and the HMM state sequence determination unit 138 output Based on the probability density function of each state estimated from the HMM state sequence to be estimated using the HMM acoustic model 136, the F0 parameter 142, voiced / And a voice synthesis parameter calculating unit 140 outputs the parameters 144 and the spectral envelope parameter 146 for articulation. The spectrum envelope parameter 146 may be, for example, a cepstrum coefficient, a mel cepstrum coefficient, a linear prediction coefficient, or the like.

音声合成部１１２は、音声合成パラメータ算出部１４０からＦ０パラメータ１４２及び有声／無声パラメータ１４４を受けて音源信号を生成する音源信号生成部１５０と、音源信号生成部１５０により生成される音源信号にスペクトル包絡パラメータ１４６に基づいて変調することにより音声信号１０４を出力する音声合成フィルタ１５２とを含む。 The speech synthesizer 112 receives the F0 parameter 142 and the voiced / unvoiced parameter 144 from the speech synthesis parameter calculator 140 and generates a sound source signal, and the sound source signal generated by the sound source signal generator 150 has a spectrum. And a speech synthesis filter 152 that outputs the speech signal 104 by modulating based on the envelope parameter 146.

音声合成時、HMM状態系列決定部１３８は、ラベル列１３２を入力として、継続長モデル１３４の決定木を探索することにより、HMM状態系列の各状態の継続長を決定し、その時間情報が付されたラベル列を音声合成パラメータ算出部１４０に与える。音声合成パラメータ算出部１４０は、発話のフレームごとに、そのフレームに対応する状態の出力の確率密度関数を用いて、最尤となる音声合成パラメータ系列を推定し、Ｆ０パラメータ１４２、有声／無声パラメータ１４４、及びスペクトル包絡パラメータ１４６を生成して音声合成部１１２に与える。 At the time of speech synthesis, the HMM state sequence determination unit 138 determines the continuation length of each state of the HMM state sequence by searching the decision tree of the continuation length model 134 using the label sequence 132 as input, and adds the time information. The obtained label sequence is given to the speech synthesis parameter calculation unit 140. The speech synthesis parameter calculation unit 140 estimates the maximum likelihood speech synthesis parameter sequence for each utterance frame using the probability density function of the output corresponding to the frame, and uses the F0 parameter 142, the voiced / unvoiced parameter. 144 and the spectral envelope parameter 146 are generated and provided to the speech synthesizer 112.

音声合成部１１２の音源信号生成部１５０は、音声合成パラメータ算出部１４０からのＦ０パラメータ１４２及び有声／無声パラメータ１４４にしたがって音源信号を生成する。音声合成フィルタ１５２がこの音源信号をスペクトル包絡パラメータ１４６により定まる特性で変調し、音声信号１０４を出力する。 The sound source signal generation unit 150 of the speech synthesis unit 112 generates a sound source signal according to the F0 parameter 142 and the voiced / unvoiced parameter 144 from the speech synthesis parameter calculation unit 140. The voice synthesis filter 152 modulates the sound source signal with characteristics determined by the spectrum envelope parameter 146 and outputs the voice signal 104.

HMM音声合成は、利点として、長年の知見が蓄積され、声質及び発話様式に対する制御及び操作技術が確立していることが挙げられる。そのため、生成されたパラメータに異音が発生した場合でもその問題特定と修正が容易である。一方、状態単位のモデル化及びHMM状態系列決定の際の決定木によるハードクラスタリングが要因となって、合成音声の品質の低下を招くという問題がある。 As an advantage of HMM speech synthesis, many years of knowledge are accumulated, and control and operation techniques for voice quality and speech style are established. Therefore, even when abnormal noise occurs in the generated parameters, it is easy to identify and correct the problem. On the other hand, there is a problem that the quality of the synthesized speech is deteriorated due to hard clustering by a decision tree at the time of state unit modeling and HMM state sequence determination.

一方、DNN音声合成は、図３に概略を示すDNN１７０のようなDNNをパラメータ生成に用いる。図３に示すDNN１７０は、ラベル列１３２に基づき生成されたベクトルを入力として、そのラベル列１３２に対応する音声パラメータを出力する。ネットワーク重み及びバイアスは、学習データを用いて予め算出される。 On the other hand, DNN speech synthesis uses a DNN such as DNN 170 schematically shown in FIG. 3 for parameter generation. The DNN 170 shown in FIG. 3 receives a vector generated based on the label string 132 and outputs a speech parameter corresponding to the label string 132. The network weight and bias are calculated in advance using learning data.

DNN１７０は、ラベル列１３２を２値表現又は数値に変換したベクトルをノードに持つ入力層１７２と、音声パラメータからなるノードを持つ出力層１７８と、入力層１７２と出力層１７８との間に順番に設けられた１又は複数の隠れ層１７４及び１７６とを含む。図３では、図を簡略にするために各層が持つノード数は同じとしているが、隠れ層が持つノードの数はこれらに限定されない。また、図３で隠れ層は２つだが、１つでもよいし、３つ以上であってもよい。また、図３に示すDNN１７０は入力が入力層１７２から出力層１７８に向けて順次伝搬する形となっているが、途中の隠れ層の一部の出力をその入力に戻すパスを持つ、いわゆるリカレント型NN等、他の形式のNNを用いても良い。 The DNN 170 includes an input layer 172 having a vector obtained by converting the label sequence 132 into a binary expression or a numerical value as a node, an output layer 178 having a node made up of audio parameters, and an input layer 172 and an output layer 178 in order. One or more hidden layers 174 and 176 provided. In FIG. 3, in order to simplify the drawing, the number of nodes included in each layer is the same, but the number of nodes included in the hidden layer is not limited thereto. Moreover, although there are two hidden layers in FIG. 3, there may be one or three or more. The DNN 170 shown in FIG. 3 has a form in which the input sequentially propagates from the input layer 172 to the output layer 178, but has a path for returning a part of the output of the hidden layer in the middle to the input. Other types of NN such as type NN may be used.

図４を参照して、従来のDNNを用いた音声合成装置２００は、入力テキスト１０２を受けて、Ｆ０パラメータ２４２、有声／無声パラメータ２４４及びスペクトル包絡パラメータ２４６を出力するためのパラメータ生成部２１０と、パラメータ生成部２１０が出力するＦ０パラメータ２４２、有声／無声パラメータ２４４及びスペクトル包絡パラメータ２４６を受けて音声信号２０４を出力する、図２と同じ音声合成部１１２とを含む。 Referring to FIG. 4, a conventional speech synthesizer 200 using DNN receives input text 102 and outputs parameter F0 242, voiced / unvoiced parameter 244, and spectral envelope parameter 246. 2, which receives the F0 parameter 242, the voiced / unvoiced parameter 244 and the spectrum envelope parameter 246 output from the parameter generation unit 210 and outputs the speech signal 204.

パラメータ生成部２１０は、入力テキスト１０２を受けてラベル列１３２を出力するテキスト解析処理部１３０と、ラベル列１３２を入力として受けてフレームごとに音声パラメータの確率密度関数の平均ベクトルを出力するDNNからなるDNN音響モデル２３０と、予め学習データに基づいて算出された、音声パラメータのグローバル平均ベクトルと共分散行列を記憶する正規化パラメータ記憶部２３４と、DNN音響モデル２３０から与えられる平均ベクトルと、正規化パラメータ記憶部２３４から読み出したグローバル平均ベクトルと共分散行列とに基づいて逆正規化した音声合成パラメータを算出した後、最も尤度が高くなるような音声合成パラメータであるＦ０パラメータ２４２、有声／無声パラメータ２４４、及びスペクトル包絡パラメータ２４６の系列を出力し音声合成部１１２に与える音声合成パラメータ算出部２３２とを含む。Ｆ０パラメータ２４２及び有声／無声パラメータ２４４は音源信号生成部１５０に、スペクトル包絡パラメータ２４６は音声合成フィルタ１５２に、それぞれ与えられる。正規化パラメータ記憶部２３４に記憶されるグローバル平均ベクトル及び共分散行列は、DNNの学習時に学習データから算出され、全てのフレームにおいて共通して用いられる。 The parameter generation unit 210 receives the input text 102 and outputs a label sequence 132, and the DNN receives the label sequence 132 as an input and outputs an average vector of the probability density function of speech parameters for each frame. DNN acoustic model 230, a normalization parameter storage unit 234 that stores a global average vector of speech parameters and a covariance matrix, which are calculated based on learning data in advance, an average vector given from DNN acoustic model 230, After calculating speech synthesis parameters that have been denormalized based on the global average vector and the covariance matrix read from the generalization parameter storage unit 234, the F0 parameter 242 that is the speech synthesis parameter that has the highest likelihood, voiced / Silent parameter 244 and spectral envelope parameters Outputs 246 series and a voice synthesis parameter calculating unit 232 to be supplied to the speech synthesis unit 112. The F0 parameter 242 and the voiced / unvoiced parameter 244 are provided to the sound source signal generation unit 150, and the spectrum envelope parameter 246 is provided to the speech synthesis filter 152, respectively. The global average vector and the covariance matrix stored in the normalization parameter storage unit 234 are calculated from learning data during DNN learning, and are used in common in all frames.

テキスト解析処理部１３０が出力するラベル列１３２はDNN音響モデル２３０の入力にフレームごとに与えられる。この入力に応答してDNN音響モデル２３０はフレームごとに出力の確率密度関数の平均ベクトルを出力する。このベクトルは音声合成パラメータ算出部２３２に与えられる。音声合成パラメータ算出部２３２は、正規化パラメータ記憶部２３４から正規化パラメータを読み出し、DNN音響モデル２３０からの平均ベクトルと組み合わせて得られる確率密度関数にしたがって、Ｆ０パラメータ２４２、有声／無声パラメータ２４４及びスペクトル包絡パラメータ２４６を生成し出力する。 The label sequence 132 output from the text analysis processing unit 130 is given to the input of the DNN acoustic model 230 for each frame. In response to this input, DNN acoustic model 230 outputs an average vector of output probability density functions for each frame. This vector is given to the speech synthesis parameter calculation unit 232. The speech synthesis parameter calculation unit 232 reads the normalization parameter from the normalization parameter storage unit 234, and according to the probability density function obtained in combination with the average vector from the DNN acoustic model 230, the F0 parameter 242, the voiced / unvoiced parameter 244, and A spectral envelope parameter 246 is generated and output.

音声合成部１１２の音源信号生成部１５０及び音声合成フィルタ１５２は、図２に示すものと同様にして音声信号２０４を合成し出力する。 The sound source signal generation unit 150 and the speech synthesis filter 152 of the speech synthesis unit 112 synthesize and output the speech signal 204 in the same manner as shown in FIG.

DNN音声合成は、DNNに基づき、フレーム単位でモデル化が可能であり、加えて、HMM音声合成よりも高い品質の音声を生成可能である。しかしその中核をなすDNNについての制御・操作技術は未だ限定的であり、異音が発生した場合の修正の難しさ、及び、近年音声合成に適用されたことによる知見の少なさが問題点として挙げられる。 DNN speech synthesis can be modeled in units of frames based on DNN, and in addition, higher quality speech than HMM speech synthesis can be generated. However, the control and operation technology for DNN, which is the core of the system, is still limited, and it is difficult to correct abnormal sounds and the lack of knowledge due to recent application to speech synthesis. Can be mentioned.

B. Chen, Z. Chen, J. Xu and K. Yu, “An investigation of context clustering for statistical speech synthesis with deep neural network,” in Proc. ICASSP, pp. 2212-2216, 2015.統合手法B. Chen, Z. Chen, J. Xu and K. Yu, “An investigation of context clustering for statistical speech synthesis with deep neural network,” in Proc. ICASSP, pp. 2212-2216, 2015. H. Zen, M. Gales, Y. Nankaku and K. Tokuda, “Product of experts for statistical parametric speech synthesis.” Audio, Speech, and Language Processing, IEEE Transactions on, 20(3) pp. 794-805, 2012.H. Zen, M. Gales, Y. Nankaku and K. Tokuda, “Product of experts for statistical parametric speech synthesis.” Audio, Speech, and Language Processing, IEEE Transactions on, 20 (3) pp. 794-805, 2012 .

SPSSの品質改善のため、様々な取り組みがなされている。その一つとして、異なるモデルを統合する試みがある。モデル統合には主に２種の方法が考えられる。一方はモデル同士を直列に接続する手法であり、他方は、モデルを並列に接続する手法である。 Various efforts have been made to improve the quality of SPSS. One of these is an attempt to integrate different models. There are two main methods for model integration. One is a method of connecting models in series, and the other is a method of connecting models in parallel.

直列に接続する手法として、SPSSの代表的な手法であるDNNとHMM音声合成を組み合わせる手法が提案されている（非特許文献１）。この手法では、HMMの推定結果をDNNの入力としている。しかし、モデルを直列に接続しているため、柔軟にモデルを変更したり、複数のモデルを統合したりすることが困難である。 As a technique for connecting in series, a technique combining DNN, which is a typical technique of SPSS, and HMM speech synthesis has been proposed (Non-Patent Document 1). In this method, the HMM estimation result is used as the DNN input. However, since the models are connected in series, it is difficult to change the model flexibly or to integrate a plurality of models.

一方、並列に接続する手法は、柔軟性に優れ、簡易に複数のモデルを統合することが可能である。この手法の例として、Product-of-Experts（PoE）のフレームワークを用いた統合が挙げられる（非特許文献２）。非特許文献２に記載の手法では、2つの異なる種類のHMMをPoEにしたがって統合している。しかし、この手法は依然としてHMM音声合成を用いていることにより、確率密度関数が状態単位でしか変化しないため、フレーム単位で精緻にモデル化可能なDNNと比較して、性能面で課題が残る疑念がある。 On the other hand, the parallel connection method is excellent in flexibility and can easily integrate a plurality of models. An example of this technique is integration using a Product-of-Experts (PoE) framework (Non-Patent Document 2). In the method described in Non-Patent Document 2, two different types of HMMs are integrated according to PoE. However, since this method still uses HMM speech synthesis, the probability density function changes only on a state-by-state basis, so there is a suspicion that performance issues remain compared to DNN, which can be modeled precisely on a frame-by-frame basis. There is.

HMM音声合成は、ラベル列から状態ごとに確率密度関数を推定し、系列として出力する。そのため、得られた確率分布系列は、状態ごとに平均ベクトル、共分散行列が変化する。また、DNN音声合成は、フレーム単位で確率密度関数の平均ベクトルを高精度に推定するが、共分散行列は全てのフレームで予め計算された固定の値を用いる。HMMとDNN音声合成の利点を活かすため、本発明では、PoEのフレームワークで両者を組み合わせる手段を提供する。 In HMM speech synthesis, a probability density function is estimated for each state from a label string and output as a sequence. Therefore, in the obtained probability distribution series, the average vector and the covariance matrix change for each state. In DNN speech synthesis, an average vector of a probability density function is estimated with high accuracy in units of frames, but a covariance matrix uses a fixed value calculated in advance for all frames. In order to take advantage of the advantages of HMM and DNN speech synthesis, the present invention provides means for combining both in the PoE framework.

本発明の第１の局面に係る音声合成パラメータ生成装置は、発話すべき音素及びその文脈情報等を表すラベルからなるラベル列を受けて、音声の特徴を示す音声パラメータ系列を、音声合成の時間的単位であるフレームごとに出力し、音声パラメータから音声合成のための音声合成パラメータを生成する。この装置は、プログラム、HMMからなる第１の音響モデル、及びニューラルネットワーク（Neural Network：NN）からなる第２の音響モデルを記憶するメモリと、メモリに記憶されたプログラムを実行することにより、音声パラメータ系列をフレームごとに出力するプロセッサとを含む。プロセッサは、プログラムにより、ラベル列に対応したHMM状態系列を出力する継続長モデルを参照して、ラベル列に対応したHMMの状態系列を出力する状態系列決定ステップと、第１の音響モデルを参照して、前記状態系列を入力として、HMMから推定されるHMMパラメータ系列を状態ごとに出力する第１のモデルパラメータ生成ステップと、第２の音響モデルを参照して、前記状態系列とラベル列を入力として、NNから推定されるNNパラメータ系列をフレームごとに出力する第２のモデルパラメータ生成ステップと、第１のモデルパラメータ生成ステップにおいて出力されたHMMパラメータ系列と、第２のモデルパラメータ生成ステップにおいて出力されたNNパラメータ系列とを、フレームごとにPoEフレームワークにより統合して統合モデルパラメータとして出力するモデル統合ステップと、モデル統合ステップにおいて出力された統合モデルパラメータに基づいて音声合成パラメータ系列を生成し出力する音声合成パラメータ生成ステップとを含む方法を実行するようプログラムされている。 The speech synthesis parameter generation device according to the first aspect of the present invention receives a label sequence composed of labels representing phonemes to be uttered and context information thereof, and converts speech parameter sequences indicating speech characteristics into speech synthesis time. Output for each frame, which is a target unit, and generate speech synthesis parameters for speech synthesis from speech parameters. This apparatus stores a program, a first acoustic model composed of an HMM, and a second acoustic model composed of a neural network (NN), and executes a program stored in the memory to execute speech. And a processor for outputting the parameter series for each frame. The processor refers to the continuation model that outputs the HMM state sequence corresponding to the label sequence by the program, the state sequence determination step that outputs the HMM state sequence corresponding to the label sequence, and the first acoustic model. Then, referring to the first model parameter generation step for outputting the HMM parameter series estimated from the HMM for each state with the state series as an input, and referring to the second acoustic model, the state series and the label string are As input, a second model parameter generation step for outputting an NN parameter sequence estimated from the NN for each frame, an HMM parameter sequence output in the first model parameter generation step, and a second model parameter generation step The output NN parameter series is integrated with the PoE framework for each frame to integrate model parameters. And model integration step of outputting as being programmed to perform a method comprising a speech synthesis parameter generating step for generating and outputting a voice synthesis parameter sequence based on the integrated model parameters outputted in the model integration step.

好ましくは、音声パラメータ系列を構成する音声パラメータの各々は複数の要素を持つベクトルである。HMMの各状態の出力の確率密度関数と、NNがモデリングする第２の確率密度関数は、いずれも音声パラメータのベクトルと同次元数のガウス分布を含む。 Preferably, each of the speech parameters constituting the speech parameter series is a vector having a plurality of elements. Both the probability density function of the output of each state of the HMM and the second probability density function modeled by the NN include a Gaussian distribution having the same number of dimensions as the speech parameter vector.

より好ましくは、第１のモデルパラメータ生成ステップにおいて出力される状態系列を構成する状態パラメータの各々は、状態ごとの出力の確率密度関数であるガウス分布を定義する平均ベクトル及び共分散行列を含む。第２のモデルパラメータ生成ステップにより生成されるNNパラメータ系列を構成するNNパラメータの各々は、フレームごとのガウス分布を定義する平均ベクトルを含む。抽出モデル合成ステップは、第１のモデルパラメータ生成ステップにおいて出力された平均ベクトル及び共分散行列により定義されるガウス分布と、第２のモデルパラメータ生成ステップにおいて出力された平均ベクトル並びに予め算出されたグローバル平均ベクトル及び固定の共分散行列により定義されるガウス分布とを、フレームごとにPoEフレームワークにより乗算して統合モデルパラメータとして出力するステップを含む。 More preferably, each of the state parameters constituting the state series output in the first model parameter generation step includes a mean vector and a covariance matrix that define a Gaussian distribution that is a probability density function of an output for each state. Each of the NN parameters constituting the NN parameter series generated by the second model parameter generation step includes an average vector that defines a Gaussian distribution for each frame. The extracted model synthesis step includes a Gaussian distribution defined by the average vector and covariance matrix output in the first model parameter generation step, an average vector output in the second model parameter generation step, and a pre-calculated global A step of multiplying the Gaussian distribution defined by the mean vector and the fixed covariance matrix by the PoE framework for each frame and outputting as an integrated model parameter.

第１のモデルパラメータ生成ステップにおいて出力される状態系列を構成するHMMパラメータの各々は、状態ごとに確率密度関数として定義されるガウス分布の平均ベクトル及び共分散行列を含む。第２のモデルパラメータ生成ステップにより生成されるNNパラメータは、フレームごとに定義されるガウス分布の平均ベクトルを含む。モデル統合ステップは、第１のモデルパラメータ生成ステップにおいて出力された平均ベクトル及び共分散行列により定義されるガウス分布と、第２のモデルパラメータ生成ステップにおいて出力された平均ベクトル並びに予め算出された固定のグローバル平均ベクトル及び共分散行列により定義されるガウス分布とを、フレームごとにPoEフレームワークにより重み付けで乗算して統合モデルパラメータとして出力するステップを含む。 Each of the HMM parameters constituting the state series output in the first model parameter generation step includes a Gaussian distribution mean vector and a covariance matrix defined as a probability density function for each state. The NN parameter generated by the second model parameter generation step includes an average vector of Gaussian distribution defined for each frame. The model integration step includes a Gaussian distribution defined by the average vector and covariance matrix output in the first model parameter generation step, an average vector output in the second model parameter generation step, and a fixed fixed value calculated in advance. A step of multiplying the Gaussian distribution defined by the global mean vector and the covariance matrix by weighting by the PoE framework for each frame and outputting as an integrated model parameter.

第１のモデルパラメータ生成ステップにおいて出力される状態系列を構成するHMMパラメータの各々は、状態ごとのガウス分布を定義する平均ベクトル及び共分散行列を含んでもよい。第２のモデルパラメータ生成ステップにより生成されるNNパラメータは、フレームごとのガウス分布を定義する平均ベクトルを含んでもよい。第２のモデルパラメータ生成ステップは、第２の音響モデルを参照して、各フレームについて、第１のモデルパラメータ生成ステップにより当該フレームに割り当てられた状態に対応する音素のNNパラメータを出力するステップを含んでもよい。 Each of the HMM parameters constituting the state series output in the first model parameter generation step may include an average vector and a covariance matrix that define a Gaussian distribution for each state. The NN parameter generated by the second model parameter generation step may include an average vector that defines a Gaussian distribution for each frame. The second model parameter generation step refers to the second acoustic model, and for each frame, outputs a phoneme NN parameter corresponding to the state assigned to the frame by the first model parameter generation step. May be included.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータを、上記したいずれかの音声合成パラメータ生成装置として機能させる。 A computer program according to the second aspect of the present invention causes a computer to function as any of the speech synthesis parameter generation devices described above.

HMMの概念的構成を示す模式図である。It is a schematic diagram which shows the conceptual structure of HMM. 従来のHMM音声合成手法を用いる音声合成装置のブロック図である。It is a block diagram of the speech synthesizer using the conventional HMM speech synthesis method. DNNの概念的構成を示す模式図である。It is a schematic diagram which shows the conceptual structure of DNN. 従来のDNN音声合成手法を用いる音声合成装置のブロック図である。It is a block diagram of the speech synthesizer using the conventional DNN speech synthesis method. 本発明の実施の形態に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to an embodiment of the present invention. PoEによる確率密度関数の乗算結果を説明するグラフである。It is a graph explaining the multiplication result of the probability density function by PoE. 本発明の実施の形態に係る音声合成装置においてフレームごとに音声合成パラメータを生成する処理を実現するコンピュータプログラムの概略フローチャートである。It is a schematic flowchart of the computer program which implement | achieves the process which produces | generates a speech synthesis parameter for every flame | frame in the speech synthesizer which concerns on embodiment of this invention. 本発明の実施の形態に係る音声合成装置において使用するHMM及びDNNの学習を行うモデル学習装置のブロック図である。It is a block diagram of the model learning apparatus which performs learning of HMM and DNN used in the speech synthesizer concerning an embodiment of the invention. 本発明に係る音声合成装置で使用するモデルにおいてHMMとDNNを統合する際の重みによるＦ０パラメータの２乗平均誤差の変化を示すグラフである。It is a graph which shows the change of the root mean square error of F0 parameter by the weight at the time of integrating HMM and DNN in the model used with the speech synthesizer concerning the present invention. 本発明の各実施の形態に係る音声合成装置を実現するためのコンピュータシステムの外観を示す図である。It is a figure which shows the external appearance of the computer system for implement | achieving the speech synthesizer which concerns on each embodiment of this invention. 図１０に示すコンピュータシステムを構成するコンピュータのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the computer which comprises the computer system shown in FIG.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰り返さない。以下に述べる実施の形態は、いずれもPoEのフレームワークを用いてDNNとHMMとを組み合わせる手法を採用している。なお、本実施の形態ではDNNを用いているが、NNとしてはこれに限らず、リカレントNN等を用いても良い。 In the following description and drawings, the same parts are denoted by the same reference numerals. Therefore, detailed description thereof will not be repeated. Each of the embodiments described below employs a technique of combining DNN and HMM using a PoE framework. Although DNN is used in the present embodiment, NN is not limited to this, and recurrent NN or the like may be used.

＜構成＞
本発明の実施の形態に係る音声合成パラメータ生成装置を含む音声合成装置の概略ブロック図を図５に示す。図５を参照して、この音声合成装置２６０は、入力テキスト１０２を受けて、HMMとDNNを用いて入力テキスト１０２に対する音声合成パラメータを推定し、Ｆ０パラメータ２９２、有声／無声パラメータ２９４、及びスペクトル包絡パラメータ２９６を生成するパラメータ生成部２７０と、パラメータ生成部２７０からＦ０パラメータ２９２、有声／無声パラメータ２９４及びスペクトル包絡パラメータ２９６を受けて音声信号２６２を出力する、従来の技術で説明したものと同じ構成の音声合成部１１２とを含む。 <Configuration>
FIG. 5 shows a schematic block diagram of a speech synthesis apparatus including a speech synthesis parameter generation apparatus according to an embodiment of the present invention. Referring to FIG. 5, the speech synthesizer 260 receives the input text 102, estimates speech synthesis parameters for the input text 102 using the HMM and DNN, and sets F0 parameter 292, voiced / unvoiced parameter 294, and spectrum A parameter generation unit 270 that generates an envelope parameter 296, and an F0 parameter 292, a voiced / unvoiced parameter 294, and a spectrum envelope parameter 296 that are received from the parameter generation unit 270 and outputs an audio signal 262. And a speech synthesizer 112 having a configuration.

パラメータ生成部２７０は、入力テキスト１０２に対して従来と同一のテキスト解析処理を行って、音素情報及び文脈情報を含むラベル列１３２を出力するテキスト解析処理部１３０と、ラベル列１３２を入力として受けてフレームごとに音声パラメータの確率密度関数の平均ベクトルを出力する、DNNからなるDNN音響モデル２８０と、図２に示したものと同様に接続された継続長モデル１３４及びHMM状態系列決定部１３８と、HMM音響モデル１３６とを含む。 The parameter generation unit 270 performs the same text analysis processing as before on the input text 102 and outputs a label sequence 132 including phoneme information and context information, and the label sequence 132 as an input. A DNN acoustic model 280 that outputs an average vector of a probability density function of speech parameters for each frame, a duration model 134 and an HMM state sequence determination unit 138 connected in the same manner as shown in FIG. HMM acoustic model 136.

パラメータ生成部２７０はさらに、DNN音響モデル２８０からフレームごとに出力される平均ベクトル、及び、DNN音響モデル２８０の学習時に予め算出されたグローバル平均ベクトルと共分散行列により規定される確率密度関数と、HMM状態系列決定部１３８から出力されるHMM系列にしたがってHMM音響モデル１３６から状態ごとに出力される平均ベクトル及び共分散行列により規定される確率密度関数とを、PoEフレームワークにしたがってフレームごとに重み付きで乗算して音声パラメータを統合し、統合後の音声パラメータを出力する音声パラメータ統合部２８４と、音声パラメータ統合部２８４による音声パラメータの統合時に使用される重みを記憶し音声パラメータ統合部２８４に出力する重み記憶部２８８と、DNN音響モデル２８０に対応する固定したグローバル平均ベクトルと共分散行列を記憶し、DNN音響モデル２８０の出力する音声パラメータを正規化するために音声パラメータ統合部２８４に出力する正規化パラメータ記憶部２８６と、音声パラメータ統合部２８４から出力される合成後の音声パラメータを用いて、Ｆ０パラメータ２９２、有声／無声パラメータ２９４、及びスペクトル包絡パラメータ２９６からなる音声合成パラメータを生成し出力する音声合成パラメータ算出部２９０とを含む。なお、上記した確率密度関数は、本実施の形態では、平均ベクトル及び共分散行列により定義されるガウス分布である。 The parameter generation unit 270 further includes an average vector output from the DNN acoustic model 280 for each frame, a probability average function defined by a global average vector and a covariance matrix calculated in advance when learning the DNN acoustic model 280, According to the HMM sequence output from the HMM state sequence determination unit 138, the average vector output for each state from the HMM acoustic model 136 and the probability density function defined by the covariance matrix are weighted for each frame according to the PoE framework. The voice parameter integration unit 284 that integrates the voice parameters and outputs the voice parameters after the integration, and the weight used when the voice parameters are integrated by the voice parameter integration unit 284 are stored in the voice parameter integration unit 284. Supports output weight storage unit 288 and DNN acoustic model 280 A normalization parameter storage unit 286 that stores a fixed global average vector and a covariance matrix, and outputs to a speech parameter integration unit 284 in order to normalize a speech parameter output from the DNN acoustic model 280, and a speech parameter integration unit 284 A speech synthesis parameter calculation unit 290 that generates and outputs a speech synthesis parameter including an F0 parameter 292, a voiced / unvoiced parameter 294, and a spectrum envelope parameter 296 using the synthesized speech parameter output from. Note that the probability density function described above is a Gaussian distribution defined by an average vector and a covariance matrix in the present embodiment.

この実施の形態では、前述したようにHMMとDNNとから得られる音声パラメータをPoEフレームワークにしたがって統合する。以下、この統合について説明する。PoEは複数の確率密度関数を、それらの間の積をとることにより１つの確率密度関数に統合する。 In this embodiment, as described above, voice parameters obtained from the HMM and DNN are integrated according to the PoE framework. Hereinafter, this integration will be described. PoE combines multiple probability density functions into one probability density function by taking the product between them.

図６に示すように、乗算により、HMMとDNNを同時に満たす確率密度関数が得られる。図６において、横軸は確率変数を表し、縦軸は確率密度を表す。 As shown in FIG. 6, a probability density function that simultaneously satisfies HMM and DNN is obtained by multiplication. In FIG. 6, the horizontal axis represents a random variable, and the vertical axis represents a probability density.

本実施の形態では、DNNにより予測された確率密度関数と、HMMにより予測された確率密度関数とを以下のように統合する。なおこの実施の形態でも、DNNの確率密度関数のグローバル平均ベクトル及び共分散行列は、予め学習データから求められた、全てのフレームに共通のものを用いる。 In the present embodiment, the probability density function predicted by DNN and the probability density function predicted by HMM are integrated as follows. Also in this embodiment, the global average vector and covariance matrix of the DNN probability density function are the same for all frames, which are obtained in advance from learning data.

この式によれば、一方では、PoEによる確率密度関数の平均ベクトルはDNNから得られる平均ベクトルの変動とともにフレームごとに変動する。他方では、確率密度関数の共分散行列はHMMの状態遷移に伴う共分散行列の変動とともに状態ごとに変動する。したがって、この確率密度関数は、フレームの移動に追従して変動する平均ベクトルと、HMMの状態の移動に追従して変動する共分散行列とにより定義されることになり、DNNとHMMの良い所を組み合わせたものとなる。 According to this equation, on the other hand, the average vector of the probability density function by PoE varies from frame to frame together with the variation of the average vector obtained from DNN. On the other hand, the covariance matrix of the probability density function varies from state to state along with the variation of the covariance matrix accompanying the state transition of the HMM. Therefore, this probability density function is defined by an average vector that changes following the movement of the frame and a covariance matrix that changes following the movement of the HMM state. Will be combined.

なお、本実施の形態ではさらに、両者の合成の際の重みを導入し、以下の式によりPoEによるモデルの合成を行う。 In the present embodiment, weights for the combination of both are further introduced, and the PoE model is synthesized by the following equation.

図７は、図５に示す音声パラメータ統合部２８４をコンピュータにより実現するためのコンピュータプログラムの制御構造を示すフローチャートである。図７を参照して、このプログラムは、以下に説明するモデル合成ステップ３３２を、音声合成の各フレームについて実行するステップ３３０を含む。 FIG. 7 is a flowchart showing a control structure of a computer program for realizing the voice parameter integration unit 284 shown in FIG. 5 by a computer. Referring to FIG. 7, the program includes a step 330 of executing a model synthesis step 332 described below for each frame of speech synthesis.

モデル合成ステップ３３２は、図５のHMM状態系列決定部１３８が出力するHMMの状態系列のうち、処理対象のフレームの時刻を含む状態の平均ベクトルと共分散行列とを含むHMMパラメータを読むステップ３４０と、図５のDNN音響モデル２８０が処理対象のフレームについて出力する平均ベクトルと、正規化パラメータ記憶部２８６に記憶されている固定されたグローバル平均ベクトル及び共分散行列とを含むDNNパラメータを読むステップ３４２とを含む。 The model synthesis step 332 reads an HMM parameter including an average vector of a state including the time of a frame to be processed and a covariance matrix among the HMM state sequences output by the HMM state sequence determination unit 138 in FIG. 5. A DNN parameter including a mean vector output by the DNN acoustic model 280 of FIG. 5 for a frame to be processed, and a fixed global mean vector and covariance matrix stored in the normalized parameter storage unit 286 342.

上記した実施の形態に係るDNN音響モデル２８０及びHMM音響モデル１３６とは、互いに独立に学習したものでもよいが、PoEによる統合を行うために最適化されたものであればより好ましい。以下、DNNとHMMとをこの実施の形態に係る音声合成装置２６０のために最適化するためのDNNとHMMの同時学習方法について説明する。 The DNN acoustic model 280 and the HMM acoustic model 136 according to the above-described embodiment may be learned independently of each other, but it is more preferable if they are optimized for integration by PoE. Hereinafter, a DNN and HMM simultaneous learning method for optimizing DNN and HMM for speech synthesis apparatus 260 according to this embodiment will be described.

図８に、このDNNとHMMの同時学習方法をフローチャート形式で示す。図８を参照して、この学習方法３６０は、通常のDNN及びHMMの学習と同様に、音響特徴量と、各音響特徴量に対応する音素及び文脈情報からなるラベル列とを含む学習データ３６２を準備するステップと、この学習データ３６２を用いて初期HMMの学習を行うステップ３８０及び初期DNNの学習を行うステップ３８２とを含む。このようにして初期学習が行われたHMMのパラメータ集合及びDNNのパラメータ集合を、ステップ３８４においてPoEにより次の式で与えられる尤度関数を最大化することにより最適化する。 FIG. 8 shows a flowchart of this DNN and HMM simultaneous learning method. Referring to FIG. 8, this learning method 360 is similar to normal DNN and HMM learning, and includes learning data 362 including acoustic feature amounts and a label string made up of phonemes and context information corresponding to each acoustic feature amount. And a step 380 for learning the initial HMM using the learning data 362 and a step 382 for learning the initial DNN. The HMM parameter set and DNN parameter set subjected to the initial learning in this way are optimized by maximizing the likelihood function given by the following equation by PoE in step 384.

ここで、EMアルゴリズムのEステップで算出される事後確率密度関数は次の式で与えられる。 Here, the posterior probability density function calculated in the E step of the EM algorithm is given by the following equation.

すなわち、図８を参照して、ステップ３８４は、与えられたHMM及びDNNを用いてPoEモデルを算出し、そのモデルを用いてHMM及びDNNに対する潜在変数の事後確率密度関数を同時に推定するEステップ４００と、Eステップ４００でそれぞれのモデルについて推定された事後確率分布を観測ベクトルとしてDNN及びHMMのパラメータを別々に最尤推定するMステップ４０２と、終了条件が充足するまで、Mステップ４０２の結果得られたモデルパラメータを新たなモデルパラメータとして、Eステップ４００及びMステップ４０２を繰り返し行うステップ４０４と、ステップ４０４で終了条件が成立したと判定されたときに、そのときのHMM及びDNNのパラメータ集合をHMM音響モデル３６４及びDNN音響モデル３６６として出力するステップ３８０とを含む。終了条件としては、PoEによる尤度関数が収束したか否か、HMM及びDNNのパラメータが収束したか否か、又は所定回数の繰り返しが終了したか否か、等が用いられる。HMM及びDNNのパラメータ集合はいずれも事前学習の段階で一度収束するまで学習が行われている。したがって、この処理で改めて更新する場合にも収束は早く、１回又は２回、上記処理を繰り返すことにより収束することが多い。 That is, referring to FIG. 8, step 384 calculates an PoE model using a given HMM and DNN, and simultaneously estimates an a posteriori probability density function of a latent variable for the HMM and DNN using the model. 400, M step 402 for estimating maximum likelihood of DNN and HMM parameters separately using the posterior probability distribution estimated for each model in E step 400 as an observation vector, and the result of M step 402 until the termination condition is satisfied Using the obtained model parameter as a new model parameter, step 404 in which the E step 400 and M step 402 are repeated, and when it is determined in step 404 that the termination condition is satisfied, the HMM and DNN parameter sets at that time Are output as an HMM acoustic model 364 and a DNN acoustic model 366. As the termination condition, whether or not the likelihood function by PoE has converged, whether or not the parameters of HMM and DNN have converged, whether or not a predetermined number of iterations have been completed, and the like are used. Both HMM and DNN parameter sets are learned until they converge once at the pre-learning stage. Therefore, even when renewed in this process, the convergence is fast and often converges by repeating the above process once or twice.

Eステップ４００は、入力されるHMM及びDNNに基づいてPoEモデルを算出するステップ４２０と、ステップ４２０で算出されたPoEモデルを用いてHMM及びDNNに対する潜在変数を同時推定し出力するステップ４２２とを含む。 E step 400 includes a step 420 for calculating a PoE model based on the input HMM and DNN, and a step 422 for simultaneously estimating and outputting latent variables for the HMM and DNN using the PoE model calculated in step 420. Including.

Mステップ４０２は、ステップ４２２で同時推定された潜在変数を用いた事後確率分布を観測ベクトルとしてHMMのパラメータ集合を最尤推定し更新するステップ４４０と、同様にDNNのパラメータ集合を最尤推定し更新するステップ４４２とを含む。 M step 402 performs maximum likelihood estimation and update of the HMM parameter set using the posterior probability distribution using the latent variable simultaneously estimated in step 422 as an observation vector, and similarly performs maximum likelihood estimation of the DNN parameter set. Updating 442.

HMMのパラメータ集合の更新には最急降下法等が用いられる。DNNのパラメータ集合の更新には確率的勾配降下法等が用いられる。 The steepest descent method is used to update the HMM parameter set. Stochastic gradient descent is used to update the DNN parameter set.

＜動作＞
図５〜図７に示した装置は以下のように動作する。なお、HMM及びDNNの最適化は、図８に示したとおりである。最適化したときの重みは重み記憶部２８８に記憶される。また、DNNの学習時にDNNの出力の確率密度関数の共分散行列が計算され、正規化パラメータ記憶部２８６に記憶される。 <Operation>
The apparatus shown in FIGS. 5 to 7 operates as follows. The optimization of HMM and DNN is as shown in FIG. The weight when optimized is stored in the weight storage unit 288. Further, a covariance matrix of the probability density function of the DNN output is calculated during DNN learning and stored in the normalization parameter storage unit 286.

図５を参照して、テキスト解析処理部１３０は、入力テキスト１０２を解析することにより、ラベル列１３２を出力する。ラベル列１３２の各ラベルは、発話を構成する音素と、文脈情報とを含む。 Referring to FIG. 5, the text analysis processing unit 130 analyzes the input text 102 and outputs a label string 132. Each label of the label column 132 includes phonemes constituting the utterance and context information.

DNN音響モデル２８０は、ラベル列１３２を受けて、発話のフレームごとに平均ベクトルを出力し音声パラメータ統合部２８４に与える。HMM状態系列決定部１３８は、ラベルに応じて決定木を探索して継続長モデル１３４から継続長を読み出すことにより入力テキスト１０２に対応するHMM状態系列を決定し、各状態における出力の確率密度関数の平均ベクトル及び共分散行列をHMM状態系列として音声パラメータ統合部２８４に与える。 The DNN acoustic model 280 receives the label sequence 132, outputs an average vector for each utterance frame, and provides it to the speech parameter integration unit 284. The HMM state sequence determining unit 138 determines an HMM state sequence corresponding to the input text 102 by searching the decision tree according to the label and reading the duration from the duration model 134, and the probability density function of the output in each state Are provided to the speech parameter integration unit 284 as an HMM state sequence.

音声パラメータ統合部２８４は、フレームごとに以下の処理（図７のステップ３３２）を繰り返す。すなわち、まずそのフレームを含む状態におけるHMMパラメータ（平均ベクトルと共分散行列）を読む（ステップ３４０）。続いて、そのフレームに対するDNNパラメータ（平均ベクトル）と、正規化パラメータ記憶部２８６に記憶されたグローバル平均ベクトル及び共分散行列を読む（ステップ３４２）。ステップ３４３で、両者の有声／無声パラメータが一致するか否かを判定する。両者が一致する場合、ステップ３４４で、PoEに基づくモデルの確率密度関数の平均ベクトル及び共分散行列を式（７）及び式（８）により算出する。さもなければ、ステップ３４５で有声／無声パラメータを除いてPoEに基づくモデルの確率密度関数の平均ベクトル及び共分散行列をそれぞれ式（７）及び式（８）により算出し、有声／無声パラメータとしてはDNNのものをそのまま出力する。続くステップ３４６で、こうして算出された平均ベクトルと共分散行列を現フレームにおける確率密度関数として出力する。 The audio parameter integration unit 284 repeats the following process (step 332 in FIG. 7) for each frame. That is, first, HMM parameters (average vector and covariance matrix) in a state including the frame are read (step 340). Subsequently, the DNN parameter (average vector) for the frame and the global average vector and covariance matrix stored in the normalized parameter storage unit 286 are read (step 342). In step 343, it is determined whether both voiced / unvoiced parameters match. If the two match, in step 344, the average vector and covariance matrix of the probability density function of the model based on PoE are calculated using equations (7) and (8). Otherwise, in step 345, the mean vector and the covariance matrix of the probability density function of the model based on PoE are calculated by Equation (7) and Equation (8), respectively, excluding the voiced / unvoiced parameters. The DNN output is output as is. In the following step 346, the average vector and covariance matrix calculated in this way are output as a probability density function in the current frame.

ステップ３３２の処理を、発話を構成する全てのフレームについて行うことにより、入力テキスト１０２に基づく音声合成が行われる。 The speech synthesis based on the input text 102 is performed by performing the processing in step 332 for all the frames constituting the utterance.

我々は、PoEによるモデルの合成において、DNNとHMMとの重みを様々な値に替えて、得られたPoEモデルによる音声合成の品質について調べた。結果を図９に示す。 We investigated the quality of speech synthesis using the PoE model obtained by changing the weights of DNN and HMM to various values. The results are shown in FIG.

図９を参照して、HMMの重みが１でDNNの重みが０のときが左端、HMMの重みが０でDNNの重みが１のときが右端である。図９のグラフから、DNNの重みが０．７５から０．９８程度の範囲でPoEによるシステムがDNN及びHMMの単独のシステムの性能を明らかに上回っている。DNNの重みが０．９の前後±0.05の範囲で特に性能が高くなることが分かる。 Referring to FIG. 9, the left end is when the HMM weight is 1 and the DNN weight is 0, and the right end is when the HMM weight is 0 and the DNN weight is 1. From the graph of FIG. 9, the system based on PoE clearly outperforms the performance of the single DNN and HMM system when the DNN weight is in the range of about 0.75 to 0.98. It can be seen that the performance is particularly high when the DNN weight is in the range of ± 0.05 around 0.9.

なお、上記実施の形態では、図７に示すようにフレームごとにそのフレームに対応するDNNからのパラメータ及びHMMからのパラメータを読み出している。しかし本発明はそのような実施の形態には限定されない。HMMパラメータは状態ごとに変化するだけなので、状態が変化したときにのみ読みだすこととし、各フレームの処理では読み出さないようにしてもよい。 In the above embodiment, as shown in FIG. 7, the parameters from the DNN and the parameters from the HMM corresponding to the frame are read for each frame. However, the present invention is not limited to such an embodiment. Since the HMM parameter only changes for each state, the HMM parameter may be read only when the state changes, and may not be read in the processing of each frame.

また上の実施の形態は、２つのモデルDNNとHMMをPoEフレームワークにより合成している。しかし本発明はそのような実施の形態には限定されない。３つ以上のモデルであっても、一つのモデルの欠点を他のモデルにより補うことができるような場合、すなわち、互いが異なる考え方により作成された音響モデルであるような場合には、本発明を適用できる。 In the above embodiment, two models DNN and HMM are synthesized by the PoE framework. However, the present invention is not limited to such an embodiment. Even if there are three or more models, the present invention can be applied to the case where the shortcomings of one model can be compensated for by other models, that is, the acoustic models are created based on different ideas. Can be applied.

［コンピュータによる実現］
本発明の実施の形態に係る音声合成パラメータ生成装置を含む音声合成装置２６０は、コンピュータハードウェアと、そのコンピュータハードウェア上で実行されるコンピュータプログラムとにより実現できる。図１０はこのコンピュータシステム５３０の外観を示し、図１１はコンピュータシステム５３０の内部構成を示す。 [Realization by computer]
The speech synthesis apparatus 260 including the speech synthesis parameter generation apparatus according to the embodiment of the present invention can be realized by computer hardware and a computer program executed on the computer hardware. FIG. 10 shows the external appearance of this computer system 530, and FIG. 11 shows the internal configuration of the computer system 530.

図１０を参照して、このコンピュータシステム５３０は、メモリポート５５２及びＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ５５０を有するコンピュータ５４０と、キーボード５４６と、マウス５４８と、モニタ５４２とを含む。 Referring to FIG. 10, the computer system 530 includes a computer 540 having a memory port 552 and a DVD (Digital Versatile Disc) drive 550, a keyboard 546, a mouse 548, and a monitor 542.

図１１を参照して、コンピュータ５４０は、メモリポート５５２及びＤＶＤドライブ５５０に加えて、ＣＰＵ（中央処理装置）５５６と、ＣＰＵ５５６、メモリポート５５２及びＤＶＤドライブ５５０に接続されたバス５６６と、ブートプログラム等を記憶する読出専用メモリ（ＲＯＭ）５５８と、バス５６６に接続され、プログラム命令、システムプログラム及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）５６０と、ハードディスク５５４を含む。コンピュータシステム５３０はさらに、他端末との通信を可能とするネットワーク５６８への接続を提供するネットワークインターフェイス（Ｉ／Ｆ）５４４を含む。 11, in addition to the memory port 552 and the DVD drive 550, the computer 540 includes a CPU (Central Processing Unit) 556, a bus 566 connected to the CPU 556, the memory port 552, and the DVD drive 550, and a boot program. And the like, a read only memory (ROM) 558 for storing etc., a random access memory (RAM) 560 connected to the bus 566 for storing program instructions, system programs, work data and the like, and a hard disk 554. Computer system 530 further includes a network interface (I / F) 544 that provides a connection to a network 568 that allows communication with other terminals.

コンピュータシステム５３０を上記した実施の形態に係る音声合成装置２６０の各機能部として機能させるためのコンピュータプログラムは、ＤＶＤドライブ５５０又はメモリポート５５２に装着されるＤＶＤ５６２又はリムーバブルメモリ５６４に記憶され、さらにハードディスク５５４に転送される。又は、プログラムはネットワーク５６８を通じてコンピュータ５４０に送信されハードディスク５５４に記憶されてもよい。プログラムは実行の際にＲＡＭ５６０にロードされる。ＤＶＤ５６２から、リムーバブルメモリ５６４から又はネットワーク５６８を介して、直接にＲＡＭ５６０にプログラムをロードしてもよい。 A computer program for causing the computer system 530 to function as each functional unit of the speech synthesis apparatus 260 according to the above-described embodiment is stored in the DVD drive 550 or the DVD 562 or the removable memory 564 installed in the memory port 552, and further the hard disk 554. Alternatively, the program may be transmitted to the computer 540 through the network 568 and stored in the hard disk 554. The program is loaded into the RAM 560 when executed. The program may be loaded directly into the RAM 560 from the DVD 562, from the removable memory 564, or via the network 568.

このプログラムは、コンピュータ５４０を、上記実施の形態に係る音声合成装置２６０の各機能部として機能させるための複数の命令からなる命令列を含む。コンピュータ５４０にこの動作を行わせるのに必要な基本的機能のいくつかはコンピュータ５４０上で動作するオペレーティングシステム若しくはサードパーティのプログラム又はコンピュータ５４０にインストールされる、ダイナミックリンク可能な各種プログラミングツールキット又はプログラムライブラリにより提供される。したがって、このプログラム自体はこの実施の形態のシステム、装置及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能又はプログラミングツールキット又はプログラムライブラリ内の適切なプログラムを実行時に動的に呼出すことにより、上記したシステム、装置又は方法としての機能を実現する命令のみを含んでいればよい。もちろん、プログラムのみで必要な機能を全て提供してもよい。 This program includes an instruction sequence including a plurality of instructions for causing the computer 540 to function as each functional unit of the speech synthesizer 260 according to the above embodiment. Some of the basic functions necessary to cause the computer 540 to perform this operation are an operating system or third party program running on the computer 540 or various dynamically linkable programming toolkits or programs installed on the computer 540. Provided by the library. Therefore, this program itself does not necessarily include all the functions necessary for realizing the system, apparatus, and method of this embodiment. The program is a system as described above by dynamically calling an appropriate program in an appropriate function or programming toolkit or program library in a controlled manner to obtain a desired result among instructions, It is only necessary to include an instruction for realizing a function as an apparatus or a method. Of course, all necessary functions may be provided only by the program.

＜実施の形態の作用及び効果＞
本発明では、HMM音声合成とDNN音声合成のモデルを、どちらの制約も満たすことの出来るProduct-of-Expertsのフレームワークを用いて、統合する。つまり、HMMとDNNの制約を満たした上で、平均ベクトルは精度の高いDNNを用いて、分散はHMMを利用するといったことが可能となる。DNN音声合成の精度で平均ベクトルを推定しつつ、HMM音声合成の分散を考慮した音声合成パラメータを生成することが可能となり、より品質の高い合成音声を生成する。すなわち、DNNとHMM音声合成等、音声合成のための複数の異なるタイプのモデルで生成されたパラメータの内、どちらか一方では品質劣化を招く場合に、もう一方のパラメータで補うことが出来る。同様の考え方で、複数のモデルを統合することも可能である。統合する対象のモデルも、互いに異なる種類のモデルで相互の長所を組み合わせることができるモデルであれば、それらから得られるパラメータを統合して音声合成パラメータの品質を高くできる。また上記実施の形態のように、統合対象となるモデルを最適化するにあたって、PoEのフレームワークに潜在変数を導入することにより、統合元のモデルに対して条件付き独立性を生み出し、EMアルゴリズムによる学習を可能にする。EMアルゴリズムにより、最尤な解に収束することが保証される。したがって複数のモデルから得た音声パラメータを安定して統合することが可能になる。複数のモデルの制約を満たした音声パラメータ系列が生成できることにより、合成音声の品質劣化を防止し、品質の高い合成音声が生成できる。 <Operation and effect of embodiment>
In the present invention, HMM speech synthesis and DNN speech synthesis models are integrated using a Product-of-Experts framework that can satisfy both constraints. That is, while satisfying the restrictions of HMM and DNN, it is possible to use DNN with high accuracy for the average vector and HMM for dispersion. It is possible to generate a speech synthesis parameter considering the variance of HMM speech synthesis while estimating the average vector with the accuracy of DNN speech synthesis, and to generate synthesized speech with higher quality. In other words, when one of the parameters generated by a plurality of different types of models for speech synthesis such as DNN and HMM speech synthesis causes quality degradation, the other parameter can be supplemented. In the same way, it is possible to integrate multiple models. If the models to be integrated are models that can combine the advantages of different types of models, the parameters obtained from them can be integrated to improve the quality of the speech synthesis parameters. Also, as in the above embodiment, when optimizing the model to be integrated, by introducing latent variables into the PoE framework, conditional independence is created for the original model, and the EM algorithm is used. Enable learning. The EM algorithm guarantees convergence to the most likely solution. Therefore, it is possible to stably integrate speech parameters obtained from a plurality of models. Since a speech parameter sequence that satisfies the constraints of a plurality of models can be generated, quality degradation of the synthesized speech can be prevented and a synthesized speech with high quality can be generated.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

Ｓ１１、Ｓ１２、Ｓ１３状態
６０ HMM
１００、２００、２６０音声合成装置
１０２入力テキスト
１０４、２０４、２６２音声信号
１１０、２１０、２７０パラメータ生成部
１１２音声合成部
１３０テキスト解析処理部
１３２ラベル列
１３４継続長モデル
１３６、３６４ HMM音響モデル
１３８ HMM状態系列決定部
１４０、２３２、２９０音声合成パラメータ算出部
１４２、２４２、２９２Ｆ０パラメータ
１４４、２４４、２９４有声／無声パラメータ
１４６、２４６、２９６スペクトル包絡パラメータ
１５０音源信号生成部
１５２音声合成フィルタ
１７０ DNN
１７２入力層
１７４、１７６隠れ層
１７８出力層
２３０、２８０、３６６ DNN音響モデル
２３４、２８６正規化パラメータ記憶部
２８４音声パラメータ統合部
２８８重み記憶部
３３２モデル合成ステップ
３６２学習データ
４００ Eステップ
４０２ Mステップ S11, S12, S13 State 60 HMM
100, 200, 260 Speech synthesizer 102 Input text 104, 204, 262 Speech signal 110, 210, 270 Parameter generator 112 Speech synthesizer 130 Text analysis processor 132 Label sequence 134 Duration model 136, 364 HMM acoustic model 138 HMM State series determination unit 140, 232, 290 Speech synthesis parameter calculation unit 142, 242, 292 F0 parameters 144, 244, 294 Voiced / unvoiced parameters 146, 246, 296 Spectrum envelope parameter 150 Sound source signal generation unit 152 Speech synthesis filter 170 DNN
172 Input layer 174, 176 Hidden layer 178 Output layer 230, 280, 366 DNN acoustic model 234, 286 Normalization parameter storage unit 284 Speech parameter integration unit 288 Weight storage unit 332 Model synthesis step 362 Learning data 400 E step 402 M step

Claims

Receives a label string consisting of a phoneme to be uttered and a label representing its context information, and outputs a speech parameter series indicating the features of the speech for each frame, which is a time unit of speech synthesis, for speech synthesis from the speech parameters. A speech synthesis parameter generation device for generating a speech synthesis parameter of
A memory for storing a program, a first acoustic model comprising a hidden Markov model (HMM), and a second acoustic model comprising a neural network (NN);
A processor that outputs the voice parameter series for each frame by executing the program stored in the memory;
The processor is executed by the program.
With reference to a continuation length model that outputs an HMM state sequence corresponding to the label sequence, a state sequence determination step for determining and outputting an HMM state sequence corresponding to the label sequence;
A first model parameter generation step of referring to the first acoustic model and outputting the HMM parameter sequence estimated from the HMM for each state, using the state sequence as an input;
A second model parameter generation step of referring to the second acoustic model and outputting the NN parameter sequence estimated from the NN for each frame with the state sequence and the label sequence as inputs;
The HMM parameter series output in the first model parameter generation step and the NN parameter series output in the second model parameter generation step are converted into a product-of-experts (PoE) framework for each frame. A model integration step that integrates and outputs as an integrated model parameter,
A speech synthesis parameter generation device programmed to execute a method including a speech synthesis parameter generation step of generating and outputting the speech synthesis parameter series based on the integrated model parameter output in the model integration step.

Each of the speech parameters constituting the speech parameter series is a vector having a plurality of elements,
The probability density function of the output of each state of the HMM and the second probability density function modeled by the NN both include a Gaussian distribution having the same number of dimensions as the vector of the speech parameters. Speech synthesis parameter generator.

Each of the HMM parameter series output in the first model parameter generation step includes a mean vector and a covariance matrix that define a Gaussian distribution that is a probability density function of an output for each state,
Each of the NN parameters constituting the NN parameter series generated by the second model parameter generation step includes an average vector defining a Gaussian distribution for each frame,
The model integration step includes the Gaussian distribution defined by the average vector and the covariance matrix output in the first model parameter generation step, the average vector output in the second model parameter generation step, and The speech synthesis parameter generation device according to claim 2, comprising a step of multiplying a Gaussian distribution defined by a global average and a covariance matrix determined in advance by a PoE framework for each frame and outputting the result as the integrated model parameter. .

Each of the HMM parameters output in the first model parameter generation step includes a mean vector and a covariance matrix of a Gaussian distribution defined as a probability density function for each state,
The NN parameter generated by the second model parameter generation step includes a mean vector of Gaussian distribution defined for each frame,
The model integration step includes the Gaussian distribution defined by the average vector and the covariance matrix output in the first model parameter generation step, the average vector output in the second model parameter generation step, and The speech synthesis parameter according to claim 2, comprising a step of adding a Gaussian distribution defined by a fixed covariance matrix calculated in advance by weighting by a PoE framework for each frame and outputting the result as the integrated model parameter. Generator.

A computer program for causing a computer to function as the speech synthesis parameter generation device according to any one of claims 1 to 4.