JP6618453B2

JP6618453B2 - Database generation apparatus, generation method, speech synthesis apparatus, and program for speech synthesis

Info

Publication number: JP6618453B2
Application number: JP2016223331A
Authority: JP
Inventors: 信行西澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2016-11-16
Filing date: 2016-11-16
Publication date: 2019-12-11
Anticipated expiration: 2036-11-16
Also published as: JP2018081200A

Description

本発明は、音声合成装置で使用するデータベースの生成技術に関する。 The present invention relates to a database generation technique used in a speech synthesizer.

非特許文献１は、隠れマルコフモデル（ＨＭＭ）に基づく音声合成を開示している。ＨＭＭに基づく音声合成では、予め、学習用データに基づき、音声を合成する元となる音素（音声学的な定義による音素とは異なる。）毎にＨＭＭによるモデル化を行い（以下、ＨＭＭによる音素のモデルを、単に、音素モデルと呼ぶ。）、データベースを作成しておく。図１は、ある１つの音素のＨＭＭを示している。図１ではＳ１〜Ｓ５の５状態のＨＭＭとしてモデル化を行っており、各状態において、予め学習で決定した音声合成パラメータが出力される。音声合成装置は、音声合成の際、例えば、入力されるテキスト情報から、音素の並びを決定し、先行する音素モデルのＨＭＭの最終状態の次の状態が、それに後続する音素モデルのＨＭＭの最初の状態となる様にＨＭＭを接続して１発話分のＨＭＭを構成する。そして、１発話分のＨＭＭの各状態が出力する音声合成パラメータに基づきフィルタパラメータや音声の基本周波数等を制御することで音声信号を生成する。なお、ＨＭＭの状態が出力する音声合成パラメータは、確率密度関数の形で与えられる。通常は、この確率密度関数に適当な分布（例えば正規分布）を仮定し、ＨＭＭの状態には、その分布のパラメータ（例えば、１次元の正規分布であれば平均と分散、多次元正規分布であれば平均ベクトルと分散共分散行列）の値が格納される。以下これを音声合成用分布パラメータという。つまり、ＨＭＭの状態が出力する音声合成パラメータは、音声合成用分布パラメータを変数とする関数で決まる。以下、この関数を音声合成パラメータ出力関数という。 Non-Patent Document 1 discloses speech synthesis based on a Hidden Markov Model (HMM). In the speech synthesis based on the HMM, modeling based on the HMM is performed in advance for each phoneme (which is different from the phoneme defined by phonetic definition) based on the learning data (hereinafter, the phoneme based on the HMM). This model is simply called a phoneme model.) A database is created. FIG. 1 shows an HMM of a phoneme. In FIG. 1, modeling is performed as an HMM in five states S1 to S5, and speech synthesis parameters determined in advance by learning are output in each state. For speech synthesis, for example, the speech synthesizer determines the arrangement of phonemes from input text information, and the next state after the final state of the HMM of the preceding phonemic model is the first of the HMMs of the subsequent phonemic model. The HMM is connected so as to be in the state, and the HMM for one utterance is configured. Then, a speech signal is generated by controlling a filter parameter, a fundamental frequency of speech, and the like based on speech synthesis parameters output by each state of the HMM for one utterance. Note that the speech synthesis parameter output by the HMM state is given in the form of a probability density function. In general, an appropriate distribution (for example, a normal distribution) is assumed for the probability density function, and the HMM state includes parameters of the distribution (for example, one-dimensional normal distribution is average and variance, multi-dimensional normal distribution). If there are, the values of the mean vector and the variance-covariance matrix are stored. Hereinafter, this is referred to as a speech synthesis distribution parameter. That is, the speech synthesis parameter output by the HMM state is determined by a function having the speech synthesis distribution parameter as a variable. Hereinafter, this function is referred to as a speech synthesis parameter output function.

このデータベースは、決定木を含んでいる。図２は、データベースに含まれる２分木である決定木を示している。図２の丸（決定木のノード）は、音素空間を分割する規則に対応し、この規則は"質問"と呼ばれる。質問は、例えば、中心音素が／ａ／であるか否か、ＨＭＭの最初の状態であるか否か等、その答えが"Ｙｅｓ"又は"Ｎｏ"となるものである。なお、決定木を構築する質問の数は、例えば、数百から数千程度であり、データベースの構築を行うものが予め用意しておく。なお、この予め用意される多数の質問の集合は"質問セット"と呼ばれる。決定木のリーフ（図２の三角）は、音声合成用分布パラメータに対応し、かつ、図１のＨＭＭの状態に対応付けられる。これにより、任意の種類の音素について、決定木のルートから順にノードの質問を適用してリーフまでたどることで、対応する１組の音声合成用分布パラメータが決まり、このパラメータにより音素モデルの各状態に対応する音声合成パラメータ出力関数が決まる。つまり、このデータベースにより、任意の種類の音素モデルが得られる。なお、決定木のリーフの数は、学習対象となる音素モデルの、ＨＭＭの状態の数の合計より通常少ない。例えば、学習データに含まれる音素数が５００００であり、これらの音素をそれぞれ５状態のＨＭＭでモデル化すると、学習対象となる音素モデルの状態数は合計で２５００００であるが、決定木のリーフの数は、通常、２５００００未満である。つまり、決定木の１つのリーフと、学習対象となる音素モデルの状態との対応関係は、一般的には、１対多であり、学習対象となる音素モデルの複数の状態が、音声合成パラメータに対する１つの確率密度関数を共有する。 This database contains decision trees. FIG. 2 shows a decision tree that is a binary tree included in the database. The circles (nodes of the decision tree) in FIG. 2 correspond to the rules for dividing the phoneme space, and these rules are called “questions”. The question is, for example, whether the central phoneme is / a / or whether it is the first state of the HMM, and the answer is “Yes” or “No”. The number of questions for constructing the decision tree is, for example, about several hundred to several thousand, and those for constructing the database are prepared in advance. A set of many questions prepared in advance is called a “question set”. The leaves of the decision tree (triangles in FIG. 2) correspond to the speech synthesis distribution parameters and are associated with the state of the HMM in FIG. As a result, for any type of phoneme, by applying the node questions in order from the root of the decision tree to the leaf, a corresponding set of distribution parameters for speech synthesis is determined, and this parameter determines each state of the phoneme model. A speech synthesis parameter output function corresponding to is determined. That is, any kind of phoneme model can be obtained from this database. Note that the number of leaves of the decision tree is usually smaller than the total number of HMM states of the phoneme model to be learned. For example, the number of phonemes included in the learning data is 50000, and when these phonemes are modeled by an HMM of 5 states, the total number of states of the phoneme model to be learned is 250,000. The number is usually less than 250,000. That is, the correspondence between one leaf of the decision tree and the state of the phoneme model to be learned is generally one-to-many, and a plurality of states of the phoneme model to be learned are represented by speech synthesis parameters. Share one probability density function for.

データベースの生成装置におけるデータベースの生成処理は、決定木構築処理と、ＨＭＭ学習処理の２つに分けられる。処理の順番としては、決定木構築処理を行い、続いて、当該決定木構築処理で決定した決定木に基づきＨＭＭ学習処理を行うことを、所定の回数だけ繰り返すことでデータベースを生成する。 The database generation process in the database generation apparatus is divided into two processes: a decision tree construction process and an HMM learning process. As the processing order, a decision tree construction process is performed, and subsequently, an HMM learning process based on the decision tree determined in the decision tree construction process is repeated a predetermined number of times to generate a database.

決定木構築処理は、図２の決定木の各ノードに対応する質問を、質問セット内の質問から選択する処理である。具体的には、生成装置は、その初期状態として音素空間全体に対応する１つのクラスタを置き、クラスタを、質問セット内の質問で順次分割していき、最終的に木構造とそのリーフに対応するクラスタを決定する。その際、質問セット内のどの質問をどの位置で適用するかを評価値により決定する。なお、評価値は、クラスタに対応する音素モデルの状態に対応付けられた音声合成パラメータ出力関数の、学習用データに対する対数尤度である。具体的には、決定木構築処理においては、生成装置は、その時点で評価値が最も小さいクラスタに対して、質問セット内の各質問でクラスタを試験的に２分割し、２つのクラスタの評価値の合計を求める。そして、評価値の合計が最大となる質問を分割に適用する質問とする。この処理を繰り返し、例えば、評価値の合計の分割前後での変化量が閾値未満となると分割処理を停止する。これにより、質問をノードとする決定木と、リーフに対応する音声合成パラメータ出力関数を記述する、音声合成用分布パラメータが決定される。なお、評価値の合計の分割前後での変化量により決定木構築処理を停止させるのではなく、木構造におけるルートからの深さ（当該クラスタが生成されるまでの音素空間全体からのみた分割数）に上限を設け、上限に達すると決定木構築処理を停止させる構成であっても良い。 The decision tree construction process is a process for selecting a question corresponding to each node of the decision tree in FIG. 2 from the questions in the question set. Specifically, the generating device places one cluster corresponding to the entire phoneme space as its initial state, and sequentially divides the cluster by the questions in the question set, and finally corresponds to the tree structure and its leaves. Determine which cluster to use. At that time, which question is applied at which position in the question set is determined by the evaluation value. The evaluation value is a log likelihood of the speech synthesis parameter output function associated with the state of the phoneme model corresponding to the cluster with respect to the learning data. Specifically, in the decision tree construction process, the generating device divides the cluster into two for each question in the question set for the cluster having the smallest evaluation value at that time, and evaluates two clusters. Find the sum of values. Then, the question having the maximum evaluation value is set as a question to be applied to the division. This process is repeated. For example, when the amount of change before and after the total evaluation value division is less than the threshold value, the division process is stopped. Thus, a speech synthesis distribution parameter that describes a decision tree having a question as a node and a speech synthesis parameter output function corresponding to a leaf is determined. The decision tree construction process is not stopped by the amount of change in the total evaluation value before and after the division, but the depth from the root in the tree structure (the number of divisions from the entire phoneme space until the cluster is generated) May be configured to stop the decision tree construction process when the upper limit is reached.

ＨＭＭ学習処理は、音素モデルのＨＭＭの状態（学習の対象となる音素モデルが５００００個あり、それぞれが５状態のＨＭＭであるとすると２５００００状態）間の状態遷移確率と、ＨＭＭの各状態に対応する音声合成用分布パラメータを、学習用データに基づき調整する処理であり、例えば、Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムが使用される。この際にも音声合成用分布パラメータの評価基準として対数尤度を用いる。ただし、先述のように異なるＨＭＭ状態間で確率密度関数が共有されているため、一般に、１組の音声合成用分布パラメータは、対応する複数のＨＭＭ状態から決まる。 The HMM learning process corresponds to the state transition probabilities between the HMM states of the phoneme model (250,000 states if there are 50000 phoneme models to be learned and each is a 5-state HMM), and each state of the HMM Is a process of adjusting the distribution parameter for speech synthesis based on the learning data. For example, a Baum-Welch algorithm is used. Also in this case, log likelihood is used as an evaluation criterion for the speech synthesis distribution parameter. However, since the probability density function is shared between different HMM states as described above, generally, one set of distribution parameters for speech synthesis is determined from a plurality of corresponding HMM states.

吉村貴克、徳田恵一、益子貴史、小林隆夫、北村正、「ＨＭＭに基づく音声合成におけるスペクトル・ピッチ・継続長の同時モデル化」、電子情報通信学会論文誌（Ｄ−ＩＩ）、Ｊ８３−Ｄ−ＩＩ、１１、ｐｐ．２０９９−２１０７、２０００年１１月Takakatsu Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura, “Simultaneous Modeling of Spectrum, Pitch, and Duration in HMM-Based Speech Synthesis”, IEICE Transactions (D-II), J83-D -II, 11, pp. 2099-2107, November 2000

一般的に、自然な音声合成を行うために、音声合成に使用するデータベースには細かく分類された音素モデルが必要となる。例えば、日本語の"あ"に対応する音声信号は、その前後の音声信号の影響を受ける。つまり、文字上では同じ"あ"と表記される場合であっても、"あ"に対応する音声信号波形は、その前後の信号（音素）により異なる。したがって、音声合成に使用するデータベースでの音素は、日本語表記での文字とは一対一に対応せず、日本語表記での文字が同じであっても、その前後の音素等によって異なる種類の音素として管理される。 In general, in order to perform natural speech synthesis, a database used for speech synthesis requires a phoneme model that is finely classified. For example, an audio signal corresponding to Japanese “a” is affected by the audio signals before and after that. That is, even when the same “A” is written on the characters, the sound signal waveform corresponding to “A” varies depending on the signals (phonemes) before and after that. Therefore, the phonemes in the database used for speech synthesis do not correspond one-on-one with the characters in Japanese notation, and even if the characters in Japanese notation are the same, they differ depending on the phonemes before and after that. Managed as phonemes.

また、データベースの生成においては、学習用の音声情報として連続的に発生された音声信号を使用する連結学習と呼ばれる手法が利用される。ここで、連結学習において、入力される音声信号の内、ある音素に対応する部分は、各音素の開始時刻及び終了時刻の情報は使わずに、音素順に連結したＨＭＭに対して確率的に尤もらしいＨＭＭ状態の位置として決定されるため、ＨＭＭによりモデル化しようとしている音素と、入力される音声信号波形の対応する部分とに時間位置的なずれが生じ得る。図３（Ａ）は、／ｉ／、／ａ／（"い"の後に"あ"）と発声された音声信号と、生成装置が認識した音素／ｉ＋ａ／（／ａ／に先行する音素／ｉ／の意味）の区間と、音素／ｉ−ａ／（／ｉ／に後続する音素／ａ／の意味）の区間との関係を示している。なお、ＨＭＭの状態数は５としている。図３（Ａ）においては、／ｉ／の信号波形の後側の一部を、音素／ｉ−ａ／の先頭部分と認識している。同様に、図３（Ｂ）は、／ｕ／、／ａ／（"う"の後に"あ"）と発声された音声信号と、生成装置が認識した音素／ｕ＋ａ／（／ａ／に先行する音素／ｕ／の意味）の区間と、音素／ｕ−ａ／（／ｕ／に後続する音素／ａ／の意味）の区間との関係を示している。図３（Ｂ）においては、／ａ／の信号波形の前側の一部を、音素／ｕ＋ａ／の後端部分と認識している。この様な現象が生じるのは、／ｉ／、／ａ／や／ｕ／、／ａ／と発声された音声信号に対して、ＨＭＭの学習の際に２つの音素間の境界位置が情報（すなわちＨＭＭ学習時の制約）として全く与えられていないためである。一般に、音声信号内の時間的な音素位置を、目視や聴取、あるいは別の統計的方法によって、ＨＭＭの学習とは全く別に決め、それを制約としてＨＭＭを学習するよりも、連結学習を行った方が、ＨＭＭによる学習用音声のモデル化という観点では制約が少ないため有利である。しかし、音声合成とは通常、学習用の音声とは異なる音素順の音声を合成する処理であり、別の基準（例えば目視や聴取）で見たときの時間位置的なずれを内包した音素モデルのＨＭＭを連結して、その連結されたＨＭＭに基づき１発話の音声を合成する。従って、連結学習の結果、音素の種類と時間位置のずれ方の間で一貫性が保てないと、合成される音声の品質が劣化する。 Further, in the generation of the database, a technique called connected learning using a speech signal generated continuously as speech information for learning is used. Here, in the connected learning, the part corresponding to a certain phoneme in the input speech signal is probabilistically probable to the HMMs connected in the order of phonemes without using the information on the start time and end time of each phoneme. Since it is determined as the position of a new HMM state, a time position shift may occur between the phoneme to be modeled by the HMM and the corresponding portion of the input speech signal waveform. FIG. 3A shows an audio signal uttered as / i /, / a / ("A" followed by "A"), and a phoneme preceding the phoneme / i + a / (/ a / recognized by the generator. The relationship between the section of (i / meaning) and the section of phoneme / ia / (meaning phoneme / a / following / i /) is shown. The number of states of the HMM is 5. In FIG. 3A, the rear part of the signal waveform of / i / is recognized as the head part of phoneme / ia /. Similarly, FIG. 3 (B) shows a speech signal uttered as “/ u /, / a / (“ a ”followed by“ a ””) and a phoneme / u + a / (/ a / ”recognized by the generator. The relationship between the phoneme / u / meaning) and the phoneme / u−a / (meaning phoneme / a / following / u /) is shown. In FIG. 3B, a part of the front side of the signal waveform of / a / is recognized as a rear end part of phoneme / u + a /. Such a phenomenon occurs because the boundary position between two phonemes is information (in the case of HMM learning) for voice signals uttered as / i /, / a /, / u /, / a /. That is, it is not given as a restriction at the time of HMM learning. In general, the temporal phoneme position in the speech signal is determined completely separately from the learning of the HMM by visual observation, listening, or another statistical method, and connected learning is performed rather than learning the HMM using it as a constraint. This is more advantageous because there are few restrictions from the viewpoint of modeling speech for learning by HMM. However, speech synthesis is usually a process of synthesizing speech in a phoneme order that is different from the speech for learning, and a phoneme model that includes time-positional deviations when viewed with another standard (for example, visual observation or listening). HMMs are connected, and speech of one utterance is synthesized based on the connected HMMs. Therefore, if the consistency between the phoneme type and the time position shift cannot be maintained as a result of the connected learning, the quality of synthesized speech deteriorates.

上記問題は、ＨＭＭ学習に用いる学習データ量と比較して、決定木において音素空間の分割数を増大させるほど生じ易くなる。これは、細かく分割してしまうことで、例えば日本語表記の仮名は同じだが音声合成における音素分類上は異なる音素の間で、確率密度関数の共有が無くなり、それらの音素間の共通の特徴（すなわち音素空間内における、日本語表記の仮名が同じ音素の巨視的な分布）がＨＭＭ学習時に考慮されないまま、それぞれの音素モデルが独立に最適化されるためである。しかしながら、上記問題を防ぐために音素空間の分割数を小さくしても、ＨＭＭでそれぞれの音素の細かい特徴が表現できなくなり、合成される音声の品質が劣化する。 The above problem is more likely to occur as the number of phoneme spaces divided in the decision tree is increased as compared to the amount of learning data used for HMM learning. This is because, for example, the kana in Japanese notation is the same but the phoneme classification in the speech synthesis is different, so there is no sharing of the probability density function, and the common feature between these phonemes ( That is, each phoneme model is optimized independently without taking into account the phoneme space macroscopic distribution of the same phonetic kana in the phoneme space). However, even if the number of divisions of the phoneme space is reduced in order to prevent the above problem, the fine features of each phoneme cannot be expressed by the HMM, and the quality of synthesized speech deteriorates.

本発明は、合成される音声の品質を向上させるデータベースを生成する技術を提供するものである。 The present invention provides a technique for generating a database that improves the quality of synthesized speech.

本発明の一側面によると、複数の音素モデルに基づき音声を合成する音声合成装置で使用するデータベースの生成装置であって、各音素モデルは、１つ以上の内部状態を有し、各内部状態は、音声合成パラメータ出力関数に関連付けられており、前記生成装置は、学習データと、予め用意された質問セットの質問に基づき、そのノードが前記質問セットの何れかの質問に対応する第１決定木を構築する構築手段であって、前記第１決定木のリーフのそれぞれは、前記学習データに基づき生成された音声合成パラメータ出力関数に対応する、前記構築手段と、前記学習データに基づき複数の音声特徴パラメータ出力関数のうちからいずれかを選択する選択手段と、前記学習データに基づき前記第１決定木のリーフそれぞれに対応する音声合成パラメータ出力関数と、前記選択手段が選択する音声特徴パラメータ出力関数とを調整する調整処理を行う調整手段と、を備えており、前記調整手段は、前記調整処理において、前記第１決定木の各リーフに対応する音声合成パラメータ出力関数と、前記選択手段が選択する音声特徴パラメータ出力関数とを、当該音声合成パラメータ出力関数の前記学習データに対する尤度と、当該音声特徴パラメータ出力関数の前記学習データに対する尤度と、の重み付け和に基づき調整することを特徴とする。 According to one aspect of the present invention, a database generation device used in a speech synthesizer that synthesizes speech based on a plurality of phoneme models, each phoneme model having one or more internal states, each internal state Is associated with a speech synthesis parameter output function, and the generating device determines, based on learning data and a question of a question set prepared in advance, a node whose node corresponds to any question of the question set. A construction means for constructing a tree, wherein each of the leaves of the first decision tree corresponds to a speech synthesis parameter output function generated based on the learning data, and a plurality of based on the learning data. Selecting means for selecting one of the speech feature parameter output functions; and a speech synthesis parameter corresponding to each leaf of the first decision tree based on the learning data. Adjustment means for performing adjustment processing for adjusting a meter output function and a voice feature parameter output function selected by the selection means, wherein the adjustment means is configured to adjust each of the first decision trees in the adjustment processing. A speech synthesis parameter output function corresponding to a leaf; a speech feature parameter output function selected by the selection means; a likelihood of the speech synthesis parameter output function for the learning data; and the learning data of the speech feature parameter output function. It adjusts based on the likelihood with respect to and the weighted sum of it.

本発明によると、合成される音声の品質を向上させるデータベースを生成することができる。 According to the present invention, a database that improves the quality of synthesized speech can be generated.

ＨＭＭを示す図。The figure which shows HMM. 決定木を示す図。The figure which shows a decision tree. 連結学習による課題の説明図。Explanatory drawing of the subject by connection learning. 一実施形態による質問セットを示す図。The figure which shows the question set by one Embodiment. 一実施形態によるＨＭＭを示す図。The figure which shows HMM by one Embodiment. 一実施形態による生成装置の構成図。The block diagram of the production | generation apparatus by one Embodiment.

以下、本発明の例示的な実施形態について図面を参照して説明する。なお、以下の実施形態は例示であり、本発明を実施形態の内容に限定するものではない。また、以下の各図においては、実施形態の説明に必要ではない構成要素については図から省略する。 Hereinafter, exemplary embodiments of the present invention will be described with reference to the drawings. In addition, the following embodiment is an illustration and does not limit this invention to the content of embodiment. In the following drawings, components that are not necessary for the description of the embodiments are omitted from the drawings.

本発明による生成装置においても、従来の生成装置と同様に、決定木構築処理と、ＨＭＭ学習処理の２つを所定回数だけ繰り返すことでデータベースを生成する。しかしながら、本実施形態においては、図４に示す様に、２つの質問セットを設ける。ここで、第２質問セットの質問は、第１質問セットの質問と比較して音素空間を細かく分割するものが少なくなる様に用意する。具体的には、第２質問セットの質問は、第１質問セットの質問と比較して、決定木構築処理中に、特定のノードに繋がるリーフばかりを分割していく様な質問が少なくなる様に用意しておく。生成装置は、決定木構築処理において、第１質問セットに基づく決定木（第１決定木）と、第２質問セットに基づく決定木（第２決定木）を構築する。第２質問セットでは音素空間の細かい分割はできないので、通常、第２決定木による音素空間の分割数は、第１決定木による音素空間の分割数より少なくなる。つまり、第２決定木のリーフ数は、第１決定木のリーフ数より少なくなる。第１決定木のリーフ及び第２決定木のリーフは、それぞれ、図２と同様に、音声合成パラメータの分布に対応する。なお、以下の説明においては、第２決定木のリーフに対応する音声合成パラメータは区別のために音声特徴パラメータと呼び、音声特徴パラメータの分布のパラメータについては、第１決定木のリーフに対応する音声合成用分布パラメータと区別するため、"調整用分布パラメータ "と呼ぶものとする。 In the generation apparatus according to the present invention, as in the conventional generation apparatus, the database is generated by repeating the decision tree construction process and the HMM learning process a predetermined number of times. However, in this embodiment, two question sets are provided as shown in FIG. Here, the questions in the second question set are prepared in such a manner that less phoneme spaces are divided compared to the questions in the first question set. Specifically, the questions in the second question set are less likely to divide only the leaves connected to a specific node during the decision tree construction process compared to the questions in the first question set. Be prepared. In the decision tree construction process, the generation device constructs a decision tree based on the first question set (first decision tree) and a decision tree based on the second question set (second decision tree). Since the phoneme space cannot be finely divided in the second question set, the number of phoneme spaces divided by the second decision tree is usually smaller than the number of phoneme spaces divided by the first decision tree. That is, the number of leaves of the second decision tree is smaller than the number of leaves of the first decision tree. The leaves of the first decision tree and the second decision tree respectively correspond to the speech synthesis parameter distribution, as in FIG. In the following description, the speech synthesis parameter corresponding to the leaf of the second decision tree is referred to as a speech feature parameter for distinction, and the speech feature parameter distribution parameter corresponds to the leaf of the first decision tree. In order to distinguish from the distribution parameters for speech synthesis, they are called “adjustment distribution parameters”.

したがって、図５に示す様に、本実施形態では、ＨＭＭの１つの状態に、音声合成用分布パラメータと調整用分布パラメータがそれぞれ対応付けられることになる。本実施形態の生成装置は、ＨＭＭ学習処理において、学習データに対する音声合成用分布パラメータで決まる音声合成パラメータ出力関数の対数尤度Ａと、学習データに対する調整用分布のパラメータで決まる音声特徴パラメータ出力関数の対数尤度Ｂについて、その重み付き和であるｎＡ＋ｍＢを最大とする様にＨＭＭの学習を行う。ここで、ｎ及びｍは、それぞれ重みであり、ｎ＋ｍ＝１であり、かつ、ｍは０より大きく１より小さい値である。なお、ｍを０とすると、第１決定木のみによる従来のＨＭＭ学習処理と同じとなり、ｎを０とすると、第２決定木のみによる従来のＨＭＭ学習と同じとなる。 Therefore, as shown in FIG. 5, in this embodiment, the distribution parameter for speech synthesis and the distribution parameter for adjustment are respectively associated with one state of the HMM. In the HMM learning process, the generation apparatus according to the present embodiment uses a logarithmic likelihood A of a speech synthesis parameter output function determined by a speech synthesis distribution parameter for learning data and a speech feature parameter output function determined by an adjustment distribution parameter for learning data. HMM learning is performed so that nA + mB, which is a weighted sum, is maximized. Here, n and m are weights, respectively, n + m = 1, and m is a value greater than 0 and less than 1. If m is 0, it is the same as the conventional HMM learning process using only the first decision tree, and if n is 0, it is the same as the conventional HMM learning using only the second decision tree.

生成装置は、この様にＨＭＭ学習処理を行って、第１決定木と、第１決定木の各リーフに対応する音声合成用分布パラメータをデータベースに格納する。そして、音声合成装置は、従来と同様に、音声合成を行う。つまり、本発明において第２決定木と、調整用分布パラメータは、音声合成には使用しない。 The generation apparatus performs the HMM learning process in this manner, and stores the first decision tree and the speech synthesis distribution parameters corresponding to the leaves of the first decision tree in the database. Then, the speech synthesizer performs speech synthesis as in the conventional case. That is, in the present invention, the second decision tree and the adjustment distribution parameter are not used for speech synthesis.

上述した様に、精度の良い音声合成のためには、対応する音素の細かい特徴を音素モデルのＨＭＭによって表現する必要があり、そのためには大きな第１決定木を使用して音素空間の分割数を多くする必要がある。しかしながら、大きな第１決定木を使用すると、図３を用いて説明した様な問題が発生する。一方、小さな第２決定木のみを使用してＨＭＭ学習を行うと音素空間の分割数が少なくなり合成される音声品質が劣化する。本実施形態では、大きな第１決定木と、第１決定木より小さな第２決定木を用い、ＨＭＭ学習処理においては、この２つの決定木のそれぞれのリーフに対応する音声合成パラメータの分布を考慮して、音声合成で使用する音声合成用分布パラメータ（第１決定木のリーフに対応する音声合成パラメータの分布のパラメータ）を決定する。これにより、ＨＭＭの連結学習の際には、第１決定木より小さい第２決定木で決まる確率密度関数の共有構造も考慮され、よって、従来の連結学習時に大きい第１決定木のみを用いた方法では生じていた、ＨＭＭによる音素モデル化対象の時間位置的なずれを抑えることができる。また、音素空間の分割数は、第１決定木により決定されるため、合成される音声品質の劣化も抑えることができる。 As described above, for accurate speech synthesis, it is necessary to express the detailed features of the corresponding phoneme by the HMM of the phoneme model. For this purpose, the number of phoneme space divisions using a large first decision tree Need to be more. However, when the large first decision tree is used, the problem described with reference to FIG. 3 occurs. On the other hand, when HMM learning is performed using only a small second decision tree, the number of phoneme space divisions is reduced and the synthesized speech quality deteriorates. In this embodiment, a large first decision tree and a second decision tree smaller than the first decision tree are used, and the HMM learning process takes into account the distribution of speech synthesis parameters corresponding to the leaves of the two decision trees. Then, a speech synthesis distribution parameter (speech synthesis parameter distribution parameter corresponding to the leaf of the first decision tree) used in speech synthesis is determined. As a result, in the HMM concatenation learning, the shared structure of the probability density function determined by the second decision tree smaller than the first decision tree is also taken into consideration, and thus only the first decision tree that is large in the conventional concatenation learning is used. It is possible to suppress a time positional shift of the phoneme modeling target by the HMM, which has occurred in the method. Further, since the number of divisions of the phoneme space is determined by the first decision tree, it is possible to suppress deterioration of the synthesized speech quality.

なお、上記説明では、第１質問セットと第２質問セットを異なるものとしたが、１つの第１質問セットのみを使用する構成であっても良い。この場合、第１質問セットによる第１決定木の構築の停止条件、つまり、分割前後での対数尤度の変化量の閾値を、第１質問セットによる第２決定木の構築の停止条件より小さくし、よって、第２決定木の構築処理を、第１決定木の構築処理より早く停止させる。この場合、第１決定木は、第２決定木のリーフをノードに代えてさらに伸ばしたものとなる。 In the above description, the first question set and the second question set are different from each other. However, only one first question set may be used. In this case, the stop condition for the construction of the first decision tree by the first question set, that is, the threshold value of the change amount of the log likelihood before and after the division is smaller than the stop condition for the construction of the second decision tree by the first question set. Therefore, the construction process of the second decision tree is stopped earlier than the construction process of the first decision tree. In this case, the first decision tree is obtained by further extending the leaves of the second decision tree in place of the nodes.

なお、第２質問セットにより第２決定木を構築するのではなく、第２決定木については、ユーザが予め決定する構成とすることもできる。例えば、５つの母音（／ａ／、／ｉ／、／ｕ／、／ｅ／、／ｏ／）と、それ以外の音素との計６種類に全音素を分類し、それぞれが、各分類の音素モデルのＨＭＭの全ての状態に対応する、６つのリーフで構成される第２決定木を使用することもできる。これにより、母音に対応するＨＭＭの先頭及び末尾が隣接する音素の影響を受け、音素モデル毎に異なる傾向となることを防ぐことができる。あるいは、音素当たりで５状態のＨＭＭを用いる場合において、５つの母音の、ＨＭＭのそれぞれ２状態目から４状態目に対応する（すなわち、／ａ／の２状態目から４状態目に対応する１つのリーフ、／ｉ／の２状態目から４状態目に対応する１つのリーフ、といったリーフからなる）５つのリーフと、それ以外のＨＭＭの全ての状態に対応する１つのリーフからなる、合計６つのリーフから成る第２決定木を使用することもできる。これにより、隣接する音素の影響は１状態目と５状態目に集中させ、２状態目から４状態目に影響しないようにできる。さらに、ＨＭＭによる音声合成においては無音も音素の１つの種類として扱われるが、無音の音素モデルのＨＭＭの全状態と、それ以外の音素モデルのＨＭＭの全状態をそれぞれリーフとするような、リーフが２個の第２決定木を使用することもできる。これにより、連結学習により発話前後の音声境界が時間位置的にずれて、無音のＨＭＭが無音でない音声区間に対応してしまう現象を防ぐことができる。 Instead of constructing the second decision tree by the second question set, the second decision tree can be configured to be determined in advance by the user. For example, all phonemes are classified into a total of six types of five vowels (/ a /, / i /, / u /, / e /, / o /) and other phonemes, It is also possible to use a second decision tree composed of six leaves corresponding to all states of the phoneme model HMM. As a result, it is possible to prevent the HMM corresponding to the vowel from being influenced by the phonemes adjacent to each other at the beginning and the end and becoming different for each phoneme model. Alternatively, in the case of using five-state HMMs per phoneme, each of the five vowels corresponds to the second to fourth states of the HMM (that is, 1 corresponding to the second to fourth states of / a /). 5 leaves) (one leaf, one leaf corresponding to the second to fourth states of / i /) and one leaf corresponding to all other states of the HMM, a total of 6 A second decision tree consisting of two leaves can also be used. As a result, the influence of adjacent phonemes can be concentrated in the first state and the fifth state, and can be prevented from affecting the second state to the fourth state. Furthermore, in speech synthesis by HMM, silence is also treated as one type of phoneme, but the leaf is such that the entire state of the HMM of the silent phoneme model and the entire state of the HMM of the other phonemic model are leaves. Can also use two second decision trees. As a result, it is possible to prevent a phenomenon in which the speech boundary before and after the utterance is shifted in time position due to the connected learning and the silent HMM corresponds to a speech section that is not silent.

なお、これまでの説明では決定木により、音素から、音素モデルのＨＭＭの各状態の、音声合成パラメータの分布のパラメータが決まるとしていたが、本発明はこれに限定されない。それぞれの音素に対応するＨＭＭの各状態の音声合成パラメータの確率密度関数が決定できればよく、その決定方法は決定木を使う方法に限らない。例えば、先述したように５種類の母音とそれ以外の合計６つの分布しかない場合については、第２決定木を用いなくても、任意の種類の音素に対して、対応する音素モデルのＨＭＭの各状態の分布を容易に決めることができる。第１決定木についても同様である。 In the above description, the decision tree determines the parameters of the distribution of the speech synthesis parameters for each state of the HMM of the phoneme model from the phoneme by the decision tree, but the present invention is not limited to this. It is only necessary to determine the probability density function of the speech synthesis parameter of each state of the HMM corresponding to each phoneme, and the determination method is not limited to the method using a decision tree. For example, as described above, in the case where there are only five types of vowels and a total of six distributions, the HMM of the corresponding phoneme model can be applied to any type of phonemes without using the second decision tree. The distribution of each state can be easily determined. The same applies to the first decision tree.

また、第１決定木で決まる音声合成用分布パラメータにより決まる音声合成パラメータ出力関数と、第２決定木で決まる調整用分布パラメータにより決まる音声特徴パラメータ出力関数がそれぞれ実際に出力するパラメータの種類は、同じ種類でも、異なる種類でも良い。異なる種類のパラメータを用いる例として、音声のスペクトル特徴のモデル化において、音声合成用分布パラメータにより決まる分布が音声合成や音声符号化で広く使われている線スペクトル対（ＬＳＰ）係数を対象とし、調整用分布パラメータにより決まる分布が音声認識で広く使われているケプストラム係数を対象とするといった構成でも良い。調整用分布パラメータできまる音声特徴パラメータ出力関数は音声合成の際に使わないので、音声合成には使いにくい種類のパラメータや、パラメータの分布関数（例えば混合正規分布）を考慮したＨＭＭの学習が可能である。 The types of parameters that are actually output by the speech synthesis parameter output function determined by the speech synthesis distribution parameter determined by the first decision tree and the speech feature parameter output function determined by the adjustment distribution parameter determined by the second decision tree are as follows: The same type or different types may be used. As an example of using different types of parameters, in modeling the spectral characteristics of speech, the distribution determined by the speech synthesis distribution parameters is targeted for line spectrum pair (LSP) coefficients widely used in speech synthesis and speech coding. The distribution determined by the adjustment distribution parameter may be a cepstrum coefficient widely used in speech recognition. Since the speech feature parameter output function that is determined by the distribution parameter for adjustment is not used in speech synthesis, it is possible to learn HMM considering parameters that are difficult to use for speech synthesis and parameter distribution functions (for example, mixed normal distribution). It is.

さらに、本発明は、厳密な意味でのＨＭＭに基づく音声合成装置のみを対象としたものではなく、ＨＭＭに類似したモデルに基づく音声合成も適用対象に含む。例えば、ＨＭＭに自己遷移確率の滞在時間数に関する分布を制約として追加したモデルである、隠れセミマルコフモデル（ＨＳＭＭ）について、このＨＳＭＭに基づく音声合成も、これまでの説明に基づく一連の手法が同様に適用可能である。より一般的には、本発明は、各音素モデルが１つ以上の内部状態を有し、各内部状態それぞれに音声合成パラメータ出力関数に関連付けられている構造のモデルに基づく音声合成方式を対象とすることができる。 Furthermore, the present invention is not limited to a speech synthesizer based on HMM in a strict sense, but includes speech synthesis based on a model similar to HMM. For example, for a hidden semi-Markov model (HSMM), which is a model in which the distribution of the self-transition probability regarding the number of staying hours is added to the HMM as a constraint, the speech synthesis based on this HSMM is similar to the series of methods based on the previous explanation It is applicable to. More generally, the present invention is directed to a speech synthesis scheme based on a model of a structure in which each phoneme model has one or more internal states, and each internal state is associated with a speech synthesis parameter output function. can do.

図６は、本実施形態による生成装置の構成図である。質問セット保持部３には、予め用意された複数の質問を含む質問セットが用意される。上述した様に、第１決定木を構築するための質問セットと、第２決定木を構築するための質問セットは、同じものであっても、異なるものであっても良い。また、第１決定木を構築するための質問セットと、第２決定木を構築するための質問セットとが、部分的に同じ質問を共有するものであっても良い。 FIG. 6 is a configuration diagram of the generation apparatus according to the present embodiment. In the question set holding unit 3, a question set including a plurality of questions prepared in advance is prepared. As described above, the question set for constructing the first decision tree and the question set for constructing the second decision tree may be the same or different. Moreover, the question set for constructing the first decision tree and the question set for constructing the second decision tree may share the same question partially.

決定木構築部１は、上述した決定木構築処理を行い、第１決定木と第２決定木とを構築する。また、学習処理部２は、上述したＨＭＭ学習処理を行い、第１決定木の各リーフに対応する音声合成パラメータの調整処理を行う。なお、決定木構築処理と、その後に続くＨＭＭ学習処理は所定回数だけ行われ、学習処理部２は、最後のＨＭＭ学習処理を行うと、そのときの第１決定木と、第１決定木のリーフに対応する音声合成パラメータをデータベース４に格納する。そして、図示しない音声合成装置は、データベース４に格納された情報を使用して音声合成を行う。 The decision tree construction unit 1 performs the above-described decision tree construction process to construct a first decision tree and a second decision tree. In addition, the learning processing unit 2 performs the above-described HMM learning processing, and performs speech synthesis parameter adjustment processing corresponding to each leaf of the first decision tree. The decision tree construction process and the subsequent HMM learning process are performed a predetermined number of times, and when the learning processing unit 2 performs the last HMM learning process, the first decision tree and the first decision tree at that time The speech synthesis parameter corresponding to the leaf is stored in the database 4. A speech synthesizer (not shown) performs speech synthesis using information stored in the database 4.

なお、決定木構築部１は、第１決定木のみを構築し、第２決定木についてはユーザが決定して決定木構築部１に設定する構成であっても良い。 The decision tree construction unit 1 may be configured to construct only the first decision tree, and the user decides and sets the second decision tree in the decision tree construction unit 1.

なお、本発明による生成装置は、コンピュータを上記生成装置として動作させるプログラムにより実現することができる。これらコンピュータプログラムは、コンピュータが読み取り可能な記憶媒体に記憶されて、又は、ネットワーク経由で配布が可能なものである。 The generation apparatus according to the present invention can be realized by a program that causes a computer to operate as the generation apparatus. These computer programs can be stored in a computer-readable storage medium or distributed via a network.

１：決定木構築部、２：学習処理部、３：質問セット保持部、４：データベース 1: Decision tree construction unit, 2: Learning processing unit, 3: Question set holding unit, 4: Database

Claims

A database generator for use in a speech synthesizer that synthesizes speech based on a plurality of phoneme models,
Each phoneme model has one or more internal states, each internal state being associated with a speech synthesis parameter output function,
The generator is
Based on learning data and a question of a question set prepared in advance, the node is a construction means for constructing a first decision tree corresponding to any question of the question set, the leaf of the first decision tree Each of the construction means corresponding to a speech synthesis parameter output function generated based on the learning data;
Selecting means for selecting one of a plurality of speech feature parameter output functions based on the learning data;
Adjustment means for performing adjustment processing for adjusting a speech synthesis parameter output function corresponding to each leaf of the first decision tree based on the learning data and a speech feature parameter output function selected by the selection means; ,
In the adjustment process, the adjustment unit is configured to convert a speech synthesis parameter output function corresponding to each leaf of the first decision tree and a speech feature parameter output function selected by the selection unit into the speech synthesis parameter output function. A generation apparatus that adjusts based on a weighted sum of likelihood for learning data and likelihood of the speech feature parameter output function for the learning data.

Each of the speech synthesis parameter output functions corresponding to the leaves of the first decision tree corresponds to one or more internal states of each phoneme model,
2. The generation apparatus according to claim 1, wherein each of the speech feature parameter output functions selected by the selection unit corresponds to one or more internal states of each phoneme model.

The selecting means selects a speech feature parameter output function based on a second decision tree having a smaller number of leaves than the first decision tree;
The construction means constructs the second decision tree used by the selection means, whose node corresponds to any question of the question set based on the learning data and the questions of the prepared question set. The generating apparatus according to claim 1 or 2, wherein

The selection means selects each of a plurality of vowel phoneme models and a speech feature parameter output function of one phoneme model corresponding to other than the plurality of vowels. The generating device described in 1.

4. The generating apparatus according to claim 1, wherein the selection unit selects a phoneme model of silence and a speech feature parameter output function of a phoneme model other than the silence. 5.

The adjustment unit stores the first decision tree and the speech synthesis parameter output function after the adjustment process corresponding to each leaf of the first decision tree in a database. The generation device according to any one of 5.

A database generator for use in a speech synthesizer that synthesizes speech based on a plurality of phoneme models,
Each phoneme model has one or more internal states, each internal state being associated with a speech synthesis parameter output function,
The generator is
First selection means for selecting one of a plurality of speech synthesis parameter output functions based on the learning data;
Second selection means for selecting one of a plurality of speech feature parameter output functions based on the learning data;
Adjusting means for adjusting a speech synthesis parameter output function selected by the first selection means based on the learning data and a speech feature parameter output function selected by the second selection means;
In the adjustment process, the adjustment unit is configured to convert the speech synthesis parameter output function selected by the first selection unit and the speech feature parameter output function selected by the second selection unit into the learning of the speech synthesis parameter output function. A generation apparatus that adjusts based on a weighted sum of likelihood for data and likelihood of the speech feature parameter output function for the learning data.

8. The generating apparatus according to claim 7, wherein the number of speech feature parameter output functions selected by the first selection unit is greater than the number of speech feature parameter output functions selected by the second selection unit.

9. The generating apparatus according to claim 7, wherein the adjustment unit stores a speech synthesis parameter output function after the adjustment processing of the speech synthesis parameter output function selected by the first selection unit in a database.

The speech synthesis parameter output function after the adjustment processing corresponding to each of the first leaves of the first decision tree and the plurality of first leaves of the first decision tree stored in the database of the generation device according to claim 6. A speech synthesizer that performs speech synthesis based on the above.

The speech synthesis is performed based on the speech synthesis parameter output function after the adjustment processing of the speech synthesis parameter output function selected by the first selection unit, which is stored in the database of the generation device according to claim 9. A speech synthesizer.

A program that causes a computer to function as the generation device according to claim 1.

A generation method for generating in a computer a database used in a speech synthesizer that synthesizes speech based on a plurality of phoneme models,
Each phoneme model has one or more internal states, each internal state being associated with a speech synthesis parameter output function,
The generation method is:
Based on learning data and a question of a prepared question set, a construction step in which the node constructs a first decision tree corresponding to any question of the question set, the leaf of the first decision tree Each of the construction steps corresponding to a speech synthesis parameter output function generated based on the learning data;
A selection step of selecting one of a plurality of speech feature parameter output functions based on the learning data;
Adjusting a speech synthesis parameter output function corresponding to each leaf of the first decision tree based on the learning data and a speech feature parameter output function selected in the selection step;
In the adjustment step, the speech synthesis parameter output function corresponding to the leaf of the first decision tree and the speech feature parameter output function selected in the selection step are represented by the likelihood of the speech synthesis parameter output function for the learning data. And a step of adjusting based on a weighted sum of the speech feature parameter output function and the likelihood of the learning data with respect to the learning data.