JPWO2014061230A1

JPWO2014061230A1 - Prosody model learning device, prosody model learning method, speech synthesis system, and prosody model learning program

Info

Publication number: JPWO2014061230A1
Application number: JP2014541930A
Authority: JP
Inventors: 康行三井; 玲史近藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2012-10-16
Filing date: 2013-10-08
Publication date: 2016-09-05
Anticipated expiration: 2033-10-08
Also published as: WO2014061230A1; JP6314828B2

Abstract

［課題］統計的手法において安定性の高い韻律を生成する、韻律モデル学習装置、韻律モデル学習方法、音声合成システム、および韻律モデル学習プログラムが提供される。［解決手段］本発明の韻律モデル学習装置は、データを分割する条件であり、韻律の生成に与える影響が大きい条件を1以上含む第一の条件集合を用いて、前記データのクラスタリングを行う第一のクラスタリング手段と、前記第一のクラスタリング手段によるクラスタリング結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データのクラスタリングを行う第二のクラスタリング手段と、前記第二のクラスタリング手段によるクラスタリング結果に基づいて、韻律モデルを学習する学習手段とを有する。[Problem] Provided are a prosody model learning device, a prosody model learning method, a speech synthesis system, and a prosody model learning program for generating a highly stable prosody in a statistical method. [Solution] The prosody model learning device according to the present invention performs the clustering of the data using a first condition set including one or more conditions that are conditions for dividing the data and have a large influence on the generation of the prosody. Clustering of the data using one clustering means, a clustering result by the first clustering means, and a second condition set including one or more conditions different from the conditions included in the first condition set. Second clustering means to perform, and learning means to learn the prosodic model based on the clustering result by the second clustering means.

Description

本発明は、韻律モデル学習装置、韻律モデル学習方法、音声合成システム、および韻律モデル学習プログラムに関する。 The present invention relates to a prosody model learning device, a prosody model learning method, a speech synthesis system, and a prosody model learning program.

一般的なテキスト音声合成システムは、以下のように音声を合成する。テキスト音声合成システムは、まず、形態素解析等により、入力されたテキストの言語構造を解析する言語解析処理を行う。次に、テキスト音声合成システムは、その結果に基づいて、アクセント等が付与された音韻情報を生成する。さらに、テキスト音声合成システムは、発音情報に基づいて基本周波数（Ｆ０）パタンや音素継続時間長を推定することで韻律情報を生成する韻律生成処理を行う。そして、テキスト音声合成システムは、生成された韻律情報と音韻情報に基づいて音声波形を生成する波形生成処理を行う。 A general text-to-speech synthesis system synthesizes speech as follows. The text-to-speech synthesis system first performs language analysis processing for analyzing the language structure of the input text by morphological analysis or the like. Next, the text-to-speech synthesis system generates phoneme information to which accents and the like are given based on the result. Further, the text-to-speech synthesis system performs prosody generation processing for generating prosody information by estimating a fundamental frequency (F0) pattern and phoneme duration based on pronunciation information. Then, the text-to-speech synthesis system performs waveform generation processing for generating a speech waveform based on the generated prosodic information and phonological information.

韻律情報を生成する方法の一例が、非特許文献１に記されているような、統計的手法として隠れマルコフモデル（HMM）を用いた音声合成方式である。統計的手法を用いた音声合成システムは、大量の学習用データを用いて学習（生成）した韻律モデルおよび音声合成単位（パラメータ）モデルを使って、音声を生成する。 An example of a method for generating prosodic information is a speech synthesis method using a hidden Markov model (HMM) as a statistical method as described in Non-Patent Document 1. A speech synthesis system using a statistical method generates speech using a prosodic model and speech synthesis unit (parameter) model learned (generated) using a large amount of learning data.

ここで、韻律モデルを学習する手法の一例が、学習用データをクラスタリングして、クラスタごとに韻律モデルを学習する手法である。クラスタごとの韻律モデル（代表パタン）を生成し、前記代表パタンに基づいて韻律を生成する方法が、特許文献１および特許文献２に開示されている。 Here, an example of a method for learning a prosodic model is a method for learning data for each cluster by clustering learning data. Patent Documents 1 and 2 disclose a method of generating a prosody model (representative pattern) for each cluster and generating a prosody based on the representative pattern.

特開平１１−９５７８３JP-A-11-95783 特開２００６−１８９７２３JP 2006-189723 A

徳田恵一「隠れマルコフモデルの音声合成への応用」電気通信学会技術研究報告ＳＰ９９−６１ｐｐ．４７−５４、１９９９Keiichi Tokuda “Application of Hidden Markov Models to Speech Synthesis” IEICE Technical Report SP99-61 pp. 47-54, 1999

学習用データをクラスタリングして代表パタンを生成する統計的手法では、学習用データ量が少ないと学習用データの不足や偏りが起こる。これは、データスパースネス問題と呼ばれる。よって、安定性の高い韻律を生成できないという課題がある。 In the statistical method of generating the representative pattern by clustering the learning data, the learning data is insufficient or biased if the learning data amount is small. This is called the data sparseness problem. Therefore, there is a problem that a highly stable prosody cannot be generated.

［発明の目的］
本発明の目的の一つは、上記の課題に鑑みてなされたものであり、統計的手法において安定性の高い韻律を生成する、韻律モデル学習装置、韻律モデル学習方法、音声合成システム、およびプログラムを提供することである。[Object of invention]
One of the objects of the present invention has been made in view of the above problems, and generates a prosody model learning apparatus, a prosody model learning method, a speech synthesis system, and a program for generating a highly stable prosody in a statistical method. Is to provide.

本発明の韻律モデル学習装置は、データを分割する条件であり、韻律の生成に与える影響が大きい条件を1以上含む第一の条件集合を用いて、前記データのクラスタリングを行う第一のクラスタリング手段と、前記第一のクラスタリング手段によるクラスタリング結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データのクラスタリングを行う第二のクラスタリング手段と、前記第二のクラスタリング手段によるクラスタリング結果に基づいて、韻律モデルを学習する学習手段とを有する。 The prosody model learning device of the present invention is a first clustering means for clustering the data using a first condition set that includes one or more conditions that are conditions for dividing the data and that have a large influence on the generation of the prosody Second clustering of the data using a clustering result by the first clustering means and a second condition set including one or more conditions different from the conditions included in the first condition set Clustering means and learning means for learning the prosodic model based on the clustering result by the second clustering means.

本発明の韻律モデル学習方法は、データを分割する条件であり、韻律の生成に与える影響が大きい条件を1以上含む第一の条件集合を用いて、前記データに対して第一のクラスタリングを行い、前記第一のクラスタリングの結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データに対して第二のクラスタリングを行い、前記第二のクラスタリングの結果を用いて、韻律モデルを学習する。 The prosody model learning method of the present invention is a condition for dividing data, and performs first clustering on the data using a first condition set including one or more conditions that have a large influence on prosody generation. , Using the result of the first clustering and a second condition set including one or more conditions different from the conditions included in the first condition set, performing a second clustering on the data, A prosodic model is learned using the result of the second clustering.

本発明の韻律モデル学習プログラムは、データを分割する条件であり、韻律の生成に与える影響が大きい条件を1以上含む第一の条件集合を用いて、前記データのクラスタリングを行う第一のクラスタリングステップと、前記第一のクラスタリング手段によるクラスタリング結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データのクラスタリングを行う第二のクラスタリングステップと、前記第二のクラスタリング手段によるクラスタリング結果を用いて、韻律モデルを学習する学習ステップとをコンピュータに実行させる。 The prosody model learning program of the present invention is a first clustering step for clustering the data using a first condition set that includes one or more conditions that are conditions for dividing data and have a large influence on the generation of prosody Second clustering of the data using a clustering result by the first clustering means and a second condition set including one or more conditions different from the conditions included in the first condition set The computer executes a clustering step and a learning step of learning the prosodic model using the clustering result obtained by the second clustering means.

本発明の音声合成システムは、データを分割する条件であり、韻律の生成に与える影響が大きい条件である第一の条件を1以上含む第一の条件集合を用いて、前記データのクラスタリングを行う第一のクラスタリング手段と、前記第一のクラスタリング手段によるクラスタリング結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データのクラスタリングを行う第二のクラスタリング手段と、前記第二のクラスタリング手段によるクラスタリング結果を用いて、韻律モデルの学習を行う学習手段と、前記学習手段で学習された韻律モデルに基づいて、入力されたテキストに対応する合成音声の波形を生成する合成手段とを有する。 The speech synthesis system according to the present invention performs clustering of the data using a first condition set including one or more first conditions that are conditions for dividing data and have a large influence on the generation of prosody. Clustering of the data using a first clustering means, a clustering result by the first clustering means, and a second condition set including one or more conditions different from the conditions included in the first condition set The second clustering means for performing the learning, the learning means for learning the prosodic model using the clustering result by the second clustering means, and the input text based on the prosodic model learned by the learning means. Synthesizing means for generating a corresponding synthesized speech waveform.

本発明は、係る韻律モデル学習プログラムが格納された、コンピュータ読み取り可能な不揮発性の記録媒体によっても実現可能である。 The present invention can also be realized by a computer-readable non-volatile recording medium storing such a prosodic model learning program.

本発明によれば、安定性の高い韻律を生成可能な韻律モデルを生成できるという効果がある。 According to the present invention, there is an effect that a prosody model capable of generating a highly stable prosody can be generated.

図１は、本発明の各実施形態に係るハードウェア構成の一例を表す図である。FIG. 1 is a diagram illustrating an example of a hardware configuration according to each embodiment of the present invention. 図２は、本発明の第１の実施形態に係るブロック図である。FIG. 2 is a block diagram according to the first embodiment of the present invention. 図３は、本発明の第１の実施形態に係るフローチャートである。FIG. 3 is a flowchart according to the first embodiment of the present invention. 図４は、本発明の第２の実施形態に係るブロック図である。FIG. 4 is a block diagram according to the second embodiment of the present invention. 図５は、本発明の第２の実施形態に係るフローチャートである。FIG. 5 is a flowchart according to the second embodiment of the present invention. 図６は、本発明の第３の実施形態に係るブロック図である。FIG. 6 is a block diagram according to the third embodiment of the present invention. 図７は、本発明の第３の実施形態に係るフローチャートである。FIG. 7 is a flowchart according to the third embodiment of the present invention. 図８は、本発明の第４の実施形態に係るブロック図である。FIG. 8 is a block diagram according to the fourth embodiment of the present invention. 図９は、本発明の第４の実施形態を説明するための第一の図である。FIG. 9 is a first diagram for explaining a fourth embodiment of the present invention. 図１０は、本発明の第４の実施形態を説明するための第二の図である。FIG. 10 is a second diagram for explaining the fourth embodiment of the present invention. 図１１は、本発明の第１の実施形態に係る第二のブロック図である。FIG. 11 is a second block diagram according to the first embodiment of the present invention. 図１２は、本発明の第２の実施形態に係る第二のブロック図である。FIG. 12 is a second block diagram according to the second embodiment of the present invention. 図１３は、本発明の第４の実施形態を説明するための第三の図である。FIG. 13 is a third diagram for explaining the fourth embodiment of the present invention.

次に、本発明の実施形態について図面を参照して詳細に説明する。なお、各実施形態について、同様な構成要素には同じ符号を付し、適宜説明を省略する。 Next, embodiments of the present invention will be described in detail with reference to the drawings. In addition, about each embodiment, the same code | symbol is attached | subjected to the same component and description is abbreviate | omitted suitably.

（第１の実施形態）
図１は、本発明の第１の実施形態に係る韻律モデル学習装置１を実現する、コンピュータのハードウェア構成の一例を表す図である。(First embodiment)
FIG. 1 is a diagram illustrating an example of a hardware configuration of a computer that realizes the prosodic model learning device 1 according to the first embodiment of the present invention.

図１に示すように、韻律モデル学習装置１を実現可能なコンピュータ１０００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２、メモリ３、記憶装置４、通信ＩＦ（Ｉｎｔｅｒｆａｃｅ）５、表示装置６および入力装置７を有する。記憶装置４は、例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）である。通信ＩＦ５は、図示しないネットワークを介してデータの通信を行う。表示装置６は、ディスプレイ装置などである。入力装置７は、キーボードやマウス等のポインティングデバイスを含む。これらの構成要素は、バス８を通して互いに接続されており、互いにデータの入出力を行う。なお、韻律モデル学習装置１のハードウェア構成は、この構成に制限されず、適宜変更することができる。 As shown in FIG. 1, a computer 1000 capable of realizing the prosody model learning device 1 includes a CPU (Central Processing Unit) 2, a memory 3, a storage device 4, a communication IF (Interface) 5, a display device 6, and an input device 7. Have. The storage device 4 is, for example, an HDD (Hard Disk Drive). The communication IF 5 performs data communication via a network (not shown). The display device 6 is a display device or the like. The input device 7 includes a pointing device such as a keyboard and a mouse. These components are connected to each other through the bus 8 and input / output data to / from each other. The hardware configuration of the prosody model learning device 1 is not limited to this configuration and can be changed as appropriate.

また、後述される、第１の実施形態に係る韻律モデル学習装置１Ｂ、第２の実施形態に係る韻律モデル学習装置１Ａ及び韻律モデル学習装置１Ｃ、第３の実施形態に係る音声合成システム１００、及び第４の実施形態に係る音声合成システム１０１も同様に、図１に示すハードウェア構成を備えるコンピュータ１０００により実現できる。なお、各実施形態に係る韻律モデル学習装置及び音声合成システムは、図２、図４、図６、図８、図１１、図１２のうち、その韻律モデル学習装置又は音声合成システムに該当する図に示す機能を有する専用装置によっても実現できる。 Further, the prosody model learning device 1B according to the first embodiment, the prosody model learning device 1A and the prosody model learning device 1C according to the second embodiment, and the speech synthesis system 100 according to the third embodiment, which will be described later, Similarly, the speech synthesis system 101 according to the fourth embodiment can be realized by the computer 1000 having the hardware configuration shown in FIG. The prosodic model learning device and the speech synthesis system according to each embodiment are diagrams corresponding to the prosodic model learning device or the speech synthesis system in FIGS. 2, 4, 6, 8, 11, and 12. It can also be realized by a dedicated device having the functions shown in FIG.

図２は、本発明の第１の実施形態に係る韻律モデル学習装置１の機能構成の例を表すブロック図である。 FIG. 2 is a block diagram illustrating an example of a functional configuration of the prosodic model learning device 1 according to the first embodiment of the present invention.

図２を参照すると、本実施形態に係る韻律モデル学習装置１は、第一のクラスタリング部１１０と、第二のクラスタリング部１２０と、第一の学習部１３０とを有する。 Referring to FIG. 2, the prosody model learning device 1 according to the present embodiment includes a first clustering unit 110, a second clustering unit 120, and a first learning unit 130.

第一のクラスタリング部１１０は、第一の条件集合の少なくとも一部の条件を用いて、データのクラスタリングを行う。ここで、データとは、学習用データまたは暫定的に作成された韻律モデルのことである。韻律モデルは、第２の実施形態の説明において後述される。本実施の形態における第一のクラスタリング部１１０は、学習用データのクラスタリングを行う。 The first clustering unit 110 performs data clustering using at least some of the conditions of the first condition set. Here, the data is learning data or a tentatively created prosodic model. The prosody model will be described later in the description of the second embodiment. The first clustering unit 110 in the present embodiment performs clustering of learning data.

ここで、第一の条件集合は、データを分割するための条件を、１以上含む条件集合である。以下の説明において、第一の条件集合が含むデータを分割するための条件は、第一の条件と表記される。第一の条件は、重要度が高い、すなわち、韻律の生成に与える影響が大きい条件である。第一の条件は、言語的あるいは音響的に重要な特徴に関する条件である。第一の条件は、例えば、アクセント位置に関する条件である。 Here, the first condition set is a condition set including one or more conditions for dividing the data. In the following description, a condition for dividing data included in the first condition set is referred to as a first condition. The first condition is a condition that is highly important, that is, has a great influence on the generation of prosody. The first condition is a condition related to a linguistic or acoustically important feature. The first condition is, for example, a condition related to the accent position.

第一のクラスタリング部１１０は、第一の条件集合の少なくとも一部の条件を用いてもよい。また、第一のクラスタリング部１１０は、第一の条件集合の全ての条件を用いてもよい。全ての条件を用いる場合、重要度が高い条件が全てクラスタリングに用いられる。よって、後述する第一の学習部１３０は、より安定性が高い韻律モデルを学習することができる。 The first clustering unit 110 may use at least some conditions of the first condition set. Further, the first clustering unit 110 may use all the conditions of the first condition set. When all conditions are used, all the conditions having high importance are used for clustering. Therefore, the first learning unit 130 described later can learn a prosody model with higher stability.

クラスタリングの手法には、例えば、木構造クラスタリングがある。その場合、第一のクラスタリング部１１０は、第一の条件集合に含まれる条件を各ノードにもつ木構造を構築する。クラスタリングの手法として、Ｋ−ｍｅａｎｓ法（Ｋ−平均法）、ウォード法などの、その他の手法が用いられてもよい。また、第一のクラスタリング部１１０によるクラスタリングの手法には、数量化I類等の数量化理論も適用できる。 As a clustering method, for example, there is a tree structure clustering. In this case, the first clustering unit 110 constructs a tree structure having the conditions included in the first condition set at each node. As a clustering method, other methods such as a K-means method (K-mean method) and a Ward method may be used. In addition, a quantification theory such as quantification class I can be applied to the clustering technique by the first clustering unit 110.

第二のクラスタリング部１２０は、第一のクラスタリング部１１０によるクラスタリング結果と、第一の条件集合に含まれる条件とは異なる条件を含む第二の条件集合を用いて、学習用データのクラスタリングを行う。なお、第二の条件集合は、第一の条件集合に含まれる条件の全てまたは一部を、含んでもよい。 The second clustering unit 120 clusters the learning data using the clustering result obtained by the first clustering unit 110 and the second condition set including conditions different from the conditions included in the first condition set. . The second condition set may include all or part of the conditions included in the first condition set.

第二のクラスタリング部１２０は、クラスタリング構造において、第一の条件集合が第二の条件集合に対して優位となるようにクラスタリングを行う。優位であるとは、クラスタリングによる分割条件の序列が上位であることである。例えば、木構造の場合には、その条件が上位構造に位置することである。 The second clustering unit 120 performs clustering so that the first condition set is superior to the second condition set in the clustering structure. The superiority is that the order of division conditions by clustering is higher. For example, in the case of a tree structure, the condition is that it is located in the upper structure.

例えば、木構造クラスタリングが用いられる場合、第二のクラスタリング部１２０は、第一のクラスタリング部１１０が構築した木構造を保ったまま、下位構造に、第二の条件集合の条件によるノードを追加していく。 For example, when tree structure clustering is used, the second clustering unit 120 adds a node according to the condition of the second condition set to the lower structure while maintaining the tree structure constructed by the first clustering unit 110. To go.

または、第二のクラスタリング部１２０は、第一のクラスタリング部１１０が構築した木構造のノードの間に、第二の条件集合の条件によるノードを追加してもよい。この場合でも、第一の条件集合が第二の条件集合に対して優位なクラスタリング構造になるように、ノードを追加することが望ましい。 Or the 2nd clustering part 120 may add the node by the conditions of a 2nd condition set between the nodes of the tree structure which the 1st clustering part 110 constructed | assembled. Even in this case, it is desirable to add nodes so that the first condition set has a clustering structure superior to the second condition set.

第一の学習部１３０は、第二のクラスタリング部１２０によるクラスタリング結果に基づいて、学習を行うことにより韻律モデルを生成する。例えば、第一の学習部１３０は、クラスタごとに、クラスタに属する学習用データから韻律モデルを生成する。 The first learning unit 130 generates a prosody model by performing learning based on the clustering result by the second clustering unit 120. For example, the first learning unit 130 generates a prosodic model for each cluster from learning data belonging to the cluster.

なお、以上で説明した構成において、第一のクラスタリング部１１０および第二のクラスタリング部１２０は異なる部であるが、韻律モデル学習装置１の構成はこの構成に限られない。例えば、１つのクラスタリング部が、第一の条件集合が第二の条件集合に対してクラスタリング構造において優位となるようなクラスタリング構造を構築し、その構造に基づいてクラスタリングを行ってもよい。 In the configuration described above, the first clustering unit 110 and the second clustering unit 120 are different units, but the configuration of the prosodic model learning device 1 is not limited to this configuration. For example, one clustering unit may construct a clustering structure in which the first condition set is superior to the second condition set in the clustering structure, and clustering may be performed based on the structure.

以上で説明した、本実施形態における韻律モデル学習装置１は、第一のクラスタリング部１１０および第二のクラスタリング部１２０によって、二段階のクラスタリングを行う。本実施形態における韻律モデル学習装置１は、二段階ではなく、三段階以上のクラスタリングを行ってもよい。韻律モデル学習装置１が行うクラスタリングの段階数をＮと表記すると、Ｎ段階のクラスタリングにおいて、例えば、第一のクラスタリング部、第二のクラスタリング部、…、第Nのクラスタリング部が、順にクラスタリングを行う。クラスタリング部の、使用される、データを分割する条件の重要度の高さの順番は、重要度が高い方から、第一のクラスタリング部、第二のクラスタリング部、…、第Nのクラスタリング部である。 The prosody model learning device 1 according to the present embodiment described above performs two-stage clustering using the first clustering unit 110 and the second clustering unit 120. The prosody model learning device 1 according to the present embodiment may perform clustering in three or more stages instead of two stages. When the number of stages of clustering performed by the prosodic model learning device 1 is expressed as N, in the N-stage clustering, for example, the first clustering unit, the second clustering unit,. . The order of the importance of the conditions for dividing the data used by the clustering unit is from the highest importance to the first clustering unit, the second clustering unit,..., The Nth clustering unit. is there.

また、第一の条件集合および第二の条件集合は、記憶部に格納されている。図２において、その記憶部は図示されていない。第一のクラスタリング部１１０および第二のクラスタリング部１２０は、記憶部に格納された第一の条件集合または第二の条件集合を参照して、クラスタリングを行う。 The first condition set and the second condition set are stored in the storage unit. In FIG. 2, the storage unit is not shown. The first clustering unit 110 and the second clustering unit 120 perform clustering with reference to the first condition set or the second condition set stored in the storage unit.

図１１は、上述の記憶部が図示された、本実施形態に係る韻律モデル学習装置１Ｂの構成を表すブロック図である。図１１において、条件集合記憶部１５０が、第一の条件集合および第二の条件集合が格納される上述の記憶部である。韻律モデル学習装置１Ｂは、条件集合記憶部１５０が図示されていることを除き、図２に示す韻律モデル学習装置１と同じである。 FIG. 11 is a block diagram showing the configuration of the prosodic model learning device 1B according to the present embodiment, in which the above-described storage unit is illustrated. In FIG. 11, the condition set storage unit 150 is the above-described storage unit in which the first condition set and the second condition set are stored. The prosody model learning device 1B is the same as the prosody model learning device 1 shown in FIG. 2 except that the condition set storage unit 150 is illustrated.

次に、本発明の第１の実施形態の動作について詳細に説明する。 Next, the operation of the first exemplary embodiment of the present invention will be described in detail.

図３は、第１の実施形態の韻律モデル学習装置１の動作の一例を示すフローチャートである。 FIG. 3 is a flowchart illustrating an example of the operation of the prosody model learning device 1 according to the first embodiment.

第一のクラスタリング部１１０は、第一の条件集合の少なくとも一部の条件を用いて、学習用データのクラスタリングを行う（ステップＳ１０１）。第二のクラスタリング部１２０は、第一のクラスタリング部１１０のクラスタリング結果と、第一の条件集合に含まれる条件とは異なる条件で構成される第二の条件集合を用いて、学習用データのクラスタリングを行う（ステップＳ１０２）。第一の学習部１３０は、第二のクラスタリング部１２０のクラスタリング結果に基づいて、韻律モデルを学習する（ステップＳ１０３）。 The first clustering unit 110 clusters learning data using at least some of the conditions of the first condition set (step S101). The second clustering unit 120 performs clustering of learning data by using a second condition set composed of the clustering result of the first clustering unit 110 and a condition different from the condition included in the first condition set. Is performed (step S102). The first learning unit 130 learns a prosodic model based on the clustering result of the second clustering unit 120 (step S103).

本実施形態の韻律モデル学習装置１は、安定性の高い韻律を生成可能な韻律モデルを生成できる。統計的手法におけるクラスタリングでは、データを分割するための条件が重要であるほどクラスタリング構造の上位に位置する。しかし、重要な条件が上位に位置するためには、データが十分存在する必要がある。しかし、本実施形態によれば、データが少ない場合でも、重要な条件が上位となるクラスタリング構造に基づいてクラスタリングできる。 The prosodic model learning device 1 of the present embodiment can generate a prosodic model that can generate a highly stable prosody. In the clustering in the statistical method, the more important the condition for dividing the data is, the higher the clustering structure is. However, in order for an important condition to be positioned higher, data must be sufficiently present. However, according to the present embodiment, even when the amount of data is small, clustering can be performed based on a clustering structure in which important conditions are higher.

また、統計的手法におけるクラスタリングでは、原則的に、統計量に基づいて、クラスタリングの構造が決定される。よって、言語的あるいは音響的に重要な特徴に関する条件が使用されない恐れがあった。例えば、日本語のように声の高低（ピッチ）によってアクセントが表現される言語の場合、ピッチパタンの形状によって、発声される音声のアクセントがほぼ決定される。つまり、ピッチパタン形状が不自然だと、合成音声は訛ったような音声となってしまう。したがって、ピッチパタンや状態継続長等で表される韻律情報を生成する場合には、ピッチパタンの概形に関する条件が非常に重要である。これに関する条件が使われないと、正しいアクセントを表現するピッチパタンが生成されないことがある。 In clustering in a statistical method, the clustering structure is determined based on statistics in principle. Therefore, there is a risk that conditions relating to linguistically or acoustically important features may not be used. For example, in the case of a language in which accents are expressed by the pitch (pitch) of voice, such as Japanese, the accent of speech to be uttered is almost determined by the shape of the pitch pattern. In other words, if the pitch pattern shape is unnatural, the synthesized speech will be uttered. Therefore, when generating prosodic information represented by a pitch pattern, a state duration, or the like, conditions regarding the outline of the pitch pattern are very important. If this condition is not used, a pitch pattern that represents the correct accent may not be generated.

本実施形態の韻律モデル学習装置１は、ピッチパタンの概形などの、言語的あるいは音響的に重要な特徴に関する条件を、優先的にクラスタリングに利用する。よって、本実施形態の韻律モデル学習装置１は、より安定性の高い韻律を生成可能なモデルを、生成できる。 The prosody model learning apparatus 1 according to the present embodiment preferentially uses a condition relating to a linguistic or acoustically important feature such as a pitch pattern outline for clustering. Therefore, the prosody model learning device 1 of the present embodiment can generate a model that can generate a more stable prosody.

（第２の実施形態）
図４は、本発明の第２の実施形態に係る韻律モデル学習装置１Ａの構成例を示すブロック図である。(Second Embodiment)
FIG. 4 is a block diagram showing a configuration example of a prosody model learning device 1A according to the second exemplary embodiment of the present invention.

図４を参照すると、本実施形態に係る韻律モデル学習装置１Ａは、第一の実施形態における第一のクラスタリング部１１０、第二のクラスタリング部１２０、第一の学習部１３０、が、各々、第一のクラスタリング部１１１、第二のクラスタリング部１２１、第一の学習部１３１に置き換えられている。さらに、本実施形態に係るモデル学習装置は、第二の学習部１４０を有する。 Referring to FIG. 4, the prosody model learning device 1A according to the present embodiment includes a first clustering unit 110, a second clustering unit 120, and a first learning unit 130 in the first embodiment. The first clustering unit 111, the second clustering unit 121, and the first learning unit 131 are replaced. Furthermore, the model learning device according to the present embodiment includes a second learning unit 140.

第二の学習部１４０は、学習用データから、暫定的に、韻律モデルを作成する。 The second learning unit 140 tentatively creates a prosodic model from the learning data.

第一のクラスタリング部１１１と、第二のクラスタリング部１２１は、韻律モデルのクラスタリングを行う。また、第一の学習部１３１は、第二のクラスタリング部１２０のクラスタリングの結果に基づいて、韻律モデルを再学習する。第一のクラスタリング部１１１と、第二のクラスタリング部１２１と、第一の学習部１３１の動作は、第一の実施形態における第一のクラスタリング部１１０、第二のクラスタリング部１２０、第一の学習部１３０、と各々同様であるため、説明を省略する。 The first clustering unit 111 and the second clustering unit 121 perform prosody model clustering. The first learning unit 131 re-learns the prosodic model based on the clustering result of the second clustering unit 120. The operations of the first clustering unit 111, the second clustering unit 121, and the first learning unit 131 are the same as the first clustering unit 110, the second clustering unit 120, and the first learning in the first embodiment. Since it is the same as the unit 130, the description thereof is omitted.

さらに、本実施形態に係る韻律モデル学習装置１Ａは、第１の実施形態に係る韻律モデル学習装置１と同様に、条件集合記憶部１５０を含んでいる。ただし、図４において、第一の条件集合および第二の条件集合を記憶する条件集合記憶部１５０は図示されていない。 Furthermore, the prosodic model learning device 1A according to the present embodiment includes a condition set storage unit 150, as in the prosodic model learning device 1 according to the first embodiment. However, in FIG. 4, the condition set storage unit 150 for storing the first condition set and the second condition set is not shown.

図１２は、上述の記憶部が図示された、本実施形態に係る韻律モデル学習装置１Ｃの構成を表すブロック図である。図１２において、条件集合記憶部１５０が、第一の条件集合および第二の条件集合が格納される上述の記憶部である。韻律モデル学習装置１Ｃは、条件集合記憶部１５０が図示されていることを除き、図４に示す韻律モデル学習装置１Ａと同じである。 FIG. 12 is a block diagram showing the configuration of the prosodic model learning device 1C according to the present embodiment, in which the above-described storage unit is illustrated. In FIG. 12, the condition set storage unit 150 is the above-described storage unit in which the first condition set and the second condition set are stored. The prosody model learning device 1C is the same as the prosody model learning device 1A shown in FIG. 4 except that the condition set storage unit 150 is illustrated.

次に、本発明の第２の実施形態の動作について詳細に説明する。 Next, the operation of the second exemplary embodiment of the present invention will be described in detail.

図５は、第２の実施形態の韻律モデル学習装置１Ａの動作の一例を示すフローチャートである。 FIG. 5 is a flowchart illustrating an example of the operation of the prosody model learning device 1A according to the second embodiment.

第二の学習部１４０は、学習用データから、韻律モデルを作成する（ステップＳ１１４）。第一のクラスタリング部１１０は、第一の条件集合の少なくとも一部の条件を用いて、韻律モデルのクラスタリングを行う（ステップＳ１１１）。第二のクラスタリング部１２０は、第二の条件集合の少なくとも一部の条件を用いて、韻律モデルのクラスタリングを行う（ステップＳ１１２）。第一の学習部１３０は、第二のクラスタリング部１２０のクラスタリング結果に基づいて、韻律モデルを再学習する（ステップＳ１１３）。 The second learning unit 140 creates a prosodic model from the learning data (step S114). The first clustering unit 110 performs clustering of the prosodic model using at least a part of the conditions of the first condition set (step S111). The second clustering unit 120 performs clustering of the prosodic model using at least a part of the conditions of the second condition set (step S112). The first learning unit 130 re-learns the prosodic model based on the clustering result of the second clustering unit 120 (step S113).

本実施形態の韻律モデル学習装置１Ａは、より安定性の高い韻律を生成可能なモデルを生成できる。韻律モデルを再学習することで、モデルを学習する精度が向上するためである。 The prosody model learning device 1A according to the present embodiment can generate a model capable of generating a more stable prosody. This is because re-learning the prosodic model improves the accuracy of learning the model.

（第３の実施形態）
図６は、本発明の第３の実施形態に係る音声合成システム１００の構成例を示すブロック図である。図６を参照すると、本実施形態に係る音声合成システム１００は、学習部１０と音声合成部２０によって構成されている。学習部１０は、第一のクラスタリング部１１０と、第二のクラスタリング部１２０と、第一の学習部１３０と、韻律モデル記憶部３１０とを有する。音声合成部２０は、言語解析部２１０と、韻律生成部２２０と、波形生成部２３０とを有する。(Third embodiment)
FIG. 6 is a block diagram illustrating a configuration example of the speech synthesis system 100 according to the third embodiment of the present invention. Referring to FIG. 6, the speech synthesis system 100 according to this embodiment includes a learning unit 10 and a speech synthesis unit 20. The learning unit 10 includes a first clustering unit 110, a second clustering unit 120, a first learning unit 130, and a prosody model storage unit 310. The speech synthesis unit 20 includes a language analysis unit 210, a prosody generation unit 220, and a waveform generation unit 230.

韻律モデル記憶部３１０は、第一の学習部１３０が生成した韻律モデルを記憶する。 The prosodic model storage unit 310 stores the prosodic model generated by the first learning unit 130.

音声合成部２０は、入力されたテキストに対応する合成音声の波形を生成する。 The speech synthesizer 20 generates a synthesized speech waveform corresponding to the input text.

言語解析部２１０は、入力されたテキストを言語解析して、音韻情報を出力する。 The language analysis unit 210 performs language analysis on the input text and outputs phonological information.

韻律生成部２２０は、韻律モデル記憶部３１０に記憶された韻律モデルに含まれるクラスタリング構造の情報を参照して、音韻情報が属するクラスタを判断する。さらに、韻律生成部２２０は、そのクラスタの韻律モデルに基づいて、韻律情報を生成する。 The prosody generation unit 220 refers to the clustering structure information included in the prosody model stored in the prosody model storage unit 310 to determine the cluster to which the phoneme information belongs. Further, the prosody generation unit 220 generates prosody information based on the prosody model of the cluster.

波形生成部２３０は、生成された韻律情報に基づいて、合成音声の波形を生成する。波形生成方式には、例えば、波形接続方式、波形編集方式あるいはパラメトリック方式がある。 The waveform generation unit 230 generates a synthesized speech waveform based on the generated prosodic information. Examples of the waveform generation method include a waveform connection method, a waveform editing method, and a parametric method.

本実施形態の学習部１０は、図２に示す第１の実施形態の韻律モデル学習装置１に、さらに韻律モデル記憶部３１０が含まれた韻律モデル学習装置である。本実施形態の学習部１０は、第１の実施形態の韻律モデル学習装置１と、韻律モデル記憶部３１０により実現されていてもよい。さらに、第１の実施形態の韻律モデル学習装置１と同様に、学習部１０は、図６において図示されない、前述の条件集合記憶部１５０を含む。すなわち、本実施形態の学習部１０は、図１１に示す、第１の実施形態の韻律モデル学習装置１Ｂに、さらに韻律モデル記憶部３１０が含まれた韻律モデル学習装置である。 The learning unit 10 of this embodiment is a prosodic model learning device in which a prosody model storage unit 310 is further included in the prosody model learning device 1 of the first embodiment shown in FIG. The learning unit 10 of this embodiment may be realized by the prosody model learning device 1 of the first embodiment and the prosody model storage unit 310. Further, like the prosodic model learning device 1 of the first embodiment, the learning unit 10 includes the above-described condition set storage unit 150 not shown in FIG. That is, the learning unit 10 of this embodiment is a prosodic model learning device in which the prosody model storage device 310 is further included in the prosody model learning device 1B of the first embodiment shown in FIG.

本実施形態の音声合成部２０は、言語解析部２１０と韻律生成部２２０と波形生成部２３０を有する音声合成装置によって実現されていてもよい。その音声合成装置は、韻律モデル記憶部３１０に格納されている韻律モデルを取得可能であればよい。例えば、その音声合成装置は、韻律モデル記憶部３１０を含む上述の韻律モデル学習装置に接続され、韻律モデル記憶部３１０に格納されている韻律モデルをその韻律モデル学習装置から受信することができればよい。 The speech synthesizer 20 of the present embodiment may be realized by a speech synthesizer having a language analyzer 210, a prosody generator 220, and a waveform generator 230. The speech synthesizer only needs to be able to acquire the prosodic model stored in the prosodic model storage unit 310. For example, the speech synthesizer is connected to the above-described prosodic model learning device including the prosodic model storage unit 310, and can receive the prosodic model stored in the prosodic model storage unit 310 from the prosodic model learning device. .

次に、本発明の第３の実施形態の動作について詳細に説明する。 Next, the operation of the third exemplary embodiment of the present invention will be described in detail.

図７は、第３の実施形態の音声合成システム１００の動作の一例を示すフローチャートである。 FIG. 7 is a flowchart illustrating an example of the operation of the speech synthesis system 100 according to the third embodiment.

ステップＳ１０１〜ステップＳ１０３は、第１の実施形態と同じであるため、説明を省略する。 Steps S101 to S103 are the same as those in the first embodiment, and a description thereof will be omitted.

言語解析部２１０は、入力されたテキストを言語解析して、音韻情報を出力する（ステップＳ２０１）。韻律生成部２２０は、音韻情報が属するクラスタを判断し、韻律情報を生成する（ステップＳ２０２）。波形生成部２３０は、生成された韻律情報に基づいて、合成音声の波形を生成する（ステップＳ２０３）。 The language analysis unit 210 performs language analysis on the input text and outputs phonological information (step S201). The prosody generation unit 220 determines a cluster to which the phoneme information belongs, and generates prosody information (step S202). The waveform generation unit 230 generates a waveform of synthesized speech based on the generated prosodic information (step S203).

以上のように、本実施形態の音声合成システム１００は、安定性の高い韻律を有する合成音声波形を生成することができる。 As described above, the speech synthesis system 100 according to the present embodiment can generate a synthesized speech waveform having a highly stable prosody.

（第４の実施形態）
続いて、本発明の第４の実施形態について説明する。図８は、本発明の第４の実施形態に係る音声合成システム１０１の構成例を示すブロック図である。(Fourth embodiment)
Subsequently, a fourth embodiment of the present invention will be described. FIG. 8 is a block diagram illustrating a configuration example of the speech synthesis system 101 according to the fourth embodiment of the present invention.

本実施形態に係る音声合成システム１０１は、学習部１１と音声合成部２０を有する。学習部１１は、第二の学習部１４０と、第一のクラスタリング部１１１と、第二のクラスタリング部１２１と、第一の学習部１３１とを有する。音声合成部２０は、言語解析部２１０と、韻律生成部２２０と、波形生成部２３０とを有する。音声合成システム１０１は、さらに、韻律モデル記憶部３１０を有する。 A speech synthesis system 101 according to the present embodiment includes a learning unit 11 and a speech synthesis unit 20. The learning unit 11 includes a second learning unit 140, a first clustering unit 111, a second clustering unit 121, and a first learning unit 131. The speech synthesis unit 20 includes a language analysis unit 210, a prosody generation unit 220, and a waveform generation unit 230. The speech synthesis system 101 further includes a prosody model storage unit 310.

なお、本実施形態における音声合成システム１０１は、コンテクスト情報に依存したＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）モデルを用いるものとする。本実施形態における音声合成システム１０１は、ｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型の連続分布ＨＭＭを、音素毎に1つあるいは複数の状態で連結する事によりモデル化する。コンテクスト情報とは、スペクトル、ピッチ、継続長等、音響的なパラメータに影響を与えると考えられる情報（すなわち変動要因）である。 Note that the speech synthesis system 101 in the present embodiment uses an HMM (Hidden Markov Model) model that depends on context information. The speech synthesis system 101 in this embodiment models a left-to-right continuous distribution HMM by connecting one or more states for each phoneme. Context information is information (that is, fluctuation factors) that is considered to affect acoustic parameters such as spectrum, pitch, and duration.

本実施形態における音声合成システム１０１は、日本語の音声を合成する。日本語は、声の高低によりアクセントを表現するピッチアクセント言語である。よって、アクセントは、主にピッチパタンと音素時間継続長が支配的となる。そこで、本実施形態では、韻律情報は、ピッチパタンと音素時間継続長の特徴量に関する情報とする。さらに、韻律情報は、パワー等を含んでもよい。また、本実施形態において、クラスタリング手法として、二分木の木構造クラスタリングが用いられる。そのため、データを分割する条件は、ノードを二分する質問となる。 The speech synthesis system 101 in this embodiment synthesizes Japanese speech. Japanese is a pitch accent language that expresses accents according to voice pitch. Therefore, the accent is mainly governed by the pitch pattern and the phoneme duration. Therefore, in this embodiment, the prosody information is information related to the feature quantity of the pitch pattern and the phoneme time duration. Further, the prosodic information may include power and the like. In the present embodiment, binary tree tree structure clustering is used as a clustering method. Therefore, the condition for dividing the data is a question that bisects the node.

学習用データは、予め用意されている。学習用データは、音声合成で再現したい話者の音声を収録した音声波形データを少なくとも含む。さらに、学習用データは、音声波形データを分析して生成された付加情報を含む。付加情報は、発声内容のテキスト情報、音声波形データにおける各音素のコンテクスト情報、音声波形データにおける各音素の継続時間長、等間隔ごとの基本周波数情報（ピッチパタン情報）、等間隔ごとのケプストラム情報（音声波形データのスペクトル情報）、を含む。また、コンテクスト情報は、少なくともアクセント句のピッチパタン概形に関する情報を含み、先行／当該／後続の音素に関する情報、文／アクセント句／呼気段落のモーラ数に関する情報、アクセント位置に関する情報、疑問文か否かの情報等を含む。
第二の学習部１４０は、学習用データを用いて、韻律モデルを作成するための学習を行う。韻律モデルは、クラスタリングや再学習を行うために作成する暫定的なモデルである。モデルの精度は、低くなることが多い。The learning data is prepared in advance. The learning data includes at least speech waveform data in which a speaker's speech that is to be reproduced by speech synthesis is recorded. Furthermore, the learning data includes additional information generated by analyzing the speech waveform data. Additional information includes text information of utterance content, context information of each phoneme in speech waveform data, duration of each phoneme in speech waveform data, basic frequency information (pitch pattern information) at regular intervals, and cepstrum information at regular intervals (Spectrum information of voice waveform data). The context information includes at least information on the pitch pattern outline of the accent phrase, information on the preceding / subject / subsequent phonemes, information on the number of mora in the sentence / accent phrase / expired paragraph, information on the accent position, question sentence Includes information on whether or not.
The second learning unit 140 performs learning for creating a prosodic model using the learning data. The prosodic model is a provisional model created for clustering and relearning. The accuracy of the model is often low.

第一のクラスタリング部１１１は、第一の条件集合を用いて、韻律モデルのクラスタリングを行う。第一の条件集合は、アクセント句におけるピッチパタンの概形に関する質問のみで構成される。クラスタリングは、音声波形データを構成する各音素のコンテクスト情報に基づいて行われる。よって、アクセント句におけるピッチパタンの概形に関する質問は、例えば「３型アクセント句の２番目の音節か？」「平板アクセント句の３番目以降の音節か？」というような質問である。 The first clustering unit 111 performs clustering of the prosodic model using the first condition set. The first condition set consists only of questions about the outline of the pitch pattern in the accent phrase. Clustering is performed based on the context information of each phoneme constituting the speech waveform data. Therefore, the questions regarding the outline of the pitch pattern in the accent phrase are, for example, “second syllable of type 3 accent phrase?” Or “third syllable after plate accent phrase?”.

第一のクラスタリング部１１１は、アクセント句におけるピッチパタンの概形に関する質問のみをノードに持つ木構造（第一段木構造）を構築する。第一の条件集合は、後述する第二の条件集合と比べて小規模な集合となっている。よって、最終的に構築される木構造に比べると、第一段木構造は小規模な構造となる。図９に、第一段木構造の例を示す。 The first clustering unit 111 constructs a tree structure (first tree structure) having only questions regarding the outline of the pitch pattern in the accent phrase as nodes. The first condition set is a smaller set than the second condition set described later. Therefore, compared with the tree structure finally constructed, the first stage tree structure is a small-scale structure. FIG. 9 shows an example of the first tree structure.

第二のクラスタリング部１２１は、第二の条件集合を用いて、第一段木構造をさらに詳細化するためのクラスタリングを行う。具体的には、第二のクラスタリング部１１２は、第一段木構造を保ったまま、第二の条件集合の質問によってノードを追加していく。第二の条件集合には、例えば「当該音素が“ａ”？」「５モーラ目の音節？」といった当該音素に関する質問や、「先行音素が無声音？」「後続音素がポーズ？」といった、先行および後続環境に関する質問が含まれる。 The second clustering unit 121 performs clustering to further refine the first stage tree structure using the second condition set. Specifically, the second clustering unit 112 adds nodes according to the second condition set question while maintaining the first tree structure. The second condition set is preceded by a question related to the phoneme, such as “the phoneme is“ a ”?” Or “fifth mora syllable?”, “Preceding phoneme is unvoiced sound?”, “Follower phoneme is paused?” And questions about subsequent environments.

このようにして、第二のクラスタリング部１２１は、詳細な木構造（第二段木構造）を構築する。図１０に、第二段木構造の例を示す。図１０に示すように、第二段木構造は、第一段木構造で構築された終端ノードに対してさらに枝分かれした構造となる。 In this way, the second clustering unit 121 constructs a detailed tree structure (second-stage tree structure). FIG. 10 shows an example of the second stage tree structure. As shown in FIG. 10, the second stage tree structure is a structure that is further branched with respect to the terminal node constructed by the first stage tree structure.

なお、図１０において、第一段木構造の部分は省略されている。図１３は、図１０において省略されている第一段木構造の部分を表す図である。 In FIG. 10, the portion of the first tree structure is omitted. FIG. 13 is a diagram illustrating a portion of the first tree structure that is omitted in FIG. 10.

このように、第一のクラスタリング部１１１および第二のクラスタリング部１２１は、アクセント句におけるピッチパタンの形状に関する質問が上位構造にある、木構造を構築する。 As described above, the first clustering unit 111 and the second clustering unit 121 construct a tree structure in which the question regarding the shape of the pitch pattern in the accent phrase is in the upper structure.

第一の学習部１３１は、第二のクラスタリング部１２１のクラスタリング結果を用いて、韻律モデルの再学習をクラスタごとに行う。韻律モデルは、木構造クラスタリングの構造情報も含む。 The first learning unit 131 uses the clustering result of the second clustering unit 121 to re-learn the prosodic model for each cluster. The prosodic model also includes structure information of tree structure clustering.

第一の学習部１３１は、再学習によって生成された韻律モデルを、韻律モデル記憶部３１０に格納する。 The first learning unit 131 stores the prosody model generated by relearning in the prosody model storage unit 310.

音声合成部２０は、入力されたテキストに基づいて、合成音声の波形を生成する。言語解析部２１０は、入力されたテキストを言語解析し、入力されたテキストの音韻情報を生成する。韻律生成部２２０は、この音韻情報から、韻律モデル内に含まれる木構造の情報に基づいて各音韻情報が属するクラスタを判断する。さらに、韻律生成部２２０は、音韻情報が属するクラスタの韻律モデルを用いて韻律情報（例えば、ピッチパタン、音素の時間継続長）を生成する。波形生成部２３０は、生成された韻律情報に基づいて、合成音声の波形を生成する。 The speech synthesizer 20 generates a synthesized speech waveform based on the input text. The language analysis unit 210 performs language analysis on the input text and generates phoneme information of the input text. The prosody generation unit 220 determines a cluster to which each phoneme information belongs based on the tree structure information included in the prosody model from the phoneme information. Further, the prosody generation unit 220 generates prosody information (eg, pitch pattern, phoneme duration) using the prosody model of the cluster to which the phoneme information belongs. The waveform generation unit 230 generates a synthesized speech waveform based on the generated prosodic information.

以上の説明において、本実施形態では、第一の条件集合は、アクセント句概形に関する質問のみを含んでいる。しかし、第一の条件集合は、それに限られない。例えば、第一の条件集合は、少なくとも「当該音素が有声音？」という質問を含んでもよい。有声音か無声音であるかは、韻律を生成する際に、重要な条件である。無声音はピッチ周波数が存在しないために無声音に対してピッチを生成する必要がないが、有声音に対してピッチを生成する必要がある。 In the above description, in the present embodiment, the first condition set includes only the question about the accent phrase outline. However, the first condition set is not limited thereto. For example, the first condition set may include at least the question “Is the phoneme a voiced sound?”. Whether it is voiced or unvoiced is an important condition when generating a prosody. An unvoiced sound does not need to generate a pitch for an unvoiced sound because there is no pitch frequency, but a pitch needs to be generated for a voiced sound.

以上の説明において、本実施形態では、ピッチアクセント言語である日本語が対象であるため、韻律情報は、ピッチパタンと音素時間継続長である。英語を代表とした、声の強弱をアクセントとするストレスアクセント言語の場合は、韻律情報は、パワーと音素継続時間長であればよい。もちろん、ピッチアクセント言語かストレスアクセント言語に関わらず、韻律情報は、ピッチパタン、音素時間継続長、パワーおよびその他の特徴量を全て含んでもよい。 In the above description, in the present embodiment, Japanese, which is a pitch accent language, is targeted, so the prosodic information is a pitch pattern and a phoneme time duration. In the case of a stress accent language with English voice as an accent, the prosodic information only needs to be power and phoneme duration. Of course, regardless of the pitch accent language or the stress accent language, the prosody information may include all of the pitch pattern, phoneme duration, power, and other feature quantities.

韻律モデル記憶部３１０が記憶している韻律モデルは、クラスタ内の実際のデータであってもよい。韻律生成部２２０は、クラスタ内の実際のデータを選択することによって韻律情報を生成する。例えば、韻律モデル記憶部３１０は、クラスタごとに、アクセント句ごとのピッチパタンの複数のデータを記憶する。各クラスタの代表ピッチパタンは、セントロイド（すなわち、重心）に最も近いデータとする。韻律生成部２２０は、クラスタの代表ピッチパタンに基づいて、韻律情報を生成する。 The prosodic model stored in the prosodic model storage unit 310 may be actual data in the cluster. The prosody generation unit 220 generates prosody information by selecting actual data in the cluster. For example, the prosodic model storage unit 310 stores a plurality of pieces of pitch pattern data for each accent phrase for each cluster. The representative pitch pattern of each cluster is data closest to the centroid (that is, the center of gravity). The prosody generation unit 220 generates prosody information based on the representative pitch pattern of the cluster.

なお、第一の学習部１３１が生成した韻律モデルに対して、第一のクラスタリング部１１１および第二のクラスタリング部１２１が、再度クラスタリングを行ってもよい。このように、複数回の学習とクラスタリングを繰り返すことにより、モデルを学習する精度が向上する。よって、より安定性の高い韻律を生成可能なモデルが生成される。 Note that the first clustering unit 111 and the second clustering unit 121 may perform clustering again on the prosodic model generated by the first learning unit 131. Thus, the accuracy of learning the model is improved by repeating the learning and clustering a plurality of times. Therefore, a model capable of generating a more stable prosody is generated.

本実施形態の学習部１１は、図４に示す、第２の実施形態に係る韻律モデル学習装置１Ａである。本実施形態の学習部１１は、さらに、韻律モデル記憶部３１０を含んでいてもよい。その場合、本実施形態の学習部１１は、第２の実施形態に係る韻律モデル学習装置１Ａに、さらに音律モデル記憶部３１０が含まれた音律モデル学習装置である。また、第２の実施形態に係る韻律モデル学習装置１Ａと同様に、本実施形態の学習部１１は、図８においては図示されない、前述の条件集合記憶部１５０を含む。すなわち、本実施形態の学習部１１は、図１２に示す、第２の実施形態に係る韻律モデル学習装置１Ｃに、さらに音律モデル記憶部３１０が含まれた音律モデル学習装置である。 The learning unit 11 of this embodiment is a prosody model learning device 1A according to the second embodiment shown in FIG. The learning unit 11 of the present embodiment may further include a prosody model storage unit 310. In that case, the learning unit 11 of the present embodiment is a temperament model learning device in which a temperament model storage unit 310 is further included in the prosody model learning device 1A according to the second embodiment. Further, similarly to the prosody model learning device 1A according to the second embodiment, the learning unit 11 of the present embodiment includes the above-described condition set storage unit 150, which is not illustrated in FIG. That is, the learning unit 11 of the present embodiment is a temperament model learning device in which the temperament model storage unit 310 is further included in the prosody model learning device 1C according to the second embodiment shown in FIG.

本実施形態の音声合成部２０は、言語解析部２１０と、韻律生成部２２０と、波形生成部２３０とを含む音声合成装置であってもよい。その音声合成装置は、韻律モデル記憶部３１０に格納されている韻律モデルを取得可能であればよい。 The speech synthesizer 20 of this embodiment may be a speech synthesizer that includes a language analyzer 210, a prosody generator 220, and a waveform generator 230. The speech synthesizer only needs to be able to acquire the prosodic model stored in the prosodic model storage unit 310.

以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されるものではない。 While the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments.

本願発明の構成や詳細には、例えば統計的手法の種類、クラスタリングの種類、韻律生成方式および音声合成方式等に関して、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Various changes that can be understood by those skilled in the art within the scope of the present invention can be made to the configuration and details of the present invention with respect to, for example, the type of statistical method, the type of clustering, the prosody generation method, and the speech synthesis method. .

また、上述の説明で用いた複数のフローチャートでは、複数の処理が順番に記載されているが、各実施形態で実行される処理の実行順序は、その記載の順番に制限されない。各実施形態では、図示される工程の順番を内容的に支障のない範囲で変更することができる。また、上述の各実施形態及び第４の実施形態は、内容が相反しない範囲で組み合わせることができる。 In the plurality of flowcharts used in the above description, a plurality of processes are described in order. However, the execution order of the processes executed in each embodiment is not limited to the description order. In each embodiment, the order of the illustrated steps can be changed within a range that does not hinder the contents. Moreover, each above-mentioned embodiment and 4th Embodiment can be combined in the range in which the content does not conflict.

また、韻律モデル学習装置１、韻律モデル学習装置１Ａ、韻律モデル学習装置１Ｂ、韻律モデル学習装置１Ｃ、音声合成システム１００、音声合成システム１０１、学習部１０、学習部１１、及び音声合成部２０は、それぞれ、コンピュータ及びコンピュータを制御するプログラム、専用のハードウェア、又は、コンピュータ及びコンピュータを制御するプログラムと専用のハードウェアの組合せにより実現することができる。 The prosody model learning device 1, the prosody model learning device 1A, the prosody model learning device 1B, the prosody model learning device 1C, the speech synthesis system 100, the speech synthesis system 101, the learning unit 10, the learning unit 11, and the speech synthesis unit 20 These can be realized by a computer and a program for controlling the computer, dedicated hardware, or a combination of a program for controlling the computer and the computer and dedicated hardware, respectively.

上で言及したように、図１は、韻律モデル学習装置１、韻律モデル学習装置１Ａ、韻律モデル学習装置１Ｂ、韻律モデル学習装置１Ｃ、音声合成システム１００、音声合成システム１０１、学習部１０、学習部１１、及び音声合成部２０を実現するために使用される、コンピュータ１０００のハードウェア構成の一例を表す図である。図１を参照すると、コンピュータ１０００は、さらに、記録媒体９にアクセスすることができる。メモリ３と記憶装置４は、例えば、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ハードディスクなどの記憶装置である。記録媒体９は、例えば、ＲＡＭ、ハードディスクなどの記憶装置、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、可搬記録媒体である。記憶装置４が記録媒体９であってもよい。ＣＰＵ２は、メモリ３と、記憶装置４に対して、データやプログラムの読み出しと書き込みを行うことができる。ＣＰＵ２は、通信ＩＦ５を介して、例えば、学習用データを入力する装置、入力テキストを入力する装置、韻律モデルを出力する装置、及び音声波形を出力する装置にアクセスすることができる。ＣＰＵ２は、記録媒体９にアクセスすることができる。記録媒体には、コンピュータ１０００を韻律モデル学習装置１、韻律モデル学習装置１Ａ、韻律モデル学習装置１Ｂ、韻律モデル学習装置１Ｃ、音声合成システム１００、音声合成システム１０１、学習部１０、学習部１１、又は音声合成部２０として動作させるプログラムが格納されている。 As mentioned above, FIG. 1 shows the prosody model learning device 1, the prosody model learning device 1A, the prosody model learning device 1B, the prosody model learning device 1C, the speech synthesis system 100, the speech synthesis system 101, the learning unit 10, and the learning. 2 is a diagram illustrating an example of a hardware configuration of a computer 1000 that is used to implement a unit 11 and a speech synthesis unit 20. FIG. Referring to FIG. 1, the computer 1000 can further access the recording medium 9. The memory 3 and the storage device 4 are storage devices such as a RAM (Random Access Memory) and a hard disk, for example. The recording medium 9 is, for example, a storage device such as a RAM or a hard disk, a ROM (Read Only Memory), or a portable recording medium. The storage device 4 may be the recording medium 9. The CPU 2 can read and write data and programs to and from the memory 3 and the storage device 4. The CPU 2 can access, for example, a device for inputting learning data, a device for inputting input text, a device for outputting a prosodic model, and a device for outputting a speech waveform via the communication IF 5. The CPU 2 can access the recording medium 9. As a recording medium, the computer 1000 includes a prosody model learning device 1, a prosody model learning device 1A, a prosody model learning device 1B, a prosody model learning device 1C, a speech synthesis system 100, a speech synthesis system 101, a learning unit 10, a learning unit 11, Alternatively, a program to be operated as the speech synthesizer 20 is stored.

ＣＰＵ２は、記録媒体９に格納されている、コンピュータ１０００を韻律モデル学習装置１、韻律モデル学習装置１Ａ、韻律モデル学習装置１Ｂ、韻律モデル学習装置１Ｃ、音声合成システム１００、音声合成システム１０１、学習部１０、学習部１１、又は音声合成部２０として動作させるプログラムを、メモリ３にロードする。そして、ＣＰＵ２が、メモリ３にロードされたプログラムを実行することにより、コンピュータ１０００は韻律モデル学習装置１、韻律モデル学習装置１Ａ、韻律モデル学習装置１Ｂ、韻律モデル学習装置１Ｃ、音声合成システム１００、音声合成システム１０１、学習部１０、学習部１１、又は音声合成部２０として動作する。 The CPU 2 stores the computer 1000 stored in the recording medium 9 as a prosody model learning device 1, a prosody model learning device 1A, a prosody model learning device 1B, a prosody model learning device 1C, a speech synthesis system 100, a speech synthesis system 101, and a learning. A program to be operated as the unit 10, the learning unit 11, or the speech synthesis unit 20 is loaded into the memory 3. Then, when the CPU 2 executes the program loaded in the memory 3, the computer 1000 causes the prosody model learning device 1, the prosody model learning device 1A, the prosody model learning device 1B, the prosody model learning device 1C, the speech synthesis system 100, It operates as the speech synthesis system 101, the learning unit 10, the learning unit 11, or the speech synthesis unit 20.

第一のクラスタリング部１１０、第一のクラスタリング部１１１、第二のクラスタリング部１２０、第二のクラスタリング部１２１、第一の学習部１３０、第一の学習部１３１、第二の学習部１４０、言語解析部２１０、韻律生成部２２０、波形生成部２３０は、例えば、プログラムを記憶する記録媒体９からメモリ３に読み込まれた、各部の機能を実現するための専用のプログラムと、そのプログラムを実行するＣＰＵ２により実現することができる。また、条件集合記憶部１５０、韻律モデル記憶部３１０は、コンピュータが含むメモリ３やハードディスク装置等の記憶装置４により実現することができる。あるいは、第一のクラスタリング部１１０、第一のクラスタリング部１１１、第二のクラスタリング部１２０、第二のクラスタリング部１２１、第一の学習部１３０、第一の学習部１３１、第二の学習部１４０、条件集合記憶部１５０、言語解析部２１０、韻律生成部２２０、波形生成部２３０、韻律モデル記憶部３１０の一部又は全部を、各部の機能を実現する専用の回路によって実現することもできる。 First clustering unit 110, first clustering unit 111, second clustering unit 120, second clustering unit 121, first learning unit 130, first learning unit 131, second learning unit 140, language The analysis unit 210, the prosody generation unit 220, and the waveform generation unit 230 execute, for example, a dedicated program for realizing the function of each unit read from the recording medium 9 storing the program into the memory 3, and the program. It can be realized by the CPU 2. The condition set storage unit 150 and the prosodic model storage unit 310 can be realized by the memory 3 included in the computer or the storage device 4 such as a hard disk device. Alternatively, the first clustering unit 110, the first clustering unit 111, the second clustering unit 120, the second clustering unit 121, the first learning unit 130, the first learning unit 131, and the second learning unit 140 are used. In addition, a part or all of the condition set storage unit 150, the language analysis unit 210, the prosody generation unit 220, the waveform generation unit 230, and the prosody model storage unit 310 can be realized by a dedicated circuit that realizes the function of each unit.

また、上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Moreover, although a part or all of said embodiment can be described also as the following additional remarks, it is not restricted to the following.

（付記１）
データを分割する条件であり、韻律の生成に与える影響が大きい条件を1以上含む第一の条件集合を用いて、前記データのクラスタリングを行う第一のクラスタリング手段と、
前記第一のクラスタリング手段によるクラスタリング結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データのクラスタリングを行う第二のクラスタリング手段と、
前記第二のクラスタリング手段によるクラスタリング結果に基づいて、韻律モデルを学習する学習手段と
を有する韻律モデル学習装置。(Appendix 1)
A first clustering means for clustering the data using a first condition set including one or more conditions that are conditions for dividing the data and have a large influence on the generation of the prosody;
Second clustering means for clustering the data using a clustering result by the first clustering means and a second condition set including one or more conditions different from the conditions included in the first condition set When,
A prosody model learning device comprising: learning means for learning a prosody model based on a clustering result by the second clustering means.

（付記２）
付記１に記載の韻律モデル学習装置において、
前記第一のクラスタリング手段は、前記第一の条件集合に含まれる全ての条件を用いてクラスタリングを行う
韻律モデル学習装置。(Appendix 2)
In the prosody model learning device according to attachment 1,
The prosody model learning device, wherein the first clustering means performs clustering using all conditions included in the first condition set.

（付記３）
付記１または２に記載の韻律モデル学習装置において、
前記第一の条件集合は、少なくとも、アクセント位置に関する条件を含む
韻律モデル学習装置。(Appendix 3)
In the prosody model learning device according to attachment 1 or 2,
The first condition set includes at least a condition related to an accent position.

（付記４）
付記１乃至３のいずれかに記載の韻律モデル学習装置において、
前記第二のクラスタリング手段は、前記第一のクラスタリング手段のクラスタリング結果を上位構造とし、前記第二の条件集合を用いて下位構造をクラスタリングする
韻律モデル学習装置。(Appendix 4)
In the prosody model learning device according to any one of appendices 1 to 3,
The prosody model learning device, wherein the second clustering means uses the clustering result of the first clustering means as an upper structure, and clusters the lower structure using the second condition set.

（付記５）
付記１乃至４のいずれかに記載の韻律モデル学習装置において、
前記第一の条件集合は、少なくとも、当該音素が有声音であるか否かに関する質問を含む
韻律モデル学習装置。(Appendix 5)
In the prosody model learning device according to any one of supplementary notes 1 to 4,
The first condition set includes at least a question regarding whether or not the phoneme is a voiced sound.

（付記６）
データを分割する条件であり、韻律の生成に与える影響が大きい条件を1以上含む第一の条件集合を用いて、前記データに対して第一のクラスタリングを行い、
前記第一のクラスタリングの結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データに対して第二のクラスタリングを行い、
前記第二のクラスタリングの結果を用いて、韻律モデルを学習する
韻律モデル学習方法。(Appendix 6)
A condition for dividing the data, and using a first condition set including one or more conditions that have a large influence on the generation of the prosody, the first clustering is performed on the data,
Using the result of the first clustering and a second condition set including one or more conditions different from the conditions included in the first condition set, second clustering is performed on the data,
A prosodic model learning method for learning a prosodic model using a result of the second clustering.

（付記７）
データを分割する条件であり、韻律の生成に与える影響が大きい条件を1以上含む第一の条件集合を用いて、前記データのクラスタリングを行う第一のクラスタリングステップと、
前記第一のクラスタリングステップによるクラスタリング結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データのクラスタリングを行う第二のクラスタリングステップと、
前記第二のクラスタリングステップによるクラスタリング結果を用いて、韻律モデルを学習する学習ステップと
をコンピュータに実行させる韻律モデル学習プログラム。(Appendix 7)
A first clustering step for performing clustering of the data using a first condition set including one or more conditions that are conditions for dividing the data and have a large influence on the generation of the prosody;
A second clustering step of clustering the data using a clustering result in the first clustering step and a second condition set including one or more conditions different from the conditions included in the first condition set; When,
A prosody model learning program that causes a computer to execute a learning step of learning a prosody model using the clustering result of the second clustering step.

（付記８）
データを分割する条件であり、韻律の生成に与える影響が大きい条件である第一の条件を1以上含む第一の条件集合を用いて、前記データのクラスタリングを行う第一のクラスタリング手段と、
前記第一のクラスタリング手段によるクラスタリング結果と、前記第一の条件集合に含まれる条件とは異なる条件を１以上含む第二の条件集合とを用いて、前記データのクラスタリングを行う第二のクラスタリング手段と、
前記第二のクラスタリング手段によるクラスタリング結果を用いて、韻律モデルの学習を行う学習手段と、
前記学習手段で学習された韻律モデルに基づいて、入力されたテキストに対応する合成音声の波形を生成する合成手段と
を有する音声合成システム。(Appendix 8)
A first clustering means for performing clustering of the data using a first condition set including one or more first conditions that are conditions for dividing the data and have a large influence on the generation of the prosody;
Second clustering means for clustering the data using a clustering result by the first clustering means and a second condition set including one or more conditions different from the conditions included in the first condition set When,
Learning means for learning a prosodic model using the clustering result by the second clustering means;
A speech synthesis system comprising: synthesis means for generating a synthesized speech waveform corresponding to the input text based on the prosodic model learned by the learning means.

この出願は、２０１２年１０月１６日に出願された日本出願特願２０１２−２２８６６３を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2012-228663 for which it applied on October 16, 2012, and takes in those the indications of all here.

１、１Ａ、１Ｂ、１Ｃ韻律モデル学習装置
２ＣＰＵ
３メモリ
４ＨＤＤ
５通信ＩＦ
６表示装置
７入力装置
８バス
１０、１１学習部
２０音声合成部
１００、１０１音声合成システム
１１０、１１１第一のクラスタリング部
１２０、１２１第二のクラスタリング部
１３０、１３１第一の学習部
１４０第二の学習部
１６０条件集合記憶部
２１０言語解析部
２２０韻律生成部
２３０波形生成部
３１０韻律モデル記憶部
１０００コンピュータ1, 1A, 1B, 1C Prosody model learning device 2 CPU
3 Memory 4 HDD
5 Communication IF
6 Display device 7 Input device 8 Bus 10, 11 Learning unit 20 Speech synthesis unit 100, 101 Speech synthesis system 110, 111 First clustering unit 120, 121 Second clustering unit 130, 131 First learning unit 140 Second Learning unit 160 condition set storage unit 210 language analysis unit 220 prosody generation unit 230 waveform generation unit 310 prosody model storage unit 1000 computer

Claims

A first clustering means for clustering the data using a first condition set including one or more conditions that are conditions for dividing the data and have a large influence on the generation of the prosody;
Second clustering means for clustering the data using a clustering result by the first clustering means and a second condition set including one or more conditions different from the conditions included in the first condition set When,
A prosody model learning device comprising: learning means for learning a prosody model based on a clustering result by the second clustering means.

The prosody model learning device according to claim 1, wherein the first clustering unit performs clustering using all conditions included in the first condition set.

The prosody model learning device according to claim 1, wherein the first condition set includes at least a condition related to an accent position.

4. The prosodic model learning according to claim 1, wherein the second clustering unit sets a clustering result of the first clustering unit as an upper structure and clusters the lower structure using the second condition set. 5. apparatus.

The prosody model learning device according to any one of claims 1 to 4, wherein the first condition set includes at least a question regarding whether or not the phoneme is a voiced sound.

A condition for dividing the data, and using a first condition set including one or more conditions that have a large influence on the generation of the prosody, the first clustering is performed on the data,
Using the result of the first clustering and a second condition set including one or more conditions different from the conditions included in the first condition set, second clustering is performed on the data,
A prosodic model learning method for learning a prosodic model using a result of the second clustering.

A first clustering step for performing clustering of the data using a first condition set including one or more conditions that are conditions for dividing the data and have a large influence on the generation of the prosody;
A second clustering step of clustering the data using a clustering result in the first clustering step and a second condition set including one or more conditions different from the conditions included in the first condition set; When,
A prosody model learning program that causes a computer to execute a learning step of learning a prosody model using the clustering result of the second clustering step.

A first clustering means for performing clustering of the data using a first condition set including one or more first conditions that are conditions for dividing the data and have a large influence on the generation of the prosody;
Second clustering means for clustering the data using a clustering result by the first clustering means and a second condition set including one or more conditions different from the conditions included in the first condition set When,
Learning means for learning a prosodic model using the clustering result by the second clustering means;
A speech synthesis system comprising: synthesis means for generating a synthesized speech waveform corresponding to the input text based on the prosodic model learned by the learning means.