JPH08263520A

JPH08263520A - System and method for speech file constitution

Info

Publication number: JPH08263520A
Application number: JP7091616A
Authority: JP
Inventors: Takashi Horie; 高志保理江; Noriya Murakami; 憲也村上
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1995-03-24
Filing date: 1995-03-24
Publication date: 1996-10-11

Abstract

PURPOSE: To constitute a speech file which is highly comprehensive for both context and rhythm in the speech file constitution system which is applied to waveform synthesis for obtaining a synthesized sound by connecting phoneme waveforms. CONSTITUTION: First, waveform data having the same phoneme labels are segmented as an initial cluster 110 form speech waveform data stored in a speech data base. Next, clustering based on context is performed in a specific characteristic parameter space for the initial cluster 110. Then, clustering in a specific rhythm pattern space is performed for respective clusters 130, 140, and 150 obtained by the context clustering. Lastly, waveform data which are closest to the centroids of fine clusters 131-133, 141-143, and 151-153 obtained by the rhythm clustering are extracted from the fine clusters 131-133, 141-143, and 151-153 and registered in the speech file 160 for synthesis.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、予め自然発声から切出
してメモリや外部記憶装置等に蓄積した音声波形素片の
集合中から適当な音声波形素片を選択し接続することに
より合成音を得る波形合成、に適用される音声ファイル
の構成技術に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention produces a synthetic voice by selecting and connecting an appropriate speech waveform segment from a set of speech waveform segments which are cut out from a natural utterance and stored in a memory, an external storage device or the like. The present invention relates to a technique for constructing an audio file applied to obtain waveform synthesis.

【０００２】[0002]

【従来の技術】従来、自然性の高い合成音が得られる波
形合成技術の研究が広く行われてきたが、波形合成のた
めの音声ファイルの構成法については未確立の状態であ
り、主として実験者の主観に基づき手作業により行われ
るのが実情であった。2. Description of the Related Art Conventionally, research on a waveform synthesizing technique capable of obtaining a natural synthesized voice has been widely conducted, but a method of constructing a voice file for waveform synthesizing has not been established, and is mainly tested. The reality is that it is done manually based on the subjectivity of the person.

【０００３】このような中にあって、従来より自動的な
音声ファイル構成手法として知られているものには、音
韻環境クラスタリング（Context Oriented Clusterin
g；以下ＣＯＣという）（伊藤、中嶌、広川：信学技法S
P93ー121、（1994ー01））を用いた手法が挙げられる。こ
れは音響的な特徴に基づきコンテキスト（着目している
音韻の前後の音韻系列）をベースとして音声波形をクラ
スタリングし、得られた各クラスタのセントロイド（ク
ラスタの重心）の最近傍にある波形データにより音声フ
ァイルを自動的に構成する手法である。[0003] Of these, the one known as a conventional automatic audio file construction method is Context Oriented Clusterin.
g; hereinafter referred to as COC) (Ito, Nakajima, Hirokawa: Scholarly Technique S
P93-121, (1994-01)). This is the waveform data that is closest to the centroid of each cluster (cluster centroid) obtained by clustering speech waveforms based on context (phoneme sequence before and after the phoneme of interest) based on acoustic characteristics. Is a method of automatically configuring a sound file.

【０００４】ところで、波形合成を行う際、良好な合成
音を得るには、音韻の用いられているコンテキストや韻
律（音声のピッチ形状、時間的な長さ、大きさ等の特
徴）に基づき、適切な波形素片を選択して使用する必要
がある。このため、波形素片選択の際に参照される音声
ファイルについても、適切な波形素片が見当たらないと
いうような事態が生じないよう、種々の音声事象に対し
て充分な網羅性が保証されていなければならない。By the way, when performing waveform synthesis, in order to obtain a good synthesized voice, based on the context and prosody (characteristics such as the pitch shape of the voice, the temporal length, and the size) of the phoneme, It is necessary to select and use an appropriate waveform element. For this reason, sufficient comprehensiveness is assured for various audio events so that a situation in which no appropriate waveform element is found does not occur in the audio file that is referred to when selecting the waveform element. There must be.

【０００５】この観点から上述のＣＯＣを見てみると、
ＣＯＣはそもそもＬＳＰ（線スペクトラム対）音声合成
における音声ファイル構成手法として位置付けられるも
ので、個々の音声波形素片をＬＳＰパラメータの形で蓄
積している。ここで、ＬＳＰパラメータは、補間性がよ
く変形の自由度が高いという特長をもつため、ＬＳＰ音
声合成では、各コンテキストでの平均的なスペクトル特
徴を音声ファイルに蓄積した上で、波形合成の際には、
音声波形素片の韻律特徴を所望の音声に適合するよう変
形して用いるという方法が採られる。このため、ＣＯＣ
では、コンテキストによるスペクトル特徴の類別に重点
が置かれており、音声ファイル構成において韻律特徴を
考慮する必要はない。Looking at the above-mentioned COC from this point of view,
COC is originally positioned as a voice file construction method in LSP (line spectrum pair) voice synthesis, and stores individual voice waveform segments in the form of LSP parameters. Here, since the LSP parameter has a feature of good interpolating property and high degree of freedom of deformation, in the LSP voice synthesis, an average spectral feature in each context is stored in the voice file and then the waveform synthesis is performed. Has
A method is used in which the prosodic features of the speech waveform segment are modified and used to suit the desired speech. Therefore, COC
In, the emphasis is placed on the classification of spectral features according to context, and it is not necessary to consider prosodic features in the audio file structure.

【０００６】[0006]

【発明が解決しようとする課題】一方、音声波形そのも
のを音声単位として蓄積する音声ファイルの場合には、
過度の波形変形は合成音品質の自然性の低下につながる
ため、波形変形を最小限に抑えるべく、音声ファイルは
スペクトル特徴のみならず、韻律特徴をも網羅した構成
となっている必要がある。On the other hand, in the case of a voice file in which the voice waveform itself is stored as a voice unit,
Since excessive waveform transformation leads to deterioration in the naturalness of the synthesized voice quality, the audio file must have a structure that covers not only spectral features but also prosodic features in order to minimize waveform transformation.

【０００７】しかし、上記ＣＯＣでは、コンテキストに
ついては充分な網羅性を持つものの、韻律に関しては考
慮されていないので、コンテキストによるスペクトル特
徴の類別のみで辞書（音声ファイル）構成を行った場
合、音声合成の波形変形への依存度が高まることとな
る。そのため、韻律制御の際、過度な波形変形を行わざ
るを得ない場合が生じ、その結果、波形の自然性が損な
われ合成音質が劣化するという問題が生じる。However, in the above COC, although the context is sufficiently exhaustive, the prosody is not taken into consideration. Therefore, when the dictionary (speech file) is constructed only by classifying the spectral features according to the context, the speech synthesis is performed. Will be more dependent on the waveform deformation. Therefore, in the case of prosody control, there is a case where the waveform must be excessively deformed, resulting in a problem that the naturalness of the waveform is impaired and the synthesized sound quality is deteriorated.

【０００８】上述したように、従来の音声ファイル構成
方法においては、コンテキスト及び韻律の双方において
網羅性の高い音声ファイルを構成することができない。As described above, in the conventional audio file construction method, it is not possible to construct an audio file having high comprehensiveness in both context and prosody.

【０００９】従って、本発明の目的は、コンテキスト及
び韻律の双方において網羅性の高い音声ファイルを構成
できる音声ファイル構成方式を提供することにある。[0009] Therefore, an object of the present invention is to provide a voice file structuring system capable of structuring a voice file having high coverage in both context and prosody.

【００１０】[0010]

【課題を解決するための手段】本発明は、自然発声から
切出した音声波形素片を組合わせることにより合成音を
得る波形合成に利用される音声ファイルの構成方式にお
いて、自然発声から切出した多数の音声波形素片を蓄積
した音声データベースと、この音声データベース内の同
一音素をもつ音声波形素片の集合に対して、所定の特徴
パラメータ空間においてコンテキストをベースとしたク
ラスタリング（以下、コンテキストクラスタリングとい
う）を行うコンテキストクラスタリング手段と、コンテ
キストクラスタリングにより得られた個々のクラスタに
対し、所定の韻律パラメータ空間におけるクラスタリン
グ（以下、韻律クラスタリングという）を行う韻律クラ
スタリング手段と、韻律クラスタリングにより得られた
個々のクラスタのセントロイドを求め、このセントロイ
ドの近傍にある音声波形素片を抽出して音声ファイルに
登録する音声素片登録手段とを備えることを特徴とす
る。SUMMARY OF THE INVENTION The present invention is a method of constructing a voice file used for waveform synthesis to obtain a synthesized voice by combining voice waveform segments cut out from a natural utterance. Context-based clustering in a given feature parameter space for a speech database accumulating speech waveform segments of the above and a set of speech waveform segments having the same phonemes in this speech database (hereinafter referred to as context clustering) And a prosodic clustering means for performing clustering in a predetermined prosodic parameter space (hereinafter referred to as prosodic clustering) on each cluster obtained by context clustering, and an individual cluster obtained by prosodic clustering. Seeking Ntoroido, characterized in that it comprises a speech element registering means for registering the audio file by extracting the voice waveform segments in the vicinity of the centroid.

【００１１】また、本発明は、上記方式により行われる
音声ファイル構成方法も提供する。The present invention also provides a method for constructing an audio file performed by the above method.

【００１２】[0012]

【作用】本発明によれば、予め用意された音声データベ
ースから同一音素の音声波形素片の集合を抽出し、ま
ず、この同一音素の波形集合にコンテキストクラスタリ
ングを施すことにより、コンテキストをベースにした複
数のクラスタが得られる。次に、このコンテキストクラ
スタリングにより得られた個々のクラスタに対し韻律ク
ラスタリングを施すことにより、各クラスタが韻律環境
の異なる複数の小クラスタに分割される。次に、これら
小クラスタの各々から、そのセントロイドの近傍にある
波形素片が抽出されて音声ファイルに登録される。その
結果、コンテキスト及び韻律の双方に関し網羅性の高い
音声ファイルが構成される。According to the present invention, a set of speech waveform units of the same phoneme is extracted from a speech database prepared in advance, and first, by performing context clustering on the same phoneme waveform set, the context is based. Multiple clusters are obtained. Next, by applying prosodic clustering to each cluster obtained by this context clustering, each cluster is divided into a plurality of small clusters having different prosodic environments. Next, the waveform segment near the centroid is extracted from each of these small clusters and registered in the audio file. As a result, an audio file having a high degree of comprehensiveness in both context and prosody is constructed.

【００１３】この音声ファイルを利用すれば、波形合成
の際、所望の音素及びコンテキストに関し、所望の韻律
に近い韻律を持った波形素片が得られるため、韻律制御
の際には、波形変形への依存度が低下し、結果として合
成音の品質劣化を防止することができる。By using this audio file, a waveform segment having a prosody close to a desired prosody can be obtained for a desired phoneme and context at the time of waveform synthesis. It is possible to prevent the deterioration of the quality of the synthetic speech as a result.

【００１４】好適な実施例では、コンテキストクラスタ
リング手段及び韻律クラスタリング手段がそれぞれ、ク
ラスタリングにおけるクラスタ分割の利得に基づいて、
クラスタリングの終了を判定する終了判定手段を有して
いる。この終了判定手段によれば、適切な段階でクラス
タリングを終了させることが可能となる。In the preferred embodiment, the context clustering means and the prosody clustering means are each based on the gain of cluster partitioning in clustering,
It has an end determination means for determining the end of clustering. According to this end determination means, it is possible to end the clustering at an appropriate stage.

【００１５】[0015]

【実施例】以下、本発明の一実施例を、図面により詳細
に説明する。尚、以下の実施例は周知の汎用コンピュー
タを用いて実現することができ、そのハードウェア構成
は特に説明せずとも、その処理過程さえ理解できれば当
業者は本発明を容易に実施することができるから、以
下、実施例の処理過程のみを説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below in detail with reference to the drawings. It should be noted that the following embodiments can be realized by using a well-known general-purpose computer, and those skilled in the art can easily implement the present invention without understanding the hardware configuration thereof and only understanding the processing steps thereof. Therefore, only the processing steps of the embodiment will be described below.

【００１６】〔１〕音声データベース図１は、本発明の一実施例において、音声ファイルを作
るための素材として用意される音声データベースの構成
を示す。[1] Voice Database FIG. 1 shows the configuration of a voice database prepared as a material for creating a voice file in one embodiment of the present invention.

【００１７】図１に示した音声データベース１００は、
自然発声された音声に対し音素ラベリングを行って、音
声波形のどの区間がどの音素に該当するかの対応づけを
行って得られたものである。この音声データベース１０
０には、音声の波形データを蓄積した波形データファイ
ル１０１（図１では、個々の音素区間の波形に分離して
示してある）と、この波形データの各音素区間に対応し
た発音記号（以下、音素ラベルという）及びそのコンテ
キスト（例えば、前後各１個の音素）の発音記号とが記
述された音素ラベリングファイル１０２と、各音素区間
の波形分析により得られた韻律パラメータ（例えば、ピ
ッチ〔Ｈz〕、ピッチ傾斜〔Ｈz／ｍｓ〕、パワー〔ｄ
Ｂ〕、及び時間長〔ｍｓ〕）が記述された韻律データフ
ァイル１０３とを備える。The voice database 100 shown in FIG.
This is obtained by performing phoneme labeling on a naturally uttered voice and associating which section of the speech waveform corresponds to which phoneme. This voice database 10
Reference numeral 0 indicates a waveform data file 101 (in FIG. 1, the waveforms of individual phoneme sections are shown separately) in which voice waveform data is accumulated, and phonetic symbols (hereinafter referred to as “phonetic symbols”) corresponding to the respective phoneme sections of the waveform data. , Phoneme label) and phonetic symbols of its context (for example, one phoneme before and after each phoneme) and prosodic parameters (for example, pitch [Hz ], Pitch inclination [Hz / ms], power [d
B] and a prosody data file 103 in which the time length [ms]) is described.

【００１８】〔２〕全体処理図２は、本発明の一実施例による全体の処理過程を概略
的に示す。[2] Overall Processing FIG. 2 schematically shows an overall processing procedure according to an embodiment of the present invention.

【００１９】図２において、まず、元となる音声データ
ベース（図１参照）１００から、同一の音素ラベルを有
する音素区間の波形データが切出される（図示の例で
は、音素ラベル／ａ／を持った波形データが抽出されて
いる）。そして、これにより得られた波形データ集合が
初期クラスタ１１０とされる。In FIG. 2, first, waveform data of a phoneme section having the same phoneme label is cut out from the original speech database (see FIG. 1) 100 (in the illustrated example, the phoneme label / a / is included). Waveform data has been extracted). Then, the waveform data set thus obtained is set as the initial cluster 110.

【００２０】次に、この初期クラスタ１１０に対し、特
徴パラメータの空間（以下、特徴パラメータ空間とい
う）上でコンテキストをベースとしたクラスタリング
（コンテキストクラスリング）が行われる。ここで、特
徴パラメータとは、音素区間のスペクトルの特徴を表す
パラメータ、例えば、ケプストラムやＬＳＰ等である。Next, for the initial cluster 110, context-based clustering (context classing) is performed in a space of feature parameters (hereinafter referred to as feature parameter space). Here, the characteristic parameter is a parameter representing the characteristic of the spectrum of the phoneme section, such as a cepstrum or an LSP.

【００２１】このコンテキストクラスタリングにより、
初期クラスタ１１０はコンテキストの異なる幾つかのク
ラスタに分割される。例えば図示の例では、後続音韻が
有声音であるクラスタ１２０と、無声音であるクラスタ
１３０とに分割され、更に、クラスタ１２０は、語頭の
音素のクラスタ１４０と、語中の音素のクラスタ１５０
とに分割されている。各クラスタ１３０、１４０、１５
０は、特徴パラメータの類似したコンテキストをもった
波形素片集合１３９、１４９、１５９からそれぞれ構成
されることになる。By this context clustering,
The initial cluster 110 is divided into several clusters with different contexts. For example, in the illustrated example, the subsequent phoneme is divided into a cluster 120 that is a voiced sound and a cluster 130 that is an unvoiced sound. Further, the cluster 120 includes a cluster 140 of a phoneme at the beginning of a word and a cluster 150 of a phoneme in a word.
It is divided into and. Each cluster 130, 140, 15
0 is composed of waveform segment sets 139, 149, and 159 having similar contexts of feature parameters.

【００２２】次に、それらのクラスタ１３０、１４０、
１５０の各々に対し、韻律パラメータ（例えば、ピッ
チ、ピッチ傾斜、パワー及び時間長）の空間（以下、韻
律パラメータ空間という）上で更なるクラスタリング
（韻律クラスタリング）が行われる。例えば、各クラス
タ１３０、１４０、１５０に属するの音素波形データ
（以下、クラスタの要素という）が、各クラスタの韻律
パラメータ空間Ｐ1、Ｐ2、Ｐ3上で図示の点のようにプ
ロットされたとすると、韻律クラスタリングによって、
図中の破線１３４〜１３６、１４４〜１４６、及び１５
４〜１５６で示すような境界が引かれて、更に細かいク
ラスタ１３１〜１３３、１４１〜１４３、１５１〜１５
３に分けられる。Next, the clusters 130, 140,
Further clustering (prosodic clustering) is performed on a space of prosodic parameters (for example, pitch, pitch gradient, power, and time length) (hereinafter referred to as prosodic parameter space) for each of 150. For example, if the phoneme waveform data (hereinafter referred to as cluster elements) belonging to each cluster 130, 140, 150 is plotted on the prosodic parameter space P1, P2, P3 of each cluster as shown in the figure, the prosody is By clustering,
Broken lines 134 to 136, 144 to 146, and 15 in the figure
Boundaries such as 4 to 156 are drawn, and finer clusters 131 to 133, 141 to 143, and 151 to 15 are created.
Divided into three.

【００２３】次に、これら細分されたクラスタ１３１〜
１３３、１４１〜１４３、１５１〜１５３の各々からセ
ントロイドが求められ、そのセントロイドの最近傍にあ
る波形データが抽出され、合成用音声ファイル１６０に
登録される。同時に、抽出された波形データのコンテキ
ストや韻律パラメータについても、合成用音声ファイル
１６０に格納される。Next, these subdivided clusters 131 ...
A centroid is obtained from each of 133, 141 to 143, and 151 to 153, waveform data in the vicinity of the centroid is extracted, and registered in the synthesis voice file 160. At the same time, the context and prosody parameters of the extracted waveform data are also stored in the synthesis voice file 160.

【００２４】上述した処理過程を経ることにより、１つ
の音素／ａ／に関する音声ファイルが構成される。他の
音素についても同様の処理過程を経て音声ファイルが構
成される。A voice file for one phoneme / a / is constructed by going through the above-mentioned processing steps. For other phonemes, a voice file is constructed through similar processing steps.

【００２５】以上のように本実施例では、コンテキスト
クラスタリングで得られたクラスタに対し更に韻律クラ
スタリングを実施することにより、韻律パラメータ空間
上で要素の分布を考慮し複数個の波形データを抽出する
ため、調音的揺らぎ、つまり、前後のコンテキストの影
響による音韻の音響的特徴の変動だけでなく、韻律特徴
の差異も含めた音声事象の網羅性を向上させることが可
能となる。As described above, in the present embodiment, by further performing prosodic clustering on the clusters obtained by context clustering, a plurality of waveform data are extracted in consideration of the distribution of elements in the prosodic parameter space. , It is possible to improve not only the articulatory fluctuations, that is, the variation of acoustic features of the phoneme due to the influence of contexts before and after, but also the comprehensiveness of speech events including the difference of prosodic features.

【００２６】〔３〕コンテキストクラスタリング図３は、本実施例におけるコンテキストクラスタリング
の処理過程を示すフローチャートである。[3] Context Clustering FIG. 3 is a flowchart showing the process of context clustering in this embodiment.

【００２７】図３において、まず、音声データベース１
００内の音素ラベリングされた波形データ中から、同一
の音素ラベルが付与されている波形データを全て取り出
し、初期クラスタ１１０とする（ステップ２０１）。次
に、この初期クラスタ１１０内の個々の波形データ（要
素）を特徴分析する（ステップ２０２）。この特徴分析
においては、ＬＰＣ（線形予測符号化法）ケプストラム
等の特徴パラメータの次数をｎとし、かつ、分析窓関数
のフレーム周期を可変として、フレーム数がｍフレーム
となるように分析を行うことにより、各要素に対してｎ
×ｍ次元の特徴パラメータ行列を得る。In FIG. 3, first, the voice database 1
From the phoneme-labeled waveform data in 00, all the waveform data to which the same phoneme label is given are taken out and set as the initial cluster 110 (step 201). Next, the individual waveform data (elements) in this initial cluster 110 are subjected to feature analysis (step 202). In this feature analysis, the order of feature parameters such as LPC (linear predictive coding) cepstrum is n, and the frame period of the analysis window function is variable, and the analysis is performed so that the number of frames is m frames. To n for each element
Obtain a × m-dimensional feature parameter matrix.

【００２８】次に、この特徴分析の結果を用いて、初期
クラスタ１１０のクラスタ歪を求める（ステップ２０
３）。これは次のように行う。Next, using the result of this feature analysis, the cluster distortion of the initial cluster 110 is obtained (step 20).
3). This is done as follows.

【００２９】まず、各要素の特徴パラメータをｎ×ｍ行
列の形式から（ｎ×ｍ）×１次元のベクトル形式に変換
する。これを簡単化した例で示すと、First, the characteristic parameter of each element is converted from the n × m matrix format to the (n × m) × 1 dimensional vector format. To show this in a simplified example,

【数１】のように、例えば元の形式が４×３次元行列である場
合、その１２個の成分を１次元に並べて１２次元ベクト
ルの形式に変換する。[Equation 1] As described above, for example, when the original format is a 4 × 3 dimensional matrix, the 12 components are arranged in one dimension and converted into a 12 dimensional vector format.

【００３０】次に、この特徴パラメータのベクトル空間
において、初期クラスタ１１０の全ての要素と予め求め
ておいたセントロイドとの間の距離の２乗和を求めて、
これを初期クラスタ１１０のクラスタ歪と定義する。Next, in the vector space of this feature parameter, the sum of squares of the distances between all the elements of the initial cluster 110 and the previously obtained centroid is calculated,
This is defined as the cluster distortion of the initial cluster 110.

【００３１】こうして初期クラスタ１１０のクラスタ歪
を求めると、これをコンテキストクラスタテーブル２０
８に登録する。このコンテキストクラスタテーブル２０
８には、図示のように、各クラスタ毎に、それに属する
コンテキストと、そのセントロイドと、そのクラスタ歪
と、それに含まれる要素波形の集合とが登録されてい
る。尚、初期クラスタ１１０のクラスタ歪を求めた段階
では、初期クラスタ１１０だけがコンテキストクラスタ
テーブル２０８に登録されていることになる。When the cluster distortion of the initial cluster 110 is calculated in this way, it is calculated as the context cluster table 20.
Register at 8. This context cluster table 20
In FIG. 8, as shown, for each cluster, the context belonging to it, its centroid, its cluster distortion, and the set of element waveforms included in it are registered. At the stage of obtaining the cluster distortion of the initial cluster 110, only the initial cluster 110 is registered in the context cluster table 208.

【００３２】次に、コンテキストクラスタテーブル２０
８中からクラスタ歪が最大となるクラスタを求め（ステ
ップ２０４）、この求めたクラスタを、コンテキストク
ラスタテーブル２０８中から取り出し、コンテキストに
より更に２つのクラスタに分割する（ステップ２０
５）。尚、最初の段階では、初期クラスタ１１０だけが
コンテキストクラスタテーブル２０８に登録されている
ので、この初期クラスタ１１０に対してクラスタ分割が
行われる。このクラスタ分割の方法は以下の通りであ
る。Next, the context cluster table 20
A cluster having the largest cluster distortion is obtained from the eight (step 204), the obtained cluster is taken out from the context cluster table 208, and further divided into two clusters according to the context (step 20).
5). At the initial stage, since only the initial cluster 110 is registered in the context cluster table 208, the initial cluster 110 is divided into clusters. The method of this cluster division is as follows.

【００３３】例えば、分割対象のクラスタが｜Ａ，Ｂ，
Ｃ，Ｄ，Ｅ｜の５種類のコンテキストを含んでいるとす
ると、可能な分割方法として、｜Ａ，Ｂ，Ｃ｜と｜Ｄ，Ｅ｜に分割、｜Ａ，Ｃ，Ｄ，Ｅ｜と｜Ｂ｜に分割、｜Ａ，Ｂ，Ｅ｜と｜Ｃ，Ｄ｜に分割、 …等の多数の方法があるが、その中から、分割によって
生じた２つのクラスタのクラスタ歪の和が最小となる１
つの分割方法を選択する。即ち、可能な分割方法の全て
について、次の（１）式により、クラスタ分割の利得Ｑ
を計算する。For example, if the cluster to be divided is | A, B,
If five contexts C, D, E | are included, possible division methods are: | A, B, C | and | D, E |, | A, C, D, E | There are many methods such as | B | division, | A, B, E | and | C, D | division, and so on. Among them, the sum of the cluster distortions of the two clusters caused by the division is the smallest. Becomes 1
Choose one split method. That is, for all possible division methods, the gain Q of cluster division is calculated by the following equation (1).
Is calculated.

【００３４】Ｑ＝ＶＡＲ−（Ｖａ＋Ｖｂ） …（１）ここに、ＶＡＲは分割対象のクラスタのクラスタ歪、Ｖ
ａ、Ｖｂはぞれぞれ分割後の２つのクラスタのクラスタ
歪である。尚、分割後のクラスタのクラスタ歪の計算
は、上に述べた方法と同じである。この利得Ｑは、クラ
スタ分割によって得られるクラスタ歪の減少量を意味し
ている。Q = VAR- (Va + Vb) (1) where VAR is the cluster distortion of the cluster to be divided, and V is
a and Vb are cluster distortions of the two clusters after the division. The calculation of the cluster distortion of the cluster after the division is the same as the method described above. The gain Q means a reduction amount of cluster distortion obtained by cluster division.

【００３５】こうして全ての分割方法について利得Ｑが
求まると、その中から利得Ｑが最大となる、換言すれば
分割後のクラスタ歪の和（Ｖａ＋Ｖｂ）が最小となる分
割方法が選択され、この分割方法により分割対象のクラ
スタが２つのクラスタに分割される。In this way, when the gain Q is obtained for all the division methods, the division method that maximizes the gain Q, in other words, the division method that minimizes the sum (Va + Vb) of the cluster distortion after division is selected, and this division is selected. The cluster to be divided is divided into two clusters by the method.

【００３６】このようにして初期クラスタ１１０の分割
が行われた後、コンテキストクラスタテーブル２０８に
おいて、初期クラスタ１１０が削除され、分割された２
つのクラスタが新たに登録される（ステップ２０６）。After the initial cluster 110 is divided in this way, the initial cluster 110 is deleted and divided into two in the context cluster table 208.
Two clusters are newly registered (step 206).

【００３７】次に、再びステップ２０３に戻り、コンテ
キストクラスタテーブル２０８の中から最大のクラスタ
歪をもつクラスタが選択され、そして、上記と同様の方
法でその選択されたクラスタが２分割され、コンテキス
トクラスタテーブル２０８が書換えられる（ステップ２
０４〜２０６）。Next, returning again to step 203, the cluster having the largest cluster distortion is selected from the context cluster table 208, and the selected cluster is divided into two by the same method as described above to obtain the context cluster. The table 208 is rewritten (step 2)
04-206).

【００３８】以上の処理（ステップ２０３〜２０６）を
繰り返すことにより、初期クラスタ１１０は次第に小さ
いクラスタに細分化されていく。そして、この各繰り返
しループ毎に、コンテキストクラスタリングの終了判定
が行われる（ステップ２０７）。この終了判定では、生
成されたクラスタ数が所定の上限に達する、又は、前記
利得Ｑが所定の閾値を下回るという条件が満たされるこ
とにより終了と判定する。By repeating the above processing (steps 203 to 206), the initial cluster 110 is gradually subdivided into smaller clusters. Then, the termination determination of the context clustering is performed for each of the repeated loops (step 207). In this termination determination, the termination is determined when the number of generated clusters reaches a predetermined upper limit or when the condition that the gain Q falls below a predetermined threshold is satisfied.

【００３９】ここで、利得Ｑに関する終了条件について
図４を参照して補足説明すると、クラスタリングが進行
するに伴って、クラスタ歪の総和は徐々に減少して行く
が、その総和の減少量も徐々に小さくなっていくため、
クラスタ分割の利得Ｑも徐々に小さくなって行く。そこ
で、この利得Ｑが予め経験的に決定された閾値を下回っ
た段階で、クラスタ歪の総和をこれ以上減少させ得ない
ない程度にまで充分なクラスタ分割を行ったと判断し
て、クラスタリングを終了させることができる。Here, supplementary description will be given of the termination condition regarding the gain Q with reference to FIG. 4. Although the total sum of cluster distortion gradually decreases as the clustering progresses, the decrease amount of the total sum gradually increases. Because it becomes smaller
The gain Q of cluster division also gradually decreases. Therefore, when this gain Q falls below a threshold value that is empirically determined in advance, it is determined that sufficient cluster division has been performed to such an extent that the total sum of cluster distortion cannot be further reduced, and clustering is terminated. be able to.

【００４０】この利得Ｑに基づいた終了条件を用いるこ
とで、初期のクラスタ歪の値に依存せずに適切な段階で
クラスタリングを終了させることができる。By using the termination condition based on the gain Q, the clustering can be terminated at an appropriate stage without depending on the initial cluster distortion value.

【００４１】さて、ステップ２０７で終了と判定された
ときには、コンテキストクラスタリングの過程を終了
し、図５に示す韻律クラスタリングの過程に移行する。When it is determined in step 207 that the process is finished, the process of context clustering is finished, and the process proceeds to the process of prosody clustering shown in FIG.

【００４２】尚、以上のコンテキストクラスタリング
は、原則として、音声データベース１００内の全ての音
素ラベルに対して実行される。The context clustering described above is, in principle, executed for all phoneme labels in the speech database 100.

【００４３】〔４〕韻律クラスタリング図５は、韻律クラスタリングのフローチャートである。[4] Prosody Clustering FIG. 5 is a flowchart of prosody clustering.

【００４４】この韻律クラスタリングは、上記コンテキ
ストクラスタリングにより得られたコンテキストクラス
タテーブル２０８内の各クラスタに対して順次行われ
る。This prosodic clustering is sequentially performed on each cluster in the context cluster table 208 obtained by the above context clustering.

【００４５】図５において、まず、コンテキストクラス
タテーブル２０８を参照して韻律クラスタリングの対象
となるクラスタを選定し、そのクラスタに属する全ての
要素の韻律パラメータ（ピッチ、ピッチ傾斜、時間長、
パワー等）を、音声データベース１００から読込む（ス
テップ３０１）。In FIG. 5, first, a cluster targeted for prosodic clustering is selected by referring to the context cluster table 208, and prosodic parameters (pitch, pitch slope, time length,
Power, etc.) is read from the voice database 100 (step 301).

【００４６】次に、ステップ３０１で読込んだ全要素の
韻律パラメータを走査して、個々のパラメータ（つま
り、ピッチ、ピッチ傾斜、時間長、パワー等の各々）に
ついて、そのクラスタでの最大値及び最小値を求める
（ステップ３０２）。ステップ３０２の処理が終了する
と、以下に説明する手順でＬＢＧアルゴリズムによる韻
律パラメータ空間上でのクラスタリングが行われる。Next, the prosody parameters of all the elements read in step 301 are scanned to find the maximum value and the maximum value in the cluster for each individual parameter (that is, pitch, pitch slope, time length, power, etc.). The minimum value is obtained (step 302). When the process of step 302 ends, clustering on the prosody parameter space by the LBG algorithm is performed in the procedure described below.

【００４７】ここで、ＬＢＧアルゴリズを図６〜図９を
参照して説明する。尚、図中の黒点は個々の要素を示
す。Here, the LBG algorithm will be described with reference to FIGS. The black dots in the figure indicate individual elements.

【００４８】まず、図６に示すように、所定のパラメー
タ空間（便宜上、２次元空間で図示してある）におい
て、ランダムに２つのセントロイド４０１、４０２を定
め、これを仮に分割後の２つのクラスタＡ、Ｂのセント
ロイドとする。そして、クラスタＣの個々の要素４０３
について、２つのセントロイド４０１、４０２の各々ま
での所定の距離尺度に基づいた距離を測定し、いずれか
距離の短い方のセントロイドのクラスタＡ又はＢに、そ
の要素４０３を属させる。これを全ての要素について行
うことにより、図７に示すように、元のクラスタＣは境
界線４０４によってクラスタＡとクラスタＢとに分割さ
れる（第１回目の分割）。First, as shown in FIG. 6, two centroids 401 and 402 are randomly defined in a predetermined parameter space (illustrated in a two-dimensional space for convenience), and the two centroids are temporarily divided into two. Let it be the centroid of clusters A and B. Then, the individual elements 403 of the cluster C
, The distance to each of the two centroids 401, 402 is measured based on a predetermined distance measure, and the element 403 belongs to the cluster A or B of the centroid having the shorter distance. By performing this for all the elements, the original cluster C is divided into the cluster A and the cluster B by the boundary line 404 as shown in FIG. 7 (first division).

【００４９】次に、分割により生成されたクラスタＡ及
びＢの各々において、図８に示すように、新たなセント
ロイド４０５、４０６を求める（セントロイド更新）。
次に、新たなセントロイド４０５、４０６に基づいて、
個々の要素４０３がクラスタＡ、Ｂいずれに属するのか
を再決定し、クラスタＡ、Ｂの境界線を第１回目の分割
の境界線４０４から図９に示す境界線４０７に変更する
（第２回目の分割）。Next, in each of the clusters A and B generated by the division, as shown in FIG. 8, new centroids 405 and 406 are obtained (centroid update).
Then, based on the new centroids 405, 406,
It is re-determined whether each element 403 belongs to the cluster A or B, and the boundary line of the clusters A and B is changed from the boundary line 404 of the first division to the boundary line 407 shown in FIG. 9 (second time). Split).

【００５０】このようなクラスタ分割とセントロイド更
新とを、セントロイドが収束するまで反復する。これが
ＬＢＧアルゴリズムである。Such cluster division and centroid update are repeated until the centroid converges. This is the LBG algorithm.

【００５１】このＬＢＧアルゴリズムによるクラスタリ
ングを、ピッチ、ピッチ傾斜、パワー及び時間長等から
なる韻律パラメータ空間上で行うのが、本実施例の韻律
クラスタリングである。It is the prosodic clustering of this embodiment that clustering by the LBG algorithm is performed on a prosodic parameter space consisting of pitch, pitch gradient, power, time length and the like.

【００５２】この韻律クラスタリングでは、２つの要素
Ｗ1、Ｗ2間の距離Ｄは、以下の（２）式によって求め
る。In this prosodic clustering, the distance D between the two elements W1 and W2 is obtained by the following equation (2).

【００５３】[0053]

【数２】ここに、Ｆ1、Ｖ1、Ａ1及びＴ1は一方の要素Ｗ1のピッ
チ傾斜、基本周波数、パワー、及び時間長であり、Ｆ
2、Ｖ2、Ａ2、及びＴ2は他方の要素Ｗ2のピッチ傾斜、
基本周波数、パワー、及び時間長である。また、記号‖
…‖は、ステップ３０２で各パラメータ毎に得られる最
大値と最小値との差による正規化を意味する。[Equation 2] Where F1, V1, A1 and T1 are the pitch slope, fundamental frequency, power and time length of one element W1,
2, V2, A2, and T2 are pitch tilts of the other element W2,
Basic frequency, power, and time length. Also, the symbol ‖
... ‖ means normalization by the difference between the maximum value and the minimum value obtained for each parameter in step 302.

【００５４】この（２）式は、要するに、２つの要素間
Ｗ1、Ｗ2において、同種のパラメータ同士の差をとった
上で、その差を各パラメータの最大値と最小値との差で
正規化した後、それらの２乗和をとることを意味してい
る。尚、こうして求めた距離Ｄ（厳密には２乗距離であ
る）は、２つの要素Ｗ1、Ｗ2同士の韻律における類似の
度合を示している。This equation (2) is, in short, after taking the difference between the same kind of parameters in the two elements W1 and W2, normalizing the difference by the difference between the maximum value and the minimum value of each parameter. After that, it means to take the sum of squares of them. The distance D (strictly speaking, the squared distance) thus obtained indicates the degree of similarity in the prosody of the two elements W1 and W2.

【００５５】或は、距離Ｄを以下の（３）式のように、
各パラメータの重要度に応じた重み付けを加味して計算
してもよい。Alternatively, the distance D is expressed by the following equation (3):
The calculation may be performed by adding weighting according to the importance of each parameter.

【００５６】[0056]

【数３】ここに、ωf、ωv,ωa、及びωtは、夫々ピッチ傾斜、
基本周波数、パワー、及び時間長に対しての重み係数を
示す。(Equation 3) Where ωf, ωv, ωa, and ωt are pitch tilts,
The weighting factors for the fundamental frequency, power, and time length are shown.

【００５７】さて、再び図５を参照して、韻律クラスタ
リングでは、まず、韻律パラメータ空間上で、クラスタ
リング対象のクラスタの中から座標の異なる２つの要素
を初期セントロイドとして設定する（ステップ３０
３）。尚、初期セントロイドの設定方法としては、例え
ば、ランダムに２要素を選ぶ方法や、その対象クラスタ
のセントロイドを予め求めておき、その近傍にある２要
素を選ぶ方法等がある。Referring again to FIG. 5, in prosodic clustering, first, two elements having different coordinates are set as initial centroids in the cluster to be clustered on the prosodic parameter space (step 30).
3). As a method of setting the initial centroid, for example, there are a method of randomly selecting two elements, a method of previously obtaining the centroid of the target cluster, and selecting two elements in the vicinity thereof.

【００５８】こうして選んだ初期セントロイドは、韻律
クラスタテーブル３０９に仮登録される。この韻律クラ
スタテーブル３０９には、図示のように、個々の対象ク
ラスタ毎に、それに含まれるコンテキストと、それを分
割した細分クラスタのクラスタ歪及びセントロイドとが
登録されるようになっている。ここで、クラスタ歪は、
韻律パラメータ空間上の対象クラスタの全要素とセント
ロイドとの間の距離（上述した方法で求めた距離Ｄ）の
和として定義される。例えば、図１０に示すように、各
クラスタにおける全ての要素とセントロイドとの間の距
離５０１の２乗和である。The initial centroid thus selected is provisionally registered in the prosody cluster table 309. In the prosody cluster table 309, as shown in the figure, the context included in each target cluster and the cluster distortion and centroid of the subdivided cluster obtained by dividing the context are registered. Where the cluster distortion is
It is defined as the sum of the distances (distance D obtained by the above-described method) between all the elements of the target cluster on the prosody parameter space and the centroid. For example, as shown in FIG. 10, it is the sum of squares of the distance 501 between all the elements in each cluster and the centroid.

【００５９】尚、初期セントロイドは、韻律クラスタテ
ーブル３０９の対応する対象クラスタのセントロイドの
欄に仮登録されるが、この初期セントロイドが仮登録さ
れた段階では、対象クラスタは未だ細分クラスタに分割
されてはいない。The initial centroid is provisionally registered in the centroid column of the corresponding target cluster in the prosody cluster table 309. At the stage when this initial centroid is provisionally registered, the target cluster is still a subdivided cluster. It is not divided.

【００６０】こうして初期セントロイドを仮登録した
後、次に、韻律クラスタテーブル３０９中の対象クラス
タのクラスタ歪の欄から、クラスタ歪が最大であるクラ
スタを抽出する（ステップ３０４）。但し、初期セント
ロイドを仮登録した段階では、対象クラスタは未だ分割
されておらず、韻律クラスタテーブル３０９には対象ク
ラスタ自体のクラスタ歪が登録されているので、この対
象クラスタが最大クラスタ歪をもつものとして抽出され
ることになる。After temporarily registering the initial centroid in this way, the cluster having the largest cluster distortion is extracted from the cluster distortion column of the target cluster in the prosody cluster table 309 (step 304). However, when the initial centroid is provisionally registered, the target cluster is not yet divided, and the cluster distortion of the target cluster itself is registered in the prosody cluster table 309. Therefore, this target cluster has the maximum cluster distortion. It will be extracted as a thing.

【００６１】次に、ステップ３０４で抽出された対象ク
ラスタに対し、既に説明したＬＢＧアルゴリズムによる
クラスタリングが行われ（ステップ３０５）、対象クラ
スタは２分割される。この２分割が完了すると、韻律ク
ラスタテーブル３０９には、新たに生まれた２つの細分
クラスタのクラスタ歪とセントロイドとが、元の対象ク
ラスタに対応する欄に登録される（ステップ３０６）。Next, the target cluster extracted in step 304 is subjected to clustering by the LBG algorithm already described (step 305), and the target cluster is divided into two. When the two divisions are completed, the cluster distortions and centroids of the two newly created subdivision clusters are registered in the prosody cluster table 309 in the column corresponding to the original target cluster (step 306).

【００６２】この後、再びステップ３０４に戻り、クラ
スタ歪が最大の細分クラスタが抽出され、そして、その
細分クラスタに対してＬＢＧアルゴリズムによるクラス
タリングが行われ（ステップ３０５）、韻律クラスタテ
ーブル３０９が書換えられる（ステップ３０６）。Thereafter, the process returns to step 304 again, the subdivided cluster having the largest cluster distortion is extracted, and the subdivided cluster is subjected to clustering by the LBG algorithm (step 305) and the prosody cluster table 309 is rewritten. (Step 306).

【００６３】以上のステップ３０４〜３０６が繰り返さ
れることにより、元の対象クラスタは徐々に細分化され
て行く。図１１〜図１３はこの過程を示したもので、元
の対象クラスタは、まず、図１１に示すように２つの細
分クラスタ５００、５０２に分割され、次に、クラスタ
歪の大きい方の細分クラスタ５０２が、図１２に示すよ
うに、２つの細分クラスタ５０３、５０４に分割され
る。更に、図１２の３つの細分クラスタのうち、クラス
タ歪の最も大きいクラスタ５０３が、図１３に示すよう
に２つの細分クラスタ５０５、５０６に分割される。By repeating the above steps 304 to 306, the original target cluster is gradually subdivided. 11 to 13 show this process. The original target cluster is first divided into two subdivision clusters 500 and 502 as shown in FIG. 11, and then the subdivision cluster with the larger cluster distortion. 502 is divided into two subdivision clusters 503, 504, as shown in FIG. Further, among the three subdivided clusters in FIG. 12, the cluster 503 having the largest cluster distortion is divided into two subdivided clusters 505 and 506 as shown in FIG.

【００６４】このようにしてクラスタ歪が最も大きいク
ラスタを対象としてクラスタリングが反復されて行く。In this way, clustering is repeated for the cluster with the largest cluster distortion.

【００６５】このクラスタリングの各反復ループにおい
て、韻律クラスタリングの終了判定が行われる（ステッ
プ３０７）。この終了判定では、生成された細分クラス
タ数が所定の上限に達する、又は以下の（４）式が満足
されるという条件が成立することで、終了と判定する。In each iterative loop of this clustering, the end judgment of the prosody clustering is performed (step 307). In this termination determination, the termination is determined when the condition that the number of generated subdivision clusters reaches a predetermined upper limit or the following expression (4) is satisfied is satisfied.

【００６６】（ＳＤ−ＳＤ´）／ＳＤ＜ε …（４）ここに、ＳＤはクラスタリング前のクラスタリング歪の
総和を示し、ＳＤ´は、クラスタリング後のクラスタ歪
の総和を示す。εは、終了判定用閾値を示す。(SD−SD ′) / SD <ε (4) Here, SD represents the total sum of clustering distortion before clustering, and SD ′ represents the total sum of cluster distortion after clustering. ε indicates a threshold value for end determination.

【００６７】なお、上記（４）式を満足するということ
は、クラスタリングを行う前のクラスタ歪の総和とクラ
スタリングを行った後のクラスタ歪の総和との差分を、
クラスタリングを行う前のクラスタ歪の総和で除算して
得られる値（つまり、クラスタ歪の減少率、つまり、一
種の正規化された利得）が、終了判定用閾値εを下回る
ということである。このクラスタ歪の減少率は、クラス
タリングが進行してクラスタ数が増加するに伴って、徐
々に低下して行く性質をもつものであり、これが所定の
閾値εを下回れば、クラスタ歪をこれ以上減少させ得な
い充分な程度にまでクラスタリングを行ったと判断でき
る。Satisfying the above equation (4) means that the difference between the sum of cluster distortion before clustering and the sum of cluster distortion after clustering is
That is, the value obtained by dividing by the total sum of the cluster distortion before clustering (that is, the reduction rate of the cluster distortion, that is, a kind of normalized gain) is below the termination determination threshold value ε. This reduction rate of cluster distortion has the property of gradually decreasing as clustering progresses and the number of clusters increases. If this is below a predetermined threshold ε, cluster distortion is further reduced. It can be judged that clustering has been performed to a sufficient degree that cannot be done.

【００６８】従って、（４）式の終了条件を用いること
により、初期のクラスタ歪に依存せずに、適切な段階で
クラスタリングを終了させることができる。Therefore, by using the termination condition of equation (4), clustering can be terminated at an appropriate stage without depending on the initial cluster distortion.

【００６９】さて、ステップ３０７で終了と判定される
と、次に、図１４に示すように、韻律クラスタリングで
得られた細分クラスタ５００〜５０６の各々について、
韻律クラスタテーブル３０９に登録された各々のセント
ロイドに最も近い要素（波形データ）が音声データベー
ス１００から抽出され、これが合成用音声ファイル１６
０に登録される（ステップ３０８）。When it is determined that the processing is completed in step 307, next, as shown in FIG. 14, for each of the subdivision clusters 500 to 506 obtained by the prosodic clustering,
The element (waveform data) closest to each centroid registered in the prosody cluster table 309 is extracted from the speech database 100, and this is the synthesis speech file 16
0 is registered (step 308).

【００７０】以上の韻律クラスタリングの処理が、コン
テキストクラスタリングで得られた全てのクラスタに対
して実行される。これにより、合成用音声ファイル１６
０が完成する。以上の説明から理解できるように、完成
した合成用音声ファイル１６０には、コンテキスト及び
韻律の双方に関して網羅性の高い音声波形データが登録
されていることになる。The above prosodic clustering processing is executed for all clusters obtained by context clustering. As a result, the audio file for synthesis 16
0 is completed. As can be understood from the above description, the completed synthesis voice file 160 has voice waveform data that is highly exhaustive in terms of both context and prosody.

【００７１】なお、上述した内容は、あくまで本発明の
一実施例に関するものであって、本発明が上記内容のみ
に限定されることを意味するものでないのは勿論であ
る。例えば、上記実施例では音素を素片単位としてして
いるが、必ずしもそうである必要はなく、例えばＣＶ音
節等の種々のタイプの素片単位を扱う場合にも本発明は
適用可能である。It should be noted that the above-mentioned contents are only related to one embodiment of the present invention, and it is needless to say that the present invention is not limited to the above contents. For example, although the phoneme is used as a unit of phoneme in the above embodiment, this is not necessarily the case, and the present invention can be applied to the case of handling various types of phoneme units such as CV syllables.

【００７２】[0072]

【発明の効果】以上説明したように、本発明によれば、
コンテキスト及び韻律の双方に関して網羅性の高い音声
ファイルを構成することができる。As described above, according to the present invention,
It is possible to construct an audio file that is highly comprehensive in terms of both context and prosody.

[Brief description of drawings]

【図１】本発明の一実施例で用いる音声データベースを
示す図。FIG. 1 is a diagram showing a voice database used in an embodiment of the present invention.

【図２】同実施例における合成用音声ファイル構成処理
の全体を示す説明図。FIG. 2 is an explanatory diagram showing an overall synthesis voice file configuration process according to the embodiment.

【図３】同実施例におけるコンテキストクラスタリング
の処理を示すフローチャート。FIG. 3 is a flowchart showing a process of context clustering in the embodiment.

【図４】コンテキストクラスタリングの進行とクラスタ
歪の総和の変化との関係を示す図。FIG. 4 is a diagram showing a relationship between progress of context clustering and a change in the total sum of cluster distortions.

【図５】同実施例における韻律クラスタリングの処理を
示すフローチャート。FIG. 5 is a flowchart showing processing of prosody clustering in the embodiment.

【図６】ＬＢＧアルゴリズムによる韻律クラスタリング
の説明図。FIG. 6 is an explanatory diagram of prosody clustering by the LBG algorithm.

【図７】ＬＢＧアルゴリズムによる韻律クラスタリング
の説明図。FIG. 7 is an explanatory diagram of prosody clustering by the LBG algorithm.

【図８】ＬＢＧアルゴリズムによる韻律クラスタリング
の説明図。FIG. 8 is an explanatory diagram of prosody clustering by the LBG algorithm.

【図９】ＬＢＧアルゴリズムによる韻律クラスタリング
の説明図。FIG. 9 is an explanatory diagram of prosody clustering by the LBG algorithm.

【図１０】韻律クラスタリングで用いるクラスタ歪の説
明図。FIG. 10 is an explanatory diagram of cluster distortion used in prosody clustering.

【図１１】韻律クラスタリングの説明図。FIG. 11 is an explanatory diagram of prosody clustering.

【図１２】韻律クラスタリングの説明図。FIG. 12 is an explanatory diagram of prosody clustering.

【図１３】韻律クラスタリングの説明図。FIG. 13 is an explanatory diagram of prosody clustering.

【図１４】韻律クラスタリングの説明図。FIG. 14 is an explanatory diagram of prosody clustering.

[Explanation of symbols]

１００音声データベース１１０初期クラスタ１６０合成用音声ファイル２０８コンテキストクラスタテーブル３０９韻律クラスタテーブル 100 voice database 110 initial cluster 160 voice file for synthesis 208 context cluster table 309 prosodic cluster table

Claims

[Claims]

1. A method of constructing a voice file used for waveform synthesis for obtaining a synthesized sound by combining speech waveform segments cut out from natural speech, wherein a large number of speech waveform segments cut out from said spontaneous speech are accumulated. A speech database, and context clustering means for performing context-based context clustering in a predetermined feature parameter space for a set of speech waveform segments having the same phonemes in the speech database, and obtained by the context clustering. Prosodic clustering means for performing prosodic clustering in a predetermined prosodic parameter space for each cluster, and centroids of the individual clusters obtained by the prosodic clustering are obtained, and speech waveform segments in the vicinity of this centroid are extracted. And before Audio file configuration system, characterized in that it comprises a speech element registering means for registering the audio file, a.

2. The method according to claim 1, wherein each of the context clustering means and the prosody clustering means has an end determination means for determining the end of clustering based on the gain of cluster division in clustering. Audio file configuration method.

3. A method of constructing a voice file used for waveform synthesis for obtaining a synthetic sound by combining speech waveform segments cut out from natural speech, wherein a large number of speech waveform segments cut out from said spontaneous speech are accumulated. A step of preparing the speech database, a step of performing context clustering in a predetermined feature parameter space on the basis of context for a set of speech waveform segments having the same phoneme in the speech database, and obtained by the context clustering. A process of performing prosodic clustering in a predetermined prosodic parameter space on each of the obtained clusters, and obtaining the centroid of each cluster obtained by the prosodic clustering, and extracting a speech waveform segment near the centroid. And registering in the audio file Audio files configuring, characterized in that it comprises a.