JPH10254471A

JPH10254471A - Voice synthesizer

Info

Publication number: JPH10254471A
Application number: JP9061037A
Authority: JP
Inventors: Shinko Morita; 眞弘森田; Shigenobu Seto; 重宣瀬戸; Hiroyuki Tsuboi; 宏之坪井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1997-03-14
Filing date: 1997-03-14
Publication date: 1998-09-25

Abstract

PROBLEM TO BE SOLVED: To optimize the evaluation function itself which is used to evaluate for the optimization of rhythm parameters. SOLUTION: Employing a prescribed evaluation function, an evaluation section 251 successively compares and evaluates the rhythm parameter time series, which are successively generated while switching the rhythm parameter values that are applicable and generated using rhythm rules by a parameter generating section 23 from various learning text information and the rhythm parameter time series which are generated from corresponding natural voice data by a parameter analysis section 24. Based on the evaluation values, an optimum rhythm parameter value, which is used for the rhythm rules by a parameter generating rule learning section 26, is determined. Moreover, a learning section 26 outputs the corresponding synthesized voices based on the rhythm parameter time series, which are successively generated by the section 23, to an operator trial hearing evaluation. The evaluation function is corrected to the direction in which no inconsistency exists between the output and the evaluation result of the section 251 employing the evaluation function corresponding to the trial hearing evaluation result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力されたテキス
ト情報から合成音声の音韻、韻律に関する情報を生成し
て、その情報をもとに合成音声の各音韻の韻律パラメー
タ値及び音韻記号を決定し、その韻律パラメータ値及び
音韻記号に基づいて音声合成に必要な音声パラメータを
生成して当該音声パラメータをもとに合成音声を出力す
る音声合成装置に係り、特に韻律パラメータの最適化、
韻律規則の最適化に好適な音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention generates information on phonemes and prosody of synthesized speech from input text information, and determines prosody parameter values and phoneme symbols of each phoneme of the synthesized speech based on the information. The present invention relates to a speech synthesizer that generates speech parameters necessary for speech synthesis based on the prosody parameter values and phoneme symbols and outputs a synthesized speech based on the speech parameters.
The present invention relates to a speech synthesis device suitable for optimizing prosody rules.

【０００２】[0002]

【従来の技術】テキスト情報からの音声合成は、音韻面
と韻律面の２つの特徴を制御することによって実現する
のが一般的である。このうち音韻面の自然性は、近年の
ハードウェア技術の進歩により飛躍的に向上してきた。2. Description of the Related Art Speech synthesis from text information is generally realized by controlling two features of a phonological surface and a prosodic surface. Of these, the naturalness of the phonological aspect has been dramatically improved by recent advances in hardware technology.

【０００３】もともと、テキスト情報から韻律規則（韻
律パラメータ生成規則）により韻律制御モデルにおける
韻律パラメータを生成し、そのパラメータを韻律制御モ
デルに適用することで韻律パラメータ時系列を生成して
音声波形を生成する一連の音声合成の処理の流れの中
で、韻律の制御に関しては、テキスト情報から韻律パラ
メータを生成するまでの処理（全般の処理）のウエイト
が大きい。一方、音韻面の制御に関しては、韻律パラメ
ータから音声波形の生成までの処理（後半の処理）のウ
エイトが大きい。このため、扱うデータサイズが大きく
なる後半の処理はハードウェアの制約を大きく受けてい
た。Originally, a prosody parameter in a prosody control model is generated from text information according to a prosody rule (prosody parameter generation rule), and a prosody parameter time series is generated by applying the parameter to the prosody control model to generate a speech waveform. In the flow of a series of speech synthesis processes performed, the weight of the process (general process) from generation of text information to generation of a prosody parameter is large with respect to prosody control. On the other hand, regarding the control of the phonetic surface, the weight of the processing from the prosodic parameters to the generation of the speech waveform (the latter half of the processing) is large. For this reason, the latter half of the processing when the data size to be handled is large has been greatly restricted by hardware.

【０００４】しかし現在は、音韻の自然性を保持・再合
成するのに十分な記憶容量・処理速度が容易に実現で
き、全般に自然な音韻性を持つ合成音が得られるように
なってきている。このため、韻律面の制御に失敗して不
自然さが目立つことのないよう、韻律制御の品質向上が
大きな課題となっている。However, at present, it is possible to easily realize a storage capacity and a processing speed sufficient to maintain and resynthesize the natural sound of a phoneme, and to obtain a synthetic sound having a natural sound sound in general. I have. For this reason, quality improvement of the prosody control is a major issue so that the prosody control does not fail and unnaturalness is not noticeable.

【０００５】韻律制御とは、入力されたテキスト情報を
解析することによって得られる言語・音韻的な属性の並
びから、予め用意した韻律規則（韻律パラメータ生成規
則）により、ピッチや音韻時間長、ポーズ長、パワーな
どの、韻律制御モデルにおけるモデルパラメータ、即ち
韻律パラメータへ変換する処理のことである。この韻律
制御の品質を向上するには、自然な合成音声を生成する
ように韻律規則を最適化（チューニング）する必要があ
る。[0005] The prosody control is based on a sequence of linguistic and phonological attributes obtained by analyzing input text information, and based on a prepared prosody rule (prosody parameter generation rule), a pitch, a phoneme time length, a pause, and the like. This is a process of converting into model parameters, such as length and power, in a prosody control model, that is, prosody parameters. In order to improve the quality of the prosody control, it is necessary to optimize (tune) the prosody rule so as to generate a natural synthesized speech.

【０００６】従来、韻律制御の品質向上には、テキスト
解析によって得た言語・音韻的な属性の特徴的な並びに
着目して、その並びに対して少数の自然音声データの分
析を行なって得た韻律パラメータにマッピングする対応
関係を定義させる問題として扱うアプローチをとってい
た。しかしながら、この対応関係は、着目している属性
の並びがその韻律的な特徴を顕在化させる主要因である
という仮定のもとに定義されたに過ぎず、拠り所となる
自然音声データの分析例が少数であるほど、その妥当性
は希薄になるという問題があった。Conventionally, in order to improve the quality of prosody control, attention has been paid to the characteristics of linguistic and phonological attributes obtained by text analysis, and a prosody obtained by analyzing a small number of natural speech data for the arrangement. The approach was to treat it as a matter of defining the correspondence that maps to parameters. However, this correspondence is only defined on the assumption that the sequence of the attributes of interest is the main factor that reveals its prosodic features, and is an example of analysis of natural speech data on which it depends. There was a problem that the validity was diminished as the number was small.

【０００７】このため、最近では、比較的規模の大きい
自然音声とその分析データを集めた音声コーパスを用い
て、その統計的な傾向を規則として扱うアプローチが増
えてきている。この方法は、韻律的な特徴を顕在化させ
る主要因が何であるかを、規模の大きいコーパスを統計
的に分析した結果に基づいて評価しているため、得られ
る規則の妥当性は比較的高いといえる。但し、大量のデ
ータを統計的に扱うために、予め定義した客観的な尺度
に基づいて評価値を算出する（即ちスコアリングする）
が、この尺度が主観的な聞こえにどう効いてくるかにつ
いては直接的な対応が明確でないという問題がある。ま
た、この方法により作成した規則の性能はコーパスのデ
ータ量をどれだけ充実できるかに依存するが、規摸が大
きく、バリエーションも豊富で、しかも品質的に均質な
コーパスを作成することは、多大な労力が必要となるた
めに実現が困難であるという問題もある。[0007] For this reason, recently, approaches using a relatively large-scale natural speech and a speech corpus which collects analysis data thereof and treating the statistical tendency as a rule have been increasing. This method evaluates what is the main factor that reveals prosodic features based on the results of statistical analysis of a large corpus, so the rules obtained are relatively valid. It can be said that. However, in order to statistically handle a large amount of data, an evaluation value is calculated (ie, scored) based on an objective scale defined in advance.
However, there is a problem in that it is not clear how the scale will affect subjective hearing. The performance of the rules created by this method depends on how much data can be added to the corpus.However, creating a corpus with large models, a wide range of variations, and uniform quality is very important. There is also a problem that the realization is difficult because of the great effort required.

【０００８】このような問題意識から、大規模コーパス
を整備するプロジェクトが運営されて研究用として提供
されているが、扱うトピックや文体は比較的「質が良
い」文であって、計算機ネットワークを通じて流通量が
増え続ける様々なテキストの文体も含めて考えると、や
やバランスに欠けると言えなくもない。[0008] From such awareness of the problem, a project for preparing a large-scale corpus is operated and provided for research purposes, but the topics and styles to be handled are relatively “high-quality” sentences, and are transmitted through a computer network. Considering the style of various texts that are increasing in circulation, it can be said that the balance is somewhat lacking.

【０００９】以上のことから、韻律制御の品質向上の問
題は、同種の文例を大量に集めて同時に考慮すること、
様々な文体・新たな文例を取り込む柔軟性を有するこ
と、主観的な聞こえにどう効いてくるかについてをいか
に反映させるか、また、韻律的な特徴を顕在化させる主
要因が何であるかの選択が恣意的でないように、統計的
な視点で判断する、などを考慮して解く必要がある。[0009] From the above, the problem of improving the quality of prosodic control is that a large number of similar sentences are collected and considered at the same time.
Flexibility to incorporate a variety of styles and new examples, how to reflect on subjective hearing, and selection of the main factors that make prosodic features manifest It is necessary to solve the problem by taking into account the judgment from a statistical point of view, so that is not arbitrary.

【００１０】[0010]

【発明が解決しようとする課題】しかしながら従来技術
にあっては、少数の自然音声データの解析によって一意
に決定された韻律パラメータ値を用いているため、想定
していない文例に対して、不適切なパラメータを決定す
ることが多いという問題があった。また、一意に決定さ
れているため、あらゆる文に対して最適なパラメータと
なるかはわからないという問題があった。However, in the prior art, since a prosodic parameter value uniquely determined by analyzing a small number of natural speech data is used, an unsuitable sentence example is inappropriate. There is a problem that many parameters are determined. In addition, since it is uniquely determined, there is a problem that it is not known whether the parameters are optimal for every sentence.

【００１１】そこで、韻律パラメータの最適化を自動的
に行なう従来技術の例もある。しかし、この種の従来技
術では、自然音声データから推定される（アクセント指
令値、フレーズ指令値等からなる）韻律パラメータ（モ
デルパラメータ）と、当該自然音声データに対応するテ
キスト情報から音声合成装置内で音声合成用に決定され
る韻律パラメータとの、パラメータ同士の比較を評価に
用いるため、韻律パラメータの時間変化が考慮されてい
ないという問題があった。また、基本周波数制御パラメ
ータのように自然音声データからモデルパラメータに変
換するのが容易でないものもあり、多量のデータを収集
するのが容易ではないという問題があった。Therefore, there is an example of the prior art in which the prosodic parameters are automatically optimized. However, in this type of conventional technology, a prosody parameter (model parameter) estimated from natural speech data (consisting of an accent command value, a phrase command value, and the like) and text information corresponding to the natural speech data are used in a speech synthesizer. However, there is a problem that the time change of the prosody parameter is not taken into account because the comparison between the prosody parameter and the prosody parameter determined for speech synthesis is used for the evaluation. In addition, there is a problem that it is not easy to convert natural voice data to model parameters, such as a fundamental frequency control parameter, and it is not easy to collect a large amount of data.

【００１２】また、自然音声データから比較的容易に分
析できる韻律パラメータ時系列と、合成装置内で決定さ
れる韻律パラメータ時系列という時系列同士の比較を評
価に用いる従来技術の例もわずかながらある。しかし、
このような従来技術にあっては、基本周波数パラメータ
の最適化を例にとると、評価関数として二乗誤差総和を
単純に用いており、自然音声データの無声区間でのデー
タ欠落や、子音などによる局所変動、知覚的に影響の少
ない部分での誤差などの影響を受けやすいという問題が
あった。Also, there are a few examples of the prior art in which a comparison between time series of a prosody parameter time series that can be relatively easily analyzed from natural speech data and a prosody parameter time series determined in a synthesizer is used for evaluation. . But,
In such prior art, taking optimization of the fundamental frequency parameter as an example, the sum of squared errors is simply used as an evaluation function, and data loss in unvoiced sections of natural voice data, consonants, etc. There has been a problem that the system is susceptible to local fluctuations and errors in parts having little perceptual influence.

【００１３】また従来技術にあっては、不適切な韻律規
則の発見、改良が容易ではないという問題もあった。ま
た従来技術にあっては、自然音声データの不足する韻律
規則の最適化が行なえないという問題もあった。In the prior art, there is another problem that it is not easy to find and improve inappropriate prosody rules. Further, in the related art, there is a problem that the prosody rule in which natural voice data is insufficient cannot be optimized.

【００１４】本発明は上記事情を考慮してなされたもの
でその目的は、韻律パラメータの最適化のための評価に
用いる評価関数自体を最適化できる音声合成装置を提供
することにある。The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a speech synthesizer capable of optimizing an evaluation function itself used for evaluation for optimizing prosodic parameters.

【００１５】本発明の他の目的は、不適切な韻律規則を
抽出して改良することができる音声合成装置を提供する
ことにある。本発明の更に他の目的は、韻律規則で考慮
されていない韻律パラメータ決定要因の要素を用いてク
ラスタリングを行ない、韻律規則を分割した方が評価値
が良くなる場合には当該韻律規則の分割を行なうことが
できる音声合成装置を提供することにある。Another object of the present invention is to provide a speech synthesizer capable of extracting and improving inappropriate prosody rules. Still another object of the present invention is to perform clustering using elements of prosodic parameter determinants that are not taken into account in the prosody rules, and to divide the prosody rules when it is better to divide the prosody rules. It is an object of the present invention to provide a speech synthesizer that can perform the speech synthesis.

【００１６】本発明の更に他の目的は、自然音声データ
の不足した韻律規則に関してオペレータによる試聴評価
を利用することで、その韻律規則の最適化を行なうこと
ができる音声合成装置を提供することにある。Still another object of the present invention is to provide a speech synthesizing apparatus capable of optimizing a prosody rule by utilizing an audition evaluation by an operator with respect to a prosody rule lacking natural speech data. is there.

【００１７】[0017]

【課題を解決するための手段】本発明の第１の観点に係
る音声合成装置は、種々の学習用テキスト情報を順次解
析して、合成音声の音韻、韻律を表す情報を生成するテ
キスト解析手段と、韻律パラメータ生成のための各種韻
律規則が予め登録されている韻律規則記憶手段と、上記
テキスト解析手段により生成された上記情報、及び上記
韻律規則記憶手段に登録されている対応する韻律規則を
もとに、当該韻律規則で適用する韻律パラメータ値を予
め定められた複数候補の中から順次選択しながら音声合
成用の第１の韻律パラメータ時系列を順に生成するパラ
メータ時系列生成手段と、上記学習用テキスト情報に対
応する自然音声データを分析して第２の韻律パラメータ
時系列を生成するパラメータ分析手段と、このパラメー
タ分析手段により生成される第２の韻律パラメータ時系
列と上記パラメータ時系列生成手段により韻律パラメー
タ値を切り替えながら順に生成される第１の韻律パラメ
ータ時系列とを所定の評価関数を用いて比較評価する評
価手段と、この評価手段の評価結果をもとに対応する韻
律規則で適用する韻律パラメータ値を最適化する韻律パ
ラメータ値学習手段と、上記パラメータ時系列生成手段
により生成される第１の韻律パラメータ時系列をもとに
対応する合成音声をオペレータによる試聴評価のために
出力させる試聴評価要求手段と、上記オペレータによる
試聴評価結果が入力されるオペレータ入力手段と、この
オペレータ入力手段から入力されたオペレータによる試
聴評価結果と対応する上記評価手段の評価結果とが無矛
盾となる方向に上記評価関数を修正する評価関数学習手
段とを備えたことを特徴とする。A speech synthesizing apparatus according to a first aspect of the present invention sequentially analyzes various types of learning text information to generate information representing phonemes and prosody of synthesized speech. A prosody rule storage unit in which various prosody rules for generating a prosody parameter are registered in advance, the information generated by the text analysis unit, and a corresponding prosody rule registered in the prosody rule storage unit. A parameter time series generating means for sequentially generating a first prosody parameter time series for speech synthesis while sequentially selecting a prosody parameter value to be applied in the prosody rule from a plurality of predetermined candidates, Parameter analysis means for analyzing natural speech data corresponding to the learning text information to generate a second prosody parameter time series, and the parameter analysis means Evaluation means for comparing and evaluating, using a predetermined evaluation function, a second prosody parameter time series generated and a first prosody parameter time series generated in sequence while switching prosody parameter values by the parameter time series generation means. A prosody parameter value learning means for optimizing a prosody parameter value applied by a corresponding prosody rule based on the evaluation result of the evaluation means, and a first prosody parameter time series generated by the parameter time series generation means. A trial evaluation requesting means for outputting a synthesized speech corresponding to the original for trial listening evaluation by the operator, an operator input means for inputting the trial listening result by the operator, and a trial listening evaluation by the operator inputted from the operator input means Correct the above evaluation function so that the result and the corresponding evaluation result of the above evaluation means are consistent. Characterized by comprising an evaluation function learning means that.

【００１８】ここで、韻律パラメータ値の最適化には、
各韻律パラメータ値の候補ごとに評価関数の値の例えば
平均値（総和だけでもよい）を求めて、その値が最良と
なる韻律パラメータを求めればよい。Here, to optimize the prosodic parameter values,
For example, for each prosody parameter value candidate, an average value of the evaluation function values (only the sum may be obtained) may be obtained, and a prosody parameter having the best value may be obtained.

【００１９】このような構成においては、自然音声デー
タから得られる韻律パラメータ時系列と、その自然音声
データに対応するテキスト情報から音声合成のために得
られる韻律パラメータ時系列との比較により韻律パラメ
ータの評価が行なわれると共に、この評価（客観評価）
の結果とオペレータの試聴による評価（主観評価）の結
果の相関が考慮される結果、両結果の間で矛盾が生じに
くい評価関数が得られ、評価関数自体を最適化すること
が可能となる。ここで評価関数の最適化には、基準の評
価関数に重み（初期値は例えば１）を持たせ、客観評価
結果を対応する主観評価結果がよいグループと悪いグル
ープとに（予め定められた閾値で）２分した場合に、両
グループの客観スコアの分布の統計的な差が予め定めら
れた閾値より大きくなるような重みを選択し、その選択
した重みを持つ重み付き評価関数を採用すればよい。In such a configuration, a prosody parameter time series obtained from natural speech data is compared with a prosody parameter time series obtained for textural synthesis from text information corresponding to the natural speech data, and the prosody parameter time series is obtained. Evaluation is performed and this evaluation (objective evaluation)
As a result of considering the correlation between the result of (1) and the result of the evaluation (subjective evaluation) by the operator's audition, an evaluation function with less inconsistency between the two results is obtained, and the evaluation function itself can be optimized. Here, in the optimization of the evaluation function, a weight (the initial value is, for example, 1) is given to the reference evaluation function, and the objective evaluation result is divided into a group having a good subjective evaluation result and a group having a bad subjective evaluation result (a predetermined threshold value). If a weight is selected such that the statistical difference between the distributions of the objective scores of the two groups becomes larger than a predetermined threshold value, and a weighted evaluation function having the selected weight is adopted, Good.

【００２０】本発明の第２の観点に係る音声合成装置
は、上記第１の観点に係る音声合成装置における上記試
聴評価要求手段に代えて、上記評価手段の評価結果を統
計処理して不適切な韻律規則を検出し、当該規則を修正
して、その修正後の規則を適用した上記音声合成用の第
１の韻律パラメータ時系列を上記パラメータ時系列生成
手段により生成させて対応する合成音声をオペレータに
よる試聴評価のために出力させる試聴評価要求手段を設
けると共に、上記第１の観点に係る音声合成装置におけ
る上記評価関数学習手段に代えて、上記オペレータ入力
手段から入力されたオペレータによる試聴評価結果をも
とに上記不適切な韻律規則を修正・最適化する韻律規則
学習手段を設けたことを特徴とする。A speech synthesizer according to a second aspect of the present invention performs statistical processing on the evaluation result of the evaluation means in place of the sample evaluation requesting means in the speech synthesis apparatus according to the first aspect, and performs an inappropriate process. A first prosody parameter time series for speech synthesis to which the modified rule is applied, and the corresponding synthesized speech is generated by the parameter time series generation means. A trial evaluation request means for outputting for trial evaluation by the operator is provided, and a trial evaluation result by the operator input from the operator input means instead of the evaluation function learning means in the speech synthesizer according to the first aspect. A prosodic rule learning means for correcting and optimizing the inappropriate prosodic rule based on the above.

【００２１】ここで不適切な韻律規則の検出には、当該
規則が適用されたそれぞれのテキスト情報（文例）に対
する最適なパラメータ値の分散を求め、その分散が予め
定められた閾値以上であるか否かを判断すればよい。Here, in order to detect an inappropriate prosody rule, an optimum parameter value variance for each piece of text information (sentence example) to which the rule is applied is determined, and whether the variance is greater than or equal to a predetermined threshold value It is only necessary to determine whether or not.

【００２２】このような構成においては、不適切な韻律
規則を抽出して改良することができる。本発明の第３の
観点に係る音声合成装置は、上記第２の観点に係る音声
合成装置に、上記評価手段の評価結果及び上記オペレー
タ入力手段から入力されたオペレータによる試聴評価結
果を、対応する韻律規則ごとに、当該規則で考慮された
韻律パラメータ決定要因の要素及び該当するテキスト情
報で決まる当該規則で非考慮の他の韻律パラメータ決定
要因の要素とを組にして記憶しておくための評価結果記
憶手段を設けると共に、上記第２の観点に係る音声合成
装置における上記韻律規則学習手段に代えて、上記評価
結果記憶手段に記憶されている上記評価手段の評価結果
を統計処理して不適切な韻律規則を検出し、当該規則で
非考慮の韻律パラメータ決定要因の要素を用いてクラス
タリングを行なうことで、そのクラスタリング結果をも
とに当該規則を分割する韻律規則学習手段を設けたこと
を特徴とする。In such a configuration, inappropriate prosody rules can be extracted and improved. A speech synthesizer according to a third aspect of the present invention corresponds to the speech synthesizer according to the second aspect, in which the evaluation result of the evaluation means and the audition evaluation result by the operator input from the operator input means correspond. Evaluation for storing, for each prosody rule, elements of prosody parameter determinants considered by the rule and elements of other prosody parameter determinants not considered by the rule determined by the corresponding text information. In addition to providing a result storage unit, instead of the prosody rule learning unit in the speech synthesizer according to the second aspect, the evaluation result of the evaluation unit stored in the evaluation result storage unit is statistically processed to be inappropriate. Prosody rules are detected, and clustering is performed using elements of the prosodic parameter determinants that are not considered in the rules. Characterized in that a prosodic rule learning means for dividing the rules.

【００２３】ここで、韻律規則学習手段を、上記評価結
果記憶手段に記憶されている評価手段の評価結果を統計
処理して不適切な韻律規則を検出し、当該規則で非考慮
の韻律パラメータ決定要因の要素を用いてクラスタリン
グを行なうクラスタリング手段と、このクラスタリング
手段のクラスタリング結果の分布が複数に別れている場
合に、各分布ごとにその分布の重心に最も近いテキスト
情報を選択すると共に、対応する韻律規則を分割して新
たな複数の韻律規則を生成する韻律規則分割手段とで構
成し、上記韻律規則分割手段により選択された各テキス
ト情報に対応する上記第１の韻律パラメータ時系列を、
上記韻律規則分割手段により生成された各韻律規則に従
って上記パラメータ時系列生成手段にて生成させて、対
応する合成音声をオペレータによる試聴評価のために出
力させ、上記韻律規則分割手段では、自身が選択した各
テキスト情報についてのオペレータによる試聴評価結果
をもとに上記生成した複数の韻律規則を採用するか否か
を決定するようにするとよい。Here, the prosody rule learning means statistically processes the evaluation result of the evaluation means stored in the evaluation result storage means to detect an inappropriate prosody rule, and determines a prosody parameter which is not considered by the rule. A clustering unit that performs clustering using the factor element; and when the distribution of the clustering result of the clustering unit is divided into a plurality, the text information closest to the center of gravity of the distribution is selected for each distribution, and the corresponding Prosody rule division means for dividing a prosody rule to generate a plurality of new prosody rules, wherein the first prosody parameter time series corresponding to each text information selected by the prosody rule division means is
In accordance with each prosody rule generated by the prosody rule division means, the parameter time series generation means generates the corresponding synthesized voice for the trial listening evaluation by the operator. It is preferable to determine whether or not to employ the plurality of generated prosody rules based on the result of the trial listening evaluation performed by the operator on each piece of text information.

【００２４】このような構成においては、韻律規則で考
慮されていない韻律パラメータ決定要因（考慮した方が
良いかもしれない韻律パラメータ決定要因）の要素を用
いてクラスタリングが行なわれ、韻律規則を分割した方
が評価値が高くなる場合には当該韻律規則が分割され、
韻律規則の最適化が図れる。In such a configuration, clustering is performed using elements of prosody parameter determinants not considered in the prosody rules (prosody parameter determinants that may be better considered) to divide the prosody rules. If the evaluation value is higher, the prosodic rule is divided,
The prosody rule can be optimized.

【００２５】また、上記第１乃至第４の観点に係る音声
合成装置のいずれかの試聴評価要求手段に、学習用テキ
スト情報に対応する自然音声データの数が予め定められ
た閾値以下の場合、上記学習用テキスト情報に対応する
合成音声をオペレータによる試聴評価のために出力させ
る機能を持たせると共に、上記韻律パラメータ値学習手
段に、上記オペレータによる試聴評価結果及び評価手段
の評価結果をもとに対応する韻律規則で適用する韻律パ
ラメータ値を最適化する機能を持たせたことを特徴とす
る。Further, when the number of natural speech data corresponding to the learning text information is equal to or less than a predetermined threshold value, a request is sent to one of the trial listening evaluation request means of the speech synthesis apparatus according to the first to fourth aspects. In addition to having a function of outputting a synthesized voice corresponding to the learning text information for trial listening evaluation by the operator, the prosodic parameter value learning means is based on the trial listening evaluation result by the operator and the evaluation result of the evaluation means. It is characterized by having a function of optimizing a prosody parameter value applied by a corresponding prosody rule.

【００２６】このような構成においては、自然音声デー
タの不足した韻律規則に関してオペレータによる試聴評
価を利用することで、その韻律規則の最適化を行なうこ
とができる。In such a configuration, optimization of the prosody rule can be performed by using the trial listening evaluation by the operator with respect to the prosody rule lacking the natural voice data.

【００２７】[0027]

【発明の実施の形態】以下、本発明の実施の形態につき
図面を参照して説明する。図１は本発明の一実施形態に
係る音声合成装置の構成を示すブロック図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to one embodiment of the present invention.

【００２８】図１に示す音声合成装置は、テキスト入力
部１、パラメータ生成・評価部２、合成器３、音声合成
単位辞書４、及び音声出力部５から構成されており、任
意のテキスト情報から合成音声を出力するものである。
この音声合成装置でのテキスト情報からの合成音声の出
力は、次のように行なわれる。The speech synthesizing apparatus shown in FIG. 1 comprises a text input unit 1, a parameter generation / evaluation unit 2, a synthesizer 3, a speech synthesis unit dictionary 4, and a speech output unit 5. It outputs synthesized speech.
Output of synthesized speech from text information in this speech synthesizer is performed as follows.

【００２９】まずテキスト入力部１は、音声合成の対象
となる任意のテキスト情報（以下、単にテキストと称す
る）の入力を司る。このテキスト入力部１は、オペレー
タ（ユーザ）操作によりテキスト（を構成する例えば漢
字仮名混じり文）の入力が可能なキーボード、テキスト
の保存と、その保存テキストの読み出しが可能なハード
ディスク装置、ＣＤ−ＲＯＭ装置等の外部記憶装置、或
いはネットワーク等を介して転送されるテキストを受信
して入力することが可能な通信インタフェース等を用い
て実現される。First, the text input unit 1 manages input of arbitrary text information (hereinafter, simply referred to as text) to be subjected to speech synthesis. The text input unit 1 includes a keyboard capable of inputting a text (for example, a sentence mixed with Chinese characters and kana) by an operator (user) operation, a hard disk device capable of storing text and reading the stored text, a CD-ROM. This is realized using an external storage device such as a device, or a communication interface capable of receiving and inputting text transferred via a network or the like.

【００３０】テキスト入力部１により任意のテキストが
入力されると、その入力テキストは、パラメータ生成・
評価部２に設けられたパラメータ生成部２３内のテキス
ト解析部２３１に与えられる。テキスト解析部２３１
は、この入力テキストを対象とする形態素解析、構文解
析等を行なうことで、合成音声の音韻、韻律に関する情
報を生成する。この音韻、韻律に関する情報は、パラメ
ータ時系列生成部２３２に与えられる。When an arbitrary text is input by the text input unit 1, the input text is used to generate a parameter.
It is provided to a text analysis unit 231 in a parameter generation unit 23 provided in the evaluation unit 2. Text analysis unit 231
Performs morphological analysis, syntactic analysis, and the like on the input text to generate information on phonemes and prosody of the synthesized speech. Information on the phonemes and prosody is given to the parameter time series generation unit 232.

【００３１】パラメータ時系列生成部２３２は韻律パラ
メータの生成機能と音韻記号の生成機能とを有してお
り、テキスト解析部２３１から与えられた音韻、韻律に
関する情報と、韻律規則記憶部２３３に格納されている
韻律規則（韻律パラメータ生成規則）とに基づいて、各
音韻の基本周波数（ピッチ周波数）、継続時間長、パワ
ー、ポーズなどの韻律パラメータの時系列を生成すると
共に、音韻記号列を生成する。この韻律パラメータ時系
列及び音韻記号列は合成器３に与えられる。The parameter time series generation unit 232 has a function of generating a prosody parameter and a function of generating a phoneme symbol, and stores the information on the phoneme and the prosody given from the text analysis unit 231 and the prosody rule storage unit 233. A time series of prosody parameters such as a fundamental frequency (pitch frequency), duration, power, and pause of each phoneme, and a phoneme symbol string are generated based on the prosody rules (prosody parameter generation rules). I do. The prosody parameter time series and the phoneme symbol string are provided to the synthesizer 3.

【００３２】合成器３は、音声パラメータ生成機能と音
声合成機能とを有しており、パラメータ時系列生成部２
３２から与えられた音声記号列に従って、合成に必要な
音声合成単位を音声合成単位辞書４から選択し、その選
択した音声合成単位をパラメータ時系列生成部２３２か
ら与えられた韻律パラメータ時系列に従って接続して、
音声パラメータの時系列を生成する。ここで、音声合成
単位辞書４に格納（登録）されている音声合成単位は、
例えば、アナウンサ等が発声した音声を分析した所定の
音声の特徴パラメータを得た後、日本語の音節単位など
の所定の合成単位で、日本語の音声に含まれる全ての音
節を上記特徴パラメータから切り出すことにより作成さ
れた音声素片である。音声合成単位辞書４は、ハードデ
ィスク装置、ＲＯＭ等を用いて実現される。The synthesizer 3 has a voice parameter generating function and a voice synthesizing function.
32, a speech synthesis unit required for the synthesis is selected from the speech synthesis unit dictionary 4 in accordance with the speech symbol string given from 32, and the selected speech synthesis unit is connected according to the prosodic parameter time series given from the parameter time series generation unit 232. do it,
Generate a time series of voice parameters. Here, the speech synthesis unit stored (registered) in the speech synthesis unit dictionary 4 is
For example, after obtaining a feature parameter of a predetermined voice obtained by analyzing a voice uttered by an announcer or the like, in a predetermined synthesis unit such as a Japanese syllable unit, all syllables included in the Japanese voice are extracted from the feature parameter. This is a speech unit created by cutting out. The speech synthesis unit dictionary 4 is realized using a hard disk device, a ROM, and the like.

【００３３】合成器３は、生成した音声パラメータの時
系列に基づいて音声を合成する。合成器３により合成さ
れた音声は音声出力部５に与えられてＤ／Ａ（ディジタ
ル／アナログ変換）された後、合成音声としてスピーカ
等から出力される。The synthesizer 3 synthesizes speech based on the time series of the generated speech parameters. The voice synthesized by the synthesizer 3 is supplied to a voice output unit 5 and subjected to D / A (digital / analog conversion), and then output as a synthesized voice from a speaker or the like.

【００３４】以上、図１の音声合成装置において入力テ
キストから合成音声を生成する従来からよく知られてい
る動作につき説明した。次に、図１の音声合成装置の特
徴である韻律規則の評価・学習機能について説明する。The operation of the speech synthesizer shown in FIG. 1 for generating a synthesized speech from an input text has been described. Next, the prosody rule evaluation / learning function, which is a feature of the speech synthesizer of FIG. 1, will be described.

【００３５】まず、パラメータ生成・評価部２は、テキ
ストデータ記憶部２１及び音声データ記憶部２２を含ん
でいる。テキストデータ記憶部２１には学習に用いるテ
キストデータが格納されており、音声データ記憶部２２
には、テキストデータ記憶部２１に記憶されたテキスト
データに対応する音韻ラベル情報付きの音声データ（自
然音声データ）が格納されている。First, the parameter generation / evaluation unit 2 includes a text data storage unit 21 and a voice data storage unit 22. The text data storage unit 21 stores text data used for learning, and is stored in the voice data storage unit 22.
Stores voice data (natural voice data) with phoneme label information corresponding to the text data stored in the text data storage unit 21.

【００３６】本実施形態では、前記した入力テキストか
ら合成音声を生成するモード（通常モード）と、韻律規
則記憶部２３３に格納されている韻律規則の学習を行な
う学習モードとが選択指定できるようになっている。こ
こでは、オペレータからの特別の指定がない状態（例え
ばシステム立ち上げ時）では通常モードが自動設定され
る。通常モードから学習モードへの切り替えと、学習モ
ードから通常モードへの切り替えは、オペレータ入力部
２８からの選択指定が必要となる。また、学習モードで
は、韻律規則の特殊化を自動的に行なう規則特殊化自動
モードと、オペレータによる試聴評価を要求し、その評
価結果を考慮して韻律規則の特殊化を行なう規則特殊化
オペレータ介在モードとが選択指定可能である。この規
則特殊化自動モードと規則特殊化オペレータ介在モード
については後述する。In this embodiment, a mode (normal mode) for generating a synthesized speech from the input text and a learning mode for learning the prosody rules stored in the prosody rule storage unit 233 can be selected and designated. Has become. Here, the normal mode is automatically set when there is no special designation from the operator (for example, when the system is started). Switching from the normal mode to the learning mode and switching from the learning mode to the normal mode require selection and designation from the operator input unit 28. In the learning mode, a rule specialization automatic mode for automatically specializing prosody rules and a rule specialization operator intervention for requesting trial evaluation by an operator and specializing the prosody rules in consideration of the evaluation result are provided. The mode can be selected and specified. The rule specialization automatic mode and the rule specialization operator intervention mode will be described later.

【００３７】学習モードでは、パラメータ生成部２３内
のテキスト解析部２３１は、パラメータ生成規則学習部
２６の制御のもとで、テキストデータ記憶部２１に格納
されている学習用テキストデータを読み込んで周知の形
態素解析、構文解析等を行ない、テキストデータ（文）
中の各アクセント単位に対して、読み情報、文法属性
（品詞等）、アクセント型、モーラ数などの音韻、韻律
に関する情報を付与する。テキスト解析部２３１は、学
習用テキストデータについての音韻、韻律に関する情報
を取得すると、その情報をパラメータ時系列生成部２３
２に与える。In the learning mode, the text analysis unit 231 in the parameter generation unit 23 reads the learning text data stored in the text data storage unit 21 under the control of the parameter generation rule learning unit 26, and makes it known. Morphological analysis, parsing, etc. of text data (sentence)
For each accent unit in the middle, information on phonemes and prosody, such as reading information, grammatical attributes (part of speech, etc.), accent type, mora number, etc. is given. When the text analysis unit 231 obtains information on phonemes and prosody of the learning text data, the text analysis unit 231 converts the information into the parameter time series generation unit 23.
Give to 2.

【００３８】パラメータ時系列生成部２３２は、テキス
ト解析部２３１から音韻、韻律に関する情報が与えられ
ると、ハードディスク装置等の外部記憶装置を用いて実
現される韻律規則記憶部２３３をアクセスする。この韻
律規則記憶部２３３には、基本周波数、音韻継続時間
長、ポーズ長、パワーなどの韻律に関する韻律規則（韻
律パラメータ生成規則）の集合が格納されているパラメータ時系列生成部２３２は、テキスト解析部２３
１から与えられた音韻、韻律に関する情報をもとに、韻
律規則記憶部２３３の中から対応する韻律規則を選択
し、その韻律規則を当該情報に適用して、基本周波数、
音韻継続時間長、ポーズ長、パワーといった韻律パラメ
ータの時系列を生成する。ここで、基本周波数に関して
は、基本周波数パターンを話調成分とアクセント成分の
和であると仮定する重畳モデルなどの韻律制御モデル、
例えば藤崎モデル（電子情報通信学会論文誌 A Vol.J7
2-A No.1 pp32-40 1989-01）で記述され、韻律規則によ
ってフレーズ指令値（例えばフレーズ指令の大きさ、フ
レーズ指令の時点）、アクセント指令値（アクセント指
令の大きさ、アクセント指令の始点並びに終点）など、
当該モデルにおけるモデルパラメータ（韻律パラメー
タ）が決定され、しかる後に韻律規則によって決定され
た韻律パラメータをパラメータ時系列生成部２３２にお
いて当該モデルに適用することで、韻律パラメータ時系
列が生成される。The parameter time series generation unit 232, when given information on phonemes and prosody from the text analysis unit 231, accesses the prosody rule storage unit 233 implemented using an external storage device such as a hard disk device. The prosody rule storage unit 233 stores a set of prosody rules (prosody parameter generation rules) related to prosody such as fundamental frequency, phoneme duration, pause length, and power. Part 23
1, the corresponding prosody rule is selected from the prosody rule storage unit 233 based on the information on the phoneme and the prosody given from 1 and the prosody rule is applied to the information to obtain the fundamental frequency,
A time series of prosodic parameters such as phoneme duration, pause length, and power is generated. Here, with respect to the fundamental frequency, a prosody control model such as a superposition model that assumes that the fundamental frequency pattern is the sum of a speech component and an accent component,
For example, Fujisaki model (Transactions of IEICE Transactions A Vol.J7
2-A No.1 pp32-40 1989-01), the phrase command value (for example, the size of the phrase command, the time of the phrase command), the accent command value (the size of the accent command, Start point and end point)
The model parameters (prosodic parameters) in the model are determined, and then the prosody parameters determined by the prosody rules are applied to the model in the parameter time series generation unit 232, thereby generating a prosody parameter time series.

【００３９】パラメータ生成・評価部２はまた、パラメ
ータ分析部２４、パラメータ生成規則評価部２５、及び
パラメータ生成規則学習部２６を含んでいる。パラメー
タ分析部２４は、学習モードにおいてパラメータ生成規
則学習部２６により起動され、テキスト解析部２３１に
より解析されてパラメータ時系列生成部２３２により韻
律パラメータ時系列に変換された学習用テキストデータ
に対応する音韻ラベル情報付きの音声データを読み込ん
で分析することで、対応する分析パラメータ時系列（韻
律パラメータ時系列）を生成する。The parameter generation / evaluation unit 2 also includes a parameter analysis unit 24, a parameter generation rule evaluation unit 25, and a parameter generation rule learning unit 26. The parameter analysis unit 24 is activated by the parameter generation rule learning unit 26 in the learning mode, is analyzed by the text analysis unit 231, and is converted into a prosodic parameter time series by the parameter time series generation unit 232. By reading and analyzing the audio data with label information, a corresponding analysis parameter time series (prosodic parameter time series) is generated.

【００４０】学習モードにおいて、パラメータ時系列生
成部２３２により生成された韻律パラメータ時系列は、
パラメータ生成規則評価部２５内の評価部２５１に与え
られる。この評価部２５１には、パラメータ分析部２４
での分析により生成された（上記パラメータ時系列生成
部２３２からの韻律パラメータ時系列に対応する）韻律
パラメータ時系列も与えられる。In the learning mode, the prosody parameter time series generated by the parameter time series generation unit 232 is
This is given to the evaluation unit 251 in the parameter generation rule evaluation unit 25. The evaluation unit 251 includes the parameter analysis unit 24
Is also provided (corresponding to the prosody parameter time series from the parameter time series generation unit 232).

【００４１】評価部２５１は、パラメータ時系列生成部
２３２からの韻律パラメータ時系列とパラメータ分析部
２４からの韻律パラメータ時系列（分析パラメータ時系
列）とを、着目する韻律規則の韻律パラメータの影響が
及ぶ範囲内で比較して、評価スコア（評価値）を算出す
る。ここで、着目する韻律規則の韻律パラメータの影響
が及ぶ範囲とは、その韻律パラメータを何通りか変化さ
せたとき、その結果生成される合成音声に人間が聞き分
けられる大きさ以上の変化が生じるような時間範囲のこ
とである。例えば基本周波数の場合では、数Ｈｚ以上、
あるいは数％以上の変化が生じる範囲などと定義するこ
とができる。また、評価スコアは、（継続時間長、パワ
ー等の）韻律パラメータ種別ごとに用意された評価関数
によって与えられる。評価部２５１で算出された評価ス
コアは、パラメータ生成規則評価部２５内の評価結果記
憶部２５２に、適用韻律規則、対象テキストデータ、適
用韻律パラメータ値に対応させて格納される。The evaluation unit 251 compares the prosody parameter time series from the parameter time series generation unit 232 and the prosody parameter time series (analysis parameter time series) from the parameter analysis unit 24 with the influence of the prosody parameters of the prosody rule of interest. An evaluation score (evaluation value) is calculated by making a comparison within the range. Here, the range affected by the prosodic parameters of the prosodic rule of interest is such that when the prosodic parameters are changed in several ways, the resulting synthesized speech has a change larger than a human can recognize. Time range. For example, in the case of the fundamental frequency, several Hz or more,
Alternatively, it can be defined as a range in which a change of several percent or more occurs. The evaluation score is given by an evaluation function prepared for each prosodic parameter type (such as duration and power). The evaluation score calculated by the evaluation unit 251 is stored in the evaluation result storage unit 252 in the parameter generation rule evaluation unit 25 in association with the applied prosody rule, the target text data, and the applied prosody parameter value.

【００４２】以上の動作は、着目する韻律規則ごとに、
パラメータ時系列生成部２３２で適用する韻律パラメー
タ値を予め用意されている複数の候補の中からパラメー
タ生成規則学習部２６の制御により順次切り替えなが
ら、繰り返し行なわれる。The above operation is performed for each prosody rule of interest.
The prosody parameter value applied by the parameter time series generation unit 232 is repeatedly switched from one of a plurality of candidates prepared in advance under the control of the parameter generation rule learning unit 26.

【００４３】パラメータ生成規則評価部２５には、統計
的規則評価部２５３が設けられている。統計的規則評価
部２５３は、評価結果記憶部２５２に格納されている評
価スコアを各韻律規則ごとに統計的に分析する。The parameter generation rule evaluation section 25 is provided with a statistical rule evaluation section 253. The statistical rule evaluation unit 253 statistically analyzes the evaluation score stored in the evaluation result storage unit 252 for each prosody rule.

【００４４】パラメータ生成規則学習部２６は、統計的
規則評価部２５３の評価結果等に従って、韻律規則記憶
部２３３に格納されている対応する韻律規則（韻律パラ
メータ生成規則）、或いは当該規則で適用する韻律パラ
メータ値を修正する。パラメータ生成規則学習部２６は
また、不適切な韻律規則（これの定義については後述す
る）を検出した場合には、ＣＲＴディスプレイ、液晶デ
ィスプレイ等を用いて構成される情報提示部２７に当該
規則を表示することで、オペレータに当該規則を提示
し、加えてオペレータに試聴評価を促す。The parameter generation rule learning unit 26 uses the corresponding prosody rule (prosody parameter generation rule) stored in the prosody rule storage unit 233 or the rule according to the evaluation result of the statistical rule evaluation unit 253 or the like. Modify the prosodic parameter values. When the parameter generation rule learning unit 26 detects an inappropriate prosody rule (the definition thereof will be described later), the parameter generation rule learning unit 26 transmits the rule to an information presentation unit 27 configured using a CRT display, a liquid crystal display, or the like. By displaying, the rule is presented to the operator, and in addition, the operator is urged to perform a trial evaluation.

【００４５】この際、パラメータ生成規則学習部２６
は、不適切な（問題のある）韻律規則が適用されるテキ
ストデータをテキスト解析部２３１により順次解析させ
て、パラメータ時系列生成部２３２により対応する韻律
パラメータ時系列を生成させる制御動作を、当該韻律規
則で適用する韻律パラメータ値を複数の候補の中から選
択的に切り替えながら繰り返す。ここで、他の韻律規則
で適用する韻律パラメータ値には、即ち試聴する範囲の
音に影響する、着目していない韻律規則の韻律パラメー
タには、その時点で最適とされている韻律パラメータ値
が用いられる。At this time, the parameter generation rule learning section 26
The control operation of causing the text analysis unit 231 to sequentially analyze text data to which an inappropriate (problematic) prosody rule is applied, and generating the corresponding prosody parameter time series by the parameter time series generation unit 232, The prosody parameter value applied in the prosody rule is repeated while selectively switching from a plurality of candidates. Here, the prosody parameter value applied in another prosody rule, that is, the prosody parameter of the prosody rule that is not focused on, which affects the sound in the listening range, includes the prosody parameter value that is currently optimal. Used.

【００４６】なお、試聴評価用の合成音の生成に、自然
音声データを用いるようにしても構わない。即ち自然音
声データの韻律を、着目する韻律パラメータの影響の及
ぶ範囲に関して、パラメータ生成部２３で生成された韻
律パラメータ時系列に従って変更し、それによって生成
される合成音をオペレータに試聴させるようにしても構
わない。Natural sound data may be used to generate a synthesized sound for trial evaluation. That is, the prosody of the natural voice data is changed in accordance with the prosody parameter time series generated by the parameter generation unit 23 with respect to the range affected by the target prosody parameter, and the synthesized sound generated thereby is previewed by the operator. No problem.

【００４７】例えば、着目する韻律パラメータが、ある
韻律規則によって決定されるアクセント指令値の大きさ
の場合、まず自然音声データを分析して基本周波数の時
系列を抽出する一方で、パラメータ生成部２３では該当
アクセント指令を含むアクセント句（１アクセントを含
む句）のアクセント成分時系列を生成する。このとき、
アクセント指令の始点及び終点は自然音声データに付与
された音韻ラベルを参照することによって、自然音声デ
ータに対応したものにする。次に、自然音声データから
分析された基本周波数の時系列において、上記パラメー
タ生成部２３で生成されたアクセント成分時系列中のア
クセント成分０の時点に対応する基本周波数を直線で近
似したものを話調成分と見なし、上記アクセント句に対
応する部分に関して話調成分以外を取り除いた後、代わ
りにパラメータ生成部２３で生成されたアクセント成分
時系列を加算する。なおこの際、該当アクセント指令値
を切り替えて同様の基本周波数の時系列を生成したと
き、いずれかのパラメータ値のときに元の自然音声デー
タの基本周波数の時系列との誤差がある閾値以下である
こと、即ち仮定した話調成分が不適切でないことを確認
した方が良い。こうして生成された基本周波数の時系列
に合わせて、音声データの波形を変える（TD-PSOLA法
Speech Commun. 9,453-467,1990 ）ことによって生成さ
れる合成音を、オペレータに試聴させる。For example, when the prosody parameter of interest is the magnitude of the accent command value determined by a certain prosody rule, first, natural voice data is analyzed to extract a time series of fundamental frequencies, while the parameter generation unit 23 Generates an accent component time series of an accent phrase including the corresponding accent command (a phrase including one accent). At this time,
The start and end points of the accent command correspond to the natural voice data by referring to the phoneme labels assigned to the natural voice data. Next, in the time series of the fundamental frequency analyzed from the natural speech data, a linear approximation of the fundamental frequency corresponding to the time point of the accent component 0 in the accent component time series generated by the parameter generation unit 23 will be described. It is regarded as a tonal component, and after removing portions other than the tonal component from the portion corresponding to the accent phrase, the accent component time series generated by the parameter generating unit 23 is added instead. At this time, when a time series of the same fundamental frequency is generated by switching the corresponding accent command value, an error from the time series of the fundamental frequency of the original natural voice data at any parameter value is equal to or less than a threshold. It is better to confirm that there is, that is, that the assumed speech component is not inappropriate. Change the waveform of the audio data according to the time series of the fundamental frequency generated in this way (TD-PSOLA method
Speech Commun. 9,453-467, 1990) allows the operator to listen to the synthesized sound generated by the Speech Commun.

【００４８】パラメータ時系列生成部２３２では、韻律
パラメータ時系列と共に音韻記号列が生成される。韻律
パラメータ時系列はパラメータ生成規則評価部２５（内
の評価部２５１）に与えられる他、通常モードと同様に
対応する音韻記号列と共に合成器３にも与えられる。す
ると合成器３では、韻律パラメータ時系列及び音韻記号
列に基づいて音声合成に必要な音声パラメータの時系列
が生成されて音声が合成され、音声出力部５により合成
音声が出力される。これによりオペレータは、問題のあ
った韻律規則を適用して、韻律パラメータ値の候補（着
目する韻律パラメータ値）を順次切り替えながら韻律制
御を行なった場合の、各合成音声を逐次試聴して評価す
ることができる。The parameter time series generation section 232 generates a phoneme symbol string together with the prosody parameter time series. The prosodic parameter time series is provided to the parameter generation rule evaluation unit 25 (the evaluation unit 251 therein), and also to the synthesizer 3 together with the corresponding phoneme symbol string as in the normal mode. Then, the synthesizer 3 generates a time series of speech parameters necessary for speech synthesis based on the prosody parameter time series and the phoneme symbol string, synthesizes the speech, and outputs the synthesized speech by the speech output unit 5. Thus, the operator sequentially listens to and evaluates each synthesized speech when the prosody control is performed while applying the problematic prosody rule and sequentially switching prosody parameter value candidates (prosody parameter values of interest). be able to.

【００４９】オペレータによる試聴評価の結果（主観ス
コア）はオペレータ入力部２８を介して入力され、評価
結果記憶部２５２に格納される。また、試聴評価の結果
はパラメータ生成規則学習部２６に渡され、当該学習部
２６での韻律規則の修正に用いられる。The result (subjective score) of the trial listening evaluation by the operator is input via the operator input unit 28 and stored in the evaluation result storage unit 252. Further, the result of the trial listening evaluation is passed to the parameter generation rule learning unit 26, and is used for correcting the prosody rule in the learning unit 26.

【００５０】図２は、上記評価部２５１の構成を示すブ
ロック図である。評価部２５１は、図２に示すように、
分析パラメータ評価用データ生成部２５１ａ、韻律パラ
メータ評価用データ生成部２５１ｂ、及び比較部２５１
ｃから構成される。FIG. 2 is a block diagram showing the configuration of the evaluation unit 251. The evaluation unit 251 includes, as shown in FIG.
Analysis parameter evaluation data generation unit 251a, prosody parameter evaluation data generation unit 251b, and comparison unit 251
c.

【００５１】分析パラメータ評価用データ生成部２５１
ａには、パラメータ分析部２４での音声データ分析結果
である分析パラメータ時系列（韻律パラメータ時系列）
が、対応する音声データに付されている音韻ラベル情報
と共に、パラメータ分析部２４から入力される。分析パ
ラメータ評価用データ生成部２５１ａにはまた、着目し
ている韻律パラメータに関する情報がパラメータ時系列
生成部２３２から入力される。これにより分析パラメー
タ評価用データ生成部２５１ａは、着目している韻律パ
ラメータの影響する範囲に（予め定められた数の）前後
数単語を加えた範囲の分析パラメータ時系列（韻律パラ
メータ時系列）を音韻ラベル情報をもとに評価用データ
として抽出する。Analysis Parameter Evaluation Data Generation Unit 251
In a, analysis parameter time series (prosodic parameter time series) which is the result of voice data analysis by the parameter analysis unit 24
Is input from the parameter analysis unit 24 together with the phoneme label information attached to the corresponding audio data. The analysis parameter evaluation data generation unit 251a also receives information on the prosody parameter of interest from the parameter time series generation unit 232. Accordingly, the analysis parameter evaluation data generation unit 251a generates an analysis parameter time series (a prosody parameter time series) in a range obtained by adding several words before and after (a predetermined number) to the range affected by the prosody parameter of interest. It is extracted as evaluation data based on the phoneme label information.

【００５２】さて、上記着目している韻律パラメータに
関する情報は、韻律パラメータ評価用データ生成部２５
１ｂにも入力される。韻律パラメータ評価用データ生成
部２５１ｂにはまた、パラメータ時系列生成部２３２で
生成された韻律パラメータ時系列も入力される。これに
より韻律パラメータ評価用データ生成部２５１ａは、着
目している韻律パラメータの影響する範囲に（予め定め
られた数の）前後数単語を加えた範囲の韻律パラメータ
時系列を評価用データとして抽出する。The information on the prosody parameter of interest is stored in the prosody parameter evaluation data generation unit 25.
1b. The prosody parameter time series generated by the parameter time series generation unit 232 is also input to the prosody parameter evaluation data generation unit 251b. As a result, the prosody parameter evaluation data generation unit 251a extracts, as evaluation data, a prosody parameter time series in a range obtained by adding several words before and after (a predetermined number) to the range affected by the prosody parameter of interest. .

【００５３】比較部２５１ｃは、分析パラメータ評価用
データ生成部２５１ａと韻律パラメータ評価用データ生
成部２５１ｂでそれぞれ抽出された評価用データ（評価
用の韻律パラメータ時系列）を、二乗誤差総和に代表さ
れる評価関数を用いて比較し、その類似性を示す評価ス
コア（評価値）を算出する。この比較部２５１ｃで算出
された評価スコアは、適用韻律規則に対応して確保され
る評価結果記憶部２５２内領域（テーブル領域）に格納
される。The comparison unit 251c represents the evaluation data (prosody parameter time series for evaluation) extracted by the analysis parameter evaluation data generation unit 251a and the prosody parameter evaluation data generation unit 251b, as represented by the sum of squared errors. Then, an evaluation score (evaluation value) indicating the similarity is calculated. The evaluation score calculated by the comparison unit 251c is stored in an area (table area) in the evaluation result storage unit 252 secured according to the applied prosody rule.

【００５４】図３は、評価結果記憶部２５２での評価ス
コアの格納形式、更に具体的に述べるならば、適用韻律
規則に対応して確保される評価結果記憶部２５２内領域
における評価スコアの格納形式の一例を示す。FIG. 3 shows the storage format of the evaluation score in the evaluation result storage unit 252. More specifically, the storage of the evaluation score in the area in the evaluation result storage unit 252 secured in accordance with the applied prosody rule. Here is an example of the format.

【００５５】ここでは、該当する韻律規則が適用された
データのＩＤ（識別子）、韻律パラメータの決定要因の
要素、種々の韻律パラメータ値（図ではＡ0 〜Ａ5 の６
種）を与えたときのスコア（評価スコア）がセットで格
納される。韻律パラメータ決定要因としては、該当する
韻律規則中で予め考慮されている（記述されている）も
のと、考慮されていない（記述されていない）ものとが
あり、後述する規則の特殊化、一般化の際に用いられ
る。また、韻律パラメータ値ごとのスコア（評価スコ
ア）には、評価部２５１（内の比較部２５１ｃ）での比
較・評価結果（客観スコア）の他、オペレータが試聴評
価を行なったデータに関しては主観スコアがある。ここ
で主観スコアには４段階評価値が使用され、二重丸が最
も良く、以下、丸（○）、三角（△）、ばつ（×）の順
となる。なお、客観スコアは、誤差関数によるスコアを
想定しており、値が小さいほど評価が良いことを表す。Here, the ID (identifier) of the data to which the corresponding prosody rule is applied, the element of the determinant of the prosody parameter, and various prosody parameter values (6 in the figure, A0 to A5).
The score (evaluation score) when the seed is given is stored as a set. The prosodic parameter determinants include those that are considered (described) in advance in the corresponding prosody rule and those that are not considered (not described). It is used at the time of conversion. The score (evaluation score) for each prosodic parameter value includes, in addition to the comparison / evaluation result (objective score) in the evaluation unit 251 (in which the comparison unit 251c is included), the subjective score for the data on which the operator performed the audition evaluation There is. Here, a four-point evaluation value is used as the subjective score, and a double circle is the best, and in the following, a circle (○), a triangle (△), and a cross (×) are arranged in this order. The objective score is assumed to be a score based on an error function, and the smaller the value, the better the evaluation.

【００５６】次に、パラメータ生成規則学習部２６での
制御のもとでの評価処理の詳細を、図４及び図５のフロ
ーチャートを参照して説明する。まずパラメータ生成規
則学習部２６は、韻律規則記憶部２３３の中から着目す
る規則ｉを決めて選択する（ステップ４０１，４０
２）。次にパラメータ生成規則学習部２６は、選択した
規則ｉに対応して韻律規則記憶部２３３内に予め用意さ
れている韻律パラメータ値の候補の１つＡj を選択し、
当該規則ｉの韻律パラメータ値としてセットする（ステ
ップ４０３，４０４）。Next, details of the evaluation processing under the control of the parameter generation rule learning section 26 will be described with reference to the flowcharts of FIGS. First, the parameter generation rule learning unit 26 determines and selects the rule i of interest from the prosody rule storage unit 233 (steps 401 and 40).
2). Next, the parameter generation rule learning unit 26 selects one of the prosody parameter value candidates Aj prepared in advance in the prosody rule storage unit 233 corresponding to the selected rule i,
It is set as the prosodic parameter value of the rule i (steps 403 and 404).

【００５７】次にパラメータ生成規則学習部２６は、着
目する規則ｉを適用する学習用テキストデータｋ（ここ
では、例えば１文単位）を決めて選択し（ステップ４０
５，４０６）、パラメータ生成部２３及びパラメータ分
析部２４を起動する。Next, the parameter generation rule learning section 26 determines and selects learning text data k (here, for example, in units of one sentence) to which the rule i of interest is applied (step 40).
5, 406), and activates the parameter generation unit 23 and the parameter analysis unit 24.

【００５８】これによりパラメータ生成部２３内では、
テキストデータ記憶部２１内のテキストデータｋに対す
るテキスト解析部２３１による解析処理と、その解析結
果に基づくパラメータ時系列生成部２３２による韻律パ
ラメータ時系列生成とが行なわれる。このパラメータ時
系列生成部２３２での韻律パラメータ時系列生成処理に
おいて、規則ｉが適用可能な部分については、パラメー
タ生成規則学習部２６によりセットされた韻律パラメー
タ値Ａj が用いられ、他の規則が適用可能な部分につい
ては、その規則に関してその時点において最適であると
されている韻律パラメータ値が用いられる。パラメータ
時系列生成部２３２により生成された韻律パラメータ時
系列はパラメータ生成規則評価部２５内の評価部２５１
に与えられる。Thus, in the parameter generation unit 23,
The text data k in the text data storage unit 21 is analyzed by the text analysis unit 231 and the prosody parameter time series is generated by the parameter time series generation unit 232 based on the analysis result. In the prosody parameter time series generation processing in the parameter time series generation section 232, the prosody parameter value Aj set by the parameter generation rule learning section 26 is used for a portion to which the rule i can be applied, and another rule is applied. For the possible parts, the prosodic parameter values that are currently considered optimal for the rule are used. The prosody parameter time series generated by the parameter time series generation unit 232 is evaluated by the evaluation unit 251 in the parameter generation rule evaluation unit 25.
Given to.

【００５９】一方、パラメータ分析部２４では、テキス
トデータｋに対応する音声データ記憶部２２内の音韻ラ
ベル情報付き学習用音声データを分析して、対応する分
析パラメータ時系列（韻律パラメータ時系列）を生成す
る処理が行なわれる。パラメータ分析部２４により生成
された韻律パラメータ時系列も評価部２５１に与えられ
る。On the other hand, the parameter analysis unit 24 analyzes the learning speech data with phoneme label information in the speech data storage unit 22 corresponding to the text data k, and calculates a corresponding analysis parameter time series (prosodic parameter time series). Generation processing is performed. The prosody parameter time series generated by the parameter analysis unit 24 is also provided to the evaluation unit 251.

【００６０】評価部２５１は、パラメータ生成規則学習
部２６の制御のもとで、パラメータ時系列生成部２３２
からの韻律パラメータ時系列中に規則ｉが適用された部
分があるか否かを調べる（ステップ４０７）。もし、規
則ｉが適用された部分があれば、その部分について、パ
ラメータ分析部２４からの分析パラメータ時系列との比
較・評価を行ない、その評価結果（評価スコア）を規則
ｉ用の評価結果記憶部２５２内領域に、該当するデータ
部分（データｋ内の当該データ部分の位置）及びパラメ
ータ値Ａj に対応付けて格納する（ステップ４０８）。The evaluator 251 controls the parameter time series generator 232 under the control of the parameter generation rule learning unit 26.
It is checked whether or not there is a part to which rule i is applied in the prosodic parameter time series from (step 407). If there is a part to which the rule i is applied, the part is compared and evaluated with the analysis parameter time series from the parameter analysis unit 24, and the evaluation result (evaluation score) is stored in the evaluation result for the rule i. The corresponding data portion (the position of the data portion in the data k) and the parameter value Aj are stored in the area in the section 252 in association with each other (step 408).

【００６１】このようにしてデータｋの最終部分に対応
する韻律パラメータ時系列部分まで進むと、パラメータ
生成規則学習部２６は未処理のテキストデータがあるか
否かをチェックし（ステップ５０１）、あるならば、未
処理のテキストデータを１つ選択して（ステップ５０
２，４０６）、そのデータについて上記と同様の処理
（ステップ４０７，４０８）を行なわせる。When the process proceeds to the prosody parameter time-series portion corresponding to the final portion of the data k in this way, the parameter generation rule learning section 26 checks whether or not there is unprocessed text data (step 501). Then, select one unprocessed text data (step 50).
2, 406), and the same processing (steps 407, 408) is performed on the data.

【００６２】このようにして、すべてのテキストデータ
について規則ｉを適用した評価が終了したならば、パラ
メータ生成規則学習部２６は、規則ｉについて、すべて
の韻律パラメータ値の候補で評価したか否かをチェック
し（ステップ５０３）、未評価の韻律パラメータ値の候
補があるならば、その未評価の韻律パラメータ値の候補
を１つ選択して規則ｉのパラメータとしてセットし（ス
テップ５０４，４０４）、上記ステップ４０５以降の処
理に進む。When the evaluation using the rule i for all the text data is completed in this way, the parameter generation rule learning unit 26 determines whether or not the rule i has been evaluated for all the prosody parameter value candidates. Is checked (step 503). If there is a candidate for the unevaluated prosodic parameter value, one candidate for the unevaluated prosodic parameter value is selected and set as a parameter of rule i (steps 504 and 404). The process proceeds to the processing after step 405.

【００６３】やがて、規則ｉについて、すべての韻律パ
ラメータ値の候補で評価が行なわれたならば、パラメー
タ生成規則学習部２６は、規則ｉ用の評価結果記憶部２
５２内領域の評価スコアをもとに、各候補別（図３の例
ではＡ0 〜Ａ5 別）に評価スコアの平均値（総和でも
可）を求め、その平均値（総和）を最良にする（最も小
さくする）候補Ａx を、規則ｉの最適パラメータに決定
する（ステップ５０６）。図３の例では、Ａ2 が上記Ａ
x に相当する。When the rule i is evaluated for all the prosody parameter value candidates, the parameter generation rule learning unit 26 returns to the evaluation result storage unit 2 for the rule i.
On the basis of the evaluation score of the area inside 52, an average value (sum may be obtained) of the evaluation scores is obtained for each candidate (A0 to A5 in the example of FIG. 3), and the average value (sum) is optimized ( The candidate Ax to be minimized) is determined as the optimal parameter of the rule i (step 506). In the example of FIG. 3, A2 is the above A
It corresponds to x.

【００６４】次にパラメータ生成規則学習部２６は、す
べての規則について評価を行なったか否かをチェックし
（ステップ５０６）、未評価（未適用）の規則があるな
らば、その未評価の規則を１つ選択して（ステップ５０
７，４０２）、上記ステップ４０３以降の処理に進
む。。Next, the parameter generation rule learning unit 26 checks whether or not all the rules have been evaluated (step 506). Select one (Step 50
7, 402), and proceeds to the processing after step 403. .

【００６５】やがて、すべての規則について評価が行な
われたならば、パラメータ生成規則学習部２６は、その
うちのいずれかの規則で最適パラメータの変更がなされ
たか否かをチェックし（ステップ５０８）、最適パラメ
ータの変更がなされたならば、パラメータ間での相互作
用が見られたものと判断して先頭ステップ４０１に戻
り、すべての規則の最適パラメータが変更されなくなる
までステップ４０１以降の処理を繰り返す。After all the rules have been evaluated, the parameter generation rule learning section 26 checks whether or not any of the rules has changed the optimum parameter (step 508). If the parameters have been changed, it is determined that an interaction between the parameters has been found, and the process returns to the first step 401, and the processing from step 401 onward is repeated until the optimal parameters of all rules are not changed.

【００６６】次に、パラメータ生成規則学習部２６によ
るパラメータ生成規則学習処理の１つであるパラメータ
生成規則特殊化処理について図６及び図７のフローチャ
ートを参照して説明する。Next, the parameter generation rule specializing process, which is one of the parameter generation rule learning processes by the parameter generation rule learning section 26, will be described with reference to the flowcharts of FIGS.

【００６７】まずパラメータ生成規則学習部２６は、着
目する規則ｉを決めて選択する（ステップ６０１，６０
２）。すると統計的規則評価部２５３は、パラメータ生
成規則学習部２６により選択された規則ｉが適用された
それぞれの文例（テキストデータ部分）に対する最適な
パラメータ値を規則ｉ用の評価結果記憶部２５２内領域
を参照して調べて、その分散を計算する（ステップ６０
３）。First, the parameter generation rule learning unit 26 determines and selects the rule i of interest (steps 601 and 60).
2). Then, the statistical rule evaluation unit 253 stores the optimum parameter value for each sentence example (text data portion) to which the rule i selected by the parameter generation rule learning unit 26 is applied in the area in the evaluation result storage unit 252 for the rule i. To calculate the variance (step 60).
3).

【００６８】パラメータ生成規則学習部２６は、統計的
規則評価部２５３で算出された分散が予め定められた閾
値以上であるか否かをチェックする（ステップ６０
４）。もし、算出された分散が閾値を下回った場合に
は、パラメータ生成規則学習部２６は、該当する規則ｉ
は適切であると判断して、次の規則について（ステップ
７０５，７０６，６０２）、同様の処理（ステップ６０
３以降の処理）を行なう。The parameter generation rule learning unit 26 checks whether the variance calculated by the statistical rule evaluation unit 253 is equal to or larger than a predetermined threshold (step 60).
4). If the calculated variance falls below the threshold, the parameter generation rule learning unit 26
Is determined to be appropriate, and the same processing (step 60) is performed for the next rule (steps 705, 706, 602).
3).

【００６９】これに対し、算出された分散が閾値以上の
場合には、パラメータ生成規則学習部２６は該当する規
則ｉは不適切であると判断し、対応するテキストデータ
（に対するテキスト解析部２３１での解析結果）で決ま
る、当該規則ｉ中で考慮していないパラメータ決定要因
（アクセント型など）の要素（アクセント型０、アクセ
ント型１、アクセント型２など）を用いてクラスタリン
グを行ない（ステップ６０５）、そのクラスタリングの
結果から、規則ｉを特殊化すると評価スコアの総和が良
くなるか否かを調べる（ステップ６０６）。On the other hand, if the calculated variance is equal to or larger than the threshold, the parameter generation rule learning unit 26 determines that the corresponding rule i is inappropriate, and the text analysis unit 231 for the corresponding text data ( Clustering is performed using the elements (accent type 0, accent type 1, accent type 2, etc.) of the parameter deciding factors (accent type, etc.) not taken into account in the rule i determined by the analysis result (step 605). From the clustering result, it is checked whether or not the sum of the evaluation scores is improved by specializing the rule i (step 606).

【００７０】もし、良くなると判断できた場合、規則特
殊化自動モードが設定されているならば（ステップ７０
１）、パラメータ生成規則学習部２６は、該当する規則
ｉを特殊化、例えば分割して２つの規則を生成して、韻
律規則記憶部２３３に格納する（ステップ７０２）。If it is determined that the condition is improved, if the rule specializing automatic mode is set (step 70).
1) The parameter generation rule learning unit 26 specializes, for example, divides the corresponding rule i to generate two rules, and stores them in the prosody rule storage unit 233 (step 702).

【００７１】一方、規則特殊化オペレータ介在モードが
設定されているならば、パラメータ生成規則学習部２６
は、パラメータ生成部２３及び合成器３を動かして、規
則ｉを特殊化した場合について、以下に述べるようにし
て選択したテキストデータを対象に、その特殊化した規
則を適用した際の合成音声を音声出力部５から出力させ
ると共に、情報提示部２７に対して試聴評価を要求する
案内情報を表示して、オペレータによる試聴評価を行な
わせる（ステップ７０３）。On the other hand, if the rule specialization operator intervention mode is set, the parameter generation rule learning unit 26
In the case where the rule i is specialized by moving the parameter generation unit 23 and the synthesizer 3, the synthesized speech when the specialized rule is applied to the text data selected as described below is In addition to the output from the audio output unit 5, the guidance information requesting the trial evaluation is displayed on the information presentation unit 27, and the trial evaluation by the operator is performed (step 703).

【００７２】この場合、オペレータは情報提示部２７に
表示されている案内に従って、試聴評価の結果をオペレ
ータ入力部２８から入力する。パラメータ生成規則学習
部２６は、オペレータ入力部２８から入力されたオペレ
ータの試聴評価結果が、規則ｉを特殊化すると良くなる
か否かをチェックし（ステップ７０４）、良くなるなら
ば、該当する規則ｉを特殊化、例えば分割して２つの規
則を生成する（ステップ７０２）。上記ステップ７０
３，７０４の実現例としては、規則ｉが分割されたと仮
定した場合に、それぞれの典型例となる数テキストデー
タについてオペレータに試聴評価を行なわせ、その結果
両者のスコアにある閾値以上の差が生じた場合に、分割
を決定するという方法が適用可能である。In this case, the operator inputs the result of the trial evaluation from the operator input unit 28 according to the guidance displayed on the information presentation unit 27. The parameter generation rule learning unit 26 checks whether or not the operator's preview evaluation result input from the operator input unit 28 can be improved by specializing the rule i (step 704). i is specialized, for example, divided to generate two rules (step 702). Step 70 above
As an implementation example of 3,704, assuming that the rule i is divided, the operator performs an audition evaluation on several typical text data, and as a result, the difference between the scores of the two is equal to or greater than a threshold value. If it occurs, a method of determining the division is applicable.

【００７３】例えば、規則Ｒを規則Ｒ1 と規則Ｒ2 に分
割した場合に、規則Ｒ1 ，Ｒ2 をそれぞれ適用するデー
タのうちの典型データとして、図８（ａ）に示すよう
に、規則Ｒ1 ，Ｒ2 の各々について、その規則中で考慮
していないパラメータ決定要因で表される空間（例えば
アクセント型を１つの座標軸として、その要素であるア
クセント型０，１，２をその座標軸上の値とし、モーラ
数を別の座標軸として、とり得るモーラ数の値をその座
標軸上の値とするというように、規則中で考慮していな
い各パラメータ決定要因をそれぞれ座標軸とするパラメ
ータ決定要因空間）における、その規則を適用するデー
タ（文例）群の分布の重心に最も近いデータＸ1 ，Ｘ2
を選ぶ。For example, when the rule R is divided into the rules R1 and R2, as shown in FIG. 8A, as the typical data of the data to which the rules R1 and R2 are applied, as shown in FIG. For each, the space represented by the parameter determinant not considered in the rule (for example, the accent type as one coordinate axis, the accent types 0, 1 and 2 as its elements as values on the coordinate axis, and the number of mora Is defined as another coordinate axis, the value of the number of mora that can be taken is the value on that coordinate axis, such that each parameter determinant not considered in the rule is defined as a coordinate axis. Data X1 and X2 closest to the center of gravity of the distribution of data (sentence example) to be applied
Choose

【００７４】そして、選んだデータＸ1 ，Ｘ2 に対して
規則Ｒ1 ，Ｒ2 を適用した場合の合成音声を出力して、
オペレータによる試聴評価を行なわせ、図８（ｂ）に示
すような、その試聴評価のスコア（主観スコア）の各韻
律パラメータ値に対する分布がデータＸ1 ，Ｘ2 間でど
れだけ異なっているかを評価し、ある閾値以上異なって
いると判断できた場合、分割を行なえば良い。Then, a synthesized speech in the case where the rules R1 and R2 are applied to the selected data X1 and X2 is output.
The operator performs a trial evaluation, and evaluates how much the distribution (subjective score) of the trial evaluation score for each prosodic parameter value differs between the data X1 and X2 as shown in FIG. If it is determined that the values differ by a certain threshold or more, division may be performed.

【００７５】次に、パラメータ生成規則学習の１つであ
るパラメータ生成規則の一般化の方法につき説明する。
図９のように、規則ｐが適用されたデータ数と、規則ｑ
が適用されたデータ数が共に少なく（データは黒丸で表
現）、また、そのデータに対する規則ｐまたはｑの適用
時、その規則中で考慮されているパラメータ決定要因で
表される空間（パラメータ決定要因空間）における、そ
の規則が適用されたデータ（文例）群の分布の重心Ｇp
，Ｇq 間の距離ｄがある閾値より短く、且つ各規則
ｐ，ｑに与えられている最適なパラメータ値が一致或い
は非常に近い場合、両規則ｐ，ｑを含む新たな規則ｒを
作成することが可能である。Next, a generalization method of the parameter generation rule, which is one of the parameter generation rule learning, will be described.
As shown in FIG. 9, the number of data to which the rule p is applied and the rule q
Is small (the data is represented by black circles), and when the rule p or q is applied to the data, the space represented by the parameter determinant considered in the rule (parameter determinant) Center of gravity Gp of distribution of data (sentence example) group to which the rule is applied in (space)
, Gq is shorter than a certain threshold value and if the optimum parameter values given to the rules p and q are the same or very close, a new rule r including both rules p and q is created. Is possible.

【００７６】次に、評価部２５１（内の比較部２５１
ｃ）での比較・評価により算出される客観スコアと、オ
ペレータ試聴によりオペレータ入力部２８を通してオペ
レータから与えられる主観スコアとを用いた、評価部２
５１での評価関数の最適化の方法について、図１０及び
図１１を参照して説明する。Next, the evaluation unit 251 (the comparison unit 251 of the
The evaluation unit 2 using the objective score calculated by the comparison and evaluation in c) and the subjective score given by the operator through the operator input unit 28 by the operator audition.
The method of optimizing the evaluation function at 51 will be described with reference to FIGS.

【００７７】図１０は、ある規則を文ｉと文ｊに適用し
た場合に、特に文ｊについて客観スコアと主観スコアと
が食い違っている例を示す。この場合、客観スコアから
決定した最適韻律パラメータ値Ａ3 では、文ｊのときに
は非常に聞こえの悪い韻律を生成してしまうことにな
る。このようなことは、評価関数に単純に二乗誤差など
を用いた場合に生じ得る。そのため、客観スコアと主観
スコアの間での矛盾の少ない評価関数を決める必要があ
る。FIG. 10 shows an example in which, when a certain rule is applied to a sentence i and a sentence j, the objective score and the subjective score are particularly different for the sentence j. In this case, with the optimal prosody parameter value A3 determined from the objective score, a prosody that is very inaudible at the time of sentence j is generated. This can occur when a square error or the like is simply used as the evaluation function. Therefore, it is necessary to determine an evaluation function with less inconsistency between the objective score and the subjective score.

【００７８】そこで本実施形態では、例えば基本周波数
制御パラメータの評価関数として、次式ｆi ＝Σαj ＊ｄj ……（１）で表されるｆi を適用する。ここで、ｊは音韻を複数の
タイプに分類したときのカテゴリー番号、ｄj はカテゴ
リーｊの音韻の二乗誤差、αi はカテゴリーｊに対する
重み、Σαj ＊ｄj はαj ＊ｄj の値のすべてのカテゴ
リーｊについての総和である。音韻のタイプの分類方法
としては、例えば子音と母音に分ける方法が適用可能で
ある。In the present embodiment, for example, fi represented by the following equation fi = Σαj * dj (1) is applied as an evaluation function of the fundamental frequency control parameter. Here, j is the category number when the phonemes are classified into a plurality of types, dj is the square error of the phonemes of category j, αi is the weight for category j, and Δαj * dj is for all categories j of the values of αj * dj. Is the sum of As a method of classifying phoneme types, for example, a method of dividing into consonants and vowels can be applied.

【００７９】本実施形態では、カテゴリーｊの二乗誤差
上ｄj に対する重みαj をカテゴリーｊごとに変えるこ
とで、最適な評価関数ｆi を決める。図１１（ａ）のグ
ラフは、客観スコアと主観スコアの両者が存在するデー
タに関して、評価関数にｆi を用いた場合の、主観スコ
アに対する客観スコアの値の分布を表したものである。
ここで、客観スコアを良い評価のデータ（二重丸と○）
と悪い評価のデータ（△と×）とに分け、両者の客観ス
コアの分布の統計的な差ができるだけ大きくなるような
（予め定められた閾値より大きくなるような）重みａj
を各カテゴリーｊごとに選ぶならば、主観スコアと客観
スコアの間の矛盾が生じにくい評価関数ｆi を得ること
ができる。In this embodiment, the optimum evaluation function fi is determined by changing the weight αj with respect to dj on the square error of category j for each category j. The graph of FIG. 11A shows the distribution of the value of the objective score with respect to the subjective score when fi is used as the evaluation function for data having both the objective score and the subjective score.
Here, the objective score is data of good evaluation (double circle and ○)
And bad evaluation data (△ and ×), and weight aj such that the statistical difference between the two objective scores is as large as possible (greater than a predetermined threshold).
Is selected for each category j, an evaluation function fi in which inconsistency between the subjective score and the objective score does not easily occur can be obtained.

【００８０】分布の統計的な差の指標の一例としては、
図１１（ｂ）に示すような、主観スコアで悪い評価のデ
ータについての客観スコアの分布（発生頻度の分布）１
１１ａと、主観スコアで良い評価のデータについての客
観スコアの分布（発生頻度の分布）１１１ｂとの差に関
するｔ検定を行なったとき、その差に関する有意水準が
使え、有意水準の小さいものほど統計的な差が大きいと
みなす。As an example of the statistical difference index of the distribution,
As shown in FIG. 11B, distribution of objective scores (distribution of frequency of occurrence) for data of poor evaluation with subjective scores 1
When performing a t-test on the difference between 11a and the distribution of the objective score (distribution of occurrence frequency) 111b for the data with a good subjective score, the significance level of the difference can be used. Differences are considered large.

【００８１】なお、自然音声データの不足した規則、例
えば適用可能な自然音声データの数がある閾値以下の規
則については、テキストデータ記憶部２１に格納されて
いるテキストデータのみを用いてオペレータにより試聴
評価を行なわせ、その試聴評価のスコア（主観スコア）
を客観スコアとして本来の客観スコアと共に用いて、最
適な韻律パラメータを選択することも可能である。For rules with insufficient natural voice data, for example, rules where the number of applicable natural voice data is equal to or less than a certain threshold value, the operator makes a trial listening using only the text data stored in the text data storage unit 21. Evaluation is performed, and the score of the preview evaluation (subjective score)
Can be used as an objective score together with the original objective score to select an optimal prosodic parameter.

【００８２】以上に述べた図１の構成の音声合成装置
は、コンピュータ、例えば図１２に示すようなスピーカ
１２１を内蔵したパーソナルコンピュータ１２０を、テ
キスト入力部１、パラメータ生成・評価部２、合成器
３、音声合成単位辞書４及び音声出力部５として機能さ
せるためのプログラムを記録した記録媒体、例えばフロ
ッピーディスク１２２を用い、当該フロッピーディスク
１２２をパーソナルコンピュータ１２０に装着して、当
該フロッピーディスク１２２に記録されているプログラ
ムをパーソナルコンピュータで１２０で読み取り実行さ
せることにより実現される。ここでは、スピーカ１２１
は音声出力部５の一部として用いられる。なお、プログ
ラムを記録した記録媒体としては、フロッピーディスク
１２２の他に、ＣＤ−ＲＯＭ、メモリカード等が利用可
能である。The above-described speech synthesizer having the configuration shown in FIG. 1 includes a computer, for example, a personal computer 120 having a built-in speaker 121 as shown in FIG. 12 and a text input unit 1, a parameter generation / evaluation unit 2, and a synthesizer. 3. Using a recording medium on which a program for functioning as the voice synthesis unit dictionary 4 and the voice output unit 5 is recorded, for example, a floppy disk 122, mounting the floppy disk 122 on the personal computer 120, and recording the floppy disk 122 The program is read and executed by the personal computer 120 using a personal computer. Here, the speaker 121
Is used as a part of the audio output unit 5. As a recording medium on which the program is recorded, a CD-ROM, a memory card, or the like can be used in addition to the floppy disk 122.

【００８３】[0083]

【発明の効果】以上詳述したように本発明によれば、自
然音声データから得られる韻律パラメータ時系列と、そ
の自然音声データに対応するテキスト情報から音声合成
のために得られる韻律パラメータ時系列との比較により
韻律パラメータの評価（客観評価）を行なうと共に、こ
の客観評価（比較評価）と主観評価（オペレータの試聴
による評価）との結果の相関を考慮することで、韻律パ
ラメータの最適化のための評価に用いる評価関数自体を
最適化できる。As described in detail above, according to the present invention, a prosody parameter time series obtained from natural speech data and a prosody parameter time series obtained for speech synthesis from text information corresponding to the natural speech data The evaluation of prosodic parameters (objective evaluation) is performed by comparing the results of the evaluation with the objective evaluation (comparative evaluation) and the subjective evaluation (evaluation based on the operator's audition). The evaluation function itself used for the evaluation can be optimized.

【００８４】また本発明によれば、自然音声データと対
応する合成音声データとの比較評価結果を統計処理する
ことで不適切な韻律規則を検出して当該規則を修正し、
その修正後の規則を適用して生成される合成音声に対し
てオペレータによる試聴評価を行なわせることで、その
試聴評価結果をもとに不適切な韻律規則の修正・最適化
を図ることができる。According to the present invention, an inappropriate prosody rule is detected by statistically processing a result of comparison and evaluation between natural speech data and the corresponding synthesized speech data, and the rule is corrected.
By allowing the operator to perform a trial evaluation on the synthesized speech generated by applying the modified rule, it is possible to correct and optimize inappropriate prosodic rules based on the trial evaluation result. .

【００８５】また本発明によれば、韻律規則で考慮され
ていない韻律パラメータ決定要因の要素を用いてクラス
タリングを行なうことで、当該韻律規則を分割した方が
良いか否かを判断し、評価値が良くなる方向に当該韻律
規則を分割することができる。Further, according to the present invention, by performing clustering using elements of prosodic parameter determinants not taken into account by the prosody rules, it is determined whether or not the prosody rules should be divided, and the evaluation value is determined. Can be divided in the direction in which

【００８６】また本発明によれば、自然音声データの不
足した韻律規則に関してオペレータによる試聴評価を利
用することで、その韻律規則の最適化を行なうことがで
きる。According to the present invention, the prosody rule can be optimized by using the trial listening evaluation by the operator with respect to the prosody rule lacking the natural voice data.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る音声合成装置の構成
を示すブロック図。FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to an embodiment of the present invention.

【図２】図１中の評価部２５１の構成を示すブロック
図。FIG. 2 is a block diagram showing a configuration of an evaluation unit 251 in FIG.

【図３】図１中の評価結果記憶部２５２での評価スコア
の格納形式の一例を示す図。FIG. 3 is a diagram showing an example of a storage format of evaluation scores in an evaluation result storage unit 252 in FIG. 1;

【図４】同実施形態における評価処理を説明するための
フローチャートの一部を示す図。FIG. 4 is an exemplary view showing a part of a flowchart for explaining an evaluation process in the embodiment;

【図５】同実施形態における評価処理を説明するための
フローチャートの残りを示す図。FIG. 5 is a view showing the rest of the flowchart for explaining the evaluation processing in the embodiment.

【図６】同実施形態におけるパラメータ生成規則学習部
２６によるパラメータ生成規則特殊化処理を説明するた
めのフローチャートの一部を示す図。FIG. 6 is a view showing a part of a flowchart for explaining parameter generation rule specialization processing by a parameter generation rule learning unit 26 in the embodiment.

【図７】同実施形態におけるパラメータ生成規則学習部
２６によるパラメータ生成規則特殊化処理を説明するた
めのフローチャートの残りを示す図。FIG. 7 is a view showing the rest of the flowchart for describing parameter generation rule specialization processing by the parameter generation rule learning unit 26 according to the embodiment.

【図８】同実施形態におけるパラメータ生成規則の分割
条件を説明するための図。FIG. 8 is an exemplary view for explaining division conditions of a parameter generation rule in the embodiment.

【図９】同実施形態におけるパラメータ生成規則の一般
化を説明するための図。FIG. 9 is an exemplary view for explaining generalization of a parameter generation rule according to the embodiment;

【図１０】あるパラメータ生成規則を文ｉと文ｊに適用
した場合に、客観スコアと主観スコアとが食い違ってい
る例を示す図。FIG. 10 is a diagram showing an example in which, when a certain parameter generation rule is applied to a sentence i and a sentence j, the objective score and the subjective score are different.

【図１１】同実施形態における評価関数の最適化を説明
するための図。FIG. 11 is a view for explaining optimization of the evaluation function in the embodiment.

【図１２】図１の音声合成装置を実現するパーソナルコ
ンピュータの外観を示す図。FIG. 12 is an exemplary external view of a personal computer that implements the speech synthesizer of FIG. 1;

[Explanation of symbols]

１…テキスト入力部２…パラメータ生成・評価部３…合成器４…音声合成単位辞書５…音声出力部２１…テキストデータ記憶部２２…音声データ記憶部２３…パラメータ生成部２４…パラメータ分析部２５…パラメータ生成規則評価部２６…パラメータ生成規則学習部（韻律パラメータ値学
習手段、試聴評価要求手段、評価関数学習手段、韻律規
則学習手段、クラスタリング手段、韻律規則分割手段）２７…情報提示部２８…オペレータ入力部２３１…テキスト解析部２３２…パラメータ時系列生成部２３３…韻律規則記憶部２５１…評価部２５２…評価結果記憶部２５３…統計的規則評価部。REFERENCE SIGNS LIST 1 text input unit 2 parameter generation / evaluation unit 3 synthesizer 4 voice synthesis unit dictionary 5 voice output unit 21 text data storage unit 22 voice data storage unit 23 parameter generation unit 24 parameter analysis unit 25 ... Parameter generation rule evaluation unit 26 ... Parameter generation rule learning unit (prosody parameter value learning unit, trial listening evaluation request unit, evaluation function learning unit, prosody rule learning unit, clustering unit, prosody rule division unit) 27 ... Information presentation unit 28 ... Operator input unit 231 text analysis unit 232 parameter time series generation unit 233 prosody rule storage unit 251 evaluation unit 252 evaluation result storage unit 253 statistical rule evaluation unit.

Claims

[Claims]

An information on a phoneme and a prosody of a synthesized speech is generated from input text information, and a prosody parameter value and a phoneme symbol of each phoneme of the synthesized speech are determined based on the information. And a speech synthesizer for generating speech parameters necessary for speech synthesis based on phonetic symbols and outputting synthesized speech based on the speech parameters. Text analysis means for generating information representing prosody, prosody rule storage means in which various prosody rules for generating prosody parameters are registered in advance, the information generated by the text analysis means, and the prosody rule storage Based on the corresponding prosody rule registered in the means, a prosody parameter value to be applied in the prosody rule is determined among a plurality of predetermined candidates. Parameter time series generating means for sequentially generating a first prosody parameter time series for speech synthesis while sequentially selecting the second prosody parameter time series by analyzing natural speech data corresponding to the learning text information. A parameter analysis means for generating, the second prosody parameter time series generated by the parameter analysis means, and the first prosody parameter time series generated in order while switching the prosody parameter value by the parameter time series generation means Evaluating means for comparing and evaluating using a predetermined evaluation function, prosody parameter value learning means for optimizing a prosodic parameter value to be applied in a corresponding prosodic rule based on the evaluation result of the evaluating means, A synthesized speech corresponding to the first prosody parameter time series generated by the series generation means is operated. Means for requesting a trial evaluation to be output for the trial evaluation by the operator, operator input means to which the result of the trial evaluation by the operator is input, and the evaluation corresponding to the result of the trial evaluation by the operator input from the operator input means. An evaluation function learning unit for correcting the evaluation function in a direction in which the evaluation result of the unit does not contradict the evaluation function learning unit.

2. Generating information on phonemes and prosody of synthesized speech from input text information, determining prosody parameter values and phoneme symbols of each phoneme of synthesized speech based on the information, And a speech synthesizer for generating speech parameters necessary for speech synthesis based on phonetic symbols and outputting synthesized speech based on the speech parameters. Text analysis means for generating information representing prosody, prosody rule storage means in which various prosody rules for generating prosody parameters are registered in advance, the information generated by the text analysis means, and the prosody rule storage Based on the corresponding prosody rule registered in the means, a prosody parameter value to be applied in the prosody rule is determined among a plurality of predetermined candidates. Parameter time series generating means for sequentially generating a first prosody parameter time series for speech synthesis while sequentially selecting the second prosody parameter time series by analyzing natural speech data corresponding to the learning text information. A parameter analysis means for generating, the second prosody parameter time series generated by the parameter analysis means, and the first prosody parameter time series generated in order while switching the prosody parameter value by the parameter time series generation means Evaluation means for comparing and evaluating the evaluation result; prosody parameter value learning means for optimizing a prosody parameter value to be applied with a corresponding prosody rule based on the evaluation result of the evaluation means; and statistically processing the evaluation result of the evaluation means. The first prosody rule for speech synthesis, in which an inappropriate prosody rule is detected, the rule is modified, and the modified rule is applied. A trial evaluation requesting means for generating a prosodic parameter time series by the parameter time series generating means and outputting a corresponding synthesized voice for a trial listening evaluation by an operator; and an operator input means for receiving a trial listening evaluation result by the operator. And a prosody rule learning means for correcting and optimizing the inappropriate prosody rule based on a trial evaluation result by the operator input from the operator input means.

3. Generating information on phonemes and prosody of a synthesized speech from input text information, determining prosody parameter values and phonemic symbols of each phoneme of the synthesized speech based on the information, And a speech synthesizer for generating speech parameters necessary for speech synthesis based on phonetic symbols and outputting synthesized speech based on the speech parameters. Text analysis means for generating information representing prosody, prosody rule storage means in which various prosody rules for generating prosody parameters are registered in advance, the information generated by the text analysis means, and the prosody rule storage Based on the corresponding prosody rule registered in the means, a prosody parameter value to be applied in the prosody rule is determined among a plurality of predetermined candidates. Parameter time series generating means for sequentially generating a first prosody parameter time series for speech synthesis while sequentially selecting the second prosody parameter time series by analyzing natural speech data corresponding to the learning text information. A parameter analysis means for generating, the second prosody parameter time series generated by the parameter analysis means, and the first prosody parameter time series generated in order while switching the prosody parameter value by the parameter time series generation means Evaluation means for comparing and evaluating, and an inappropriate prosody rule is detected by statistically processing the evaluation result of the evaluation means, and the rule is modified, and the modified rule is applied, and the modified rule is applied. One prosodic parameter time series is generated by the parameter time series generating means, and the corresponding synthesized speech is evaluated by the operator for trial listening evaluation. Means for requesting a trial evaluation to be output to the user, operator input means to which the result of the trial evaluation by the operator is inputted, and the result of the evaluation by the operator and the result of the trial evaluation by the operator inputted from the operator input means. For each prosody rule, a set of prosody parameter determinant factors considered in the rule and other prosody parameter determinant factors not considered in the rule determined by the corresponding text information is stored. Evaluation result storage means; prosody parameter value learning means for optimizing a prosody parameter value to be applied with a corresponding prosody rule based on the evaluation result of the evaluation means stored in the evaluation result storage means; Statistical processing is performed on the evaluation result of the evaluation means stored in the storage means to detect an inappropriate prosody rule. By performing clustering using the prosodic parameters determinant factors of consideration, speech synthesis apparatus characterized by comprising a prosodic rule learning means for dividing the rules on the basis of the clustering result.

4. The prosody rule learning means statistically processes an evaluation result of the evaluation means stored in the evaluation result storage means to detect an inappropriate prosody rule, and detects the prosody rule which is not considered by the rule. Clustering means for performing clustering using the element of the parameter determining factor, and when the distribution of the clustering result of the clustering means is divided into a plurality, and for each distribution, select text information closest to the center of gravity of the distribution, Prosody rule dividing means for generating a plurality of new prosody rules by dividing the corresponding prosody rules, wherein the audition evaluation requesting means includes a means for dividing the text information selected by the prosody rule dividing means. The corresponding first prosodic parameter time series is generated according to each of the prosodic rules generated by the prosodic rule dividing means. The prosody rule dividing means generates the corresponding synthesized speech for the trial listening evaluation by the operator, and the prosody rule dividing means generates the trial listening result by the operator for the text information selected by itself. 4. The speech synthesizer according to claim 3, wherein it is determined whether or not to use the plurality of generated prosody rules based on the rule.

5. The test evaluation evaluation means, wherein when the number of natural voice data corresponding to the learning text information is equal to or less than a predetermined threshold value, an operator evaluates a synthetic voice corresponding to the learning text information by an operator. Wherein the prosodic parameter value learning means optimizes a prosodic parameter value to be applied in a corresponding prosodic rule based on a trial listening evaluation result by the operator and an evaluation result of the evaluating means. The speech synthesizer according to claim 1.

6. Generating information on phonemes and prosody of a synthesized speech from input text information, determining prosody parameter values and phoneme symbols of each phoneme of the synthesized speech based on the information, and generating the prosody parameter values. An evaluation function optimization method applied to a speech synthesizer that generates speech parameters required for speech synthesis based on a phonetic symbol and outputs a synthesized speech based on the speech parameters, comprising: The information is sequentially analyzed to generate information representing the phoneme and the prosody of the synthesized speech, and based on the generated information and various corresponding prosody rules prepared in advance, the prosody rule is used. While sequentially generating a first prosody parameter time series for speech synthesis while sequentially selecting a prosody parameter value to be applied from among a plurality of predetermined candidates, the learning text information And generating a second prosody parameter time series by analyzing the natural speech data corresponding to the first and second prosody parameter time series. Is compared and evaluated using a predetermined evaluation function, and based on the evaluation result, the prosody parameter value applied by the corresponding prosody rule is optimized, while the first prosody parameter value is sequentially generated while switching the prosody parameter value. Output the synthesized speech corresponding to the prosodic parameter time series of
An evaluation function, comprising: inputting a preview evaluation result by the operator; and correcting the evaluation function in a direction in which the input preview evaluation result by the operator and the evaluation result using the corresponding evaluation function are consistent. Optimization method.

7. Generating information on phonemes and prosody of a synthesized speech from input text information, determining prosody parameter values and phoneme symbols of each phoneme of the synthesized speech based on the information, An evaluation function optimization method applied to a speech synthesizer that generates speech parameters required for speech synthesis based on a phonetic symbol and outputs a synthesized speech based on the speech parameters, comprising: The information is sequentially analyzed to generate information representing the phoneme and the prosody of the synthesized speech, and based on the generated information and various corresponding prosody rules prepared in advance, the prosody rule is used. While sequentially generating a first prosody parameter time series for speech synthesis while sequentially selecting a prosody parameter value to be applied from among a plurality of predetermined candidates, the learning text information And generating a second prosody parameter time series by analyzing the natural speech data corresponding to the first and second prosody parameter time series. Is compared and evaluated using a predetermined evaluation function, and based on the evaluation result, a prosody parameter value to be applied in a corresponding prosody rule is optimized. Detecting, modifying the rule, generating a first prosodic parameter time series for the speech synthesis to which the modified rule is applied, and outputting the corresponding synthesized speech for a trial listening evaluation by the operator; Inputting a result of the audition evaluation by the operator, and correcting and optimizing the inappropriate prosody rule based on the result of the audition evaluation by the input operator. Law optimization method.