JP3346671B2

JP3346671B2 - Speech unit selection method and speech synthesis device

Info

Publication number: JP3346671B2
Application number: JP06096295A
Authority: JP
Inventors: 貴夫小山; 憲也村上; 文徳吉谷
Original assignee: NTT Data Corp
Current assignee: NTT Data Corp
Priority date: 1995-03-20
Filing date: 1995-03-20
Publication date: 2002-11-18
Anticipated expiration: 2017-11-18
Also published as: JPH08263095A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、音声合成装置に関し、
特に、合成単位（音声波形素片）の選択技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer,
In particular, the present invention relates to a technique for selecting a synthesis unit (speech waveform unit).

【０００２】[0002]

【従来の技術】従来、音声波形素片を用いた規則合成装
置では、大量の音声波形をＰＣＭ（Pulse code modulat
ion）方式によりディジタルの音声波形データ（ＰＣＭ
データ）に変換して蓄積し、該音声波形データに対して
音素単位で位置情報、ピッチ形状、継続時間長、パワ等
の韻律情報及び前後の音素の種類等を記述しておくのが
一般的である。音声合成を行う際には、入力テキストに
対して規則により韻律パタンを設定し、該韻律パタンに
最も近くなるように音声素片を選択し、更に選択された
素片の韻律を変形して、目標の韻律パタンに近くなるよ
うにすることで、ＬＳＰ（Line Spectrum Pair）を用い
た合成音声を行う場合に比べて高品質な規則音声合成を
行うことを可能にしている。このような音声波形素片を
用いた規則合成装置については、「広川著、波形辞書を
用いた規則合成法、電子通信学会信学技報SP88−9（198
8‐5）」の記載を参考にすることができる。2. Description of the Related Art Conventionally, in a rule synthesizing apparatus using speech waveform segments, a large amount of speech waveforms are converted to PCM (Pulse Code Modulat).
Ion) digital sound waveform data (PCM
In general, the audio waveform data is described with positional information, pitch shape, duration length, prosody information such as power, and types of preceding and succeeding phonemes, etc., for the speech waveform data. It is. When performing speech synthesis, a prosody pattern is set according to rules for the input text, a speech unit is selected so as to be closest to the prosody pattern, and the prosody of the selected unit is further transformed. By making the prosody pattern close to the target prosody pattern, it is possible to perform high-quality regular speech synthesis as compared with the case where synthesized speech using LSP (Line Spectrum Pair) is performed. For a rule synthesizing apparatus using such speech waveform segments, see Hirokawa, "Rule synthesis method using waveform dictionary, IEICE Technical Report SP88-9 (198
8-5)) can be referred to.

【０００３】[0003]

【発明が解決しようとする課題】上述の音声合成手法で
は、音声波形をＰＣＭデータ上で接続するため、素片間
のスペクトル形状が大きく異なり、素片接続部における
異音またはノイズ発生の原因になることがある。音声波
形素片を用いた規則合成装置においては、このような素
片接続部でのノイズ混入の有無によって、合成音声の品
質が大きく左右される。しかし、従来の手法では、蓄積
情報として、スペクトル形状等の自然音声のパラメタを
有していないため、韻律的な滑らかさと音韻系列の一致
しか波形選択の基準にすることができず、素片接続部に
おけるノイズ混入の予測ができない。そのため、合成音
声を実際に人間が聴取するか、または何らかの音響分析
を行わなければ品貿の善し悪しを判断できない問題があ
った。In the above-mentioned speech synthesis technique, since the speech waveform is connected on the PCM data, the spectrum shapes between the segments are greatly different, which may cause abnormal noise or noise at the segment connection. May be. In a rule synthesizing apparatus using speech waveform segments, the quality of synthesized speech is greatly influenced by the presence or absence of noise at such segment connection sections. However, according to the conventional method, since the stored information does not have parameters of natural speech such as a spectrum shape, only matching of the prosodic smoothness and the phoneme sequence can be used as a reference for waveform selection. Prediction of noise contamination in the section cannot be made. For this reason, there is a problem that it is impossible to judge whether the trade is good or bad without actually listening to the synthesized speech by a human or performing some kind of acoustic analysis.

【０００４】本発明の課題は、上記問題点に鑑み、音声
波形素片を用いた規則合成装置において合成単位を選択
する際に、合成単位間の連続性を考慮した素片選択を可
能とし、さらに素片接続部におけるノイズ混入の有無及
び合成音声全体の品質を評価することを可能にする技術
を提供することにある。In view of the above problems, an object of the present invention is to make it possible to select a unit in consideration of continuity between synthesis units when selecting a synthesis unit in a rule synthesizing apparatus using a speech waveform unit. It is still another object of the present invention to provide a technique capable of evaluating the presence / absence of noise mixing in a unit connection unit and the quality of the entire synthesized speech.

【０００５】[0005]

【課題を解決するための手段】本発明は、まず、音声素
片選択方法を提供する。この方法は、複数の音声素片を
蓄積した素片辞書から音声合成に用いる合成単位を選択
する際に、素片間ケプストラム距離の特定情報、例えば
該距離の代表値を合成単位毎に前記素片辞書に蓄積する
段階と、合成対象音声の韻律パタンに類似する複数の合
成単位を類似度の高い順に抽出する段階と、抽出した複
数の合成単位から前記素片間ケプストラム距離に基づく
接続コストが最小となる合成単位の組を選択する段階と
をこの順に実行することを特徴とする。前記選択された
合成単位の組に基づく合成音声の品質が所定の基準値よ
り大きいときは、接続コストを大きくしている合成単位
を隣接素片間ケプストラム距離が最小となる代替素片に
置き換える。なお、この場合の合成音声の品質は、例え
ば素片間ケプストラム距離の積算値で評価することがで
き、評価の閾値は、実験的に決定することができる。The present invention first provides a speech unit selection method. According to this method, when selecting a synthesis unit to be used for speech synthesis from a segment dictionary in which a plurality of speech segments are stored, specific information of a cepstrum distance between segments, for example, a representative value of the distance is used for each of the synthesis units. Accumulating in a single dictionary, extracting a plurality of synthesis units similar to the prosodic pattern of the synthesis target voice in descending order of similarity, and connecting cost based on the inter-unit cepstrum distance from the extracted plurality of synthesis units. And selecting the minimum combination unit combination in this order. When the quality of the synthesized speech based on the selected set of synthesis units is larger than a predetermined reference value, the synthesis unit having a high connection cost is replaced with an alternative unit having a minimum cepstrum distance between adjacent units. Note that the quality of the synthesized speech in this case can be evaluated by, for example, an integrated value of the cepstrum distance between the segments, and the evaluation threshold can be experimentally determined.

【０００６】本発明は、また、上記方法の実施に適した
音声合成装置をも提供する。この装置は、複数の音声素
片を素片間ケプストラム距離の特定情報と共に蓄積した
素片辞書と、該素片辞書から音声合成に用いる合成単位
を選択する音声素片選択部と、選択された合成単位に基
づいて合成音声を生成する手段とを有する音声合成装置
であって、前記音声素片選択部が、合成対象音声の韻律
パタンに類似する複数の合成単位を類似度の高い順に抽
出する一次選択部と、抽出した複数の合成単位から前記
素片間ケプストラム距離に基づく接続コストが最小とな
る合成単位の組を選択する二次選択部とを有することを
特徴とする。The present invention also provides a speech synthesizer suitable for implementing the above method. The apparatus includes a unit dictionary storing a plurality of speech units together with specific information of an inter-unit cepstrum distance, a speech unit selection unit that selects a synthesis unit to be used for speech synthesis from the unit dictionary, Means for generating a synthesized speech based on a synthesis unit, wherein the speech unit selection unit extracts a plurality of synthesis units similar to the prosodic pattern of the synthesis target speech in descending order of similarity. It is characterized by having a primary selection unit and a secondary selection unit that selects a set of synthesis units that minimizes the connection cost based on the inter-unit cepstrum distance from a plurality of extracted synthesis units.

【０００７】上記構成の音声合成装置において、音声素
片選択部は、さらに、前記二次選択部で選択された合成
単位の組に基づく合成音声の品質を定量化するととも
に、これにより得られた定量値の良否を判定する合成品
質判定部と、前記定量値の良否判定が否のときに当該合
成単位の組における隣接素片間ケプストラム距離が最小
となる代替素片を生成する代替素片生成部とを有し、接
続コストを大きくしている合成単位を前記代替素片に置
き換える。なお、前記代替素片生成部は、韻律パタンが
目標値に近い複数の先行音置き換え候補及び後続音置き
換え候補を選択する第１の手段と、隣接素片の接続コス
トが最も小さい先行音置き換え候補の組を前記代替素片
として特定する第２の手段とを有するものである。In the speech synthesizing apparatus having the above-described configuration, the speech unit selection unit further quantifies the quality of the synthesized speech based on the set of synthesis units selected by the secondary selection unit and obtains the quality. A synthesis quality determination unit that determines the quality of the quantitative value, and an alternative unit generation that generates an alternative unit that minimizes the cepstrum distance between adjacent units in the set of the synthesis unit when the quality value determination is negative. And replacing the synthesis unit which increases the connection cost with the alternative element. The replacement unit generating unit includes a first unit that selects a plurality of preceding sound replacement candidates and a succeeding sound replacement candidate whose prosody pattern is close to the target value, and a first sound replacement candidate that has the smallest connection cost of the adjacent unit. And a second means for specifying the set as a substitute element.

【０００８】[0008]

【作用】本発明では、従来の手法と同様に韻律パタンと
の類似度の高いものを合成単位候補として抽出した後、
素片接続部同士のケプストラム距離を接続コストとし、
例えば動的計画法を用いて、接続コストが最小となる合
成単位の組を選び出す。これにより、与えられた合成単
位中で最も声道特性の連続性が高いものを選び出すこと
ができる。この場合、合成単位を選択した際の接続コス
トを参照することで、合成音声へのノイズ混入状況を推
定することが可能となる。また、合成品質低下の原因と
なる部分では、接続コストが大きく、異音またはノイズ
混入が発生しやすいので、接続コストが大きい合成単位
がある場合には、これに代えて、他の代替素片に置き換
える。According to the present invention, similar to the conventional method, a candidate having a high degree of similarity to a prosody pattern is extracted as a synthesis unit candidate.
The cepstrum distance between the element connection parts is taken as the connection cost,
For example, using a dynamic programming method, a combination of synthesis units that minimizes the connection cost is selected. As a result, a unit having the highest continuity of vocal tract characteristics can be selected from a given synthesis unit. In this case, by referring to the connection cost when the synthesis unit is selected, it is possible to estimate the state of noise mixing in the synthesized speech. In addition, since the connection cost is high and noise or noise is likely to be mixed in a portion that causes a reduction in the synthesis quality, if there is a synthesis unit having a high connection cost, another replacement unit is used instead. Replace with

【０００９】[0009]

【実施例】以下、図面を参照して本発明の好適な実施例
を説明する。図１は、本発明の一実施例に係る音声合成
装置の構成図である。まず、図１を参照して本実施例の
音声合成装置の全体的な処理の概要を説明する。この音
声合成装置１は、入力端子ＩＮから入力した漢字仮名混
じりの日本語テキストをテキスト解析部２へ送る。テキ
スト解析部２では、図示しない辞書を用いて入力テキス
トを文節に切り分け、各々の文節に対してローマ字表記
の読みを付与し、更にアクセント型を各文節に付与す
る。ここで得られた各文節の読み及びアクセント型は、
韻律生成部３に送出される。Preferred embodiments of the present invention will be described below with reference to the accompanying drawings. FIG. 1 is a configuration diagram of a speech synthesizer according to one embodiment of the present invention. First, an overview of the overall processing of the speech synthesizer of the present embodiment will be described with reference to FIG. The speech synthesizer 1 sends a Japanese text mixed with kanji and kana input from an input terminal IN to a text analyzer 2. The text analysis unit 2 divides the input text into phrases using a dictionary (not shown), gives each phrase a reading in Roman alphabet, and further gives an accent type to each phrase. The reading and accent type of each phrase obtained here are
It is sent to the prosody generation unit 3.

【００１０】韻律生成部３では、文節単位の読み情報と
アクセント型情報に基づきピッチパタン、音韻継続時間
長パタン、パワパタンの３種の韻律パタンを生成する。
ここで生成された韻律パタンと文節単位のローマ字読み
情報が、素片選択部４へ送られる。素片選択部４では、
与えられたテキストの読み情報及び韻律パタンを考慮し
て素片辞書５の中から合成に使用するのに適した合成単
位を選択する。これについては後で詳述する。素片選択
部４で選択された合成単位は、素片変形接続部６へ送ら
れる。素片変形接続部６では、素片選択部４で選択され
た合成単位を組み合わせることで得られる韻律パタン
を、韻律生成部３で生成された韻律パタンに近くなるよ
うに変形処理を行い、変形後の素片を接続して出力端子
ＯＵＴへ送る。なお、本実施例の音声合成装置１は、Ｖ
ＣＶ型の音声波形素片単位（合成単位）を用い、合成品
質を向上させるために、各ＶＣＶ単位種類に複数の合成
単位を用意しているものとする。The prosody generation unit 3 generates three types of prosody patterns, a pitch pattern, a phoneme duration time pattern, and a power pattern, based on the reading information and the accent type information in the unit of a phrase.
The generated prosody pattern and the romaji reading information in units of phrases are sent to the segment selection unit 4. In the unit selection unit 4,
A synthesis unit suitable for use in synthesis is selected from the segment dictionary 5 in consideration of the given text reading information and prosody pattern. This will be described later in detail. The synthesis unit selected by the unit selection unit 4 is sent to the unit transformation connection unit 6. The unit transformation connection unit 6 performs a transformation process on the prosody pattern obtained by combining the synthesis units selected by the unit selection unit 4 so as to be close to the prosody pattern generated by the prosody generation unit 3. The latter element is connected and sent to the output terminal OUT. Note that the speech synthesizer 1 of the present embodiment
It is assumed that a plurality of synthesis units are prepared for each VCV unit type in order to use a CV type speech waveform segment unit (synthesis unit) and improve synthesis quality.

【００１１】次に、本実施例で用いる素片辞書５の構成
例を、図２を用いて説明する。本音声合成装置１は、大
量の音声波形をＰＣＭデータに変換し、さらに、該音声
波形に音韻情報、韻律情報、及びスペクトル情報をそれ
ぞれ対応させて波形辞書５に蓄積している。そして、テ
キスト解析部２で推定された入力テキストの韻律パタン
に基づいて素片辞書５から最適の合成単位を選び出し、
選び出された合成単位を滑らかに接続することで合成音
声を得る。Next, an example of the configuration of the segment dictionary 5 used in this embodiment will be described with reference to FIG. The speech synthesizer 1 converts a large amount of speech waveforms into PCM data, and accumulates the speech waveforms in the waveform dictionary 5 in correspondence with phonological information, prosodic information, and spectral information. Then, based on the prosody pattern of the input text estimated by the text analysis unit 2, an optimal synthesis unit is selected from the segment dictionary 5,
The synthesized speech is obtained by connecting the selected synthesis units smoothly.

【００１２】従来の同様の音声合成方式では、辞書情報
としてスペクトル情報を有していないため、韻律的な滑
らかさと音韻系列の一致しか波形選択の基準にすること
が出来なかった。そこで、本実施例では、素片辞書５を
作成する際にケプストラム分析を行い、そこで現われる
ケプストラムの低次の項を特徴ベクトルとし、無音区間
以外の全てのベクトルを用いてクラスタリングを行う。
そして、クラスタリング処理後は、各クラスタに識別番
号を付与し、各合成単位の始端・終端位置のケプストラ
ムがどのクラスタに属しているかを辞書情報として登録
する。これにより、スペクトル包絡レベルでの素片連続
性を監視し、異音化及びノイズの混入を予め予測するこ
とが可能となる。In the same conventional speech synthesis system, since there is no spectrum information as dictionary information, only a match between prosodic smoothness and a phoneme sequence can be used as a reference for selecting a waveform. Therefore, in the present embodiment, cepstrum analysis is performed when the segment dictionary 5 is created, and a low-order term of the cepstrum appearing there is used as a feature vector, and clustering is performed using all vectors other than a silent section.
After the clustering process, an identification number is assigned to each cluster, and the cluster to which the cepstrum at the start / end position of each synthesis unit belongs is registered as dictionary information. As a result, it is possible to monitor the segment continuity at the spectrum envelope level and predict in advance the occurrence of abnormal noise and the inclusion of noise.

【００１３】この素片辞書５は、例えば、図２（ａ）に
示すように、素片辞書テーブル２０１に、素片を一意に
扱うための素片番号、素片の音韻環境を示す音韻種類を
格納している。また、韻律情報として、素片境界におけ
るピッチ周波数、素片継続時間情報、及びパワを格納し
ている。ピッチ周波数の分析位置は、ＶＣＶ単位におけ
る素片始端位置及び素片終端位置におけるピッチ周波数
を用いる。また、継続時間情報として、素片始端位置、
子音始端位置、及び素片終端位置の情報を用い、さらに
素片全体の長さ自体及び先行母音と後続音節の境界位置
を知ることが可能な情報、例えば後述のケプストラム距
離テ−ブル７０１に保持された距離の代表値（素片間ケ
プストラム距離の代表値）をも格納している。パワ情報
はケプストラム０次の情報を用いている。ケプストラム
の低次項は、クラスタリングした際にカテゴリ番号と一
対一で対応するようにテーブル形式で保存されている。
これについては後述する。なお、記憶容量に余裕がある
場合には、ケプストラムの情報をクラスタリングしなく
とも良い。また、分析位置に関しては、ケプストラムに
おいても素片始端及び終端位置で分析を行う。As shown in FIG. 2A, for example, the unit dictionary 5 includes, in a unit dictionary table 201, a unit number for uniquely treating a unit and a phoneme type indicating a phoneme environment of the unit. Is stored. Also, as the prosody information, a pitch frequency at a segment boundary, segment duration information, and power are stored. As the analysis position of the pitch frequency, the pitch frequency at the segment start position and the segment end position in VCV units is used. In addition, as the duration information, the segment start position,
Using the information on the consonant start position and the end position of the segment, and holding the information such as the cepstrum distance table 701, which will be described later, which allows the length of the entire segment itself and the boundary position between the preceding vowel and the succeeding syllable The stored representative value of the distance (representative value of the inter-element cepstrum distance) is also stored. As the power information, information of order 0 of the cepstrum is used. The low-order terms of the cepstrum are stored in a table format so as to correspond one-to-one with the category numbers when clustered.
This will be described later. If there is sufficient storage capacity, cepstrum information need not be clustered. Regarding the analysis position, the analysis is performed at the element start and end positions also in the cepstrum.

【００１４】図２（ｂ）の「あか」／ａｋａ／から合成
単位を作成する際の例では、合成単位「ａｋａ」２１０
を作成する場合に、該当する素片始端位置２０６及び素
片終端位置２０７でのピッチ周波数を分析して用いる。
但し、「＃ａ」２０９や「ａ＃」２１１のような語頭や
語尾などから得られる、先行又は後続の母音がない場合
には、ピッチ周波数は一意に０Ｈｚとする。なお、合成
単位表記で／＃／は、ポーズ等の無音区間を表すものと
する。また、各始端位置及び終端位置もピッチ周波数を
分析した位置と同様の位置であり、合成単位「ａｋａ」
２１０の子音始端位置は、音節境界２０３の情報を使用
する。但し、母音連接で子音がない場合にも音節の開始
位置を子音始端位置とする。In the example of creating a composite unit from “Aka” / aka / in FIG. 2B, the composite unit “aka” 210
Is created, the pitch frequency at the corresponding segment start position 206 and segment end position 207 is analyzed and used.
However, when there is no preceding or succeeding vowel obtained from the beginning or the end such as “#a” 209 or “a #” 211, the pitch frequency is uniquely set to 0 Hz. Note that / # / in the combined unit notation represents a silent section such as a pause. Also, the start position and the end position are the same positions as the positions where the pitch frequency is analyzed, and the synthesis unit “aka”
The consonant start position 210 uses the information of the syllable boundary 203. However, even when there is no consonant in the vowel concatenation, the start position of the syllable is the consonant start position.

【００１５】次に、素片選択部４の詳細な構成を図３〜
図９を参照して説明する。素片選択部４には、本音声合
成装置１に入力された日本語テキストのローマ字表記と
韻律パタンとが入力される。ここでは、先ずローマ字表
記から合成に使用する全てのＶＣＶ単位の種類を探索区
間抽出部４１で抽出する。決定されたＶＣＶ単位種類に
基づき、図４に示す内容例の探索区間検索テーブル４０
１を参照して素片辞書５内の探索区間を決定する。次に
一次選択部４２に処理を移す。Next, the detailed configuration of the segment selection unit 4 will be described with reference to FIGS.
This will be described with reference to FIG. The unit selection unit 4 receives the Romanized notation and the prosodic pattern of the Japanese text input to the speech synthesizer 1. Here, first, the search section extraction unit 41 extracts all types of VCV units used for synthesis from Romanized notation. Based on the determined VCV unit type, the search section search table 40 of the content example shown in FIG.
1, a search section in the segment dictionary 5 is determined. Next, the processing is shifted to the primary selection unit 42.

【００１６】一次選択部４２では、各々の合成単位につ
いて、既に求めている探索区間内においてピッチ、継続
時間長及びパワの韻律情報により合成に使用するのに適
した素片を選び出す。韻律情報を用いて合成に適した合
成単位を決定するときは、韻律生成部３で決定された韻
律パタンと音声素片が持つ韻律パタンを比較して評価値
を算出する。評価の基準は、目標韻律パタンに類似した
ものほど評価値が小さくなるような式を利用する。その
評価値Ｈｅｖを得る式の例を下記に示す。The primary selection unit 42 selects a segment suitable for use in the synthesis based on the pitch, duration and power prosody information in the search section already obtained for each synthesis unit. When determining the synthesis unit suitable for synthesis using the prosody information, the prosody pattern determined by the prosody generation unit 3 is compared with the prosody pattern of the speech unit to calculate an evaluation value. As an evaluation criterion, an expression is used in which an evaluation value becomes smaller as the target prosody pattern becomes more similar. An example of an equation for obtaining the evaluation value Hev is shown below.

【００１７】[0017]

【数１】Ｈｅｖ＝Ｐ＋Ｄ²＋Ｅ²・・・(1) 但しＰ＝｛（先行母音ピッチ目標値−素片先行母音ピッチ）
² ＋（先行母音ピッチ目標値−素片先行母音ピッチ）²｝Ｄ＝（継続時間長目標値・素片継続時間長）Ｅ＝（パワ目標値・素片パワ）Hev = P + D ² + E ² (1) where P = ｛(previous vowel pitch target value−unit preceding vowel pitch)
² + (previous vowel pitch target value-segment preceding vowel pitch) ² Ｄ D = (duration length target value / fragment duration time) E = (power target value / fragment power)

【００１８】上記評価式Ｈｅｖの各パラメタは、韻律生
成部３及び素片辞書５から得られるものである。この評
価式Ｈｅｖを用いて、探索区間内全ての合成単位を評価
し、評価値の小さい順から上位侯補として図５に示す内
容の候補テーブル５０１に登録する。本実施例では、同
一種類で多数の韻律のバリエーションを持った合成単位
から、合成しようとする音声の韻律パタンに近いものを
式(1)を用いて抽出している。式(1)の評価値によれば、
辞書内にある合成単位に対して近さの序列を行うことが
可能であり、評価値の小さいものほど目標とする韻律パ
タンに近く、上位の候補として登録される。この処理
は、合成に用いる各々の合成単位で行う。Each parameter of the evaluation formula Hev is obtained from the prosody generation unit 3 and the segment dictionary 5. Using this evaluation formula Hev, all the synthesis units in the search section are evaluated, and registered in the candidate table 501 having the contents shown in FIG. In the present embodiment, a unit close to the prosody pattern of the speech to be synthesized is extracted from the synthesis unit having the same type and many prosody variations using Expression (1). According to the evaluation value of equation (1),
It is possible to perform an order of closeness with respect to a synthesis unit in the dictionary, and a smaller evaluation value is closer to a target prosody pattern and is registered as a higher candidate. This process is performed for each synthesis unit used for synthesis.

【００１９】図５に示す候補テーブル５０１は、「あ
か」／ａｋａ／を合成する場合の例を示すものであり、
この場合に必要な合成単位は、／＃ａ／、／ａｋａ／、
／ａ＃／である。各々の合成単位に対して、目標韻律パ
タンと辞書内の対応する種類の素片を式(1)を用いて評
価し、評価値の最小のものを第１候補とし、順次小さい
方から上位候補として登録していく。登録する際は、素
片を一意に指している素片番号を用いて記述する。ま
た、候補数に関する制限を加えることも可能であり、個
々の合成単位種類において同一の候補数を用意する必要
はない。ここで、候補数に上限を設けることで、後段の
二次選択部４３での処理を軽減することができる。ま
た、評価値Ｈｅｖに対して上限のしきい値を設けること
によって候補数を制限することができる。以上の処理
を、合成に用いる全ての合成単位に適用して上記探索区
間テーブル４０１に格納する情報を生成し、これを二次
選択部４３へ送る。The candidate table 501 shown in FIG. 5 shows an example in which "red" / aka / is synthesized.
In this case, the necessary synthesis units are / # a /, / aka /,
/ A # /. For each synthesis unit, the target prosodic pattern and the corresponding type of segment in the dictionary are evaluated using equation (1), and the one with the smallest evaluation value is defined as the first candidate. I will register as. When registering, it is described using a unit number that uniquely points to the unit. It is also possible to add a limit on the number of candidates, and it is not necessary to prepare the same number of candidates for each type of synthesis unit. Here, by setting an upper limit on the number of candidates, the processing in the secondary selection unit 43 in the subsequent stage can be reduced. Further, the number of candidates can be limited by setting an upper limit threshold value for the evaluation value Hev. The above processing is applied to all synthesis units used for synthesis to generate information to be stored in the search section table 401, and sends this to the secondary selection unit 43.

【００２０】二次選択部４３では、一次選択部４２で得
られた合成単位候補のうち、最も素片接続の歪みが小さ
くなるように素片（合成単位）を選び出す。その処理概
念を図６及び図７を参照して説明する。二次選択部４３
では、一次選択部４２で決定された合成単位候補から、
スペクトルの連続性を考慮して最終的に合成に用いる合
成単位を選び出す。合成単位の選択は、素片間のケプス
トラム距離を接続コストとし、素片間の接続コストが最
小となるようにパスを選択する。The secondary selection unit 43 selects a unit (synthesis unit) from the synthesis unit candidates obtained by the primary selection unit 42 so as to minimize the distortion of the unit connection. The processing concept will be described with reference to FIGS. Secondary selection unit 43
Then, from the combination unit candidates determined by the primary selection unit 42,
A synthesis unit to be finally used for synthesis is selected in consideration of the continuity of the spectrum. The selection of the synthesis unit uses the cepstrum distance between the segments as the connection cost, and selects a path so that the connection cost between the segments is minimized.

【００２１】最小パスの選択には、例えば動的計画法
（ＤＰ：Dynamic Programming）により効率的に実現す
ることができる。図６の例では、太線のパスの接続コス
トが最小であり、選択結果としては、／＃ａ／が第１候
補、／ａｋａ／が第２候補、／ａ＃／が第１候補として
選択されている。なお、パスの接続コストは、先頭から
ＣＤ（ケプストラム距離、以下同じ）１，ＣＤ２，ＣＤ
３・・・，ＣＤｎなる変数で保持し、これらを後述の合
成品質の判定に用いる。また、各クラスタ間の距離を図
７に示すケプストラム距離テーブル７０１の形で保持
し、二次選択部４３における接続コストの計算を簡略化
している。以上の処理により得られた、接続歪最小パス
のコストを合成品質判定部４４へ送出する。The selection of the minimum path can be efficiently realized by, for example, dynamic programming (DP). In the example of FIG. 6, the connection cost of the path indicated by the thick line is the minimum, and as a selection result, / # a / is selected as the first candidate, / aka / is selected as the second candidate, and / a # / is selected as the first candidate. ing. The path connection costs are CD (cepstrum distance, the same applies hereinafter) 1, CD2, CD
.., CDn, and these are used for determination of the combined quality described later. Further, the distance between the clusters is held in the form of a cepstrum distance table 701 shown in FIG. 7, and the calculation of the connection cost in the secondary selection unit 43 is simplified. The cost of the connection distortion minimum path obtained by the above processing is sent to the combined quality determination unit 44.

【００２２】合成品質判定部４４では、後段の素片変形
接続部６における接続コストのＲＭＳ値（root means s
quare）Ｒｃを計算し、その値が予め設定した値より小
さい場合には、合成単位抽出部４６に処理を移す。一
方、Ｒｃがしきい値より大きい場合には、代替素片生成
部４５に処理を移す。合成品質判定部４４でＲｃを計算
する際には、例えば式(2)を用いることができる。In the synthesis quality judging unit 44, the RMS value (root means s) of the connection cost in the subsequent unit deformation connection unit 6
quare) Rc is calculated, and if the value is smaller than a preset value, the process is transferred to the combining unit extracting unit 46. On the other hand, if Rc is larger than the threshold value, the processing is shifted to the alternative unit generation unit 45. When Rc is calculated by the combined quality determination unit 44, for example, Expression (2) can be used.

【００２３】 [0023]

【００２４】一般的に合成品質低下の原因となる部分で
は、接続コストが大きく、異音またはノイズ混入が発生
しやすい。よってこの様な場合には、接続コストが大き
い素片に代わり、他の素片を組み合わせて作成した素片
に置き換える必要がある。そのための処理を行うのが代
替素片生成部４５であり、代替素片が必要な素片種類と
韻律パタン及び前後の素片のケプストラム情報が入力さ
れる。代替素片生成部４５では、Ｒｃの値を大きくして
いる素片を順番に代替素片置き換え侯補として蓄積す
る。生成された置き換え候補順に基づき、代替素片生成
情報を順次生成し、Ｒｃの値がしきい値を下回った時点
で代替素片生成処理を停止する。そしてこのときの代替
素片生成情報を合成単位抽出部４６に送出する。なお、
置き換え可能な素片が生成できない場合には、二次選択
部４３で得られた素片候補をそのまま合成単位抽出部４
６に出力する。In general, the connection cost is high in the portion that causes the deterioration of the synthesis quality, and abnormal noise or noise is likely to occur. Therefore, in such a case, it is necessary to replace a segment having a high connection cost with a segment created by combining other segments. The alternative unit generating unit 45 performs a process for this purpose, and inputs the type of unit requiring the alternative unit, the prosody pattern, and the cepstrum information of the preceding and following units. In the alternative unit generating unit 45, the units in which the value of Rc is increased are sequentially stored as alternative unit replacement candidates. The alternative unit generation information is sequentially generated based on the generated replacement candidate order, and the alternative unit generation processing is stopped when the value of Rc falls below the threshold value. Then, the substitute unit generation information at this time is transmitted to the synthesis unit extraction unit 46. In addition,
When a replaceable unit cannot be generated, the unit candidate obtained by the secondary selection unit 43 is directly used as the synthesis unit extraction unit 4
6 is output.

【００２５】代替素片生成部４５における処理の詳細を
図８及び図９を用いて説明する。図８を参照して、代替
処理判定部４５１では、求められた接続コストが、予め
設定していたしきい値よりも大きくなった場合、例えば
置き換え対象ＶＣＶ素片の子音Ｃが有声音か無声音であ
るかを判定し、無声音である場合に代替素片生成処理を
行う。The details of the processing in the substitute unit generation unit 45 will be described with reference to FIGS. Referring to FIG. 8, when the obtained connection cost is larger than a preset threshold value, alternative processing determining section 451 determines whether consonant C of the replacement target VCV unit is voiced or unvoiced. It is determined whether or not there is, and if the sound is unvoiced, a substitute unit generation process is performed.

【００２６】代替素片候補抽出部４５２では、図９上段
に示すＶＣＶ単位の先行母音Ｖ部分９０１と後続音節Ｃ
Ｖ部分９０２を最小単位として扱う。先行母音Ｖに関し
ては、母音種類及び後続の子音Ｃの種類が一致してお
り、更に韻律パタンが目標値に近いものを一次選択部４
２と同様に式(1)を用いて候補順位を設定する。また、
後続音節ＣＶでは、先行母音Ｖによらず韻律パタンが目
標値に近いものを一次選択部４２と同様に選び出し、候
補順位を設定する。ここで得られた先行母音Ｖの候補及
び後続音節ＣＶの候補をそれぞれ代替素片選択部４５３
に送出する。代替素片選択部４５３では、先行母音Ｖ及
び後続音節ＣＶの各候補に関して、各々の隣接素片との
接続コストが最も小さいものを置き換え素片として採用
する。素片生成部４５４では、ここで得られた、先行母
音置き換え素片と後続音節置き換え素片を図９後段に示
すごとく組み合わせて代替素片を生成する。合成単位抽
出部４６は、素片の実データ（ＰＣＭデータ）を素片辞
書５から抽出し、これを代替素片生成情報または素片候
補と共に素片変形接続部６に出力する。The alternative unit candidate extraction section 452 includes a preceding vowel V portion 901 and a succeeding syllable C in VCV units shown in the upper part of FIG.
The V portion 902 is handled as a minimum unit. Regarding the preceding vowel V, the vowel type and the succeeding consonant C match, and the primary vowel V whose prosody pattern is close to the target value is selected by the primary selection unit 4.
As in the case of 2, the candidate ranking is set using the equation (1). Also,
In the succeeding syllable CV, a candidate whose prosody pattern is close to the target value is selected in the same manner as the primary selection unit 42 regardless of the preceding vowel V, and the candidate order is set. The candidate for the preceding vowel V and the candidate for the succeeding syllable CV obtained here are respectively substituted for the alternative unit selection unit 453.
To send to. The alternative unit selection unit 453 adopts, for each candidate of the preceding vowel V and the following syllable CV, the candidate having the smallest connection cost with each adjacent unit as the replacement unit. The unit generation unit 454 combines the preceding vowel replacement unit and the succeeding syllable replacement unit obtained here to generate an alternative unit by combining them as shown in the latter part of FIG. The synthesis unit extraction unit 46 extracts the real data (PCM data) of the unit from the unit dictionary 5 and outputs it to the unit transformation connection unit 6 together with alternative unit generation information or unit candidates.

【００２７】このように、本実施例によれば、韻律パタ
ンと素片結合部におけるスペクトル包絡の連続性の双方
を考慮して合成単位を選択することが可能となり、ま
た、合成単位選択時に用いるケプストラム距離の積算値
を評価基準として用いることで、合成音品質の低下原因
となるノイズ混入の有無を予測することが可能となる。
さらに、接続コストが大きい場合には、素片辞書５内の
他の代替素片を品質劣化の原因になっている素片と置き
換えるようにしたので、常に一定品質以上の合成音声を
生成することが可能となる。なお、本実施例で用いた素
片（合成単位）やパラメタ値は、本発明を説明する上で
便宜的に用いたものなので、必ずしもこのような例に限
定されるものでないことは勿論である。As described above, according to the present embodiment, it is possible to select a synthesis unit in consideration of both the prosody pattern and the continuity of the spectrum envelope in the unit combining unit. By using the integrated value of the cepstrum distance as an evaluation criterion, it is possible to predict the presence or absence of noise contamination that causes a reduction in synthesized sound quality.
Further, when the connection cost is high, other alternative segments in the segment dictionary 5 are replaced with segments causing quality deterioration, so that a synthesized speech of a certain quality or more is always generated. Becomes possible. Note that the segments (synthesis units) and parameter values used in the present embodiment are used for convenience in describing the present invention, and are not necessarily limited to such examples. .

【００２８】[0028]

【発明の効果】以上の説明から明らかなように、本発明
によれば、最も声道特性の連続性が高い合成音声の選定
と、合成音声へのノイズ混入状況の推定が可能となり、
従来の問題点を解消することができる。また、合成品質
低下の原因となっている合成単位がについては、これに
代えて、品質向上に寄与する他の代替素片に置き換えら
れるので、合成される音声の品質が常に一定以上になる
効果がある。なお、本発明は、パーソナルコンピュータ
等に接続する音声合成装置や、音声合成ソフトウェアに
適用することが可能であり、また、電話を用いたチケッ
ト予約等、様々なサーピスに応用することも容易とな
る。As is apparent from the above description, according to the present invention, it is possible to select a synthesized voice having the highest continuity of the vocal tract characteristics and to estimate a noise mixing state in the synthesized voice.
Conventional problems can be solved. In addition, since the synthesis unit causing the deterioration of the synthesis quality is replaced with another alternative unit that contributes to the quality improvement, the quality of the synthesized voice is always higher than a certain level. There is. The present invention can be applied to a speech synthesizer connected to a personal computer or the like or speech synthesis software, and can also be easily applied to various services such as ticket reservation using a telephone. .

[Brief description of the drawings]

【図１】本発明の一実施例に係る音声合成装置の一実施
例を示すブロック図。FIG. 1 is a block diagram showing one embodiment of a speech synthesizer according to one embodiment of the present invention.

【図２】（ａ）は本実施例で用いる素片辞書内のテーブ
ル内容説明図、（ｂ）は合成単位の説明図。FIG. 2A is an explanatory diagram of table contents in a segment dictionary used in the embodiment, and FIG. 2B is an explanatory diagram of a synthesis unit.

【図３】本実施例による素片選択部のブロック構成図。FIG. 3 is a block configuration diagram of a segment selection unit according to the embodiment.

【図４】図３の素片選択部で用いる候捕テーブルの一例
を示す説明図。FIG. 4 is an explanatory diagram showing an example of a catch table used in the segment selection unit of FIG. 3;

【図５】本実施例により素片辞書を探索する区間を決定
する際に用いるテーブルの概念説明図。FIG. 5 is a conceptual explanatory diagram of a table used when determining a section in which to search a segment dictionary according to the embodiment;

【図６】本実施例により素片選択部の二次選択部で素片
選択を行う場合の概念図。FIG. 6 is a conceptual diagram in a case where a segment selection is performed by a secondary selection unit of the segment selection unit according to the embodiment.

【図７】図３の素片選択部で用いるケプストラム距離テ
ーブルの概念図。FIG. 7 is a conceptual diagram of a cepstrum distance table used in a segment selection unit in FIG. 3;

【図８】本実施例による代替素片生成部のブロック構成
図。FIG. 8 is a block diagram of a replacement unit generating unit according to the embodiment.

【図９】本発明で代替素片を生成する際の概念説明図。FIG. 9 is a conceptual explanatory diagram when generating a substitute unit according to the present invention.

[Explanation of symbols]

１音声合成装置２テキスト解析部３韻律生成部４素片選択部５素片辞書６素片変形接続部 DESCRIPTION OF SYMBOLS 1 Speech synthesizer 2 Text analysis part 3 Prosody generation part 4 Unit selection part 5 Unit dictionary 6 Unit transformation connection part

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平３−119394（ＪＰ，Ａ) 特開平５−73092（ＪＰ，Ａ) 特開平１−284893（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/06 ────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-3-119394 (JP, A) JP-A-5-73092 (JP, A) JP-A 1-28893 (JP, A) (58) Field (Int.Cl. ⁷ , DB name) G10L 13/06

Claims

(57) [Claims]

1. A speech unit selection method for selecting a synthesis unit to be used for speech synthesis from a unit dictionary in which a plurality of speech units are stored, wherein specific information of an inter-unit cepstrum distance is provided for each synthesis unit. And extracting a plurality of synthesis units similar to the prosodic pattern of the synthesis target voice in descending order of similarity. The connection cost based on the inter-unit cepstrum distance from the extracted synthesis units is minimized. Selecting a combination of synthesis units , and further selecting a combination of synthesis units that minimizes the connection cost.
In the step, the low-order terms of the cepstrum are characterized.
Using all vectors except silence
Distance between each cluster obtained by performing rastering
Is used to determine a set of synthesis units that minimizes the connection cost.
A method for selecting a speech unit, which comprises selecting.

2. The method according to claim 1, further comprising : determining a start position and an end position of each synthesis unit.
The cluster information to which the pstrams belong is
2. The method according to claim 1, further comprising the step of performing registration.
How to select speech unit.

3. When the quality of a synthesized speech based on the selected set of synthesis units is greater than a predetermined reference value, a synthesis unit having a high connection cost is replaced with a minimum cepstrum distance between adjacent segments. 2. The speech unit selection method according to claim 1, wherein the speech unit is replaced with a unit.

4. A unit dictionary in which a plurality of speech units are stored together with specific information of an inter-unit cepstrum distance, and a speech unit selection unit for selecting a synthesis unit used for speech synthesis from the unit dictionary. Means for generating a synthesized speech based on the synthesized unit, wherein the speech unit selection unit extracts a plurality of synthesis units similar to the prosodic pattern of the synthesis target voice in descending order of similarity. A primary selection unit that selects a set of synthesis units that minimizes the connection cost based on the inter-unit cepstrum distance from the extracted plurality of synthesis units, and further includes a connection selection unit.
In selecting the set of composite units that minimizes the cost,
The low-order terms of the
By performing clustering using all the vectors
The connection cost is minimized using the distance between each cluster
A voice synthesizing device, wherein a set of small synthesis units is selected .

5. The unit dictionary further includes:
To which cluster the cepstral at the start and end positions belongs
Is registered as dictionary information.
The speech synthesizer according to claim 4.

6. The speech unit selection unit further quantifies the quality of a synthesized speech based on the set of synthesis units selected by the secondary selection unit, and determines whether a quantitative value obtained thereby is good or bad. A synthesis quality determination unit for determining, and a replacement unit generation unit that generates a replacement unit that minimizes the cepstrum distance between adjacent units in the combination of the synthesis units when the determination of the quality value is negative. 5. The speech synthesizer according to claim 4 , wherein a synthesis unit having a high connection cost is replaced with said substitute unit.

7. The alternative unit generating unit includes: first means for selecting a plurality of preceding sound replacement candidates and subsequent sound replacement candidates whose prosody pattern is close to a target value; 7. The speech synthesizer according to claim 6 , further comprising: second means for specifying a set of sound replacement candidates as the replacement unit.