JPH09230893A

JPH09230893A - Regular speech synthesis method and device therefor

Info

Publication number: JPH09230893A
Application number: JP8035291A
Authority: JP
Inventors: Keiji Hayashi; 慶士林; Takashi Horie; 高志保理江
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1996-02-22
Filing date: 1996-02-22
Publication date: 1997-09-05

Abstract

PROBLEM TO BE SOLVED: To synthesize speech similar to natural speech by enhancing a matching degree of phoneme environment, etc., in synthesizing in a speech synthesis device selecting and connecting an optimal element piece belonging to a synthesis unit. SOLUTION: A speech synthesis device is configured comprising a pre- processing part 12 to break down an input character string into phoneme units, a plurality of waveform element pieces cut out of natural speech in the units of utterance and a plurality of single syllables, storing means 192 and 193 to store prosodic parameters of each single syllable, an element piece selection part 14, and element piece connecting means 15, 16, and 17 to generate synthesized speech by connecting the selected waveform element pieces in order of the input character string. The element piece selection part 14 selects optimal element piece corresponding to phoneme units to the metrical environment, referred to both of the extracted environment and the prosodic parameter.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、任意の入力文字列
から合成音声を生成する規則音声合成技術に関し、特
に、合成パラメタである波形素片の選択手法及び選択し
た波形素片の変形手法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a ruled speech synthesis technique for generating synthetic speech from an arbitrary input character string, and more particularly to a method for selecting a waveform segment as a synthesis parameter and a method for transforming a selected waveform segment. .

【０００２】[0002]

【従来の技術】音声合成技術は、例えば、駅構内でのア
ナウンスや機械による文章朗読等に広く用いられてい
る。音声合成に際しては、合成される音声を明瞭かつ違
和感の無いものとすることが求められている。2. Description of the Related Art Speech synthesis technology is widely used, for example, for announcements at stations and for reading sentences by machines. In speech synthesis, it is required that the synthesized speech be clear and comfortable.

【０００３】従来の一般的な音声合成装置の機能ブロッ
クの構成例を図９に示す。図９の構成の音声合成装置に
おいて、入力文字列は、入力端子９１から入力された後
に、前処理部９２で日本語辞書９８１を参照して音韻単
位に分割される。韻律設定部９３は、その音韻単位の韻
律パラメタを設定する。素片選択部９４は、素片辞書９
８２を参照して、上記設定された韻律パラメタの基準値
に最も近い波形素片を選択する。素片変形部９５におい
ては、無条件で、選択された素片が基準値に合致するよ
うに、文章及び単語テキストの音声データを格納してい
る素片ファイル９８３からの素片を夫々変形処理する。
素片接続部９６では、変形された素片をそれぞれ入力文
字列の順に接続し、出力端子９７を通じて出力する。こ
れにより入力文字列に対応する合成音声が得られる。FIG. 9 shows an example of the configuration of functional blocks of a conventional general speech synthesizer. In the speech synthesizer having the configuration of FIG. 9, an input character string is input from the input terminal 91 and then divided into phoneme units by the preprocessing unit 92 with reference to the Japanese dictionary 981. The prosody setting unit 93 sets a prosody parameter for each phoneme. The segment selection unit 94 uses the segment dictionary 9
With reference to 82, the waveform segment closest to the reference value of the set prosody parameter is selected. The segment transforming unit 95 unconditionally transforms each segment from the segment file 983 storing voice data of sentences and word texts so that the selected segment matches the reference value. To do.
The segment connecting unit 96 connects the deformed segments in the order of the input character strings, and outputs them through the output terminal 97. As a result, a synthetic voice corresponding to the input character string is obtained.

【０００４】[0004]

【発明が解決しようとする課題】上述した従来の音声合
成装置において、素片選択部９４では、合成時の音韻環
境のような要因は考慮されておらず、その選択結果は、
設定された韻律パラメータの基準値のみに強く依存して
いる。そのため、語尾に近い音韻単位に対応する波形素
片の選択時に、語頭に近い波形素片を選択してしまう現
象が生じ、合成音声が不自然になる場合が多いという問
題があった。また、従来、素片変形部９５では、波形素
片の選択結果とは無関係に、先に選択された基準値に合
致するよう変形処理を行っているため、合成される音声
の品質が劣化することもあった。In the conventional speech synthesizer described above, the element selection unit 94 does not consider factors such as the phonological environment at the time of synthesis, and the selection result is
It strongly depends only on the reference value of the set prosody parameter. Therefore, when selecting a waveform segment corresponding to a phoneme unit close to the end of a word, there is a problem that a waveform segment near the beginning of a word occurs, and synthetic speech is often unnatural. Further, conventionally, the segment transforming unit 95 performs the transforming process so as to match the previously selected reference value regardless of the selection result of the waveform segment, so that the quality of the synthesized voice is deteriorated. There were things.

【０００５】本発明の課題は、最適な波形素片を選択し
て合成時の自然性の劣化を防ぐことができる規則音声合
成方法を提供することにある。本発明の他の課題は、波
形素片の過度の変形による品質劣化を抑え、従来装置に
比べ、より自然性の高い合成音声が得られる構成の音声
合成装置を提供することにある。An object of the present invention is to provide a regular speech synthesizing method capable of selecting an optimum waveform segment to prevent deterioration of naturalness during synthesis. Another object of the present invention is to provide a speech synthesizing device having a configuration that suppresses quality deterioration due to excessive deformation of waveform segments and that can obtain synthetic speech with higher naturalness than conventional devices.

【０００６】[0006]

【課題を解決するための手段】上記課題を解決する本発
明の規則音声合成方法は、音韻単位に属する最適素片を
選択する素片選択過程と、選択した最適素片を所定順に
接続して合成音を生成する過程とを有する方法におい
て、前記素片選択過程が、合成パラメタとなる一の波形
素片の後続音素が合成時の音韻環境と一致しているか否
かを判定するステップと、後続音素一致と判定された波
形素片に対し、合成対象語彙の発声環境と該波形素片の
抽出環境との差分を表す抽出誤差を用いて少なくとも一
つの最適素片候補を選択するステップと、選択された最
適素片候補の韻律情報と所定の選択基準値との差分を表
す選択誤差に基づいて最適素片を決定するステップとを
有することを特徴とする。According to a method for synthesizing a rule of speech of the present invention which solves the above-mentioned problems, a segment selection process of selecting an optimal segment belonging to a phoneme unit and a selected optimal segment are connected in a predetermined order. In the method having a step of generating a synthetic sound, the element selection step, a step of determining whether or not the subsequent phoneme of one waveform element as a synthesis parameter matches the phonological environment at the time of synthesis, Selecting at least one optimal segment candidate using an extraction error representing a difference between the utterance environment of the synthesis target vocabulary and the extraction environment of the waveform segment with respect to the waveform segment determined to be the subsequent phoneme match; A step of determining an optimum segment based on a selection error representing a difference between the prosody information of the selected optimal segment candidate and a predetermined selection reference value.

【０００７】前記最適素片を決定するステップのより具
体的な態様としては、例えば、個々の最適素片候補につ
いて前記抽出誤差及び選択誤差を順位付けるとともに、
前記抽出誤差の順位に所定の係数を乗じて得た第１の値
と前記選択誤差の順位に所定の係数を乗じて得た第２の
値との合算値が最小となる最適素片候補を前記最適素片
として決定する。As a more specific mode of the step of determining the optimum segment, for example, the extraction error and the selection error are ranked for each of the optimum segment candidates, and
An optimum segment candidate having a minimum sum of a first value obtained by multiplying the rank of the extraction error by a predetermined coefficient and a second value obtained by multiplying the rank of the selection error by a predetermined coefficient is selected. It is determined as the optimum segment.

【０００８】本発明の規則音声合成方法は、更に、前記
決定された最適素片の韻律情報と前記選択基準値とに基
づいて素片変形率を導出するステップと、導出された素
片変形率と所定の変形率しきい値とを比較し、素片変形
率が変形率しきい値以下の場合は選択した最適素片をそ
のまま接続し、素片変形率が変形率しきい値を越える場
合は該最適素片についての前記素片変形率を変形率しき
い値以下に変形するステップと、有することを特徴とす
る。The rule speech synthesizing method of the present invention further comprises the step of deriving a segment transformation rate based on the determined prosodic information of the optimal segment and the selection reference value, and the derived segment transformation rate. When the fragment deformation rate is less than or equal to the deformation threshold, the selected optimal piece is connected as it is, and the fragment deformation exceeds the deformation threshold. Transforms the segment deformation rate of the optimum segment to a deformation rate threshold value or less.

【０００９】また、上記課題を解決する本発明の音声合
成装置は、音韻単位に属する最適素片を選択する素片選
択手段を備え、選択した最適素片を所定順に接続して合
成音を生成する装置であって、前記素片選択手段が、合
成パラメタとなる波形素片の韻律情報及び接続対象とな
る後続音素情報を含む素片情報を音韻単位毎に格納した
素片情報辞書と、各音韻単位の分布統計情報及び選択時
の読み込み数を定める情報を格納した音韻単位情報テー
ブルと、合成時に前記素片情報辞書及び音韻単位情報テ
ーブルから最適素片に関する情報を索出する素片選択部
とを含み、該素片選択部は、一の波形素片の後続音素が
合成時の音韻環境と一致しているか否かを判定し、後続
音素一致と判定された波形素片に対し、合成対象語彙の
発声環境と該波形素片の抽出環境との差分を表す抽出誤
差を用いて少なくとも一つの最適素片候補を選択すると
ともに、選択された最適素片候補の韻律情報と所定の選
択基準値との差分を表す選択誤差に基づいて最適素片を
決定することを特徴とする。ここで、音韻単位情報テー
ブルに格納する分布統計情報とは、例えば個々の音韻単
位についての韻律情報の分布を統計的に表したもので、
上記選択誤差を導出する際に用いられるものである。ま
た、選択時の読み込み数を定める情報とは、個々の波形
素片について選択が予定される候補数を指標する情報で
ある。Further, the speech synthesizing apparatus of the present invention which solves the above problem is provided with a segment selecting means for selecting an optimal segment belonging to a phoneme unit, and the selected optimal segments are connected in a predetermined order to generate a synthetic speech. In the device, the segment selection means stores segment information including, for each phoneme unit, segment information including prosodic information of a waveform segment that is a synthesis parameter and subsequent phoneme information that is a connection target. A phoneme unit information table that stores distribution statistical information of phoneme units and information that determines the number of reads at the time of selection, and a phoneme selection unit that retrieves information about the optimum phoneme from the phoneme unit information dictionary and the phoneme unit information table at the time of synthesis. The phoneme selection unit determines whether the subsequent phoneme of one waveform phoneme matches the phoneme environment at the time of synthesis, and synthesizes the waveform phoneme determined to be the following phoneme match. Vocal environment of the target vocabulary and the waveform At least one optimal segment candidate is selected using the extraction error that represents the difference between the segment extraction environment and the selection error that represents the difference between the prosody information of the selected optimal segment candidate and the predetermined selection reference value. It is characterized in that the optimum segment is determined based on this. Here, the distribution statistical information stored in the phoneme unit information table is, for example, a statistical representation of the distribution of prosodic information for each phoneme unit,
It is used when deriving the selection error. The information that determines the number of readings at the time of selection is information that indicates the number of candidates that are scheduled to be selected for each waveform segment.

【００１０】このような構成の音声合成装置によれば、
音韻単位に属する最適素片の選択基準として、韻律情報
だけでなく、もとの発話単位（又は語彙）の長さや波形
素片（又は音韻）の占める位置のような、音韻単位や波
形素片の抽出時の環境が加味されるので、合成時の不自
然性が解消される。According to the speech synthesizer having such a configuration,
As the selection criterion of the optimal segment belonging to the phoneme unit, not only the prosody information but also the phoneme unit or waveform segment such as the length of the original utterance unit (or vocabulary) and the position occupied by the waveform segment (or phoneme). Since the environment at the time of extraction is added, the unnaturalness at the time of synthesis is eliminated.

【００１１】本発明の音声合成装置は、また、前記素片
選択手段で選択した最適素片の接続に先立ち、個々の最
適素片の韻律情報を変形させて前記選択基準値との差を
零値に近づける素片変形部と、前記素片選択手段で選択
した最適素片の韻律情報と前記選択基準値とに基づいて
素片変形率を導出する手段，及び導出された素片変形率
が所定の変形率しきい値よりも大きいときのみ該選択さ
れた最適素片を前記素片変形部へ導く手段を備えた素片
接続判定部と、を有するものである。このように構成す
ることで、波形素片を変形する際の変形の程度が小さい
のものに対しては、最適素片が変形されずにそのまま用
いられるので、過度の素片変形による音声の自然性劣化
が防止される。The speech synthesizer of the present invention also modifies the prosodic information of each optimum segment before connecting the optimum segment selected by the segment selection means to make the difference from the selection reference value zero. A segment transformation unit that approaches the value, a unit that derives a segment transformation rate based on the prosodic information of the optimal segment selected by the segment selection unit and the selection reference value, and the derived segment transformation rate is And a unit piece connection determination unit including means for guiding the selected optimum unit unit to the unit unit deformation unit only when the unit size is larger than a predetermined deformation rate threshold value. By configuring in this way, the optimum segment is used as it is without being deformed for the one having a small degree of deformation when transforming the waveform segment, so that the natural sound of the voice due to excessive segment deformation is generated. Sexual deterioration is prevented.

【００１２】[0012]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を詳細に説明する。図１は、本発明の一実施形
態を示す音声合成装置のブロック構成図である。なお、
前処理部１２、韻律設定部１３、日本語辞書１９１の機
能は、基本的には図９に示した従来装置のものと同様で
あり、その詳細な説明は省略する。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block configuration diagram of a speech synthesis apparatus showing an embodiment of the present invention. In addition,
The functions of the preprocessing unit 12, the prosody setting unit 13, and the Japanese dictionary 191 are basically the same as those of the conventional apparatus shown in FIG. 9, and detailed description thereof will be omitted.

【００１３】また、図２及び図３は、この音声合成装置
における素片辞書１９２の詳細な構成例、図４〜図７
は、素片選択部１４の動作原理を示すフローチャート、
図８は、接続条件判定部２５の動作原理を示すフローチ
ャートである。以下、これらの図を用いて、本実施形態
による音声合成処理の概要を説明する。2 and 3 are detailed configuration examples of the segment dictionary 192 in this speech synthesizer, and FIGS.
Is a flow chart showing the operation principle of the segment selection unit 14,
FIG. 8 is a flowchart showing the operating principle of the connection condition determination unit 25. The outline of the speech synthesis processing according to the present embodiment will be described below with reference to these drawings.

【００１４】図１において、例えば読み上げ対象となる
文字列は、入力端子１１から入力される。前処理部１２
においては、日本語辞書１９１を用いて、入力された文
字列（入力文字列）を解析し、合成単位である音韻単位
及びその音韻単位に関するアクセント情報などを出力す
る。なお、本実施形態により使用される日本語辞書は、
単語単位の読み及びアクセント型が記述された、音声合
成用にカスタマイズされた辞書である。この辞書構成に
ついては、従来から種々のものが提案されている。ま
た、本実施形態における音韻単位は、好ましくは、／ａ
／や／ｋａ／などの音節の他に、連母音／ａｉ／や複合
音節／ａＮ／などを含む。In FIG. 1, for example, a character string to be read out is input from the input terminal 11. Preprocessing unit 12
In Japanese, the Japanese dictionary 191 is used to analyze an input character string (input character string), and a phonological unit that is a synthesis unit and accent information regarding the phonological unit are output. The Japanese dictionary used by this embodiment is
It is a dictionary customized for speech synthesis in which readings and accent types in word units are described. Various dictionary structures have been conventionally proposed. The phoneme unit in the present embodiment is preferably / a
In addition to syllables such as / and / ka /, continuous vowels / ai / and compound syllables / aN / are included.

【００１５】次に、韻律設定部１３では、音韻単位の韻
律情報、例えば韻律パラメタの基準値を設定する。韻律
情報（以下、韻律パラメタとする）は、素片選択部１４
で用いられる。韻律パラメタの要素としては、平均ピッ
チ周波数、ピッチ傾斜、時間長、平均パワーの４種類が
用いられるが、この４種類の組み合わせによる新しいパ
ラメタの算出、また４種類の取捨選択などは、合成音声
の韻律条件に応じて適宜変更可能である。Next, the prosody setting unit 13 sets prosody information in units of phonemes, for example, reference values of prosody parameters. The prosody information (hereinafter referred to as prosody parameter) is stored in the segment selection unit 14
Used in Four types of average pitch frequency, pitch slope, time length, and average power are used as elements of the prosody parameter. Calculation of new parameters by combination of these four types and selection of four types are It can be appropriately changed according to the prosody condition.

【００１６】素片選択部１４では、韻律設定部１３で設
定された基準値と波形素片の情報を格納した素片辞書１
９２とを用いて、音韻単位の最適素片を選択する。素片
選択部１４の詳細動作を、図２〜図７を用いて説明す
る。The segment selection unit 14 stores a segment value dictionary 1 that stores the reference values set by the prosody setting unit 13 and information about waveform segments.
92 and are used to select the optimum phoneme unit of the phoneme unit. The detailed operation of the element selection unit 14 will be described with reference to FIGS.

【００１７】本実施形態で使用する素片辞書１９２は、
図２に示したように、合成パラメタとなる波形素片の個
々の情報を格納した素片情報辞書２２と、各音韻単位の
統計情報を記録した音韻単位情報テーブル２３とを有
し、これらが素片選択部１４につながる入力部２１及び
出力部２４に対して各々並列に接続されている。素片情
報辞書２２は、具体的には図３（ａ）のように構成さ
れ、音韻単位３１、ファイル番号３２、後続音素３３、
複数の辞書要素３４，３５を含んでいる。辞書要素３４
は、例えば平均ピッチ、ピッチ傾斜、ＲＭＳパワ、開始
サンプル、終了サンプルから成り、辞書要素３５は、例
えば発話単位長及び発話単位位置より成る。The segment dictionary 192 used in this embodiment is
As shown in FIG. 2, it has a segment information dictionary 22 that stores individual information of waveform segments that are synthesis parameters, and a phoneme unit information table 23 that records statistical information of each phoneme unit. The input unit 21 and the output unit 24 connected to the element selection unit 14 are connected in parallel. The segment information dictionary 22 is specifically configured as shown in FIG. 3A, and includes a phoneme unit 31, a file number 32, a subsequent phoneme 33,
It includes a plurality of dictionary elements 34, 35. Dictionary element 34
Is composed of, for example, average pitch, pitch slope, RMS power, start sample and end sample, and the dictionary element 35 is composed of, for example, an utterance unit length and an utterance unit position.

【００１８】音韻単位情報テーブルは図３（ｂ）のよう
に構成され、音韻単位３１、開始インデックス３６、終
了インデックス３７、及び音韻単位要素３８を含んでい
る。開始インデックス３６と終了インデックス３７は、
音韻単位に対応する波形素片の読み込み数を規定するも
のである。音韻単位要素３８は、個々の音韻単位につい
ての分布統計情報を規定するもので、例えば平均ピッ
チ、ピッチ傾斜、時間長、ＲＭＳパワの最大値と最小値
より成る。これらの最大値及び最小値は固定値とし、こ
の範囲での情報を選択誤差の演算に用いる。なお、図３
において”Ｂ”はバイトを示すが、図示のバイト数は例
示であって、これに限定する趣旨ではない。The phoneme unit information table is constructed as shown in FIG. 3B and includes a phoneme unit 31, a start index 36, an end index 37 and a phoneme unit element 38. The start index 36 and the end index 37 are
This is to define the number of waveform segment read corresponding to a phoneme unit. The phoneme unit element 38 defines distribution statistical information for each phoneme unit, and includes, for example, an average pitch, a pitch slope, a time length, and a maximum value and a minimum value of the RMS power. These maximum value and minimum value are fixed values, and the information in this range is used for the calculation of the selection error. Note that FIG.
In FIG. 3, “B” indicates a byte, but the illustrated number of bytes is an example, and the present invention is not limited to this.

【００１９】以下、素片選択部１４の動作原理を、図４
〜図７を参照して説明する。図４を参照すると、素片選
択部１４は、該音韻単位に関する情報を、素片情報辞書
２２（図３の音韻単位３１〜辞書要素３５に対応）及び
音韻単位情報テーブル２３（図３の音韻単位３１、開始
インデックス３６〜音韻単位要素３８に対応）より読み
込む（Ｓ（処理ステップ、以下同じ）１０１）。読み込
み数は、音韻単位情報テーブル２３の開始インデックス
３６及び終了インデックス３７より算出される。The operating principle of the segment selection section 14 will be described below with reference to FIG.
~ It demonstrates with reference to FIG. Referring to FIG. 4, the phoneme selection unit 14 provides information about the phoneme unit to the phoneme information dictionary 22 (corresponding to the phoneme unit 31 to the dictionary element 35 in FIG. 3) and the phoneme unit information table 23 (phoneme in FIG. 3). The unit 31 and the start index 36 to the phoneme unit element 38) are read (S (processing step, the same applies hereinafter) 101). The number of readings is calculated from the start index 36 and the end index 37 of the phoneme unit information table 23.

【００２０】次に、当該音韻単位に属する波形素片の中
から、最適素片となる候補を選出するとともに、複数の
選択基準の１つである抽出誤差を算出する（Ｓ１０２〜
Ｓ１０４）。具体的には、音韻環境に関する適正チェッ
クとして、該波形素片の後続音素が合成時の音韻環境と
一致しているか否かのチェックを行う（Ｓ１０２）。こ
の場合、合成時の後続音素と調音様式の同一である音素
も一致とみなす。例えば合成時の音韻環境が／ｋ／であ
った場合、調音様式の同一である／ｔ／や／ｐ／も”一
致”とみなす。次に、Ｓ１０２において後続音素一致と
判定された該波形素片に対して、抽出誤差を算出する
（Ｓ１０３）。ここで、抽出誤差とは、合成対象となる
語彙の発声環境と、波形素片の抽出環境との距離尺度と
なるものであり、この実施形態では、下記式（１）〜
（３）のように算出した。Next, a candidate which is an optimum segment is selected from the waveform segments belonging to the phoneme unit, and an extraction error which is one of a plurality of selection criteria is calculated (S102-).
S104). Specifically, as a proper check regarding the phoneme environment, it is checked whether or not the subsequent phoneme of the waveform segment matches the phoneme environment at the time of synthesis (S102). In this case, a phoneme whose articulation style is the same as that of the succeeding phoneme at the time of synthesis is also regarded as a match. For example, if the phonological environment at the time of synthesis is / k /, / t / and / p /, which have the same articulation style, are also regarded as "match". Next, an extraction error is calculated for the waveform segment determined to be the subsequent phoneme match in S102 (S103). Here, the extraction error is a distance measure between the utterance environment of the vocabulary to be synthesized and the extraction environment of the waveform segment, and in this embodiment, the following expressions (1) to
It was calculated as in (3).

【００２１】[0021]

【数１】抽出誤差＝重み係数ｗ1×発話単位内モーラ差分＋重み係数ｗ2×単位内位置差分…（１）発話単位内モーラ差分＝発話単位長−音韻単位数…（２）単位内位置差分＝発話単位位置−音韻単位位置…（３）[Equation 1] Extraction error = weight coefficient w1 × moor difference in utterance unit + weight coefficient w2 × positional difference in unit ... (1) Mora difference in utterance unit = utterance unit length−phoneme unit number ... = Utterance unit position-phoneme unit position (3)

【００２２】ここで、発話単位長及び発話単位位置と
は、波形素片の抽出環境に関するもので、波形素片がど
の程度の長さの発話単位のどの位置から抽出されたかを
示す変数である。発話単位とは、一つの息継ぎの間に発
声される音声情報を表す単位である。同様に、音韻単位
数及び音韻単位位置は、選択対象となっている該音韻単
位が、どの程度の長さの合成語彙のどの位置に属するの
かを示す変数である。抽出誤差算出の一例を図５に示
す。Here, the utterance unit length and the utterance unit position relate to the extraction environment of the waveform segment, and are variables indicating from which position of the utterance unit the waveform segment is extracted. . The utterance unit is a unit representing voice information uttered during one breath. Similarly, the phoneme unit number and the phoneme unit position are variables indicating to which position in the synthetic vocabulary of which length the selected phoneme unit belongs. An example of the extraction error calculation is shown in FIG.

【００２３】図５は、合成の対象となる単語 ”温かみ
（あたたかみ）”の２番目の音韻単位５１（／ｔａ／）
を選択する際の抽出誤差を算出する方法を説明したもの
である。図５において、境界線を引かれている単位が一
つの音韻単位に対応する。この場合、波形素片５２に関
して、発話単位長は”８”、発話単位位置は”４”とな
り、音韻単位５１に関して、音韻単位数は”５”、音韻
単位位置は”２”となるので、式（２）、（３）より、
発話単位内モーラ差分は”８”−”５”＝”３”、単位
内位置差分は”４”−”２”＝”２”となり、該波形素
片５２に関する抽出誤差は、（１）式より、ｗ１×３＋
ｗ２×２となる。FIG. 5 shows the second phonological unit 51 (/ ta /) of the word "warm (warm)" to be synthesized.
The method for calculating the extraction error when selecting is described. In FIG. 5, the unit with a border line corresponds to one phoneme unit. In this case, the utterance unit length is “8” and the utterance unit position is “4” for the waveform segment 52, and the phonological unit number is “5” and the phonological unit position is “2” for the phoneme unit 51. From equations (2) and (3),
The mora difference in the utterance unit is “8” − “5” = “3”, the position difference in the unit is “4” − “2” = “2”, and the extraction error regarding the waveform segment 52 is expressed by the equation (1). From w1 × 3 +
w2 × 2.

【００２４】図４に戻り、上記Ｓ１０２〜１０３の処理
を、読み込んだ素片の数すべてに対して繰り返す（Ｓ１
０４）。その後、Ｓ１０４を経て選出された最適素片候
補に対して、基準値との距離尺度である選択誤差を算出
し（Ｓ１０５）、抽出誤差及び選択誤差から最適素片を
決定する（Ｓ１０６）。選択誤差は、基準値及び音韻単
位情報テーブル２３内の最適素片候補の韻律パラメータ
を用いて、式（４）〜（６）のように算出される。Returning to FIG. 4, the above steps S102 to 103 are repeated for all the read pieces (S1).
04). After that, a selection error, which is a distance measure from the reference value, is calculated for the optimal segment candidate selected through S104 (S105), and the optimal segment is determined from the extraction error and the selection error (S106). The selection error is calculated as in Expressions (4) to (6) using the reference value and the prosodic parameter of the optimum segment candidate in the phoneme unit information table 23.

【００２５】[0025]

【数２】選択誤差＝平均ピッチ誤差＋ピッチ傾斜誤差＋時間長誤差＋RMSパワー誤差…（４）各パラメータ誤差＝（正規化誤差値）×重み係数…（５）平均ピッチに関する正規化誤差値＝（基準値−素片値）／（素片最大値−素片最小値）…（６）[Equation 2] Selection error = average pitch error + pitch inclination error + time length error + RMS power error (4) Each parameter error = (normalized error value) × weighting coefficient (5) Normalized error value regarding average pitch = (Reference value-unit value) / (unit maximum value-unit minimum value) (6)

【００２６】（６）式において、正規化誤差値の算出時
に分母が”０”となる場合（該音韻単位に属する波形素
片が１つしかない場合など）には、正規化誤差値を”
０”とする。なお、選択誤差の算出方法には特に制限は
なく、上記（４）〜（６）式の方法以外にも、種々の方
法を用いてよい。In the equation (6), when the denominator is "0" when the normalized error value is calculated (when there is only one waveform segment belonging to the phoneme unit), the normalized error value is "
Note that there is no particular limitation on the method of calculating the selection error, and various methods other than the methods of the above formulas (4) to (6) may be used.

【００２７】Ｓ１０６の動作の詳細を、図６及び図７を
参照して説明する。図６は、上記Ｓ１０６における処理
手順を示すフローチャートであり、まず、候補番号の各
々に対して最適素片候補に対する抽出誤差の順位付けを
最小順に行う（Ｓ２０１）。次に、選択誤差に対する順
位付けを同様に行って選択誤差順位を決定する（Ｓ２０
２）。その後、各誤差の順位に基づいて、最適性の基準
となる結合スコアを以下の式（７）により算出する（Ｓ
２０３）。The details of the operation of S106 will be described with reference to FIGS. 6 and 7. FIG. 6 is a flow chart showing the processing procedure in S106. First, the extraction errors of the optimum segment candidates are ranked in the smallest order for each candidate number (S201). Next, the selection errors are ranked similarly to determine the selection error rank (S20).
2). Then, based on the rank of each error, a combination score that is a criterion of optimality is calculated by the following equation (7) (S
203).

【００２８】[0028]

【数３】結合スコア＝選択誤差順位×重み係数ｗ1’＋抽出誤差順位×重み係数ｗ2’…（７）## EQU00003 ## Combined score = selection error rank.times.weighting coefficient w1 '+ extraction error rank.times.weighting coefficient w2' ... (7)

【００２９】更に、最適素片候補に対する結合スコアを
最小順に順位付けし（Ｓ２０４）、結合スコアの順位が
最小となる候補を、最適素片として決定する（Ｓ２０
５）。Further, the combined scores of the optimum segment candidates are ranked in the smallest order (S204), and the candidate having the smallest combined score is determined as the optimum segment (S20).
5).

【００３０】図７は、特定の音韻単位に対する順位付け
の結果の例であり、上記Ｓ１０４の終了時点で、候補番
号０〜１１で示される１２個の最適素片候補が選出され
た場合の例が示されている。図示の例では、ｗ１’＝
２、ｗ２’＝１という設定になっている。これらの最適
素片候補について、以上述べたＳ１０１〜Ｓ１０５、及
びＳ２０１〜Ｓ２０５に詳細に示したＳ１０６の処理を
すべての音韻単位について繰り返すことで（Ｓ１０
７）、図７に斜線で示した候補番号”１”のものが最適
素片として決定される。これにより、語頭に近い音韻単
位に対しては、確実に語頭に近い位置にある波形素片が
選択されるようになる。FIG. 7 shows an example of the ranking result for a specific phoneme unit, and an example in which 12 optimal segment candidates indicated by candidate numbers 0 to 11 are selected at the end of S104. It is shown. In the illustrated example, w1 ′ =
2, w2 '= 1 is set. By repeating the above-described processing of S101 to S105 and S106 described in detail in S201 to S205 for all the phoneme units for these optimal segment candidates (S10
7), the candidate number "1" indicated by the diagonal lines in FIG. 7 is determined as the optimum segment. As a result, for the phoneme unit close to the beginning of the word, the waveform segment at the position close to the beginning of the word is surely selected.

【００３１】図１に戻り、接続条件判定部１５では、上
記素片選択部１４での処理結果に基づいて最適素片を変
形するか否かの判定を行い、判定結果に対応した変形フ
ラグ値を設定する。接続条件判定部１５の詳細動作手順
を図８に示す。Returning to FIG. 1, the connection condition determination unit 15 determines whether or not to deform the optimum segment based on the processing result in the segment selection unit 14, and the modification flag value corresponding to the determination result. To set. FIG. 8 shows a detailed operation procedure of the connection condition determination unit 15.

【００３２】図８を参照すると、接続条件判定部１５で
は、韻律設定部１３で設定された上記基準値を読み込み
（Ｓ３０１）、さらに最適素片の韻律パラメータを読み
込む（Ｓ３０２）。次に、式（８）により、素片変形率
を各韻律パラメータについて算出する（Ｓ３０３）。Referring to FIG. 8, the connection condition determination unit 15 reads the reference value set by the prosody setting unit 13 (S301), and further reads the prosody parameter of the optimum segment (S302). Next, the segment transformation rate is calculated for each prosody parameter by the equation (8) (S303).

【００３３】[0033]

【数４】素片変形率＝｛（基準値／素片値）−１｝×１００…（８）## EQU00004 ## Fragment deformation rate = {(reference value / segment value) -1} × 100 (8)

【００３４】次に、素片変形率と変形率しきい値を、各
韻律パラメータについて比較し（Ｓ３０４）、その比較
結果に対して、変形フラグ設定処理を行う（Ｓ３０５／
Ｓ３０６）。変形率しきい値は、音韻単位毎に変更可能
な数値である。また、この例では波形素片の変形手法は
一種であるが、波形素片の変形率に応じて変形手法を変
えることで、より劣化の少ない合成音声を生成すること
も可能である。以上述べたＳ３０１からＳ３０５又はＳ
３０６の処理を、全ての音韻単位に対して行う（Ｓ３０
７）。最後に、該最適素片を素片ファイル１９３より切
り出し、合成音声バッファに書き込む（Ｓ３０８）。Next, the fragment transformation rate and the transformation rate threshold value are compared for each prosody parameter (S304), and a transformation flag setting process is performed on the comparison result (S305 /
S306). The transformation rate threshold value is a numerical value that can be changed for each phoneme unit. Further, in this example, the method of deforming the waveform segment is one type, but it is also possible to generate synthetic speech with less deterioration by changing the modification technique according to the deformation rate of the waveform segment. S301 to S305 or S described above
The process of 306 is performed for all phoneme units (S30).
7). Finally, the optimum segment is cut out from the segment file 193 and written in the synthesized voice buffer (S308).

【００３５】各韻律パラメータに関して、変形フラグ値
が”１”の場合は、素片変形部１６において、最適素片
を式（８）で算出した素片変形率に従って変形する。変
形処理は、任意の手法が考えられるが、一例として、時
間長変形の場合には、変形率に従って波形素片の該波形
サンプルを間引き／補間する処理があげられる。変形さ
れた波形サンプルは、波形変形バッファ（図示省略）に
別途格納される。素片フラグ値が０の場合は、変形処理
を行わない。For each prosody parameter, when the transformation flag value is "1", the segment transformation unit 16 transforms the optimum segment according to the segment transformation rate calculated by the equation (8). Although any method can be considered as the transformation process, in the case of time length transformation, for example, there is a process of thinning / interpolating the waveform sample of the waveform segment according to the transformation rate. The transformed waveform sample is separately stored in a waveform transformation buffer (not shown). When the segment flag value is 0, the transformation process is not performed.

【００３６】図１の素片接続部１７では、図示しない合
成音声バッファ及び波形変形バッファに格納された波形
サンプルを順次結合し、合成音声を生成する。以上述べ
た処理を経て、出力端子１８には、入力文字列に対応し
た合成音声が出力される。In the segment connecting section 17 of FIG. 1, the waveform samples stored in the synthetic speech buffer and the waveform modification buffer (not shown) are sequentially combined to generate synthetic speech. Through the processing described above, the synthesized voice corresponding to the input character string is output to the output terminal 18.

【００３７】[0037]

【発明の効果】以上の説明から明らかなように、本発明
によれば、波形素片や音韻単位の抽出環境を含めた複数
の選択基準により最適素片が選択されるので、合成され
る音声の自然性が向上する効果がある。また、最適素片
の接続に先立ち、最適素片の変形処理の要／不要が判定
され、変形処理の必要性が高い場合にのみ変形処理がな
されるので、過度の変形処理による合成音声の品質劣化
が回避される効果がある。As is apparent from the above description, according to the present invention, the optimum segment is selected by a plurality of selection criteria including the extraction environment of the waveform segment and the phoneme unit, and thus the speech to be synthesized is synthesized. Has the effect of improving the naturalness of. In addition, it is determined whether or not the transformation processing of the optimal segment is necessary before connecting the optimal segment, and the transformation process is performed only when the transformation process is highly necessary. It is effective in avoiding deterioration.

[Brief description of drawings]

【図１】本発明の一実施形態となる音声合成装置のブロ
ック構成図。FIG. 1 is a block configuration diagram of a speech synthesizer according to an embodiment of the present invention.

【図２】本実施形態の音声合成装置における素片辞書の
構成例を示す説明図。FIG. 2 is an explanatory diagram showing a configuration example of a segment dictionary in the speech synthesis device of the present embodiment.

【図３】本実施形態の音声合成装置における素片辞書の
内容説明図。FIG. 3 is an explanatory diagram of contents of a phoneme dictionary in the speech synthesis device of this embodiment.

【図４】本実施形態の音声合成装置における素片選択部
の動作原理を示すフローチャート。FIG. 4 is a flowchart showing an operation principle of a phoneme selection unit in the speech synthesizer of this embodiment.

【図５】抽出誤差の算出例の説明図。FIG. 5 is an explanatory diagram of an example of calculating an extraction error.

【図６】最適素片の決定手順を示すフローチャート。FIG. 6 is a flowchart showing a procedure for determining an optimum segment.

【図７】素片選択部における最適素片選択結果例の説明
図。FIG. 7 is an explanatory diagram of an example of an optimum segment selection result in the segment selection unit.

【図８】接続条件判定部の処理手順を示すフローチャー
ト。FIG. 8 is a flowchart showing a processing procedure of a connection condition determination unit.

【図９】従来の音声合成装置のブロック構成図。FIG. 9 is a block configuration diagram of a conventional speech synthesizer.

[Explanation of symbols]

１２前処理部１３韻律設定部１４素片選択部１５接続条件判定部１６素片変形部１７素片接続部２２素片情報辞書２３音韻単位情報テーブル１９１日本語辞書１９２素片辞書１９３素片ファイル 12 preprocessing unit 13 prosody setting unit 14 unit selection unit 15 connection condition determination unit 16 unit transformation unit 17 unit connection unit 22 unit information dictionary 23 phoneme unit information table 191 Japanese dictionary 192 unit dictionary 193 unit file

Claims

[Claims]

1. A regular speech synthesizing method comprising: a segment selection process of selecting an optimal segment belonging to a phoneme unit; and a process of connecting the selected optimal segments in a predetermined order to generate a synthesized voice. The selection process includes a step of determining whether or not a subsequent phoneme of one waveform element as a synthesis parameter matches the phoneme environment at the time of synthesis, and a target of synthesis for a waveform element determined to be a subsequent phoneme match. Selecting at least one optimal segment candidate using an extraction error that represents the difference between the vocabulary utterance environment and the waveform segment extraction environment; prosody information of the selected optimal segment candidate and a predetermined selection criterion And a step of determining an optimum segment based on a selection error representing a difference from the value.

2. The step of determining the optimum segment includes ranking the extraction error and the selection error for each optimum segment candidate, and a first coefficient obtained by multiplying the order of the extraction error by a predetermined coefficient. 2. The optimum segment candidate having the minimum sum of the value and the second value obtained by multiplying the rank of the selection error by a predetermined coefficient is determined as the optimal segment. Regular speech synthesis method.

3. A step of deriving a segment variation rate based on the determined prosodic information of the optimal segment and the selection reference value; a derived segment variation rate and a predetermined transformation rate threshold value. If the element deformation rate is less than or equal to the deformation rate threshold value, the selected optimum element pieces are connected as they are, and if the element element deformation rate exceeds the deformation rate threshold value, the element for the optimum element piece is connected. 3. The regular speech synthesis method according to claim 1, further comprising the step of transforming the one-sided deformation rate to be equal to or less than a deformation rate threshold value.

4. A voice synthesizing device, comprising: a unit selection unit for selecting an optimum unit belonging to a phoneme unit, and connecting the selected optimum units in a predetermined order to generate a synthesized voice. Is a phoneme information dictionary that stores, for each phoneme unit, phoneme information including prosodic information of a waveform phoneme that is a synthesis parameter and subsequent phoneme information that is a connection target, distribution statistical information of each phoneme unit, and reading at the time of selection. A phoneme unit information table storing information that determines the number, and a phoneme unit selection unit that retrieves information about the optimum phoneme unit from the phoneme unit information dictionary and the phoneme unit information table at the time of synthesis, and the phoneme unit selection unit, It is determined whether or not the subsequent phoneme of one waveform segment matches the phoneme environment at the time of synthesis, and for the waveform segment determined to be the subsequent phoneme match, the utterance environment of the synthesis target vocabulary and the waveform segment Extraction error that represents the difference with the extraction environment And selecting at least one optimal segment candidate, and determining the optimal segment based on a selection error representing a difference between the prosody information of the selected optimal segment candidate and a predetermined selection reference value. Speech synthesizer.

5. A segment transforming unit for transforming prosodic information of each optimum segment to bring the difference from the selection reference value close to zero value before connecting the optimal segment selected by the segment selection means. A means for deriving a segment deformation rate based on the prosodic information of the optimum segment selected by the segment selection means and the selection reference value, and the derived segment deformation rate being greater than a predetermined transformation rate threshold value. 5. The speech synthesis apparatus according to claim 4, further comprising: a segment connection determination unit having a unit that guides the selected optimal segment to the segment transformation unit only when it is larger.