JPH0990972A

JPH0990972A - Synthesis unit generating method for voice synthesis

Info

Publication number: JPH0990972A
Application number: JP7248143A
Authority: JP
Inventors: Yuki Yoshida; 由紀吉田; Kazuo Hakoda; 和雄箱田; Tomohisa Hirokawa; 智久広川; Kenzo Ito; 憲三伊藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-09-26
Filing date: 1995-09-26
Publication date: 1997-04-04
Anticipated expiration: 2015-09-26
Also published as: JP3275940B2

Abstract

PROBLEM TO BE SOLVED: To obtain smooth and stable synthesized voices by commonly using the phoneme environment, which is equivalent in terms of phonetics, in the voice segment generated while considering the effect of a sound adjustment style based on the phoneme environment and generating a synthesis unit. SOLUTION: In a step S1, phoneme environment such as a sound adjustment combination is commonly used if it is considered to be same (being considered to be the same) so as to reduce the number of synthesis units, while considering phoneme origins. In a step S2, if the succeeding phoneme of the phoneme (immediately after the phoneme) is a long vowel, the phoneme environment of the phoneme is considered to be the phoneme environment in which a short vowel is the succeeding phoneme. In other word, a long vowel is considered to be same as a short vowel in the phoneme environment of the succeeding phoneme. In a step S3, if the phoneme is a voiceless bursting sound, the preceeding phoneme is represented by one short vowel. In a step S4, if the succeeding phoneme of the phoneme is a voiceless bursting sound, the succeeding phoneme 15 represented by one voiceless plosive. Thus, by all of the steps S1 through S4, a group is made and the number of phoneme is reduced.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声を人工的に生
成する音声合成における音声合成用合成単位作成方法に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for creating a speech synthesis unit in speech synthesis for artificially generating speech.

【０００２】[0002]

【従来の技術】従来より、規則音声合成方式において
は、「Ｃ」を子音、「Ｖ」を母音とすると、合成単位と
して、ＶＣＶ，ＣＶ，ＣＶＣ等が用いられることが多か
った。このような合成単位を使用する方式に比較して、
音声の最も基本的な単位である音素を合成単位とする規
則音声合成方式は、音素を合成単位にすることが自然で
あるにも関わらず、合成時の接続規則が複雑になること
から、あまり利用されていなかった。2. Description of the Related Art Conventionally, in a regular voice synthesis system, when "C" is a consonant and "V" is a vowel, VCV, CV, CVC, etc. are often used as a synthesis unit. Compared to the method using such a composition unit,
The rule-based speech synthesis method, which uses the phoneme, which is the most basic unit of speech, as the synthesis unit, makes the connection rules at the time of synthesis complicated even though it is natural to use phonemes as the synthesis unit. It was not used.

【０００３】従来、音素を規則音声合成の合成単位に利
用する場合、単位数が膨大になることから、前後の音韻
環境を全く考慮しないか、あるいは、ＣＯＣ（context-
oriented-clustering ）法（詳しくは特願昭６２−２３
４３５８号の願書に添付の明細書および図面参照）を適
用するかのいずれかを選択することになる。Conventionally, when a phoneme is used as a synthesis unit for regular speech synthesis, the number of units becomes enormous. Therefore, the preceding and following phonological environments are not considered at all, or the COC (context-
oriented-clustering) method (see Japanese Patent Application No. 62-23 for details)
The application of No. 4358 is to be applied either (see attached description and drawings).

【０００４】[0004]

【発明が解決しようとする課題】ところで、前後の音韻
環境を考慮しない音素を規則音声合成の合成単位とした
場合、合成単位数が少ないことや言語による表現上の差
異も少ないことから、多くの言語において利用できると
いう利点があるが、音声特性の変化が特に激しい子音
（Ｃ）から母音（Ｖ）への移り変わり部分をも規則で接
続するために、高品質な合成音声を得ることは困難であ
った。By the way, when a phoneme that does not consider the preceding and following phonological environments is used as the synthesis unit for the regular speech synthesis, the number of synthesis units is small and the difference in expression depending on the language is small. Although it has the advantage that it can be used in a language, it is difficult to obtain high-quality synthesized speech because the transition part from the consonant (C) to the vowel (V), which has a particularly large change in the speech characteristics, is also connected by a rule. there were.

【０００５】一方、ＣＯＣ法は、統計的手法により大量
の音声データの母集団から音韻環境（調音結合）を考慮
した音素をクラスタリングし、そのセントロイドを合成
単位として蓄積する方法である。ＣＯＣ法では、クラス
タリングを自動的に行うため、客観的尺度に基づいて合
成単位が生成されるという利点があるが、クラスタリン
グ母集団の整備に多大な工数がかかる上、合成単位作成
用のテキスト（音声データ）が有限であることから、母
集団中に出現しない音韻環境が存在する可能性がある。On the other hand, the COC method is a method of clustering phonemes considering a phonological environment (articulatory coupling) from a large population of voice data by a statistical method and accumulating the centroid as a synthesis unit. The COC method has an advantage that a synthetic unit is generated based on an objective scale because clustering is automatically performed. However, it takes a lot of man-hours to maintain a clustering population, and a text for creating a synthetic unit ( There is a possibility that there is a phonological environment that does not appear in the population because the (voice data) is finite.

【０００６】また、音声合成時に最適な合成単位を大量
の合成単位テーブルから検索するため、テーブル検索の
時間がかかるという欠点がある。さらに、自然な発話状
況から合成単位を生成し蓄積して母集団を構成している
ため、セントロイドの出現頻度によるばらつきが大き
く、音声合成時には、音素環境に対する適合度が小であ
っても最適な合成単位として候補が選択される場合があ
る。したがって、合成音声の品質が不安定であるという
欠点があった。これらの欠点を解決するために、前後の
音韻環境全てを考慮した音素を単位とする方法も考えら
れるが、もちろん、合成単位数が膨大になることから現
実的ではない。Further, since a large amount of synthesis unit tables are searched for the optimum synthesis unit at the time of speech synthesis, there is a disadvantage that the table search takes time. Furthermore, since a synthesis unit is generated from natural utterances and accumulated to form a population, there are large variations due to the frequency of occurrence of centroids, and during speech synthesis it is optimal even if the fitness to the phoneme environment is small. A candidate may be selected as a new synthesis unit. Therefore, there is a drawback that the quality of synthesized speech is unstable. In order to solve these drawbacks, it is possible to consider a method in which a phoneme is taken into consideration in consideration of all the phonetic environments before and after, but it is not realistic because the number of synthesis units becomes huge.

【０００７】このように、規則音声合成において、合成
単位を音素とした場合、前後の音韻環境を考慮しない音
素を合成単位とすれば単位数を低減できるが、滑らかな
合成音声を得ることができなかった。また、ＣＯＣ法を
用いた場合には、常時安定した合成音声の生成を期待す
ることはできなかった。さらに、前後の音韻環境の全て
を考慮した音素を合成単位とすれば、滑らかさおよび安
定性は補償されるかもしれないが、合成単位数が膨大に
なり現実的ではない。As described above, in regular speech synthesis, when the synthesis unit is a phoneme, the number of units can be reduced by using a phoneme that does not consider the preceding and following phoneme environments as the synthesis unit, but a smooth synthesized speech can be obtained. There wasn't. Further, when the COC method is used, it is not possible to expect stable generation of synthetic speech at all times. Furthermore, if a phoneme that takes into consideration all of the phonemic environments before and after is used as a synthesis unit, smoothness and stability may be compensated, but the number of synthesis units becomes huge, which is not realistic.

【０００８】本発明は上述した事情に鑑みて為されたも
のであり、その第１の目的は、前後の音韻環境を考慮し
た音素の全てを略均一に網羅する合成単位を作成して、
滑らかで安定した合成音声を得ることができる音声合成
用合成単位作成方法を提供することにある。加えて、本
発明の第２の目的は、得られる合成音声の品質を劣化さ
せずに合成単位数を削減することができる音声合成用合
成単位作成方法を提供することにある。The present invention has been made in view of the above circumstances, and a first object of the present invention is to create a synthesis unit that substantially uniformly covers all phonemes in consideration of preceding and following phoneme environments,
It is an object of the present invention to provide a method for creating a synthesis unit for speech synthesis that can obtain smooth and stable synthesized speech. In addition, a second object of the present invention is to provide a synthesis unit creation method for speech synthesis which can reduce the number of synthesis units without deteriorating the quality of the obtained synthesized speech.

【０００９】[0009]

【課題を解決するための手段】本発明による音声合成用
合成単位作成方法は、音素を含む音声セグメントを合成
単位として作成する音声合成用合成単位作成方法におい
て、音韻環境に基づいた調音様式の影響を考慮して音声
セグメントを作成し、作成された音声セグメントにおい
て音声学的に等価な音韻環境を共通化して合成単位を作
成することを特徴としている。According to the method for creating a synthesis unit for speech synthesis according to the present invention, in the method for creating a synthesis unit for speech synthesis, which creates a speech segment including a phoneme as a synthesis unit, the influence of the articulation style based on the phonological environment. It is characterized in that a voice segment is created in consideration of the above, and a phoneme environment that is phonetically equivalent is shared in the created voice segment to create a synthesis unit.

【００１０】[0010]

【発明の実施の形態】本発明の実施形態について説明す
る前に、本発明の基本的な考え方について説明する。本
発明は、日本語等の言語に現れる前後の音韻環境を考慮
した音声セグメント（実際には音素（トライホン））の
全てを考慮し、それらの中から、音素素性を考慮して音
韻環境を同一視できるものをグループ化して単位数の削
減を図るものである。したがって、音韻環境を考えな
い、あるいは、母集団から統計的手法により合成単位を
生成するといった従来の技術とは明らかに異なってい
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS Before describing the embodiments of the present invention, the basic concept of the present invention will be described. The present invention considers all speech segments (actually, phonemes (triphones)) in consideration of phonemic environments before and after appearing in a language such as Japanese, and selects the phoneme environment from among them in consideration of phoneme characteristics. It aims to reduce the number of units by grouping visible ones. Therefore, it is obviously different from the conventional technique in which the phonological environment is not considered or the synthesis unit is generated from the population by a statistical method.

【００１１】以下、図面を参照して、本発明の実施形態
について説明する。図１は本発明の一実施形態による音
声合成用合成単位作成方法を説明するための図であり、
この図に示されるように本実施形態は、ステップＳ１〜
Ｓ４を順に実行することにより為される。なお、図１に
示される音声合成用合成単位作成方法は、日本語の音声
合成に用いられるものであり、以下、各ステップについ
て順に説明する。ただし、以下の説明において、ｃ，
ｃ′，Ｃ，Ｃ′，Ｑ，ｖ，ｖ′，Ｖ，Ｖ′，Ｖ″はそれ
ぞれ音素を表している。具体的には、ｖ，ｖ′は長母
音、Ｃ，Ｃ′，ｃ，ｃ′は子音であり、Ｃ′はＣの無声
化子音、ｃ，ｃ′は無声破裂子音である。また、Ｑは促
音、Ｖ，Ｖ′，Ｖ″は母音であり、Ｖ，Ｖ′はそれぞれ
ｖ，ｖ′の短母音を表す。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram for explaining a method for creating a speech synthesis unit according to an embodiment of the present invention.
As shown in this figure, the present embodiment includes steps S1 to S1.
This is done by sequentially executing S4. The method for creating a synthesis unit for speech synthesis shown in FIG. 1 is used for Japanese speech synthesis, and each step will be described below in order. However, in the following description, c,
c ', C, C', Q, v, v ', V, V', V "respectively represent phonemes. Specifically, v, v'are long vowels, C, C ', c, c'is a consonant, C'is an unvoiced consonant of C, c, c'is an unvoiced explosive consonant, Q is a consonant, V, V ', V "are vowels, and V, V'are These represent v and v'short vowels, respectively.

【００１２】［ステップＳ１］ステップＳ１では、着目
している音素の先行音素が長母音の場合に、当該音素の
音韻環境を短母音が先行音素（直前の音素）である音韻
環境であるものとみなす。すなわち、先行音素の音韻環
境において長母音を短母音と同一視する。日本語に存在
するトライホンの全てを合成単位とすると、促音を含め
てその単位数は約１５，０００種であり、膨大な数にな
ってしまう。そこで、ステップＳ１において、音素素性
を考慮して、調音結合などの音韻環境について同一視可
能なものを共通化（同一視）し、合成単位数の削減を図
る。以下に、ステップＳ１で行われる同一視の例を示
す。例１：ｖＣＶ′のＣをＶＣＶ′のＣと同一の合成単位と
みなす。例２：ｖＣＣ′のＣをＶＣＣ′のＣと同一の合成単位と
みなす。例３：ｖ′ｖＣのｖをＶ′ｖＣのｖと同一の合成単位と
みなす。上記例１〜３のような規則によって、約３０００個のト
ライホンが削減される。[Step S1] In step S1, when the preceding phoneme of the phoneme of interest is a long vowel, the phoneme environment of the phoneme is a phoneme environment in which a short vowel is a preceding phoneme (previous phoneme). I reckon. That is, long vowels are identified with short vowels in the phoneme environment of the preceding phoneme. If all the triphones existing in Japanese are used as synthesis units, the number of units including consonants is about 15,000, which is an enormous number. Therefore, in step S1, in consideration of phoneme features, common identifiable phoneme environments such as articulatory combinations are identified (identified) to reduce the number of synthesis units. Below, the example of the same identification performed in step S1 is shown. Example 1: C of vCV 'is considered to be the same synthetic unit as C of VCV'. Example 2: Consider the C of vCC 'as the same synthetic unit as the C of VCC'. Example 3: Consider the v of v'vC as the same synthetic unit as the v of V'vC. The rules of Examples 1-3 above reduce about 3000 triphones.

【００１３】［ステップＳ２］ステップＳ２では、当該
音素の後続音素（直後の音素）が長母音の場合に、当該
音素の音韻環境を短母音が後続音素である音韻環境であ
るものとみなす。すなわち、後続音素の音韻環境におい
て長母音を短母音と同一視する。以下に、ステップＳ２
で行われる同一視の例を示す。例４：Ｖ′ＣｖのＣをＶ′ＣＶのＣと同一の合成単位と
みなす。例５：ＣＶｖ′のＶをＣＶＶ′のＶと同一の合成単位と
みなす。例６：Ｖ″ｖｖ′のｖをＶ″ｖＶ′のｖと同一の合成単
位とみなす。上記例４〜６のような規則によって、約２０００個のト
ライホンが削減される。[Step S2] In step S2, when the subsequent phoneme (immediate phoneme) of the phoneme is a long vowel, the phoneme environment of the phoneme is regarded as a phoneme environment in which a short vowel is a subsequent phoneme. That is, long vowels are identified with short vowels in the phoneme environment of the subsequent phoneme. Below, step S2
An example of the same identification performed in FIG. Example 4: Consider the C of V'Cv as the same synthetic unit as the C of V'CV. Example 5: V of CVv 'is considered to be the same synthetic unit as V of CVV'. Example 6: The v of V "vv 'is considered to be the same synthetic unit as the v of V"vV'. The rules of Examples 4-6 above reduce about 2000 triphones.

【００１４】［ステップＳ３］ステップＳ３では、当該
音素が無声破裂音の場合にその先行音素を一つの短母音
で代表させる。以下に、ステップＳ３で行われる同一視
の例を示す。例７：ＶｃＶ″のｃをＶ′ＣＶ″のｃと同一の合成単位
とみなす。例８：ＶｃＣ（ｃの無声化）のｃをＶ′ｃＣのｃと同一
の合成単位とみなす。例９：ＱｃＶのｃをＶ′ｃＶのｃと同一の合成単位とみ
なす。なお、例７〜９において、Ｖ′は代表させる短母音を表
す。上記例７〜９のような規則によって、約１０００個
のトライホンが削減される。[Step S3] In step S3, when the phoneme is an unvoiced plosive, the preceding phoneme is represented by one short vowel. Below, an example of the identification performed in step S3 is shown. Example 7: c of VcV ″ is considered to be the same synthetic unit as c of V′CV ″. Example 8: C in VcC (unvoiced c) is considered to be the same synthesis unit as c in V'cC. Example 9: c of QcV is regarded as the same synthetic unit as c of V'cV. In Examples 7 to 9, V'represents a representative short vowel. Rules such as Examples 7-9 above reduce about 1000 triphones.

【００１５】［ステップＳ４］ステップＳ４では、当該
音素の後続音素が無声破裂音の場合にその後続音素を一
つの無声破裂音で代表させる。例１０：ＣＶｃのＶをＣＶｃ′のＶと同じ合成単位とみ
なす。例１１．ＶＣｃ（Ｃの無声化）のＣをＶＣｃ′のＣと同
一の合成単位とみなす。なお、例１０，１１において、ｃ′は代表させる無声破
裂子音を表す。上記例１０，１１のような規則によっ
て、約３０００個のトライホンが削減される。[Step S4] In step S4, when the succeeding phoneme of the phoneme is an unvoiced plosive sound, the succeeding phoneme is represented by one unvoiced plosive sound. Example 10: V of CVc is considered to be the same synthetic unit as V of CVc '. Example 11 FIG. C of VCc (unvoiced C) is regarded as the same synthesis unit as C of VCc '. In Examples 10 and 11, c'represents a representative unvoiced explosive consonant. Rules such as Examples 10 and 11 above reduce about 3000 triphones.

【００１６】結局、上述したステップＳ１〜Ｓ４の全て
によってグループ化を行なうと、約９０００個のトライ
ホンが削減され、合成単位となるトライホンは約６００
０種になる。以上、本発明の実施形態を図面を参照して
詳述してきたが、具体的な同一視規則は、上述した実施
形態に例示されたものに限られるものではなく、本発明
の要旨を逸脱しない範囲での変更等があっても本発明に
含まれる。After all, when grouping is performed by all of the above-mentioned steps S1 to S4, about 9000 triphones are reduced, and about 600 triphones are combined.
It becomes 0 species. Although the embodiments of the present invention have been described in detail above with reference to the drawings, the specific identification rules are not limited to those exemplified in the above-described embodiments, and do not depart from the gist of the present invention. Even if there is a change in the range, it is included in the present invention.

【００１７】[0017]

【発明の効果】以上説明したように、本発明によれば、
所定の言語に現れる全てのトライホンを考慮するように
したため、出現頻度の大小に関わらず、均一で安定した
合成単位を蓄積・構成することができる。また、隣接す
る（前後の）音韻環境をグループ化することにより、合
成単位数を大幅に削減することができる。例えば、日本
語においては、音素素性によって合成単位数を２／５に
大幅削減でき、しかも日本語に存在する全てのトライホ
ンにほぼ均一に対応することができる。また、隣接する
（前後の）音韻環境を考慮した音声セグメント（音素）
を合成単位とするため、音声合成時に滑らかな連結を可
能とし、自然性に優れた音声を合成することができる。As described above, according to the present invention,
Since all triphones appearing in a predetermined language are taken into consideration, uniform and stable synthesis units can be accumulated and configured regardless of the frequency of appearance. In addition, the number of synthesis units can be significantly reduced by grouping adjacent (front and back) phoneme environments. For example, in Japanese, the number of synthesis units can be greatly reduced to 2/5 depending on the phoneme characteristics, and moreover, all triphones existing in Japanese can be handled substantially uniformly. Also, speech segments (phonemes) that take into consideration adjacent (preceding and following) phonological environments.
Is used as a synthesizing unit, it is possible to smoothly connect the voices when synthesizing the voices, and it is possible to synthesize a voice having excellent naturalness.

[Brief description of drawings]

【図１】本発明の一実施形態による音声合成用合成単位
作成方法を説明するための図である。FIG. 1 is a diagram illustrating a method of creating a synthesis unit for speech synthesis according to an embodiment of the present invention.

───────────────────────────────────────────────────── フロントページの続き (72)発明者伊藤憲三東京都千代田区内幸町一丁目１番６号日本電信電話株式会社内 ─────────────────────────────────────────────────── ─── Continued Front Page (72) Inventor Kenzo Ito 1-1-6 Uchisaiwaicho, Chiyoda-ku, Tokyo Nihon Telegraph and Telephone Corporation

Claims

[Claims]

1. A method for creating a synthesis unit for speech synthesis, which creates a speech segment containing a phoneme as a synthesis unit, creates a speech segment in consideration of the influence of an articulation style based on a phoneme environment, and creates a speech segment in the created speech segment. A method for creating a synthesis unit for speech synthesis, which comprises creating a synthesis unit by commonizing phonetically equivalent phonological environments.