JP2894447B2

JP2894447B2 - Speech synthesizer using complex speech units

Info

Publication number: JP2894447B2
Application number: JP62202286A
Authority: JP
Inventors: 芳典匂坂
Original assignee: EI TEI AARU JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: EI TEI AARU JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1987-08-12
Filing date: 1987-08-12
Publication date: 1999-05-24
Anticipated expiration: 2014-05-24
Also published as: JPS6444498A

Description

【発明の詳細な説明】［産業上の利用分野］この発明は複合音声単位を用いた音声合成装置に関
し、特に、音声単位の編集，結合，変形によって音声合
成を行なうような音声合成装置に関する。［従来の技術］従来の編集型音声合成方式，規則による音声合成方式
において種々の音声単位が提案されている。このような
音声単位には、短いものから順に、音素,CV（子音，母
音）音節，半音節（子音，母音連節および母音・子音連
接,VCV（母音・子音・母音）連接,CVC（子音，母音，子
音）連接,CVCV（子音，母音，子音・母音）連接，単
語，文節といったものがある。従来の音声合成方式では、これらのうちのどれか１種
類の音声単位集合を基本とし、他の１〜２種類の音声単
位を補助的に使用する場合を含めても、たかだか２〜３
種類の音声単位の編集，結合，変形により音声を生成し
ていた。［発明が解決しようとする問題点］第４図は従来の各種の音声単位を単位長さの点で短い
ものと長いものの２種類に大別し、音声合成システム構
成上の優劣を比較した図である。第４図において、音声
合成システムを構成する上で問題となる単位の蓄積に必
要な記憶容量は少ない方が望ましく、音声単位の長さが
短いほど有利である。また、合成単位を作成する工数は
同じく音声単位の長さが短いほど有利である。単位重複
の利用可能性や柔軟性といった面からすれば、音声単位
が短いほど大きくなる。一方、音声品質を左右する単位
変形法や単位接続法などにおいては、音声単位の長さが
長いほど問題点を少なくすることができる。上述の対比
から明らかなように、２種類の音声単位だけでバランス
良く音声合成システムを構成することは極めて困難であ
った。それゆえに、この発明の主たる目的は、種々の構造を
持つ多種類の音声単位とそれらに含まれる部分単位を使
用することにより、より効率の良い音声合成を得ること
のできるような複合音声単位を用いた音声合成装置を提
供することである。［問題点を解決するための手段］この発明は音声単位の編集により音声合成を行なうか
あるいは規則により任意語彙の音声合成を行なう複合音
声単位を用いた音声合成装置であって、音素レベルから
音節，形態素，単語，文節，文章に至るまでの種々の長
さ，構造を有する多数の音声単位を予め記憶する音声単
位ファイルと、合成箇所の音韻の部分単位が置かれた音
韻環境と韻律情報を基にして音声単位の部分を選ぶ抽出
環境，音韻の基本周波数，音韻時間長などの具体的条件
を与える規則および規則を適用するためのパラメータで
ある諸表を予め記憶する音声制御ファイルと、出力した
い音声内容に対応した声の高さ，速度，大きさを含む韻
律を示す韻律制御信号が入力されたことに応じて、音声
制御ファイルから与えられる規則および諸表に基づい
て、音声単位ファイルから合成に適した音声単位列を選
択する音声単位選択手段と、選択された音声単位列を結
合するための結合手段と、結合された合成単位列に基づ
いて、合成波形を生成するための合成波形生成手段とを
備えて構成される。［作用］この発明に係る複合音声単位を用いた音声合成装置
は、韻律制御信号が入力されたことに応じて、音声制御
ファイルから与えられる規則および諸表に基づいて、音
声単位ファイルから合成に適した音声単位列を選択して
合成し、結合した合成単位列に基づいて合成波形を生成
して音声合成を行なう。［発明の実施例］以下に、図面を参照してこの発明の実施例について詳
細に説明する。第１図はこの発明の音声合成装置の概略ブロック図で
ある。第１図において、入力端子１には、出力したい音
声内容に対応した音韻系列信号，アクセント，息継ぎな
どの韻律制御信号が入力される。入力端子１に入力され
た韻律制御信号は音声単位選択部２に与えられる。この
音声単位選択部２は、入力情報を基にして、音声単位フ
ァイル３に予め蓄えられた種々の構造を持つ多数の音声
単位から合成に適した音声単位列を選択する。音声制御
知識ファイル４には、抽出環境，基本周波数，音韻時間
長などの具体的合成条件を与える規則，諸表が蓄えられ
ている。そして、音声単位選択部２は、音声単位列の選択に際
して、音声制御知識ファイル４から、当該合成箇所の音
韻環境，韻律情報を基にして、望ましい合成単位の抽出
環境，基本周波数，音韻時間長などの具体的構成条件が
与えられ、これらの合成条件を単位選択の基準として用
いる。音声単位選択部２で選択された音声単位列は単位
結合部５に与えられ、選択された単位どうしの結合が行
なわれる。結合された音声単位は合成波形生成部６に与
えられ、単位結合部５で得られた合成単位系列を基にし
て合成波形を生成する。そして、生成された音声は出力
端子７に出力される。また、この発明による方式は、種々の構造を持つ多種
類の単位を使用する点に特色があり、この発明による方
式を実施する際に用いられる音声単位を表現する音響パ
ラメータならびに単位選択に用いる種々の基準，音声制
御知識，音声波形を生成するための合成器については、
何らこれを規定するものではなく、すべてに適用可能で
ある。第２図はこの発明に用いられる音声単位ファイルの具
体的な一例を示す図であり、特に、国語辞書内の見出語
中の音韻連接（連接数１〜４）のうち、高頻度のもの上
位50個を、その部分集合として含む音声単位セットとし
て示したものである。第２図において、各音韻連接の右
の数字は出現頻度を示し、各音韻連接数ごとに頻度の高
い順から並べてある。また、第２図において、丸印で囲
んだ音韻連接108個が単位ファイルを構成し、それらの
部分連接により、これら１〜４連接上位50個を包含して
いる。より具体的に説明すると、連接数１の音韻“u"は連接
数２の音韻“ru"に包含されており、この音韻“ru"は連
接数３の音韻“eru"に含まれており、この音韻“eru"は
連接数４の音韻“keru"に含まれている。したがって、
音韻“u",“ru",“eru"については、音声単位ファイル
に蓄えておく必要がなく、音韻“keru"から部分単位と
して抽出することによって得られることになる。第３図は第２図で示した例を音声単位セットとして使
用し、たとえば「桜が咲く」/sakuragasaku/を入力した
場合に選択される音声単位系列の一例を示す図である。第３図に示すように、文頭の/sa/は４連接/saku/から
生成される部分列によって与えられ、/kura/は４連接/k
ura/によって与えられ、「桜」の音声単位列が実現され
る。さらに、「が」は４連接/gaku/中の/ga/によって実
現され、「咲く」は４連接/saku/によって同様に実現さ
れる。［発明の効果］以上のように、この発明によれば、音声単位の多様化
により、単位蓄積に許容される記憶容量，作成に許容さ
れる工数など作成上の要求条件に応じて、自由に効率良
く単位セットを規定できる利点を有するとともに、長い
音声単位中に含まれる短い連接の部分利用など，単位を
有効かつ柔軟に使用できる利点を有している。さらに、
より長い単位を利用するために、単位間の接続に起因す
る音声品質の劣化が少なく、良好な音声品質を期待でき
る。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesis device using a complex speech unit, and more particularly to a speech synthesis device that performs speech synthesis by editing, combining, and transforming a speech unit. [Prior Art] Various speech units have been proposed in a conventional edit-type speech synthesis method and a rule-based speech synthesis method. Such speech units include phoneme, CV (consonant, vowel) syllable, semisyllable (consonant, vowel and vowel / consonant concatenation), VCV (vowel / consonant / vowel) concatenation, CVC (consonant) , Vowels and consonants), CVCV (consonants, vowels, consonants and vowels), words, and phrases, etc. In the conventional speech synthesis system, any one of these types of speech units is used as a base, Even if the other one or two types of audio units are used as a supplement, at most 2-3
Speech was generated by editing, combining, and transforming different types of speech units. [Problems to be Solved by the Invention] FIG. 4 is a diagram comparing conventional various voice units into two types, one having a short unit length and the other having a long unit length, and comparing the superiority and demerit in the configuration of a voice synthesis system. It is. In FIG. 4, it is desirable that the storage capacity required for storing units that are a problem in configuring the speech synthesis system be small, and the shorter the length of the speech unit is, the more advantageous. In addition, the man-hour for creating the synthesis unit is more advantageous as the length of the voice unit is shorter. In terms of the availability and flexibility of unit duplication, the shorter the voice unit, the larger the sound unit. On the other hand, in the unit transformation method, the unit connection method, and the like that affect the voice quality, the longer the length of the voice unit, the less the problem can be. As is clear from the above comparison, it has been extremely difficult to configure a speech synthesis system with only two types of speech units in a well-balanced manner. Therefore, a main object of the present invention is to provide a composite speech unit that can obtain more efficient speech synthesis by using various types of speech units having various structures and partial units included therein. An object of the present invention is to provide a speech synthesizer used. [Means for Solving the Problems] The present invention relates to a speech synthesizer using a complex speech unit for performing speech synthesis by editing speech units or for synthesizing an arbitrary vocabulary according to rules. , A morpheme, a word, a phrase, and a speech unit file that stores in advance a large number of speech units having various lengths and structures up to sentences, and a phonological environment and prosodic information in which partial units of the phonological unit of the synthesis location are placed. I want to output a speech control file that stores in advance an extraction environment for selecting a part of a speech unit, a rule that gives specific conditions such as a fundamental frequency of a phoneme, a phoneme time length, and tables that are parameters for applying the rule. When a prosody control signal indicating a prosody including the pitch, speed, and loudness of the voice corresponding to the voice content is input, rules and various rules given from the voice control file are input. Based on the table, a speech unit selection means for selecting a speech unit sequence suitable for synthesis from the speech unit file, a combination unit for combining the selected speech unit sequence, and based on the combined synthesis unit sequence, And a composite waveform generating means for generating a composite waveform. [Operation] The speech synthesizer using the composite speech unit according to the present invention is suitable for synthesis from a speech unit file based on rules and tables given from the speech control file in response to input of a prosody control signal. The synthesized speech unit sequence is selected and synthesized, and a synthesized waveform is generated based on the synthesized synthesis unit sequence to perform speech synthesis. Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a schematic block diagram of a speech synthesizer according to the present invention. In FIG. 1, a prosody control signal such as a phoneme sequence signal, an accent, and breathing corresponding to a speech content to be output is input to an input terminal 1. The prosody control signal input to the input terminal 1 is supplied to the voice unit selection unit 2. The voice unit selection unit 2 selects a voice unit string suitable for synthesis from a large number of voice units having various structures stored in the voice unit file 3 based on the input information. The speech control knowledge file 4 stores rules and tables for giving specific synthesis conditions such as an extraction environment, a fundamental frequency, and a phoneme time length. When selecting a speech unit sequence, the speech unit selection unit 2 extracts a desired synthesis unit extraction environment, a fundamental frequency, and a phoneme time length from the speech control knowledge file 4 based on the phoneme environment and the prosody information of the synthesis location. Specific composition conditions such as are given, and these synthesis conditions are used as criteria for unit selection. The voice unit sequence selected by the voice unit selection unit 2 is provided to the unit combining unit 5, and the selected units are combined. The combined speech unit is provided to a combined waveform generation unit 6, and a combined waveform is generated based on the combined unit sequence obtained by the unit combining unit 5. Then, the generated sound is output to the output terminal 7. Further, the method according to the present invention is characterized in that various types of units having various structures are used, and the acoustic parameters expressing the voice units used when implementing the method according to the present invention and various types used for unit selection are used. For the standard of speech, speech control knowledge, and synthesizer for generating speech waveforms,
It does not specify this at all and is applicable to all. FIG. 2 is a diagram showing a specific example of a speech unit file used in the present invention. The top 50 voice units are shown as a sound unit set including the subset. In FIG. 2, the number to the right of each phoneme connection indicates the appearance frequency, and the numbers are arranged in descending order of the frequency for each phoneme connection number. Further, in FIG. 2, 108 phoneme connections connected by a circle constitute a unit file, and the partial connection includes the top 50 of these 1 to 4 connections. More specifically, the phoneme “u” having the number of concatenation 1 is included in the phoneme “ru” having the number of concatenation 2, and this phoneme “ru” is included in the phoneme “eru” having the number of concatenation 3; This phoneme “eru” is included in the phoneme “keru” having four concatenations. Therefore,
The phonemes "u", "ru", and "eru" do not need to be stored in the voice unit file, and can be obtained by extracting the phonemes "keru" as partial units. FIG. 3 is a diagram showing an example of a voice unit sequence selected when the example shown in FIG. 2 is used as a voice unit set and, for example, "sakura blooms" / sakuragasaku / is input. As shown in FIG. 3, / sa / at the beginning of the sentence is given by a subsequence generated from four concatenations / saku /, and / kura / is four concatenations / k
Given by ura /, a voice unit sequence of "Sakura" is realized. Furthermore, "ga" is realized by / ga / of four connections / gaku /, and "blooming" is similarly realized by four connections / saku /. [Effects of the Invention] As described above, according to the present invention, the diversification of audio units allows free choice of storage capacity allowed for unit storage, man-hours allowed for creation, etc. In addition to the advantage that the unit set can be defined efficiently, there is an advantage that the unit can be used effectively and flexibly, such as partial use of a short connection included in a long audio unit. further,
Since a longer unit is used, there is little deterioration in voice quality due to connection between units, and good voice quality can be expected.

【図面の簡単な説明】第１図はこの発明の方式を実施するための概略ブロック
図である。第２図は音声単位ファイルを構成する具体的
な一例を示す図である。第３図は入力音韻系列に対して
選択される音声単位系列の一例を示す図である。第４図
は音声合成単位長と音声合成システム構成上の問題点の
関係を示す図である。図において、１は入力端子、２は音声単位選択部、３は
音声単位ファイル、４は音声制御知識ファイル、５は単
位結合部、６は合成波形生成部、７は出力端子を示す。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic block diagram for implementing the method of the present invention. FIG. 2 is a diagram showing a specific example of a sound unit file. FIG. 3 is a diagram showing an example of a speech unit sequence selected for an input phoneme sequence. FIG. 4 is a diagram showing the relationship between the speech synthesis unit length and problems in the speech synthesis system configuration. In the figure, 1 is an input terminal, 2 is a voice unit selection unit, 3 is a voice unit file, 4 is a voice control knowledge file, 5 is a unit combining unit, 6 is a synthesized waveform generation unit, and 7 is an output terminal.

Claims

(57) [Claims] This is a speech synthesizer that uses a complex speech unit that synthesizes speech by editing speech units or synthesizes an arbitrary vocabulary according to rules. The speech synthesizer uses various sounds ranging from phoneme levels to syllables, morphemes, words, phrases, and sentences. A speech unit file that stores a number of speech units having the length and structure in advance, a phoneme environment in which the phoneme subunits of the synthesis location are placed, and an extraction environment that selects speech unit parts based on prosodic information. A voice control file that stores in advance rules that give specific conditions such as the fundamental frequency and phoneme time length of the voice and tables that are parameters for applying the rules, the pitch, speed, and loudness of the voice corresponding to the voice content that you want to output In response to the input of a prosody control signal indicating a prosody including, based on rules and tables given from the voice control file, Based on a speech unit selection unit that selects a speech unit sequence suitable for synthesis, a combination unit for combining the speech unit sequences selected by the speech unit selection unit, and a synthesis unit sequence combined by the combination unit. A speech synthesizer using a composite speech unit, comprising: a synthesized waveform generating unit for generating a synthesized waveform.