JPH07152396A

JPH07152396A - Voice synthesizer

Info

Publication number: JPH07152396A
Application number: JP6050890A
Authority: JP
Inventors: Takahiro Kamai; 孝浩釜井; Kenji Matsui; 謙二松井
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-03-24
Filing date: 1994-03-22
Publication date: 1995-06-16
Anticipated expiration: 2015-09-18
Also published as: JP3089940B2

Abstract

PURPOSE:To provide a voice synthesizer with high quality capable of flexibly dealing with various voice characteristics, etc. CONSTITUTION:This device is provided with a voice sound source part 38, a series type formant synthesis part 32, a consonant generation part 39 provided with a consonant waveform storage part 33 and a synthesis part 35, and a vowel and a nasal are formed by the voice sound source part 38 and the series type formant synthesis part 32, and the consonant such as a plosive and a fricative, etc., is generated from a waveform stored in the consonant waveform storage part 33, and they are synthesized by the synthesis part 35. Both waveforms are phase synchronized by a pitch synchronizing signal generation part 24. Related to the vowel, flexible and various tone quality and intonation are added by a formant synthesis system, and related to the consonant, a high quality voice unrealisable usual formant synthesis system is provided by the system using a waveform element piece. Since the storage as the waveform element piece is limited to the consonant with a shorter duration time, it is realized by a storage with smaller capacity.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、任意のテキストを音声
に変換する音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer for converting arbitrary text into speech.

【０００２】[0002]

【従来の技術】任意のテキストを音声に変換する音声合
成方式には、大きく分けて２種類ある。一つは音声の発
声機構、即ち、声帯や口、喉の動きを理解し、その知識
を規則にして電気回路などを制御する合成方式である。
もう一つの方式は、音声の知識はあまり必要とせず、音
声の素片をたくさん用意して入力に応じて適した素片を
つなぎあわせる方式である。前者には、例えばホルマン
ト合成方式とホルマント制御規則の組合せがよく知られ
ている。図４は、このホルマント合成方式とホルマント
制御規則の組合せの構成例である。同図において、ホル
マント合成器制御規則格納部９はホルマント合成器を制
御するための複数の規則を格納する部分、ホルマント合
成器制御用係数生成部８は、前記の制御規則に基づいて
ホルマント合成器を制御するための係数を生成する部
分、ホルマント合成器１０は実際に音声を合成する部
分、有声音源部１は声帯の振動を模擬する部分、直列型
ホルマント合成部２はホルマント共振器を直列に接続
し、母音や鼻音などの有声音を合成する部分、無声音源
部６は摩擦音や破裂音などの合成に必要な乱流雑音源、
並列型ホルマント合成部７は共振器が並列に接続され摩
擦音や破裂音などの無声子音部分を合成する。合成部５
は直列型ホルマント合成部２の出力と並列型ホルマント
合成部７の出力を合成し合成音を出力する部分である。2. Description of the Related Art There are roughly two types of speech synthesis methods for converting arbitrary text into speech. One is a synthesizing method that understands the vocalization mechanism of the voice, that is, the movements of the vocal cords, the mouth, and the throat, and controls the electric circuit by using the knowledge as a rule.
The other method is a method that does not require much knowledge of speech, and prepares a large number of speech segments and connects the appropriate segments according to the input. The former is well known, for example, a combination of a formant synthesis method and a formant control rule. FIG. 4 shows a configuration example of a combination of the formant synthesis method and the formant control rule. In the figure, a formant synthesizer control rule storage unit 9 stores a plurality of rules for controlling the formant synthesizer, and a formant synthesizer control coefficient generation unit 8 is a formant synthesizer based on the control rule. , A formant synthesizer 10 actually synthesizes a voice, a voiced sound source section 1 simulates a vocal cord vibration, and a serial formant synthesizer 2 forms formant resonators in series. The part that connects and synthesizes voiced sounds such as vowels and nasal sounds, the unvoiced sound source part 6 is a turbulent noise source necessary for synthesis of fricative sounds and plosive sounds,
The parallel formant synthesizer 7 has resonators connected in parallel and synthesizes unvoiced consonant parts such as fricatives and plosives. Synthesis part 5
Is a part for synthesizing the output of the serial formant synthesizer 2 and the output of the parallel formant synthesizer 7 and outputting a synthesized sound.

【０００３】音声合成に必要な発音記号、アクセント位
置、イントネーションに関する情報などがホルマント合
成器制御用係数生成部８に入力されると、ホルマント合
成器制御用係数生成部８はホルマント合成器制御規則格
納部９から必要な規則を参照し、ホルマント合成器制御
用係数をホルマント合成器１０に出力する。ホルマント
合成器１０の内部において、以下のように動作する。有
声音源部１は母音などの有声音を合成する時に人間の声
帯で生じるパルス状の音源波形を模擬する。このパルス
状の信号は直列型ホルマント合成部２に入力され、直列
型ホルマント合成部２は、複数のホルマント共振器によ
って音源波形に母音や鼻音として適切な特徴を与え、合
成部５に出力する。一方、無声音源部６は摩擦音や破裂
音の音源となる雑音状の波形を並列型ホルマント合成部
７に送り、並列型ホルマント合成部７で複数の共振器に
よりそれぞれの子音に必要な周波数的特徴を瞬時に形成
し、合成部５に出力する。合成部５は、直列型ホルマン
ト合成部２の母音や鼻音と並列型ホルマント合成部７の
子音を合成し合成音声として出力する。When information about phonetic symbols, accent positions, and intonation necessary for speech synthesis is input to the formant synthesizer control coefficient generator 8, the formant synthesizer control coefficient generator 8 stores the formant synthesizer control rule. The unit 9 refers to a necessary rule and outputs the formant synthesizer control coefficient to the formant synthesizer 10. The inside of the formant synthesizer 10 operates as follows. The voiced sound source unit 1 simulates a pulse-shaped sound source waveform generated in a human vocal cord when synthesizing a voiced sound such as a vowel. This pulse-shaped signal is input to the serial formant synthesizer 2, and the serial formant synthesizer 2 provides the sound source waveform with an appropriate feature as a vowel or a nasal sound by using a plurality of formant resonators and outputs it to the synthesizer 5. On the other hand, the unvoiced sound source section 6 sends a noise-like waveform, which is a sound source of fricative or plosive sound, to the parallel formant synthesis section 7, and the parallel formant synthesis section 7 uses a plurality of resonators to obtain the frequency characteristics required for each consonant. Are formed instantaneously and output to the combining unit 5. The synthesizer 5 synthesizes the vowels and nasal sounds of the serial formant synthesizer 2 and the consonants of the parallel formant synthesizer 7 and outputs them as synthesized speech.

【０００４】次に、もう一つの従来例である波形素片
（音声素片）を用いる方式を説明する。図５はこの方式
の構成図である。波形素片選択部１１は入力である発音
記号列やアクセント情報から合成に必要な波形素片を波
形素片データベース格納部１２から選択する。この場
合、波形素片は例えば線形予測係数などのような係数に
圧縮されて格納されているのが通常である。選択された
複数の波形素片は素片接続合成部１３で接続され適切な
基本周波数で音声波形に合成される。Next, another method of using a waveform segment (speech segment) as a conventional example will be described. FIG. 5 is a block diagram of this system. The waveform segment selection unit 11 selects from the waveform segment database storage unit 12 a waveform segment necessary for synthesis from the phonetic symbol string and the accent information which are input. In this case, the waveform segment is usually compressed and stored in a coefficient such as a linear prediction coefficient. The selected plurality of waveform segments are connected by the segment connection synthesis unit 13 and synthesized into a speech waveform at an appropriate fundamental frequency.

【０００５】[0005]

【発明が解決しようとする課題】ところで、我々発明者
は、上記２つの方式を検討した結果、それらの２つの方
式には、次に示すような異なる特徴があることを見いだ
した。By the way, the inventors of the present invention have studied the above-mentioned two methods and found that the two methods have different characteristics as described below.

【０００６】すなわち、前者の方式の長所は、音をすべ
て規則で作り上げるので柔軟性に富み様々な音質やイン
トネーションの音声を合成できることである。短所とし
ては、特に子音などのように発声メカニズムが複雑な音
声は合成規則がまだ十分研究されていないのが現状で、
自然な音質の子音生成が難しい点である。That is, the advantage of the former method is that since all the sounds are created according to rules, it is highly flexible and can synthesize voices of various tones and intonations. The disadvantage is that the synthesis rules have not been sufficiently studied, especially for voices with complex vocalization mechanisms such as consonants.
It is difficult to generate natural consonants.

【０００７】また、後者の方式の場合の長所は、波形素
片が基本的にモデルとなる自然音声から切り出されてい
るので素片間の滑らかな接続が出来れば合成品質は極め
て高い。一方、この方式の短所は波形素片格納に大容量
の記憶装置が必要であるのでコストが高くなる。また、
モデル音声の声質しか合成できず、柔軟性に欠けるとい
う問題がある。Further, the advantage of the latter method is that since the waveform segment is basically cut out from the natural speech which is the model, the synthesis quality is extremely high if the segment can be smoothly connected. On the other hand, the disadvantage of this method is that it requires a large-capacity storage device to store the waveform elements, and thus the cost is high. Also,
There is a problem that only the voice quality of model speech can be synthesized, and it lacks flexibility.

【０００８】まとめると、音をすべて規則で作り上げる
方式の場合は、柔軟性に富み様々な音質やイントネーシ
ョンの音声を合成できるが、子音などのように発声メカ
ニズムが複雑な音声は合成規則がまだはっきりしていな
いので合成が難しい。一方、波形素片を用いた方式の場
合は、合成品質は極めて高いが、波形素片格納に大容量
の記憶装置が必要という問題や、モデル音声の声質しか
合成できず、柔軟性に欠けるという問題がある。In summary, in the case of a system in which all sounds are made up by rules, it is possible to synthesize voices of various sound quality and intonation with flexibility, but the synthesis rule is still clear for voices with a complicated voicing mechanism such as consonants. Since it is not done, it is difficult to synthesize. On the other hand, in the case of the method using the waveform segment, the synthesis quality is extremely high, but there is a problem that a large-capacity storage device is required for storing the waveform segment, and that only the voice quality of the model voice can be synthesized, resulting in lack of flexibility. There's a problem.

【０００９】本発明の目的は、音質の柔軟性に富み、記
憶容量も波形素片を用いる従来方式に比べて大幅に削減
できる、合成品質の高い音声合成装置の提供を目的とす
るものである。An object of the present invention is to provide a speech synthesizing apparatus having high synthetic quality, which is highly flexible in sound quality and can be significantly reduced in storage capacity as compared with the conventional method using waveform segments. .

【００１０】[0010]

【課題を解決するための手段】請求項１の本発明は、有
声音源信号を出力する有声音源部と、前記有声音源部か
らの有声音源信号を入力とし、複数の直列に接続された
ホルマント共振器を有し、母音など所定の音を合成する
直列型ホルマント合成部と、複数の子音など所定の音の
波形を記憶する波形記憶部と、前記波形記憶部の中から
必要な波形を読み出す波形読み出し部と、前記直列型ホ
ルマント合成部からの出力と前記波形読み出し部が読み
出す波形とを重ね合わせたり、切り替えたりして合成音
声として出力する波形結合部と、を具備する音声合成装
置である。According to a first aspect of the present invention, a voiced sound source section for outputting a voiced sound source signal, and a plurality of series-formed formant resonances to which a voiced sound source signal from the voiced sound source section is input. Type formant synthesis section for synthesizing a predetermined sound such as a vowel, a waveform storage section for storing a waveform of a predetermined sound such as a plurality of consonants, and a waveform for reading out a required waveform from the waveform storage section A speech synthesizer comprising: a reading unit; and a waveform combining unit that outputs the synthesized speech by superposing or switching the output from the serial formant synthesis unit and the waveform read by the waveform reading unit.

【００１１】請求項３の本発明は、音声から抽出された
特徴パラメータに基づいて、音声信号を生成する音生成
手段と、音声から切り出された波形素片を記憶する波形
素片記憶手段と、その記憶された波形素片についての所
定の特徴量を記憶する波形素片特徴量記憶手段と、その
記憶された波形素片の特徴量に基づき、前記音生成手段
により生成された音声信号と前記波形素片記憶手段から
得られる波形素片信号とを合成させる制御手段とを備え
た音声合成装置である。According to a third aspect of the present invention, sound generating means for generating a voice signal based on the characteristic parameters extracted from the voice, and waveform element storage means for storing the waveform element cut out from the voice, A waveform segment feature amount storage unit that stores a predetermined feature amount for the stored waveform segment, and a voice signal generated by the sound generation unit based on the stored feature amount of the waveform segment and the voice signal. It is a speech synthesizing device provided with a control means for synthesizing a waveform segment signal obtained from a waveform segment storage means.

【００１２】請求項１７の本発明は、有声音を生成する
有声音源波形生成手段と直列型ホルマント合成部と子音
を生成する子音波形生成手段と波形を接続する波形接続
手段とピッチ同期信号生成手段を具備し、前記ピッチ同
期信号生成手段は所望のピッチ周期に対応したピッチ同
期信号を出力し、前記有声音源波形生成手段と前記子音
波形生成手段はともに前記ピッチ同期信号に同期した位
相の波形を生成し、前記直列型ホルマント合成部は前記
有声音源波形生成手段の出力波形に声道特性を模擬した
伝達関数にて周波数特性の変更を行い、前記波形接続手
段は前記直列型ホルマント合成部の出力波形と前記子音
波形生成手段の出力波形を接続または混合することによ
り音声波形を生成する音声合成装置である。According to a seventeenth aspect of the present invention, a voiced sound source waveform generating means for generating a voiced sound, a serial formant synthesizer, a consonant sound waveform generating means for generating a consonant, a waveform connecting means for connecting a waveform, and a pitch synchronization signal generating means. The pitch synchronization signal generating means outputs a pitch synchronization signal corresponding to a desired pitch period, and both the voiced sound source waveform generating means and the consonant sound waveform generating means generate a waveform of a phase synchronized with the pitch synchronization signal. And the serial formant synthesizer changes the frequency characteristic with a transfer function simulating the vocal tract characteristic in the output waveform of the voiced sound source waveform generator, and the waveform connecting means outputs the output of the serial formant synthesizer. A voice synthesizer for generating a voice waveform by connecting or mixing the waveform and the output waveform of the consonant sound waveform generating means.

【００１３】[0013]

【作用】請求項１の本発明では、有声音源部が有声音源
信号を出力し、直列型ホルマント合成部が、有声音源部
からの有声音源信号を入力とし、複数の直列に接続され
たホルマント共振器を有し、母音など所定の音を合成
し、波形記憶部が複数の子音など所定の音の波形を記憶
し、波形読み出し部が、前記波形記憶部の中から必要な
波形を読み出す。また、波形結合部が、前記直列型ホル
マント合成部からの出力と前記波形読み出し部が読み出
す波形とを重ね合わせたり、切り替えたりして合成音声
として出力する。According to the present invention of claim 1, the voiced sound source section outputs the voiced sound source signal, and the serial formant synthesis section receives the voiced sound source signal from the voiced sound source section as an input, and a plurality of formant resonances connected in series. And a predetermined sound such as a vowel is synthesized, a waveform storage unit stores a waveform of a predetermined sound such as a plurality of consonants, and a waveform reading unit reads a required waveform from the waveform storage unit. In addition, the waveform combination unit superimposes or switches the output from the serial formant synthesis unit and the waveform read by the waveform reading unit, and outputs the synthesized voice.

【００１４】請求項３の本発明では、音生成手段が、音
声から抽出された特徴パラメータに基づいて、音声信号
を生成し、波形素片記憶手段が、音声から切り出された
波形素片を記憶し、波形素片特徴量記憶手段が、その記
憶された波形素片についての所定の特徴量を記憶し、制
御手段が、その記憶された波形素片の特徴量に基づき、
前記音生成手段により生成された音声信号と前記波形素
片記憶手段から得られる波形素片信号とを合成させる。According to the present invention of claim 3, the sound generating means generates a voice signal based on the characteristic parameter extracted from the voice, and the waveform element storage means stores the waveform element cut out from the voice. Then, the waveform segment feature amount storage means stores a predetermined feature amount for the stored waveform segment, the control means, based on the stored feature amount of the waveform segment,
The voice signal generated by the sound generation unit and the waveform segment signal obtained from the waveform segment storage unit are combined.

【００１５】請求項１７の本発明では、ピッチ同期信号
生成手段は所望のピッチ周期に対応したピッチ同期信号
を出力し、有声音源波形生成手段と前記子音波形生成手
段はともにピッチ同期信号に同期した位相の波形を生成
し、直列型ホルマント合成部は有声音源波形生成手段の
出力波形に声道特性を模擬した伝達関数にて周波数特性
の変更を行い、波形接続手段は直列型ホルマント合成部
の出力波形と子音波形生成手段の出力波形を接続または
混合することにより音声波形を生成する。In the seventeenth aspect of the present invention, the pitch synchronizing signal generating means outputs a pitch synchronizing signal corresponding to a desired pitch period, and both the voiced sound source waveform generating means and the consonant sound waveform generating means are synchronized with the pitch synchronizing signal. A phase form waveform is generated, the serial formant synthesizer changes the frequency characteristic with a transfer function simulating the vocal tract characteristic of the output waveform of the voiced sound source waveform generator, and the waveform connecting means outputs the serial formant synthesizer. A voice waveform is generated by connecting or mixing the waveform and the output waveform of the consonant waveform generation means.

【００１６】[0016]

【実施例】以下、本発明の実施例について図面を参照し
て説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１７】図１は本発明の第１の実施例における音声
合成装置の構成を示すものである。以下の実施例では、
おもに日本語を対象として説明するが、支障の無い限り
英語等他の言語にも適用可能である。同図において、有
声音源部１は、声帯の振動を模擬し音源信号を生成する
部分である。直列型ホルマント合成部２は、母音などの
有声音を合成する部分である。子音波形記憶部３は、自
然な音声から切り出した子音の波形素片を格納する部
分、子音波形読み出し部４は必要な波形素片を選択し取
り出す部分、合成部５は直列型ホルマント合成部２の出
力と子音波形読み出し部４の出力を合成し、合成音声と
して出力する部分である。FIG. 1 shows the configuration of a speech synthesizer according to the first embodiment of the present invention. In the examples below,
The explanation will mainly be in Japanese, but it can be applied to other languages such as English as long as there is no problem. In the figure, the voiced sound source section 1 is a section that simulates vibration of the vocal cords and generates a sound source signal. The serial formant synthesis unit 2 is a unit that synthesizes a voiced sound such as a vowel. The consonant waveform storage unit 3 stores a waveform segment of a consonant cut out from a natural voice, the consonant waveform readout unit 4 selects and extracts a required waveform segment, and the synthesis unit 5 a serial formant synthesis unit 2 Is combined with the output of the consonant sound waveform reading unit 4 and is output as a synthesized voice.

【００１８】上記のように構成された本実施例の音声合
成装置について以下にその動作を説明する。The operation of the speech synthesizer of this embodiment having the above-mentioned structure will be described below.

【００１９】従来例で説明したように本合成装置に対し
て先ずホルマント合成器制御用係数が与えられる。有声
音源部１は、ホルマント合成器制御用係数の中の基本周
波数に関する情報、音源の振幅情報に関する情報などか
ら所望の音源信号を生成し、直列型ホルマント合成部２
に入力する。子音区間や無声区間では、音源信号は出力
されない。直列型ホルマント合成部２は、ホルマント合
成器制御用係数の中のホルマント周波数情報、ホルマン
ト共振峰のバンド幅に関する情報、などから直列に並ん
だ共振器の特性を決定し、上記の音源信号から母音など
の音声信号に変換する。直列型ホルマント合成部２の出
力は合成部５に送られる。一方、ホルマント合成器制御
用係数の中の音素に関する情報から子音波形読み出し部
４は、その音素が子音波形記憶部３に存在するかどうか
を確認して、もし存在すればその波形素片を子音波形記
憶部３から取り出し、合成部５に送る。例えば、図２に
示すように、合成しようとする音素が「ｋ」で後続母音
が「あ」の場合、子音波形読み出し部４は、子音波形記
憶部３の中に子音「ｋ」で「か」から切り出した波形素
片があるか検索する。合成部５は、直列型ホルマント合
成部２からの母音信号と子音波形読み出し部４の子音信
号を加算処理や重ねあわせ処理などにより合成する。こ
のように構成することにより、母音に関してはホルマン
ト合成方式により柔軟で様々な音質やイントネーション
を付与でき、子音に関しては波形素片を用いた方式によ
りホルマント合成方式では実現出来ない高品質な音声を
提供できる。波形素片としての格納は持続時間の短い子
音に限るため小容量の記憶装置で実現が可能である。As described in the conventional example, the formant synthesizer control coefficient is first given to the present synthesizer. The voiced sound source section 1 generates a desired sound source signal from the information on the fundamental frequency in the coefficient for controlling the formant synthesizer, the information on the amplitude information of the sound source, and the like, and the serial formant synthesizer 2
To enter. No sound source signal is output in the consonant section or the unvoiced section. The series formant synthesizer 2 determines the characteristics of the resonators arranged in series from the formant frequency information in the formant synthesizer control coefficient, the information about the bandwidth of the formant resonance peak, and the like, and determines the vowel sound from the sound source signal. Etc. to a voice signal. The output of the serial formant synthesis unit 2 is sent to the synthesis unit 5. On the other hand, the consonant sound waveform reading unit 4 confirms whether or not the phoneme exists in the consonant sound wave storage unit 3 from the information about the phoneme in the coefficient for controlling the formant synthesizer, and if it exists, the consonant is detected. It is taken out from the waveform storage unit 3 and sent to the synthesis unit 5. For example, as shown in FIG. 2, when the phoneme to be synthesized is “k” and the following vowel is “a”, the consonant sound waveform reading unit 4 stores the consonant sound “k” in the consonant sound waveform storage unit 3. Search for the waveform element cut out from ". The synthesizing unit 5 synthesizes the vowel signal from the serial formant synthesizing unit 2 and the consonant signal from the consonant sound waveform reading unit 4 by addition processing or superposition processing. By configuring in this way, it is possible to provide various sound quality and intonation for vowels by the formant synthesis method, and for consonants, high quality speech that cannot be achieved by the formant synthesis method by using the waveform segment is provided. it can. Since storage as waveform segments is limited to consonants with short duration, it can be realized with a small-capacity storage device.

【００２０】次に、図３を参照しながら、上記方式の特
長を保持し、且つ、波形素片の種類を減らすことができ
必要な記憶容量の削減が可能な本発明の第２の実施例に
おける音声合成装置を説明する。Next, with reference to FIG. 3, the second embodiment of the present invention is capable of retaining the features of the above-mentioned method, reducing the types of waveform elements, and reducing the required storage capacity. The speech synthesizer in will be described.

【００２１】同図において、無声音源部６は、子音の音
源となる部分、並列型ホルマント合成部７は前記無声音
源部６からの信号を複数の並列に接続された共振器によ
り破裂音や摩擦音などの子音を合成する部分である。他
の手段は、第１の実施例と同様である。In the figure, the unvoiced sound source section 6 is a consonant sound source, and the parallel formant synthesis section 7 is a burst sound or a fricative sound produced by a plurality of resonators connected in parallel with the signal from the unvoiced sound source section 6. This is the part that synthesizes consonants such as. Other means are the same as those in the first embodiment.

【００２２】上記のように構成された本実施例の音声合
成装置について以下その動作を説明する。The operation of the speech synthesizer of the present embodiment constructed as above will be described below.

【００２３】第１の実施例と同様に、本合成装置に対し
て先ずホルマント合成器制御用係数が与えられ、有声音
源部１と直列型ホルマント合成部２によって母音性信号
に変換し合成部５に送られる。さらに、子音の中でホル
マント合成方式で十分高品質が実現できるものに関して
は、無声音源部６と並列型ホルマント合成部７が取り扱
う。即ち、与えられたホルマント合成器制御用係数の中
の無声音源に関する情報に基づいて無声音源部６が雑音
性信号の振幅、タイミングなどを調整し並列型ホルマン
ト合成部７に送る。並列型ホルマント合成部７では合成
しようとする子音の周波数特徴に関する情報などを基に
並列に並べられた共振器によって雑音性信号が所望の子
音性信号に変換され合成部５に渡される。子音波形読み
出し部４と子音波形記憶部３は第１の実施例と同様に、
上記並列型ホルマント合成部７で取り扱わない子音を波
形素片データベースから検索し、それを合成部５に送
る。合成部５は、第１の実施例と同様に、直列型ホルマ
ント合成部２からの母音性信号と子音波形読み出し部４
からの子音性信号あるいは並列型ホルマント合成部７か
らの子音性信号を加算処理や重ねあわせ処理などにより
合成する。このように構成することにより、子音の中で
ホルマント合成方式で十分高品質が実現できるものに関
しては、無声音源部と並列型ホルマント合成部が利用で
き、波形素片格納に必要な記憶容量の削減が可能にな
る。Similar to the first embodiment, first, the formant synthesizer control coefficient is given to the present synthesizer, and the voiced sound source section 1 and the serial formant synthesizer 2 convert it to a vowel signal and synthesizer 5 is added. Sent to. Further, among the consonants, those which can achieve sufficiently high quality by the formant synthesis method are handled by the unvoiced sound source section 6 and the parallel type formant synthesis section 7. That is, the unvoiced sound source section 6 adjusts the amplitude, timing, etc. of the noisy signal based on the information about the unvoiced sound source in the given formant synthesizer control coefficient, and sends it to the parallel formant synthesis section 7. In the parallel type formant synthesis unit 7, the noise signals are converted into desired consonant signals by the resonators arranged in parallel based on the information regarding the frequency characteristics of the consonants to be synthesized, and are passed to the synthesis unit 5. The consonant sound waveform reading unit 4 and the consonant sound waveform storage unit 3 are similar to those in the first embodiment.
A consonant which is not handled by the parallel formant synthesizer 7 is searched from the waveform segment database and sent to the synthesizer 5. Similar to the first embodiment, the synthesizer 5 uses the vowel signal from the serial formant synthesizer 2 and the consonant sound waveform read-out unit 4.
Or the consonant signal from the parallel formant synthesizer 7 is synthesized by addition processing or superposition processing. With this configuration, for consonants that can achieve sufficiently high quality with the formant synthesis method, the unvoiced sound source part and the parallel formant synthesis part can be used, and the storage capacity required for storing waveform segments can be reduced. Will be possible.

【００２４】また、並列型ホルマント合成部７と子音波
形読み出し部９を同時に駆動することにより、例えば、
ある波形素片の破裂部分を並列型ホルマント合成部７の
信号によってさらに強調するなど、雑音環境下などで明
瞭度を自然音声以上に高めることが可能になる。Further, by driving the parallel formant synthesizer 7 and the consonant sound waveform readout unit 9 simultaneously, for example,
In a noisy environment or the like, the intelligibility can be increased more than natural speech by, for example, further emphasizing the ruptured portion of a certain waveform segment with the signal from the parallel formant synthesis unit 7.

【００２５】次に、本発明の別の実施例に付いて説明す
る。Next, another embodiment of the present invention will be described.

【００２６】上記子音波形記憶部には自然音声波形から
子音部分を切り出したものが格納されている。無声子音
の場合は破裂部分や摩擦部分などの子音部分を有声音部
分すなわち声帯振動が開始した後の部分と切り離すこと
が可能で、それらのみを格納しておくことにより同一の
素片を任意のピッチの合成に使用することが可能であ
る。しかし、有声子音は子音部分を有声音部分と分離す
ることが出来ないので、声帯振動が開始した後の波形を
素片に含めなくてはならない。The consonant sound waveform storage unit stores a consonant portion cut out from a natural voice waveform. In the case of unvoiced consonants, consonant parts such as bursts and friction parts can be separated from voiced sound parts, i.e., parts after vocal cord vibration has started. It can be used for pitch synthesis. However, voiced consonants cannot separate the consonant part from the voiced part, so the waveform after the vocal cord vibration has started must be included in the segment.

【００２７】また、一般に子音の知覚のためのキューは
後続母音にも含まれている。したがって、子音波形素片
に後続母音の冒頭部分を含めることにより音質を向上さ
せることが出来る。In general, cues for the perception of consonants are also included in the following vowels. Therefore, the sound quality can be improved by including the beginning part of the following vowel in the consonant waveform segment.

【００２８】したがって、子音波形素片と直列型ホルマ
ント合成波形を後続母音部分で接続しなくてはならな
い。この時、例えば子音波形素片の途中で瞬間的に直列
型ホルマント合成波形に切り替えると波形不連続が発生
し、インパルス性の雑音が発生する。Therefore, it is necessary to connect the consonant sound wave segment and the series formant synthesized waveform at the succeeding vowel portion. At this time, for example, if the series type formant composite waveform is instantaneously switched in the middle of the consonant waveform segment, waveform discontinuity occurs and impulsive noise occurs.

【００２９】所定の区間幅でなめらかな重ね合わせを行
う方法が考えられる。すなわち、子音波形素片をなめら
かに減衰させるとともに、直列型ホルマント合成波形を
なめらかに立ち上げる。後続母音部分の先頭１〜２ピッ
チ周期を子音波形素片に含め、１ピッチ周期程度の区間
幅で重ね合わせを行えば、ピッチを考慮せずに子音波形
素片を用いることができる。A method of performing smooth superposition with a predetermined section width can be considered. That is, the consonant sound wave segment is smoothly attenuated, and the series formant composite waveform is smoothly raised. If the first and second pitch periods of the succeeding vowel portion are included in the consonant sound wave segment and overlapped with a section width of about one pitch period, the consonant sound wave segment can be used without considering the pitch.

【００３０】しかし、上記の方法で接続を行っても両者
の波形のタイミングを制御しなければ位相不連続が起こ
り、音質劣化が起こる。例えば、同じピッチを持つ子音
波形素片と直列型ホルマント合成波形を接続すると、両
者のタイミングが正確に制御されていなければ接続点付
近でピッチ周期が瞬間的に変化する。これは言い替えれ
ば両者の位相が異なるためである。However, even if connection is made by the above method, phase discontinuity occurs and sound quality deterioration occurs unless the timing of both waveforms is controlled. For example, when a consonant acoustic wave segment having the same pitch and a series formant composite waveform are connected, the pitch cycle instantaneously changes near the connection point unless the timings of the two are accurately controlled. This is because the phases of the two are different.

【００３１】また、そのほかにも子音の発音（出力）タ
イミングを正確に制御しなければ音韻性が損なわれ、例
えば「さ」が「つぁ」に変化してしまうなどの問題が発
生する。In addition to this, unless the pronunciation (output) timing of the consonant is accurately controlled, the phonological property is impaired and, for example, "sa" changes to "tsua".

【００３２】そこで、次の実施例では上記の問題を解決
するために子音波形素片にラベルを付与し、それをもと
に接続点での波形タイミングを制御する構成を取る。Therefore, in the next embodiment, in order to solve the above-mentioned problem, a label is given to the consonant waveform element, and the waveform timing at the connection point is controlled based on the label.

【００３３】すなわち、図６は本発明にかかる第３の実
施例の音声合成装置の構成図である。すなわち、音声合
成装置には有声音生成部１４および子音波形生成部１７
が設けられ、それら有声音生成部１４と子音波形生成部
１７には、音声波形の生成を制御する制御部２１が接続
されている。その制御部２１には、子音波形生成部１７
の子音波形記憶部１９に記憶されている各子音素片に付
けられたラベルを記憶する子音波形ラベル記憶部１８が
接続され、又、有声音生成部１４と子音波形生成部１７
の出力は合成部２２を介して出力部２０に並列に接続さ
れている。又、有声音生成部１５の内部には有声音源部
１５及び直列型ホルマント合成部１６が設けられ、有声
音源部１５の出力は直列型ホルマント合成部１６に接続
され、直列型ホルマント合成部１６の出力は有声音生成
部１４の出力として合成部２２へ接続されている。ここ
で、前述の有声音生成部１４が音生成手段であり、子音
波形記憶部１９が波形素片記憶手段であり、子音波形ラ
ベル記憶部１８が波形素片特徴量記憶手段である。That is, FIG. 6 is a block diagram of a speech synthesizer according to a third embodiment of the present invention. That is, the voice synthesizer includes a voiced sound generator 14 and a consonant sound waveform generator 17.
The voiced sound generator 14 and the consonant sound waveform generator 17 are connected to a controller 21 for controlling the generation of a voice waveform. The control unit 21 includes a consonant sound waveform generation unit 17
Is connected to a consonant sound waveform label storage unit 18 for storing the label attached to each consonant element stored in the consonant sound waveform storage unit 19 of FIG.
The output of is connected to the output unit 20 in parallel via the combining unit 22. Further, the voiced sound source 15 and the serial formant synthesizer 16 are provided inside the voiced sound generator 15, and the output of the voiced sound source 15 is connected to the serial formant synthesizer 16 and the serial formant synthesizer 16 outputs the same. The output is connected to the synthesizer 22 as the output of the voiced sound generator 14. Here, the voiced sound generation unit 14 described above is a sound generation unit, the consonant waveform storage unit 19 is a waveform segment storage unit, and the consonant waveform label storage unit 18 is a waveform segment feature amount storage unit.

【００３４】上述の子音波形ラベル記憶部１８には必要
な全ての子音素片に対し、図７に示すように、特徴量と
して波形のタイミングを表すラベルが記憶されている。
図７は無声子音素片に対するラベル付与方法の説明図で
ある。図７においてstrtは「開始ラベル」、brstは「バ
ーストラベル」、sovは「ボイシング開始ラベル」、pea
kは「ピークラベル」、endは「終了ラベル」である。特
徴量としてはその他に、gainとmagnという値も記憶され
ている。gainは「利得情報」、magnは「ピーク値情報」
である。As shown in FIG. 7, the consonant waveform label storage section 18 stores labels representing waveform timings as feature quantities for all necessary consonant segments.
FIG. 7 is an explanatory diagram of a method for assigning a label to unvoiced consonant phonemes. In Fig. 7, strt is the "start label", brst is the "burst label", sov is the "voicing start label", and pea.
k is a "peak label" and end is an "end label". In addition, the values of gain and magn are also stored as the feature amount. gain is "gain information", magn is "peak value information"
Is.

【００３５】ここで、開始ラベル、終了ラベルは文字ど
おり子音素片の発音（出力）開始点、および終了点であ
る。終了ラベルは声帯音源振動の開始後２ピッチ周期の
ゼロクロス点に付けられている。これは後続母音部分に
含まれている子音の特徴を子音素片に含めるためであ
る。できるだけ多くの特徴を子音素片に含めるために、
ピッチ周期数を大きくとりたいが、そうすると子音素片
自身のピッチが強く知覚されるようになる。合成時のピ
ッチがこれと異なるとき、ピッチの不連続を生むために
音質は劣化する。そこで、これらを考慮して子音の特徴
を十分含む範囲でできるだけ少ないピッチ周期数を個々
の子音素片に対して選択する。ピッチ周期数が１または
２の子音素片はピッチの知覚の度合い（ピッチ性と呼
ぶ）が弱いため合成時のピッチを考慮せずにそのまま用
いても差し支えない。ピッチ周期数がそれよりも大きい
子音素片や有声子音などはピッチ性が強いので、合成時
のピッチを考慮する必要がある。そこで、複数のピッチ
の子音素片を用意しておき、合成時にそれらの中から最
も近いピッチのものを選んで用いる方法や、子音素片に
ピッチ変更操作を行う（線形伸縮法やピッチ同期重畳
法）方法などを用いる。Here, the start label and the end label are literally the start point and the end point of the pronunciation (output) of the consonant element. The end label is attached to the zero-cross point of two pitch periods after the start of vocal cord sound source vibration. This is because the consonant feature included in the subsequent vowel part is included in the consonant segment. To include as many features in the consonant as possible,
We would like to have a large number of pitch periods, but then the pitch of the consonant element itself will be strongly perceived. When the pitch at the time of synthesis is different from this, the sound quality is degraded due to the discontinuity of the pitch. Therefore, in consideration of these, the smallest possible number of pitch periods is selected for each consonant element within a range sufficiently including the features of consonants. A consonant element having a pitch period number of 1 or 2 has a weak pitch perception (referred to as pitch property) and may be used as it is without considering the pitch at the time of synthesis. It is necessary to consider the pitch at the time of synthesis because consonant phonemes and voiced consonants having a larger number of pitch periods have stronger pitch characteristics. Therefore, a consonant element with a plurality of pitches is prepared, and the one with the closest pitch is selected and used during synthesis, or a pitch changing operation is performed on the consonant element (linear expansion / contraction method or pitch synchronization superimposition). Method) method etc. are used.

【００３６】バーストラベルは破裂子音の破裂部、摩擦
子音の摩擦部などそれぞれの子音を特徴づける調音が行
われる瞬間（ここではそれらを総称して破裂事象とす
る）に付けられるラベルで、合成時に子音素片の発音タ
イミングを決定するために使用する。The burst label is a label attached to the moment when the consonant characterizing each consonant such as the explosive part of the explosive consonant and the frictional part of the fricative consonant (herein, they are collectively referred to as explosive event), and at the time of synthesis. It is used to determine the pronunciation timing of consonant pieces.

【００３７】ボイシング開始ラベルは子音素片が無声子
音の場合に付けられるラベルである。このラベルは無声
化した子音を合成するために用いられる。無声化とは語
尾や後続の音韻環境によって無声子音の後続母音が消滅
する現象である。無声化した子音を合成するときは子音
素片の発音をこのラベルで終了する。無声化は、本質的
には子音部分の後に声帯が振動するかしないかの差であ
るので、このように声帯振動開始点で発音を停止すれば
再現できる。The voicing start label is a label attached when a consonant element is an unvoiced consonant. This label is used to synthesize unvoiced consonants. Devoicing is a phenomenon in which trailing vowels of unvoiced consonants disappear due to endings and subsequent phoneme environments. When synthesizing devoiced consonants, consonant segment pronunciation ends with this label. Since the devoicing is essentially the difference between whether or not the vocal cords vibrate after the consonant part, it can be reproduced by stopping the pronunciation at the vocal cord vibration start point in this way.

【００３８】ピークラベルは終了ラベル直前の波形上の
ピークに付与され、後述する有声音生成部１４と子音波
形生成部１７の出力の同期に用いられる。このピークは
声帯が閉じた瞬間に発生する。The peak label is given to the peak on the waveform immediately before the end label and is used for synchronizing the outputs of the voiced sound generator 14 and the consonant waveform generator 17 which will be described later. This peak occurs at the moment the vocal cords close.

【００３９】子音素片が有声子音の場合にはボイシング
開始ラベルの代わりに音韻性開始ラベルが付与される。
図８は有声子音素片に対するラベル付与方法の説明図で
ある。strt、brst、peak、endは無声子音と同様に付与
されるがボイシング開始ラベルは付与されない。ここで
はsovは音韻性開始ラベルとして使用されている。音韻
性開始ラベルは発音開始位置を開始ラベルから徐々に遅
らせて行ったときに音韻性が変化する直前に付与する。
この位置は一般にバーストラベル以前にあり、破裂音で
は閉鎖区間の中、その他の音韻では閉鎖区間に相当する
区間内にある。閉鎖区間とは破裂音の発音の際に声道の
ある箇所が閉鎖し、声道内圧力を高めている間の波形で
ある。有声子音素片の発音は文の先頭、または休止の直
後では開始ラベルから行い、それ以外（文の途中など、
直前が無音や休止でない場合）では音韻性開始ラベルか
ら行うように制御する。このようにして、文中で閉鎖区
間などが短縮する現象を再現し、文頭と文中の子音素片
を共通にすることを可能にする。When the consonant element is a voiced consonant, a phonological start label is given instead of the voicing start label.
FIG. 8 is an explanatory diagram of a labeling method for a voiced consonant segment. Strt, brst, peak and end are added like unvoiced consonants, but no voicing start label is added. Here sov is used as the phonological start label. The phonological start label is given immediately before the phonological change when the pronunciation start position is gradually delayed from the start label.
This position is generally before the burst label, in the closed section for plosives, and in the section corresponding to the closed section for other phonemes. The closed section is a waveform during which a part of the vocal tract is closed during the production of a plosive sound and the pressure in the vocal tract is increased. Voiced consonants are pronounced at the beginning of a sentence, or at the start label immediately after a pause, otherwise (in the middle of a sentence, such as
In the case where there is no silence or pause immediately before), it is controlled to start from the phonological start label. In this way, it is possible to reproduce the phenomenon that the closed section is shortened in the sentence and to make the consonant unit in the sentence common to the beginning of the sentence.

【００４０】利得情報は個々の子音素片が持つ音量の違
いを吸収し、合成時に適切な音量で発音させるための値
である。The gain information is a value for absorbing the difference in the volume of each consonant element and producing a sound at an appropriate volume during synthesis.

【００４１】ピーク値情報はピークラベルを付与された
ピーク波形の振幅を示し、子音素片の振幅包絡と有声音
生成部１４の出力波形の振幅包絡をなめらかに接続する
ために用いる。The peak value information indicates the amplitude of the peak waveform to which the peak label is added, and is used to smoothly connect the amplitude envelope of the consonant element and the amplitude envelope of the output waveform of the voiced sound generator 14.

【００４２】有声音源部１５は声帯音源波形を発生す
る。この波形は実音声から逆フィルタ法で抽出されたも
のである。逆フィルタ法とは実音声波形に含まれる声道
の影響すなわちホルマントを、声道の逆特性を持ったフ
ィルタ（逆フィルタ）で除去することによって声帯音源
波形を抽出する方法である。こうして得られる波形は微
分声門体積流波形と呼ばれ、声道に加わる音響振動波形
を微分した波形に相当する。従って、この波形は急速に
声帯が閉じた瞬間に上向きの鋭いパルスを発生する。こ
の波形の上向きの鋭いパルスは急速に声帯が閉じること
により発生したものである。The voiced sound source section 15 generates a vocal cord sound source waveform. This waveform is extracted from the actual voice by the inverse filter method. The inverse filter method is a method of extracting the vocal cord source waveform by removing the influence of the vocal tract included in the actual speech waveform, that is, the formant, with a filter having an inverse characteristic of the vocal tract (inverse filter). The waveform thus obtained is called a differential glottal volume flow waveform and corresponds to a waveform obtained by differentiating the acoustic vibration waveform applied to the vocal tract. Thus, this waveform produces a sharp upward pulse at the instant the vocal cords close rapidly. The sharp upward pulse of this waveform is caused by the rapid closure of the vocal cords.

【００４３】次に、上記実施例の音声合成装置の動作に
ついて、図面を参照しながら説明する。Next, the operation of the speech synthesizer of the above embodiment will be described with reference to the drawings.

【００４４】まず、合成したい音声が母音の場合、有声
音源部１５はピッチ周期に対応した声帯音源波形を生成
する。自然な音声では母音開始部分でパワーがなだらか
に立ち上がるので、有声音源部１５は出力の振幅を適当
な時定数で立ち上げるように制御する。この音源波形に
直列型ホルマント合成部１６がホルマントを付加するこ
とにより母音となって出力される。First, when the voice to be synthesized is a vowel, the voiced sound source unit 15 generates a vocal cord sound source waveform corresponding to the pitch period. Since the power rises gently at the vowel start portion in natural voice, the voiced sound source section 15 controls the output amplitude to rise at an appropriate time constant. The in-line formant synthesizer 16 adds a formant to this sound source waveform to output it as a vowel.

【００４５】次に、合成したい音声が子音の場合につい
て説明する。子音の合成には子音波形生成部１７の出力
とと有声音生成部１４の出力を合わせて用いる。まず、
子音素片の発音タイミングを決定する。音声合成装置に
は刻々と変化する合成パラメータが伝送されてくるが、
この中には音素セグメントの切り替わりに関する情報が
含まれている。たとえば、「ｋａ」という音節の場合は
「／ｋ／」のセグメントと「／ａ／」のセグメントに分
かれる。それらのセグメントの切り替わりをパラメータ
列から取り出し、そこにバーストラベルが一致するよう
に子音波形生成部１７があらかじめ子音素片の発音を開
始する。このようにすることで子音の自然な発音タイミ
ングが生成される。また、子音波形生成部１７は利得情
報を用いて子音素片の出力レベルを制御する。Next, the case where the voice to be synthesized is a consonant will be described. The output of the consonant sound waveform generation unit 17 and the output of the voiced sound generation unit 14 are used together to synthesize the consonant. First,
Determine the pronunciation timing of consonant pieces. Speech parameters that change from moment to moment are transmitted to the speech synthesizer,
This includes information about phoneme segment switching. For example, the syllable “ka” is divided into “/ k /” segment and “/ a /” segment. The switching of these segments is taken out from the parameter sequence, and the consonant waveform generating section 17 starts the sound generation of the consonant element in advance so that the burst label is matched therewith. By doing so, natural sounding timing of consonants is generated. Further, the consonant waveform generation section 17 controls the output level of the consonant element using the gain information.

【００４６】バーストラベルが発音された後、制御部２
１は終了ラベルが訪れるまでの間に、有声音生成部１４
の発音を開始する。このときに、ピークラベルと有声音
源部１５の出力のピークが一致するように有声音源部１
５の発音開始タイミングを制御する。前述したように有
声音源部１５の声帯閉鎖に伴う上向きの鋭いパルスは直
列型ホルマント合成部１６の出力波形上に上向きのピー
クを発生させるので、結果的にピークラベルと直列型ホ
ルマント合成部１６の出力波形のピークは一致する。After the burst label is sounded, the control unit 2
1 is the voiced sound generation unit 14 until the end label arrives.
Start pronunciation. At this time, the voiced sound source unit 1 is arranged so that the peak label and the peak of the output of the voiced sound source unit 15 match.
The timing of starting the sound generation of No. 5 is controlled. As described above, the upward sharp pulse accompanying the vocal cord closure of the voiced sound source section 15 causes an upward peak on the output waveform of the serial formant synthesis section 16, and as a result, the peak label and the serial formant synthesis section 16 have their respective peaks. The output waveform peaks match.

【００４７】終了ラベルの１ピッチ周期手前に来た時点
で有声音生成部１４と子音波形生成部１７の出力の重ね
合わせを開始する。すなわち、子音波形生成部１７の出
力を余弦特性で終了ラベルまでの区間で減衰させるとと
もに、有声音生成部１４の出力をその逆の特性で立ち上
げる。この操作により、波形上の不連続は除去される
が、ピークマークによる子音波形生成部１７と有声音生
成部１４の同期が行われているので、ピッチ周期の変動
がない極めてスムーズな波形接続が実現される。When one pitch period before the end label comes, the superposition of the outputs of the voiced sound generator 14 and the consonant waveform generator 17 is started. That is, the output of the consonant sound waveform generation unit 17 is attenuated in the section up to the end label with the cosine characteristic, and the output of the voiced sound generation unit 14 is activated with the opposite characteristic. By this operation, the discontinuity on the waveform is removed, but since the consonant sound waveform generation unit 17 and the voiced sound generation unit 14 are synchronized by the peak mark, an extremely smooth waveform connection without fluctuation of the pitch period can be achieved. Will be realized.

【００４８】また、同時に有声音源部１５の出力振幅立
ち上げの時定数を制御することで、有声音生成部１４と
子音波形生成部１７の出力の振幅包絡をなめらかに接続
する。この制御にはピーク値情報を用いる。すなわち、
ピークラベル時点での有声音生成部１４の振幅が、ピー
ク値情報の表す値になるように時定数を決定すれば良
い。なお、ピーク値は子音素片のピークラベル時点での
値を読みだすことでも得られるので、子音波形ラベル記
憶部１８に記憶しておかなくても構わない。At the same time, by controlling the time constant for raising the output amplitude of the voiced sound source section 15, the amplitude envelopes of the outputs of the voiced sound generation section 14 and the consonant waveform generation section 17 are smoothly connected. The peak value information is used for this control. That is,
The time constant may be determined so that the amplitude of the voiced sound generation unit 14 at the time of the peak label becomes the value represented by the peak value information. Since the peak value can be obtained by reading the value of the consonant element at the time of the peak label, it does not have to be stored in the consonant waveform label storage unit 18.

【００４９】次に、波形接続の様子を図９に示す。図９
は上から有声音源部１５の出力波形、有声音生成部１４
（直列型ホルマント合成部１６）の出力波形、子音波形
生成部１７の出力波形、出力部２０の出力波形（合成波
形）を表している。図９において４つの波形全てに渡っ
て記されている破線は子音素片のピークラベルを表して
いる。有声音源部１５のピークが子音素片のピークラベ
ルと同期する事によって、有声音生成部１４の出力が子
音素片と適正なタイミングで接続されていることがわか
る。Next, FIG. 9 shows how the waveform connection is made. Figure 9
Is the output waveform of the voiced sound source unit 15 from above, and the voiced sound generation unit 14
The output waveform of the (series formant synthesis unit 16), the output waveform of the consonant waveform generation unit 17, and the output waveform (synthesis waveform) of the output unit 20 are shown. In FIG. 9, the broken line drawn over all four waveforms represents the peak label of the consonant element. By synchronizing the peak of the voiced sound source unit 15 with the peak label of the consonant element, it can be seen that the output of the voiced sound generation unit 14 is connected to the consonant element at an appropriate timing.

【００５０】同様の波形接続手法は有声音源部１５の出
力波形の後に子音素片を接続する際にも用いることがで
きる。子音素片が有声子音の場合は子音素片開始直後の
波形上のピークなどにピークラベルを付与しておき、こ
のピークラベルを先行する有声音源部１５の出力波形の
ピークに同期させるように制御することでスムーズな接
続ができる。The same waveform connection method can also be used when connecting a consonant element after the output waveform of the voiced sound source section 15. When the consonant element is a voiced consonant, a peak label is given to a peak on the waveform immediately after the start of the consonant element, and the peak label is controlled so as to be synchronized with the peak of the output waveform of the preceding voiced sound source section 15. You can make a smooth connection.

【００５１】以上のように、接続点での波形不連続及び
ピッチ変動を防ぐために、子音素片にあらかじめラベル
を付与し、これを手がかりとして有声音生成部１４と子
音波形生成部１７の出力の同期をはかるものである。ま
た、無声化のために専用の子音素片を用意する必要をな
くするために、無声化していない通常の子音素片にラベ
ルを付与し、合成時にラベルを利用して無声化を再現す
るものである。そして、音韻性開始ラベルの利用により
文頭と文中で共通の子音素片を用いて合成することを可
能とするものである。As described above, in order to prevent the waveform discontinuity and the pitch fluctuation at the connection point, the consonant element is labeled in advance and the output of the voiced sound generator 14 and the consonant waveform generator 17 is used as a clue. It is for synchronization. Also, in order to eliminate the need to prepare a dedicated consonant element for devoicing, a label is given to a normal consonant element that is not devoiced, and the devoicing is reproduced by using the label during synthesis. Is. Then, by using the phonological start label, it is possible to synthesize by using consonant phonemes common to the beginning of a sentence and a sentence.

【００５２】その結果、有声音生成部１４の出力と子音
波形生成部１７の出力がなめらか、かつ適正なタイミン
グで接続され、雑音やピッチの不連続のない高品質な音
声を合成することができる。また、無声化や文頭、文中
のための専用の波形素片を用意する必要がなく、共通の
子音素片を用いることができ、記憶容量及び録音作業の
時間を縮小することができる。As a result, the output of the voiced sound generation unit 14 and the output of the consonant sound waveform generation unit 17 are connected smoothly and at appropriate timings, and high-quality speech without noise or discontinuity in pitch can be synthesized. . Further, it is not necessary to prepare a dedicated waveform segment for devoicing, the beginning of a sentence, and the middle of a sentence, and a common consonant segment can be used, so that the storage capacity and the recording time can be reduced.

【００５３】なお、上記実施例では、波形素片として子
音素片を用いる場合について説明したが、用いる波形素
片はそれ以外の音韻のものでも勿論構わない。In the above embodiment, the case where the consonant segment is used as the waveform segment has been described, but the waveform segment to be used may of course be any other phoneme.

【００５４】また、上記実施例では、制御部２１は、波
形上のピーク位置を一致させるのに有声音源部１５の発
音開始タイミングを制御するようにしたが、これに限ら
ず、有声音源部１５の出力波形及び子音波形生成部１７
の発音時期のいずれか一方、またはその両方を制御する
ようにしても良い。In the above embodiment, the control unit 21 controls the sounding start timing of the voiced sound source unit 15 in order to match the peak positions on the waveform. However, the present invention is not limited to this. Output waveform and child sound waveform generator 17
It is also possible to control either one or both of the pronunciation times of.

【００５５】また、上記実施例では、各処理部を専用の
ハードウェアにより構成したが、これに代えて、同様の
機能をコンピュータを用いてソフトウェア的に実現して
も勿論良い。Further, in the above-mentioned embodiment, each processing unit is constituted by dedicated hardware, but instead of this, the same function may be realized by software using a computer.

【００５６】以上これまで、無声子音、有声摩擦音、有
声破裂音などの合成のための構成法について説明した
が、鼻音のように特徴パラメータが相当長い時間長にお
よぶ音韻については、上記の子音素片の構成では十分な
音質が得られない。前述したように、ピッチを考慮せず
に接続を行うためには素片の長さは十分に短くなければ
ならない。しかし、そのような短い素片の中に鼻音のよ
うな長時間におよぶ特徴パラメータの変化を含めること
は不可能である。また、鼻音以外にも後続母音部分にま
で特徴パラメータが長く存在する音韻は多く、それらに
ついては調音結合を考慮せずにすむ範囲で素片長を長く
することにより音質の向上が期待できる。So far, the construction method for synthesis of unvoiced consonants, voiced fricatives, voiced plosives, etc. has been described. For the phonemes with feature parameters such as nasal sounds for a considerably long time, the above consonant is used. Sufficient sound quality cannot be obtained with one configuration. As described above, the length of the segment must be sufficiently short in order to make the connection without considering the pitch. However, it is impossible to include long-term changes in characteristic parameters such as nasal sounds in such short segments. In addition to nasal sounds, there are many phonemes that have long characteristic parameters up to the following vowel part, and for these, the sound quality can be expected to be improved by lengthening the segment length without considering articulation coupling.

【００５７】素片長を長くしたとき、素片と直列型ホル
マント合成波形との接続は母音の中心付近で行われる。
母音の中心付近はスペクトル変化が少ない比較的定常な
部位なので、接続による急速なスペクトル変化が音質に
与える影響は大きい。この問題を解決するためには接続
点での重ね合わせ処理をより長い区間で行うことが効果
的である。When the segment length is increased, the segment and the series formant synthesized waveform are connected near the center of the vowel.
Since the vicinity of the center of the vowel is a relatively stationary part where the spectrum change is small, the rapid spectrum change due to the connection has a great influence on the sound quality. In order to solve this problem, it is effective to perform the superposition processing at the connection point in a longer section.

【００５８】しかし、重ね合わせ区間において素片のピ
ッチと合成ピッチが異なる場合、両波形が干渉し、エコ
ーや雑音を発生する。また、長い素片自身がピッチ性を
強く持つために接続前後のピッチ不連続が大きく音質を
損ねる。However, when the pitch of the segment is different from the synthetic pitch in the overlapping section, both waveforms interfere with each other, and an echo or noise is generated. In addition, since the long piece itself has a strong pitch property, pitch discontinuity before and after the connection is large and the sound quality is impaired.

【００５９】そこで、合成ピッチに合わせた各種のピッ
チを持つ子音素片を用意しておくことが考えられるが、
十分に精度の高いピッチ整合を行うためには極めて多く
の種類の素片を用意しなくてはならない。また、合成ピ
ッチはイントネーションパターンによって変化してお
り、子音素片の継続時間内にも大きく変化が起こる。こ
のように多様なピッチ変化に対応した子音素片を用意す
ることは実質的に不可能である。Therefore, it is conceivable to prepare consonant phonemes having various pitches according to the synthetic pitch.
In order to perform pitch alignment with sufficiently high precision, it is necessary to prepare an extremely large number of kinds of pieces. Further, the synthetic pitch changes depending on the intonation pattern, and the change greatly occurs within the duration of the consonant element. As described above, it is virtually impossible to prepare consonant phonemes corresponding to various pitch changes.

【００６０】そこで、用意した子音素片にピッチ変更操
作を加えることが不可欠になる。ピッチ変更法として簡
単なものには線形伸縮法がある。この方法は記憶された
波形を読み出す際に通常は１サンプルずつを順番に読み
出すところを、１以外の間隔で読み出すことによって時
間軸に沿って伸縮した波形を得る方法である。非整数の
間隔によって記憶波形の読み出し番地が実際には存在し
ない非整数の番地になるので、前後の値から直線を用い
て内挿する。Therefore, it becomes indispensable to add a pitch changing operation to the prepared consonant element. As a simple pitch changing method, there is a linear expansion / contraction method. This method is a method for obtaining a waveform expanded or contracted along the time axis by reading out the stored waveforms one by one in order when reading the stored waveforms at intervals other than 1. Since the read address of the stored waveform becomes a non-integer address that does not actually exist due to the non-integer interval, interpolation is performed using straight lines from the preceding and following values.

【００６１】しかし、重ね合わせ区間内でピッチが整合
しても、位相の同期を正確に行うことは困難である。そ
れは、線形伸縮法があくまで原ピッチを元に一定の割合
でピッチ変更を行う方法のため、極めて正確な原ピッチ
およびその揺らぎに関する情報を持っていなければなら
ないからである。このため、上記の実施例による波形同
期方法では長期に渡る位相同期は不可能といえる。ま
た、線形伸縮によるピッチ変更操作はスペクトル形状の
変化を伴うため、音質劣化、音韻性の低下、接続による
スペクトル不連続の発生などの問題を引き起こす。この
ため、原ピッチに比べて極めて小さい範囲でしかピッチ
変更ができない。However, even if the pitches match in the overlapping section, it is difficult to accurately synchronize the phases. This is because the linear expansion / contraction method is a method of changing the pitch at a constant rate based on the original pitch, and therefore must have extremely accurate information on the original pitch and its fluctuation. Therefore, it can be said that the waveform synchronization method according to the above-described embodiment cannot achieve phase synchronization for a long time. In addition, since the pitch changing operation by linear expansion / contraction involves a change in spectrum shape, it causes problems such as sound quality deterioration, phonological deterioration, and spectrum discontinuity due to connection. For this reason, the pitch can be changed only within a range extremely smaller than the original pitch.

【００６２】そこで、次の実施例では上記の問題を解決
するためにピッチ同期重畳法を用い、ピッチ同期信号を
用いて常に波形の位相同期をはかる方法をとる。Therefore, in the next embodiment, in order to solve the above-mentioned problem, the pitch synchronization superposition method is used, and the method of always obtaining the phase synchronization of the waveform by using the pitch synchronization signal is adopted.

【００６３】図１０は本発明にかかる第４の実施例の音
声合成装置の構成図である。その音声合成装置にはピッ
チ制御部１が設けられ、その出力はピッチ同期信号生成
部２４、波形読み出し部２６ａ、２６ｂ、２６ｃ、２６
ｄ、窓掛け部２８ａ、２８ｂ、２８ｃ、２８ｄに接続さ
れている。ピッチ同期信号生成部２４の出力はピッチ同
期信号分配部２４ａおよび遅延部３７に接続されてい
る。ピッチ同期信号分配部２５ａの第１の出力は波形読
み出し部２６ａに、第２の出力は波形読み出し部２６ｂ
にそれぞれ入力されている。遅延部３７の出力はピッチ
同期信号分配部２５ｂに入力され、その第１の出力は波
形読み出し部２６ｃに、第２の出力は波形読み出し部２
６ｄにそれぞれ入力されている。FIG. 10 is a block diagram of a speech synthesizer according to the fourth embodiment of the present invention. The voice synthesizer is provided with a pitch control unit 1, and outputs the pitch control signal generation unit 24 and the waveform reading units 26a, 26b, 26c, 26.
d, and is connected to the window hanging portions 28a, 28b, 28c, 28d. The output of the pitch synchronization signal generation unit 24 is connected to the pitch synchronization signal distribution unit 24a and the delay unit 37. The first output of the pitch synchronization signal distribution unit 25a is to the waveform reading unit 26a, and the second output is the waveform reading unit 26b.
Have been entered respectively. The output of the delay unit 37 is input to the pitch synchronization signal distribution unit 25b, the first output thereof is the waveform reading unit 26c, and the second output thereof is the waveform reading unit 2.
6d is input respectively.

【００６４】波形読み出し部２６ａ、２６ｂには有声音
源波形記憶部２７とオフセット制御部４１の出力が接続
されている。オフセット制御部４１の入力には有声音源
ピーク位置記憶部２９の出力が接続されている。波形読
み出し部２６ａの出力は窓掛け部２８ａに、波形読み出
し部２６ｂの出力は窓掛け部２８ｂにそれぞれ入力され
ている。窓掛け部２８ａの出力は混合部３１ａに接続さ
れている。窓掛け部２８ｂの出力は利得制御部３０を介
して混合部３１ａに接続されている。混合部３１ａの出
力は利得制御部４０ａを介して直列型ホルマント合成部
３２に入力されている。The outputs of the voiced sound source waveform storage unit 27 and the offset control unit 41 are connected to the waveform reading units 26a and 26b. The output of the voiced sound source peak position storage unit 29 is connected to the input of the offset control unit 41. The output of the waveform reading unit 26a is input to the windowing unit 28a, and the output of the waveform reading unit 26b is input to the windowing unit 28b. The output of the window unit 28a is connected to the mixing unit 31a. The output of the windowing section 28b is connected to the mixing section 31a via the gain control section 30. The output of the mixing unit 31a is input to the series-type formant synthesis unit 32 via the gain control unit 40a.

【００６５】波形読み出し部２６ｃ、２６ｄには子音波
形記憶部３３、子音波形ピーク位置記憶部３４、および
子音波形ラベル記憶部４２の出力が接続され、波形読み
出し部２６ｃの出力は窓掛け部２８ｃに、波形読み出し
部２６ｄの出力は窓掛け部２８ｄにそれぞれ入力されて
いる。窓掛け部２８ｃおよび窓掛け部２８ｄの出力はと
もに混合部３１ｂに入力されている。混合部３１ｂの出
力は利得制御部４０ｂに接続されている。The outputs of the consonant sound waveform storage unit 33, the consonant sound waveform peak position storage unit 34, and the consonant sound waveform label storage unit 42 are connected to the waveform reading units 26c and 26d, and the output of the waveform reading unit 26c is sent to the windowing unit 28c. The output of the waveform reading unit 26d is input to the windowing unit 28d. Both outputs of the window hanging portion 28c and the window hanging portion 28d are input to the mixing portion 31b. The output of the mixing section 31b is connected to the gain control section 40b.

【００６６】直列型ホルマント合成部３２および利得制
御部４０ｂの出力は合成部３５に接続され、その出力は
出力部３６に接続されている。The outputs of the series formant combiner 32 and the gain controller 40b are connected to the combiner 35, and the outputs thereof are connected to the output 36.

【００６７】続いて、以上のように構成された音声合成
装置の動作について説明する。Next, the operation of the speech synthesizer configured as above will be described.

【００６８】ピッチ制御部２３がイントネーションパタ
ーンに従って生成したF0パラメータはピッチ同期信号生
成部２４、波形読み出し部２６ａ、２６ｂ、２６ｃ、２
６ｄ、窓掛け部２８ａ、２８ｂ、２８ｃ、２８ｄに伝達
される。ピッチ同期信号生成部２４はF0パラメータに従
った周期のピッチ同期信号を生成し、ピッチ同期信号分
配部２５ａおよび遅延部３７に出力する。The F0 parameter generated by the pitch control unit 23 in accordance with the intonation pattern is the pitch synchronization signal generation unit 24 and the waveform reading units 26a, 26b, 26c, 2
6d, transmitted to the window hanging portions 28a, 28b, 28c, 28d. The pitch synchronization signal generation unit 24 generates a pitch synchronization signal having a cycle according to the F0 parameter and outputs it to the pitch synchronization signal distribution unit 25a and the delay unit 37.

【００６９】それではまずピッチ同期重畳法を用いた有
声音源の生成方法について説明する。First, a method of generating a voiced sound source using the pitch synchronization superposition method will be described.

【００７０】ピッチ同期信号分配部２５ａは入力された
ピッチ同期信号を２つの波形読み出し部２６ａ、２６ｂ
に交互に出力する。The pitch synchronizing signal distributor 25a converts the input pitch synchronizing signal into two waveform reading units 26a and 26b.
Alternately output to.

【００７１】波形読み出し部２６ａはピッチ同期信号を
受け取ったとき、オフセット制御部４１を通じて有声音
源ピーク位置記憶部２９から最初のピーク位置を読み取
る。オフセット制御部４１は有声音源ピーク位置記憶部
２９の出力にオフセットNoffを加算して出力する。Noff
については後述する。波形読み出し部２６ａはこうして
得られたオフセット付きピーク位置を元に有声音源波形
記憶部２７に記憶された有声音源波形の読み出しを開始
する。読み出し開始位置N0は（数１）で与えられる。When receiving the pitch synchronization signal, the waveform reading section 26a reads the first peak position from the voiced sound source peak position storage section 29 through the offset control section 41. The offset control unit 41 adds the offset Noff to the output of the voiced sound source peak position storage unit 29 and outputs the result. Noff
Will be described later. The waveform reading unit 26a starts reading the voiced sound source waveform stored in the voiced sound source waveform storage unit 27 based on the thus obtained peak position with offset. The read start position N0 is given by (Equation 1).

【００７２】[0072]

【数１】 N0 = P0 - Noff - Tsyn ここで、P0は有声音源ピーク位置記憶部２９に記憶され
た０番目のピーク位置、TsynはF0パラメータに基づいた
合成ピッチ周期である。## EQU1 ## N0 = P0-Noff-Tsyn Here, P0 is the 0th peak position stored in the voiced sound source peak position storage unit 29, and Tsyn is a synthetic pitch period based on the F0 parameter.

【００７３】波形読み出し部２６ａの出力は窓掛け部２
８ａに入力され、Hanning窓によって窓掛けが行われ
る。Hanning窓の長さTwinは合成ピッチ周期Tsynと有声
音源波形の原ピッチ周期Torgのどちらか小さい方の２倍
である。これは、TwinがTorgの２倍を越えると両隣のピ
ークがHanning窓の中に入ることによる音質劣化を防ぐ
ためである。このようにしてピッチ波形が生成される。The output of the waveform reading section 26a is the windowing section 2
8a, and windowing is performed by the Hanning window. The length of the Hanning window Twin is twice the smaller of the synthetic pitch period Tsyn and the original pitch period Torg of the voiced sound source waveform. This is to prevent deterioration of sound quality due to the peaks on both sides entering the Hanning window when Twin exceeds twice Torg. In this way, the pitch waveform is generated.

【００７４】この動作よりも１ピッチ周期遅れて波形読
み出し部２６ｂにピッチ同期信号が伝達される。波形読
み出し部２６ｂは先ほどと同様に波形を読み出し、窓掛
け部２８ｂによって窓掛けが行われる。この時の波形読
み出し開始位置は（数２）で与えられる。The pitch synchronizing signal is transmitted to the waveform reading section 26b one pitch cycle later than this operation. The waveform reading unit 26b reads the waveform as before, and the windowing unit 28b performs windowing. The waveform reading start position at this time is given by (Equation 2).

【００７５】[0075]

【数２】 N1 = P1 - Noff - Tsyn ここで、P1は有声音源ピーク位置記憶部２９に記憶され
た１番目のピーク位置である。N1 = P1−Noff−Tsyn where P1 is the first peak position stored in the voiced sound source peak position storage unit 29.

【００７６】窓掛け部２８ｂの出力は利得制御部３０に
おいて０〜１の範囲で利得制御を受ける。この目的は語
頭や語尾などで発生する不安定な声帯振動を模擬するた
めである。すなわち、語頭、語尾においては声帯が１ピ
ッチ周期ごとに大小の振動を繰り返す場合があり、その
結果倍ピッチ周期成分が生まれる。利得制御部３０にお
いて利得を0.5などにすることにより、倍ピッチ周期成
分を発生させることが可能である。The output of the windowing section 28b is subjected to gain control in the gain control section 30 in the range of 0 to 1. The purpose is to simulate the unstable vocal cord vibration that occurs at the beginning and end of words. That is, at the beginning and end of a word, the vocal cords may repeat large and small vibrations every pitch period, resulting in a double pitch period component. By setting the gain to 0.5 or the like in the gain controller 30, it is possible to generate a double pitch period component.

【００７７】以上のようにして交互に生成されたピッチ
波形を混合部３１ａにおいて重ね合わせることにより、
所望のピッチ周期を持った有声音源波形が生成される。
また、個々のピッチ波形は時間軸に対して伸縮されてい
ないのでスペクトル形状の変化は起きない。By superposing the pitch waveforms alternately generated as described above in the mixing section 31a,
A voiced sound source waveform having a desired pitch period is generated.
Further, since the individual pitch waveforms are not expanded / contracted with respect to the time axis, the spectral shape does not change.

【００７８】このようにして生成された有声音源波形は
利得制御部４０ａにおいて振幅の制御を受けた後、従来
通りの直列型ホルマント合成部３２によって調音を受け
て母音成分となる。The voiced sound source waveform thus generated is subjected to amplitude control by the gain control section 40a and then articulated by the conventional series formant synthesis section 32 to become a vowel component.

【００７９】続いて前述のNoffについて説明する。有声
音源波形のピッチ変更を行うと以下のような理由でスペ
クトル歪を生じる場合がある。逆フィルタ法により抽出
された声門体積流波形は図１１のような構造を持ってい
る。この中で声門開放部波形は低域のエネルギーを持っ
ており、声門閉鎖部波形は高域のエネルギーを持ってい
る。Next, the above Noff will be described. When the pitch of the voiced sound source waveform is changed, spectrum distortion may occur due to the following reasons. The glottal volume flow waveform extracted by the inverse filter method has a structure as shown in FIG. Of these, the open-glottic waveform has low-range energy, and the closed-glottic waveform has high-range energy.

【００８０】図１２はNoff=0のもとでピッチ周波数を原
ピッチ周波数よりも低く変更した場合の図である。声門
閉鎖部はHanning窓の端に近い部分に位置するため、両
隣のHanning窓が重なり合っている区間が短くなると減
衰する。このために生成された有声音源波形は低域のエ
ネルギー成分が低下する。FIG. 12 is a diagram when the pitch frequency is changed to be lower than the original pitch frequency under Noff = 0. Since the glottal closure is located near the edge of the Hanning window, it decays when the overlapping section of adjacent Hanning windows becomes shorter. For this reason, the voiced sound source waveform generated has low-frequency energy components.

【００８１】このことを防ぐために図１３のように声門
閉鎖部をHanning窓の中心からNoffサンプルずらし、声
門開放部がHanning窓の中心に近付くようにする。ただ
し、Noffを大きくし過ぎるとピッチを上げたときに声門
閉鎖部のパルス状波形が減衰し、高域のエネルギーが低
下する。これは、ピッチ周波数を原ピッチ周波数よりも
高く変更したときにHanning窓長が短くなることによ
り、Hanning窓の端に近付いた声門閉鎖パルスが減衰す
るためである。このような理由からNoffは例えば0.1To
程度を用いる。In order to prevent this, as shown in FIG. 13, the glottic closure is shifted by Noff samples from the center of the Hanning window so that the glottal opening approaches the center of the Hanning window. However, if Noff is made too large, the pulse-like waveform at the glottal blockage is attenuated when the pitch is raised, and the high-frequency energy is reduced. This is because when the pitch frequency is changed to be higher than the original pitch frequency, the Hanning window length becomes shorter, so that the glottal closing pulse approaching the edge of the Hanning window is attenuated. For this reason Noff is 0.1To
Use degree.

【００８２】子音の生成過程では有声音源と同様に波形
の読み出しおよび窓掛けが行われるが、その入力である
ピッチ同期信号は遅延部３７によってNoffサンプルの遅
延を受ける。これにより子音波形のピーク位置と有声音
源波形のピーク位置の同期が行われる。また、第３の実
施例と同様に子音波形ラベル記憶部４２に従って発音タ
イミングの制御が行われる。In the consonant generation process, waveform reading and windowing are performed as in the voiced sound source. The pitch synchronizing signal which is the input thereof is delayed by Noff samples by the delay unit 37. As a result, the peak position of the consonant sound waveform and the peak position of the voiced sound source waveform are synchronized. Further, similarly to the third embodiment, the sound generation timing is controlled according to the consonant sound waveform label storage unit 42.

【００８３】このように互いの同期をとって生成された
母音成分波形と子音成分波形は合成部３５においてなめ
らかに重ね合わせが行われ、出力部３６で音声に変換さ
れて出力される。その結果、子音部分に波形素片を用い
た波形不連続、ピッチ不連続、位相不連続のない極めて
高品質な合成音が得られる。The vowel component waveform and the consonant component waveform thus generated in synchronization with each other are smoothly superposed on each other in the synthesizing section 35, converted into voice by the output section 36 and output. As a result, it is possible to obtain an extremely high-quality synthesized sound that does not have waveform discontinuity, pitch discontinuity, or phase discontinuity using the waveform segment in the consonant part.

【００８４】本実施例では有声音源部に単一の有声音源
波形を用いたが、簡単な拡張により複数の音源波形を用
いたさらに高品質な合成音を得ることも可能である。例
えば、高調波成分が多い音源と少ない音源を場合によっ
て混合することや、５母音に対して専用の音源を用意し
ておいて切り替えながら合成することなどが考えられ
る。In the present embodiment, a single voiced sound source waveform is used for the voiced sound source section, but it is possible to obtain a higher quality synthesized speech using a plurality of sound source waveforms by simple extension. For example, it is conceivable to mix a sound source with many harmonic components and a sound source with few harmonic components in some cases, or prepare a dedicated sound source for the five vowels and combine them while switching.

【００８５】図１４は本発明にかかる第５の実施例の音
声合成装置の構成図である。その音声合成装置は第４の
実施例における有声音源部３８が５系統設けられた構成
である。すなわち、ピッチ制御部１が設けられ、その出
力はピッチ同期信号生成部２４、有声音源部３８ａ、３
８ｂ、３８ｃ、３８ｄ、３８ｅに入力されている。ピッ
チ同期信号生成部２４の出力はピッチ同期信号分配部２
５ａと遅延部３７に入力されている。ピッチ同期信号分
配部２５ａの２つの出力はそれぞれが有声音源部３８
ａ、３８ｂ、３８ｃ、３８ｄ、３８ｅにそれぞれ２つず
つ設けられた入力に接続されている。有声音源部３８
ａ、３８ｂ、３８ｃ、３８ｄ、３８ｅの内部では第４の
実施例と同様に有声音源が生成され、それらの出力は混
合されて直列型ホルマント合成部３２に入力される。FIG. 14 is a block diagram of a speech synthesizer according to the fifth embodiment of the present invention. The voice synthesizer has a configuration in which five voiced sound source units 38 in the fourth embodiment are provided. That is, the pitch control unit 1 is provided, and the outputs thereof are the pitch synchronization signal generation unit 24, the voiced sound source units 38a, 3 and 3.
8b, 38c, 38d, and 38e. The output of the pitch synchronization signal generator 24 is the pitch synchronization signal distributor 2
5a and the delay unit 37. The two outputs of the pitch synchronization signal distribution unit 25a are voiced sound source units 38, respectively.
Each of a, 38b, 38c, 38d, and 38e is connected to two inputs. Voiced sound source section 38
Voiced sound sources are generated inside a, 38b, 38c, 38d, and 38e as in the fourth embodiment, and the outputs thereof are mixed and input to the serial formant synthesis unit 32.

【００８６】一方、遅延部３７の出力はピッチ同期信号
分配部２５ｂに接続されている。ピッチ同期信号分配部
２５ｂの２つの出力は子音生成部３９に接続されてい
る。子音生成部３９の内部では第４の実施例と同様に子
音波形素片を用いて子音成分が生成される。On the other hand, the output of the delay unit 37 is connected to the pitch synchronization signal distribution unit 25b. The two outputs of the pitch synchronization signal distributor 25b are connected to the consonant generator 39. Inside the consonant generation unit 39, consonant components are generated using consonant sound waveform segments as in the fourth embodiment.

【００８７】直列型ホルマント合成部３２と子音生成部
３９の出力は合成部３５に入力され、合成部３５の出力
は出力部３６に入力されている。The outputs of the serial formant synthesis unit 32 and the consonant generation unit 39 are input to the synthesis unit 35, and the output of the synthesis unit 35 is input to the output unit 36.

【００８８】５つの有声音源部３８ａ〜３８ｅには５母
音／ａ／〜／ｏ／から逆フィルタ法で抽出した声門体積
流波形を格納しておく。逆フィルタ法によって抽出され
る音源波形は５母音によって微妙に異なっている。した
がって、５母音の合成を共通の音源波形から行うより
も、５母音それぞれから抽出した音源波形から行う方が
高品質の音声が合成できる。In the five voiced sound source sections 38a to 38e, glottal volume flow waveforms extracted from the five vowels / a / to / o / by the inverse filter method are stored. The sound source waveform extracted by the inverse filter method is slightly different depending on the five vowels. Therefore, it is possible to synthesize high-quality speech by performing synthesis of the five vowels from the sound source waveform extracted from each of the five vowels, rather than performing synthesis from the common sound source waveform.

【００８９】そこで、母音や音節の区切りでこれらの音
源を切り替えることにより、それぞれの母音の音質を向
上することができる。切り替え時には利得制御部４０ａ
によって互いの音源の利得をなめらかに上下させること
で雑音や異音を抑制することができる。各音源は正確に
ピーク同期されているので、このように重ね合わせや切
り替えをおこなっても極めて自然な音源波形を生成する
ことができる。Therefore, the sound quality of each vowel can be improved by switching these sound sources at vowel and syllable breaks. Gain control unit 40a when switching
Noise and abnormal noise can be suppressed by smoothly moving the gains of the sound sources up and down. Since each sound source is accurately peak-synchronized, an extremely natural sound source waveform can be generated even if such superposition and switching are performed.

【００９０】５母音の音源の原ピッチが互いに異なるこ
とや、それぞれのピッチが揺らぎを含んでいることによ
り、従来の線形伸縮を用いた有声音源部の構成による完
全な同期は極めて困難である。しかし、本発明の構成に
よれば各音源の原ピッチは異なっていてもよく、ピッチ
が揺らぎを含んでいても差し支えない。Since the original pitches of the sound sources of the five vowels are different from each other and each pitch includes fluctuations, it is extremely difficult to achieve complete synchronization by the structure of the voiced sound source section using the conventional linear expansion / contraction. However, according to the configuration of the present invention, the original pitch of each sound source may be different, and the pitch may include fluctuation.

【００９１】なお、本実施例では有声音源部を５母音に
対して複数化したが、別の基準で複数化しても勿論構わ
ない。例えば、高調波の多い音源と高調波の少ない音源
による複数化や、ピッチ範囲による複数化、文中の位置
（文頭、文中、文末など）による複数化などである。In this embodiment, the voiced sound source section is made plural for the five vowels, but it may be made plural according to another standard. For example, there are pluralization by a sound source having many harmonics and plural sound sources with few harmonics, pluralization by a pitch range, pluralization by a position in a sentence (beginning of sentence, in the sentence, end of sentence, etc.).

【００９２】また、本実施例では全ての有声音源部、子
音部に共通のピッチ同期信号を用いて同期を行ったが、
F0パラメータをもとにそれぞれの部分でピッチ周期を算
出し、波形の読み出しを行ってもかまわない。この場合
には発音開始時に互いの同期をとればよい。In the present embodiment, the pitch synchronization signal common to all voiced sound source sections and consonant sections is used for synchronization.
It is also possible to calculate the pitch period in each part based on the F0 parameter and read the waveform. In this case, it is sufficient to synchronize with each other at the start of sound generation.

【００９３】また、窓関数は合成ピッチ周期と原ピッチ
周期のいずれか小さい方の２倍の長さのHanning窓とし
たが、他の形状や長さの窓を用いても勿論構わない。Although the window function is a Hanning window having a length twice as long as the smaller one of the synthetic pitch period and the original pitch period, a window having another shape or length may be used.

【００９４】[0094]

【発明の効果】以上述べたところから明らかなように、
本発明によれば、母音性信号は直列型ホルマント合成方
式により柔軟で様々な音質やイントネーションを付与で
き、子音性信号は波形素片を用いた方式によりホルマン
ト合成方式では実現出来ない高品質な子音を提供できる
ので、それらを組み合わせた合成音は高品質で且つ色々
な声質に対応できる。また、従来の波形素片を用いた方
式に対して、本方式の場合、波形素片としての格納が持
続時間の短い子音に限るため小容量の記憶装置で実現が
可能である。As is apparent from the above description,
According to the present invention, a vowel signal is flexible and can give various tones and intonation by a serial formant synthesis method, and a consonant signal is a high-quality consonant that cannot be realized by a formant synthesis method using a waveform segment. Therefore, the synthesized sound obtained by combining them can have high quality and can correspond to various voice qualities. Further, in contrast to the conventional method using the waveform element, in the case of the present method, since the storage as the waveform element is limited to the consonant having a short duration, it can be realized by a small-capacity storage device.

【００９５】さらに、無声音源部と並列型ホルマント合
成部を設けることにより、子音の中でホルマント合成方
式で十分高品質が実現できるものに関しては、並列型ホ
ルマント合成部が利用でき、波形素片格納に必要な記憶
容量の一層の削減が可能になる。また、並列型ホルマン
ト合成部と波形素片を同時に用いることにより、波形素
片自体の特性を変化させることができ、電話帯域や、雑
音環境化などで明瞭度を確保する場合に有効である。Further, by providing the unvoiced sound source section and the parallel type formant synthesizing section, the parallel type formant synthesizing section can be used for the consonant which can achieve sufficiently high quality by the formant synthesizing method, and can store the waveform segment. It is possible to further reduce the storage capacity required for. In addition, by using the parallel formant synthesizer and the waveform element at the same time, the characteristics of the waveform element itself can be changed, which is effective in securing clarity in a telephone band or in a noise environment.

【００９６】また、本発明は、波形素片の特徴量に基づ
き、音生成手段により生成された音声信号と波形素片記
憶手段から得られる波形素片信号とを合成させる制御手
段とを備えているので、音声波形の接続による雑音の発
生を抑制でき、波形素片を格納するための記憶容量、録
音作業が軽減できるという長所を有する。Further, the present invention comprises control means for synthesizing the voice signal generated by the sound generating means and the waveform element signal obtained from the waveform element storage means based on the characteristic amount of the waveform element. Therefore, it has advantages that it is possible to suppress the generation of noise due to the connection of the voice waveform, and it is possible to reduce the storage capacity for storing the waveform element and the recording work.

【００９７】さらに、本発明は、有声音源部と子音生成
部のピッチ制御にピッチ同期重畳法を用いることによ
り、有声音源波形と子音波形の完全な同期がとれ、波形
不連続、ピッチ不連続、位相不連続のない極めて高品質
な音声を合成することができる。また、ピッチ変更にと
もなうスペクトル形状の変化を回避することができる。
さらに、複数の異なる特徴を持った有声音源を目的に応
じて混合または切り替えて用いることが可能となり、様
々な局面に応じて適切な音源を用いた高品質な音声を合
成することができる。Further, according to the present invention, the pitch synchronization superposition method is used for the pitch control of the voiced sound source section and the consonant generation section, whereby the voiced sound source waveform and the consonant sound waveform are perfectly synchronized, and the waveform discontinuity, pitch discontinuity, It is possible to synthesize extremely high quality speech without phase discontinuity. Further, it is possible to avoid a change in the spectrum shape due to the pitch change.
Furthermore, voiced sound sources having a plurality of different characteristics can be mixed or switched according to the purpose, and high-quality speech can be synthesized using appropriate sound sources according to various situations.

[Brief description of drawings]

【図１】本発明第１の実施例における音声合成装置のブ
ロック図である。FIG. 1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention.

【図２】子音「／ｋ／」の波形素片と母音「あ」の合成
信号が合わされて「か」になる様子。FIG. 2 shows a state in which a waveform segment of a consonant “/ k /” and a synthesized signal of a vowel “a” are combined into a “ka”.

【図３】本発明第２の実施例における並列型ホルマント
合成部を併せ持つ音声合成装置のブロック図である。FIG. 3 is a block diagram of a voice synthesizing apparatus having a parallel formant synthesizer in a second embodiment of the present invention.

【図４】従来のホルマント型音声合成装置のブロック図
である。FIG. 4 is a block diagram of a conventional formant type speech synthesizer.

【図５】従来の波形素片を用いた音声合成装置のブロッ
ク図である。FIG. 5 is a block diagram of a conventional speech synthesizer using a waveform segment.

【図６】本発明第３の実施例における音声合成装置のブ
ロック図である。FIG. 6 is a block diagram of a speech synthesizer according to a third embodiment of the present invention.

【図７】同実施例における無声子音素片のラベル付けを
説明する図である。FIG. 7 is a diagram illustrating labeling of unvoiced consonant segments in the example.

【図８】同実施例における有声子音素片のラベル付けを
説明する図である。FIG. 8 is a diagram illustrating labeling of voiced consonant segments in the example.

【図９】同実施例における波形接続を説明する図であ
る。FIG. 9 is a diagram illustrating waveform connection in the same example.

【図１０】本発明第４の実施例における音声合成装置の
ブロック図である。FIG. 10 is a block diagram of a speech synthesizer according to a fourth embodiment of the present invention.

【図１１】声門体積流波形を説明する図である。FIG. 11 is a diagram illustrating a glottal volume flow waveform.

【図１２】ピッチ同期重畳法でピッチ周波数を下げる操
作を説明する図である。FIG. 12 is a diagram illustrating an operation of lowering the pitch frequency by the pitch synchronization superposition method.

【図１３】本発明のHanning窓と声門体積流波形の位置
関係を説明する図である。FIG. 13 is a diagram illustrating the positional relationship between the Hanning window and the glottal volume flow waveform of the present invention.

【図１４】本発明第５の実施例における音声合成装置の
ブロック図である。FIG. 14 is a block diagram of a speech synthesizer according to a fifth embodiment of the present invention.

[Explanation of symbols]

１有声音源部２直列型ホルマント合成部３子音波形記憶部４子音波形読み出し部５合成部６無声音源部７並列型ホルマント合成部８ホルマント合成器制御用係数生成部９ホルマント合成器制御規則格納部１０ホルマント合成器１１音声素片選択部１２音声素片データベース格納部１３素片接続合成部１４有声音生成部１５有声音源部１６直列型ホルマント合成部１７子音波形生成部１８子音波形ラベル記憶部１９子音波形記憶部２０出力部２１制御部２２合成部２３ピッチ制御部２４ピッチ同期信号生成部２５ピッチ同期信号分配部２６波形読み出し部２７有声音源波形記憶部２８窓掛け部２９有声音源ピーク位置記憶部３０利得制御部３１混合部３２直列型ホルマント合成部３３子音波形記憶部３４子音波形ピーク位置記憶部３５合成部３６出力部３７遅延部３８有声音源部３９子音生成部４０利得制御部４１オフセット制御部４２子音波形ラベル記憶部 1 voiced sound source section 2 serial formant synthesis section 3 consonant waveform storage section 4 consonant waveform readout section 5 synthesis section 6 unvoiced sound source section 7 parallel formant synthesis section 8 formant synthesis control coefficient generation section 9 formant synthesis control rule storage section 10 formant synthesizer 11 speech unit selection unit 12 speech unit database storage unit 13 unit connection synthesis unit 14 voiced sound generation unit 15 voiced sound source unit 16 series formant synthesis unit 17 consonant sound waveform generation unit 18 consonant sound waveform label storage unit 19 Consonant waveform storage unit 20 Output unit 21 Control unit 22 Synthesizing unit 23 Pitch control unit 24 Pitch synchronization signal generation unit 25 Pitch synchronization signal distribution unit 26 Waveform reading unit 27 Voiced sound source waveform storage unit 28 Windowing unit 29 Voiced sound source peak position storage unit 30 Gain Control Section 31 Mixing Section 32 Serial Formant Synthesis Section 33 Consonant Sound Waveform Part 34 consonant waveform peak position memory 35 combining unit 36 output unit 37 delay unit 38 a voiced sound source unit 39 consonant generator 40 gain control section 41 offset control unit 42 consonant waveform label storage unit

Claims

[Claims]

1. A voiced sound source section for outputting a voiced sound source signal, and a plurality of formant resonators connected in series with the voiced sound source signal from the voiced sound source section as an input, for synthesizing a predetermined sound such as a vowel. A serial formant synthesis unit, a waveform storage unit that stores a waveform of a predetermined sound such as a plurality of consonants, a waveform reading unit that reads out a required waveform from the waveform storage unit, and a serial formant synthesis unit. A speech synthesizer, comprising: a waveform combining section that outputs the synthesized speech by superimposing or switching the waveform read by the waveform reading section and outputting the synthesized speech.

2. A voiced sound source section for outputting a voiced sound source signal, and a plurality of formant resonators connected in series with the voiced sound source signal from the voiced sound source section as an input to synthesize a predetermined sound such as a vowel. A serial formant synthesis unit, a waveform storage unit that stores a waveform of a predetermined sound such as a plurality of consonants, a waveform readout unit that reads out a required waveform from the waveform storage unit, and an unvoiced sound such as white noise An unvoiced sound source unit, a sound source signal from the unvoiced sound source unit as an input, and having a plurality of resonators connected in parallel, a parallel formant synthesis unit that synthesizes a predetermined sound such as a plosive sound or a fricative sound, and A waveform combination unit that outputs the synthesized voice by superimposing or switching the output of the serial formant synthesis unit, the output of the parallel synthesis unit, and the waveform read by the consonant sound waveform reading unit. A speech synthesizer characterized by the above.

3. A sound generating means for generating a voice signal based on a characteristic parameter extracted from a voice, a waveform element storage means for storing a waveform element cut out from the voice, and the stored waveform element. A waveform segment feature amount storage means for storing a predetermined feature amount for each piece, and a voice signal generated by the sound generation means and the waveform segment storage means based on the stored feature amount of the waveform segment. A voice synthesizing apparatus comprising: a control means for synthesizing the obtained waveform segment signal.

4. The speech synthesizer according to claim 3, wherein the predetermined feature amount is a gain value of the waveform segment, and the amplitude of the waveform segment signal is controlled by the gain value. .

5. The voice according to claim 3, wherein the waveform segment feature amount storage means stores the existence time of a waveform having a desired feature on the waveform segment. Synthesizer.

6. The voice synthesizing apparatus according to claim 5, wherein the desired feature is any peak position or peak value on the waveform of the waveform segment.

7. The sound generation means has a voiced sound source generation section for generating a voiced sound source, and the control means sets the peak position of the output waveform of the sound generation means at the peak position on the waveform of the waveform segment. 7. One or both of the phase of the output waveform of the voiced sound source generation unit and the sounding (output) timing of the waveform segment are controlled so that they coincide with each other.
The described speech synthesizer.

8. The driving of the voiced sound source generation unit is started so that the peak of the output waveform of the voiced sound source generation unit coincides with the peak position on the waveform of the waveform segment. 7. The speech synthesizer according to 7.

9. The amplitude of the voiced sound source generation section is controlled so that the amplitude envelope of the output of the sound generation means becomes a peak amplitude value at the peak position. Speech synthesizer.

10. The waveform segment is cut out from a start portion of a consonant up to a predetermined number of pitch periods of a subsequent vowel, and is formed, as claimed in claim 3, claim 4, or claim 5.
Alternatively, the speech synthesizer according to claim 6 or claim 7 or claim 8 or claim 9.

11. The desired feature is the consonant timing of each consonant such as a plosive event when the consonant element is a plosive sound and a friction event when the consonant element is a plosive sound.
The described speech synthesizer.

12. The speech synthesizer according to claim 11, wherein the sound generation of the consonant element is started in advance with reference to the timing of articulation.

13. The speech synthesizer according to claim 5, wherein when the consonant element is an unvoiced consonant, the desired characteristic is the time of existence of the vocal cord vibration start event of the unvoiced consonant.

14. The speech synthesizer according to claim 11, wherein when synthesizing a devoiced consonant, the pronunciation of a consonant element is stopped by using the position of the vocal cord vibration start event.

15. When the consonant segment is a voiced consonant, the desired feature is the position of the phonological onset event, which is the position at which the phonological feature does not change even if the waveform before that position is removed. The speech synthesizer according to claim 5, characterized in that:

16. The speech synthesis apparatus according to claim 15, wherein when the consonant to be pronounced is not silent or pause immediately before, the pronunciation is started from the time of existence of the phonological start event.

17. A voiced sound source waveform generating means for generating a voiced sound, a serial formant synthesizer, a consonant sound waveform generating means for generating a consonant, a waveform connecting means for connecting a waveform, and a pitch synchronizing signal generating means. The synchronizing signal generating means outputs a pitch synchronizing signal corresponding to a desired pitch period, and the voiced sound source waveform generating means and the consonant sound waveform generating means both generate a waveform having a phase synchronized with the pitch synchronizing signal, and the serial type The formant synthesis unit changes the frequency characteristic by a transfer function simulating the vocal tract characteristic to the output waveform of the voiced sound source waveform generation unit, and the waveform connection unit generates the output waveform of the serial formant synthesis unit and the consonant waveform generation unit. A speech synthesizer characterized by generating a speech waveform by connecting or mixing the output waveforms of the means.

18. A voiced sound source peak, comprising pitch synchronization signal distribution means, wherein said voiced sound source waveform generation means stores a voiced sound source waveform storage means and a peak position on the voiced sound source waveform stored in said voiced sound source waveform storage means. The position synchronization means, the first pitch waveform cutting means, the second pitch waveform cutting means, and the mixing section are provided, and the pitch synchronization signal distribution means distributes the pitch synchronization signals alternately into two parts, respectively. The first
Output to the second pitch waveform cutting means and the first pitch waveform cutting means and the second pitch waveform cutting means from the voiced sound source waveform storage means to the voiced sound source peak position storage means. A pitch waveform cut out by a window function in which the window length is about twice the desired pitch period and the both ends are focused to near zero around the peak position stored in is immediately after the distribution pitch synchronization signal is received. 18. The speech synthesizer according to claim 17, wherein the speech is output to a mixing section, and the mixing section mixes outputs of the first pitch waveform cutting means and the second pitch waveform cutting means.

19. The voice synthesizing apparatus according to claim 18, wherein the voiced sound source generation means includes a gain control means, and controls the gain of either one of the two waveforms input to the mixing means.

20. Pitch synchronization signal distribution means, said consonant sound waveform generation means, a plurality of consonant sound wave shape storage means, a plurality of consonant sound wave peak position storage means corresponding to said plurality of consonant sound wave shape storage means, and a first pitch. Waveform cutting means and second
Pitch waveform cutting means and mixing means, the pitch synchronizing signal distributing means alternately distributes the pitch synchronizing signal into two, and each of the pitch synchronizing signals is divided into two parts.
Output to the second pitch waveform cutting means and the second pitch waveform cutting means, wherein the first pitch waveform cutting means and the second pitch waveform cutting means are consonant waveforms corresponding to desired consonants from the consonant waveform storage means. Is a window function that focuses on the peak position corresponding to the desired consonant stored in the consonant waveform peak position storage means and has a window length of about twice the desired pitch period and both shorts converged near zero. The cut-out pitch waveform is output to the mixing unit immediately after receiving the distributed pitch synchronization signal, and the mixing unit mixes the outputs of the first windowing unit and the second windowing unit. 18. The speech synthesizer according to claim 17, which is characterized in that.

21. Pitch synchronization signal distributing means, wherein the consonant sound waveform generating means comprises a plurality of consonant sound waveform storage means, a plurality of consonant sound waveform peak position storage means corresponding to the plurality of consonant sound waveform storage means, and a first pitch. Waveform cutting means and second
Pitch waveform cutting means and mixing means, the pitch synchronizing signal distributing means alternately distributes the pitch synchronizing signal into two, and each of the pitch synchronizing signals is divided into two parts.
Output to the second pitch waveform cutting means and the second pitch waveform cutting means, wherein the first pitch waveform cutting means and the second pitch waveform cutting means are consonant waveforms corresponding to desired consonants from the consonant waveform storage means. Is a window function that focuses on the peak position corresponding to the desired consonant stored in the consonant waveform peak position storage means and has a window length of about twice the desired pitch period and both shorts converged near zero. The cut-out pitch waveform is output to the mixing unit immediately after receiving the distributed pitch synchronization signal, and the mixing unit mixes outputs of the first windowing unit and the second windowing unit. The speech synthesizer according to claim 18, which is characterized in that.

22. Pitch synchronization signal delay means, said voiced sound source waveform generation means comprises offset control means, said offset control means advancing the reading start position of pitch waveform cutting means by the offset value The pitch of the voiced sound source waveform with respect to the center of the voiced sound source waveform, the pitch synchronization signal delaying means delays the pitch synchronization signal by the offset value, and delays the output of the consonant waveform generation means by the offset to delay the voiced sound source waveform. 22. The speech synthesizer according to claim 21, wherein

23. A method according to claim 18, further comprising a plurality of voiced sound source generation means, wherein all the voiced sound source generation means perform synchronization by using a common pitch synchronization signal or distributed pitch synchronization signal. Alternatively, the speech synthesizer according to claim 21.

24. A voice synthesis apparatus according to claim 22, further comprising a plurality of voiced sound source generation means, wherein all the voiced sound source generation means perform synchronization using a common pitch synchronization signal or distributed pitch signal and an offset value.