JPH0836397A

JPH0836397A - Voice synthesizer

Info

Publication number: JPH0836397A
Application number: JP6169280A
Authority: JP
Inventors: Kenji Matsui; 謙二松井; Takahiro Kamai; 孝浩釜井
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1994-07-21
Filing date: 1994-07-21
Publication date: 1996-02-06

Abstract

PURPOSE:To provide a voice synthesizer which has high quality, requires a smaller amount of memory and accommodates for the generation of voices having various tone quality. CONSTITUTION:The synthesizer is provided with a natural voice element storage means 2 and a feature parameter storage means 3 which extracts parameter vectors that characterize acoustic characteristics such as a vowel normal section against each element of natural voice element sets and stores the position information of the parameter vectors along with the vectors themselves. The feature parameter vectors corresponding to object natural voice elements are made to be targets. A voice synthesizing means 6 synthesizes the voice transition from the targets to desired phonemes, connects them to the natural voice elements so that soft and various tone qualities and intonation are added relative to vowels and high quality voices are provided for consonants. Moreover, the required capacity for the storage of voice elements is made smaller and the connections between voice elements are made smoother.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声合成装置に関し、
特に任意の内容を音声に変換する音声合成装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer,
In particular, the present invention relates to a voice synthesizer that converts arbitrary contents into voice.

【０００２】[0002]

【従来の技術】任意の内容を音声に変換する音声合成方
式には、大きく分けて２種類ある。一つは音声の発声機
構、即ち、声帯や口、喉の動きを理解し、その知識を規
則にして電気回路などを制御する合成方式である。もう
一つの方式は、音声の知識はあまり必要とせず、音声の
素片をたくさん用意して入力に応じて適した素片をつな
ぎあわせる方式である。前者には、例えばホルマント合
成方式とホルマント制御規則の組合せがよく知られてい
る。図６は、このホルマント合成方式とホルマント制御
規則の組合せの構成例である。同図において、ホルマン
ト合成器制御規則格納部９はホルマント合成器を制御す
るための複数の規則を格納する部分、ホルマント合成器
制御用係数生成部８は、前記の制御規則に基づいてホル
マント合成器を制御するための係数を生成する部分、ホ
ルマント合成器１０は実際に音声を合成する部分、有声
音源部１１は声帯の振動を模擬する部分、直列型ホルマ
ント合成部１３はホルマント共振器を直列に接続し、母
音や鼻音などの有声音を合成する部分、無声音源部１２
は摩擦音や破裂音などの合成に必要な乱流雑音源、並列
型ホルマント合成部１４は共振器が並列に接続され摩擦
音や破裂音などの無声子音部分を合成する。合成部１５
は直列型ホルマント合成部１３の出力と並列型ホルマン
ト合成部１４の出力を合成し合成音を出力する部分であ
る。2. Description of the Related Art There are roughly two types of speech synthesis methods for converting arbitrary contents into speech. One is a synthesizing method that understands the vocalization mechanism of the voice, that is, the movements of the vocal cords, the mouth, and the throat, and controls the electric circuit by using the knowledge as a rule. The other method is a method that does not require much knowledge of speech, and prepares a large number of speech segments and connects the appropriate segments according to the input. The former is well known, for example, a combination of a formant synthesis method and a formant control rule. FIG. 6 is a structural example of a combination of the formant synthesis method and the formant control rule. In the figure, a formant synthesizer control rule storage unit 9 stores a plurality of rules for controlling the formant synthesizer, and a formant synthesizer control coefficient generation unit 8 is a formant synthesizer based on the control rule. , A formant synthesizer 10 actually synthesizes speech, a voiced sound source 11 simulates vocal cord vibration, and a series formant synthesizer 13 forms formant resonators in series. The unvoiced sound source unit 12 that connects and synthesizes voiced sounds such as vowels and nasal sounds
Is a turbulence noise source required for synthesis of fricatives and plosives, and the parallel formant synthesizer 14 synthesizes unvoiced consonants such as fricatives and plosives with resonators connected in parallel. Synthesis unit 15
Is a part for synthesizing the output of the serial formant synthesizer 13 and the output of the parallel formant synthesizer 14 and outputting a synthesized sound.

【０００３】音声合成に必要な発音記号、アクセント位
置、イントネーションに関する情報などがホルマント合
成器制御用係数生成部８に入力されると、ホルマント合
成器制御用係数生成部８はホルマント合成器制御規則格
納部９から必要な規則を参照し、ホルマント合成器制御
用係数をホルマント合成器１０に出力する。ホルマント
合成器１０の内部において、以下のように動作する。有
声音源部１１は母音などの有声音を合成する時に人間の
声帯で生じるパルス状の音源波形を模擬する。このパル
ス状の信号は直列型ホルマント合成部１３に入力され、
直列型ホルマント合成部１３は、複数のホルマント共振
器によって音源波形に母音や鼻音として適切な特徴を与
え、合成部１５に出力する。一方、無声音源部１２は摩
擦音や破裂音の音源となる雑音状の波形を並列型ホルマ
ント合成部１４に送り、並列型ホルマント合成部１４で
複数の共振器によりそれぞれの子音に必要な周波数的特
徴を瞬時に形成し、合成部１５に出力する。合成部５
は、直列型ホルマント合成部１３の母音や鼻音と並列型
ホルマント合成部１４の子音を合成し合成音声として出
力する。When information about phonetic symbols, accent positions, and intonation necessary for speech synthesis is input to the formant synthesizer control coefficient generator 8, the formant synthesizer control coefficient generator 8 stores the formant synthesizer control rule. The unit 9 refers to a necessary rule and outputs the formant synthesizer control coefficient to the formant synthesizer 10. The inside of the formant synthesizer 10 operates as follows. The voiced sound source unit 11 simulates a pulse-shaped sound source waveform generated in a human vocal cord when synthesizing a voiced sound such as a vowel. This pulsed signal is input to the serial formant synthesizer 13,
The in-line formant synthesizer 13 gives the sound source waveform an appropriate feature as a vowel or a nasal sound by using a plurality of formant resonators, and outputs it to the synthesizer 15. On the other hand, the unvoiced sound source unit 12 sends a noise-like waveform, which is a sound source of fricative or plosive sound, to the parallel formant synthesis unit 14, and the parallel formant synthesis unit 14 uses a plurality of resonators to obtain the frequency characteristics required for each consonant. Are instantly formed and output to the combining unit 15. Synthesis part 5
Outputs the vowels and nasal sounds of the serial formant synthesizer 13 and the consonants of the parallel formant synthesizer 14 and outputs them as synthesized speech.

【０００４】この方式の長所は、音をすべて規則で作り
上げるので柔軟性に富み様々な音質やイントネーション
の音声を合成できることである。短所としては、特に子
音などのように発声メカニズムが複雑な音声は合成規則
がまだはっきりしていないのが現状で、自然な音質の子
音生成が難しい点である。The advantage of this method is that since all the sounds are created according to rules, it is highly flexible and can synthesize sounds of various tones and intonations. The disadvantage is that it is difficult to generate natural-sounding consonants, especially under the present circumstances where the synthesis rules are not clear for voices with complex vocalization mechanisms such as consonants.

【０００５】次に、もう一つの従来例である音声素片を
用いる方式を説明する。図７はこの方式の構成図であ
る。音声素片選択部１６は入力である発音記号列やアク
セント情報から合成に必要な音声素片を音声素片データ
ベース格納部１７から選択する。この場合、音声素片は
例えば線形予測係数などのような係数に圧縮されて格納
されているか、あるいは、時間波形に簡単な圧縮処理が
なされて格納されている。選択された複数の音声素片は
素片接続合成部１８で接続され適切な基本周波数で音声
波形に合成される。Next, a method of using another conventional speech unit will be described. FIG. 7 is a block diagram of this system. The speech unit selection unit 16 selects, from the speech unit database storage unit 17, a speech unit required for synthesis from the phonetic symbol string and the accent information that are input. In this case, the speech unit is stored after being compressed into a coefficient such as a linear prediction coefficient, or after being subjected to a simple compression process on the time waveform. The selected plurality of speech units are connected by the unit connection synthesizing unit 18 and synthesized into a speech waveform at an appropriate fundamental frequency.

【０００６】この方式の場合の長所は、音声素片が基本
的にモデルとなる自然音声から切り出されているので素
片間の滑らかな接続が出来れば合成品質は極めて高い。
一方、この方式の短所は音声素片格納に大容量の記憶装
置が必要であるのでコストが高くなる。また、モデル音
声の声質しか合成できず、柔軟性に欠けるという問題が
ある。The advantage of this method is that the speech unit is basically cut out from the natural speech that is the model, so that the synthesis quality is extremely high if a smooth connection between the units is possible.
On the other hand, the disadvantage of this method is that it requires a large-capacity storage device to store the speech units, which increases the cost. In addition, there is a problem in that only the voice quality of the model voice can be synthesized and the flexibility is lacking.

【０００７】そこで、前記の２種類の方法の融合方式と
して図８に示す方式を説明する。同図において、有声音
源部１１は、声帯の振動を模擬し音源信号を生成する部
分である。直列型ホルマント合成部１３は、母音などの
有声音を合成する部分である。音声素片データベース格
納部１７は、自然な音声から切り出した子音の音声素片
を格納する部分、音声素片選択部１６は必要な音声素片
を選択し取り出す部分、素片接続合成部１８は直列型ホ
ルマント合成部１３の出力と音声素片選択部１６の出力
を合成し、合成音声として出力する部分である。Therefore, the method shown in FIG. 8 will be described as a method of merging the above two methods. In the figure, the voiced sound source section 11 is a section for simulating the vibration of the vocal cords and generating a sound source signal. The serial formant synthesis unit 13 is a unit that synthesizes a voiced sound such as a vowel. The voice unit database storage unit 17 stores a voice unit of a consonant cut out from a natural voice, the voice unit selection unit 16 selects and extracts a necessary voice unit, and the unit connection synthesis unit 18 This is a part that synthesizes the output of the serial formant synthesis unit 13 and the output of the speech unit selection unit 16 and outputs the synthesized speech.

【０００８】上記のように構成された第３の従来例につ
いて以下にその動作を説明する。有声音源部１１は、ホ
ルマント合成器制御用係数の中の基本周波数に関する情
報、音源の振幅情報に関する情報などから所望の音源信
号を生成し、直列型ホルマント合成部１３に入力する。
子音区間や無声区間は、音源信号は出力されない。直列
型ホルマント合成部１３は、ホルマント合成器制御用係
数の中のホルマント周波数情報、ホルマント共振峰のバ
ンド幅に関する情報、などから直列に並んだ共振器の特
性を決定し、上記の音源信号から母音などの音声信号に
変換する。直列型ホルマント合成部１３の出力は素片接
続合成部１８に送られる。一方、ホルマント合成器制御
用係数の中の音素に関する情報から音声素片選択部１６
は、その音素が音声素片データベース格納部１７に存在
するかどうかを確認して、もし存在すればその音声素片
を音声素片データベース格納部１７から取り出し、素片
接続合成部１８に送る。例えば、合成しようとする音素
が「ｋ」で後続母音が「あ」の場合、音声素片選択部１
６は、音声素片データベース格納部１７の中に子音
「ｋ」で「か」から切り出した音声素片があるか検索す
る。素片接続合成部１８は、直列ホルマント合成部１３
からの母音信号と音声素片選択部１６の子音信号を加算
処理や重ねあわせ処理などにより合成する。このように
構成することにより、母音に関してはホルマント合成方
式により柔軟で様々な音質やイントネーションを付与で
き、子音に関しては音声素片を用いた方式によりホルマ
ント合成方式では実現出来ない高品質な音声を提供でき
る。また、音声素片としての格納は持続時間の短い音節
に限るため小容量の記憶装置で実現が可能である。しか
し、ホルマント合成部からの音質と素片が持つ音質には
相当隔たりがあるため、接続部分での不自然さや劣化が
問題となる。The operation of the third conventional example constructed as described above will be described below. The voiced sound source section 11 generates a desired sound source signal from the information about the fundamental frequency in the coefficient for controlling the formant synthesizer, the information about the amplitude information of the sound source, and inputs the desired sound source signal to the serial formant synthesizer 13.
No sound source signal is output in the consonant section or the unvoiced section. The series formant synthesizer 13 determines the characteristics of the resonators arranged in series from the formant frequency information in the formant synthesizer control coefficient, the information about the bandwidth of the formant resonance peak, and the like, and determines the vowel sound from the above sound source signal. Etc. to a voice signal. The output of the serial formant synthesis unit 13 is sent to the segment connection synthesis unit 18. On the other hand, from the information on phonemes in the formant synthesizer control coefficients, the speech unit selection unit 16
Confirms whether or not the phoneme exists in the speech unit database storage unit 17, and if it exists, takes out the speech unit from the speech unit database storage unit 17 and sends it to the unit connection synthesis unit 18. For example, when the phoneme to be synthesized is “k” and the subsequent vowel is “a”, the speech unit selection unit 1
Reference numeral 6 searches the voice unit database storage unit 17 for a voice unit cut out from "ka" with the consonant "k". The segment connection synthesizing unit 18 includes the series formant synthesizing unit 13
The vowel signal from and the consonant signal of the speech unit selection unit 16 are combined by addition processing or superposition processing. By configuring in this way, it is possible to give various sound quality and intonation for vowels by the formant synthesis method, and for consonants, high quality speech that cannot be realized by the formant synthesis method by using the voice unit is provided. it can. Further, since storage as a speech unit is limited to syllables having a short duration, it can be realized by a small-capacity storage device. However, since there is a considerable difference between the sound quality from the formant synthesis section and the sound quality of the segment, there is a problem of unnaturalness and deterioration at the connection part.

【０００９】[0009]

【発明が解決しようとする課題】上記、従来例で説明し
たように、音をすべて規則で作り上げる方式の場合は、
柔軟性に富み様々な音質やイントネーションの音声を合
成できるが、子音などのように発声メカニズムが複雑な
音声は合成規則がまだはっきりしていないので合成が難
しい。一方、音声素片を用いた方式の場合は、合成品質
は極めて高いが、音声素片格納に大容量の記憶装置が必
要という問題や、モデル音声の声質しか合成できず、柔
軟性に欠けるという問題がある。また、モデル化合成と
素片合成を融合した方式では、合成規則を用いる方式と
素片を用いる方式のそれぞれの長所を有するが、逆に、
接続部分に音質の違いが生じ、大きな音質劣化要因にな
ってしまうという問題点がある。As described above in the conventional example, in the case of a system in which all sounds are created by rules,
Although it is highly flexible and can synthesize voices of various tones and intonations, it is difficult to synthesize voices with complex voicing mechanisms such as consonants because the synthesis rules are not yet clear. On the other hand, in the case of the method using speech units, although the synthesis quality is extremely high, there is a problem that a large-capacity storage device is required to store the speech units and that only the voice quality of the model speech can be synthesized, and it lacks flexibility. There's a problem. Further, the method in which the modeling synthesis and the segment synthesis are combined has the advantages of the method using the synthesis rule and the method using the segment, but conversely,
There is a problem in that a difference in sound quality occurs at the connection portion, which causes a significant sound quality deterioration factor.

【００１０】本発明の目的は、上記従来の音声合成装置
の課題に鑑み、（１）音質の柔軟性に富み、（２）記憶
容量も音声素片を用いる従来方式に比べて大幅に削減で
き、かつ、（３）音声素片の接続操作による音質の違い
が目立たず、合成品質の高い音声合成装置の提供を目的
とするものである。In view of the above problems of the conventional speech synthesizer, the object of the present invention is (1) flexibility in sound quality, and (2) storage capacity can be greatly reduced as compared with the conventional method using speech units. And, (3) an object of the present invention is to provide a speech synthesizing device having a high synthesis quality in which a difference in sound quality due to a connection operation of speech units is not noticeable.

【００１１】[0011]

【課題を解決するための手段】本発明の音声合成装置
は、自然音声素片セットを格納する自然音声素片格納手
段と、該自然音声素片格納手段に格納されている該自然
音声素片セットの各々の素片に対して単一あるいは複数
の位置から該位置の音響的特性を特徴付けるパラメタベ
クトルを抽出し、抽出した位置の情報と共に該特徴パラ
メタベクトルを格納する特徴パラメタ格納手段と、該自
然音声素片格納手段に格納されている該自然音声素片セ
ットのうち、ある自然音声素片に対応する該特徴パラメ
タベクトルのうちある一つをターゲットとし、該ターゲ
ットから該自然音声素片に接続せしめる所望の音韻まで
の音声の遷移を音声合成せしめる音声合成手段と、該自
然音声素片と該合成音声素片とを該ターゲットの位置近
傍において接続する音声素片接続手段とを備えており、
これにより上記目的が達成される。A speech synthesis apparatus according to the present invention comprises a natural speech unit storing means for storing a natural speech unit set, and the natural speech units stored in the natural speech unit storing means. Feature parameter storage means for extracting a parameter vector characterizing the acoustic characteristic of the position from a single or a plurality of positions for each segment of the set, and storing the feature parameter vector together with information on the extracted position; From the natural speech unit set stored in the natural speech unit storage means, one of the feature parameter vectors corresponding to a certain natural speech unit is targeted, and the target is changed to the natural speech unit. A speech synthesizing unit for synthesizing a transition of voices to a desired phoneme to be connected, and the natural speech unit and the synthesized speech unit are connected in the vicinity of the position of the target. And a Koemotohen connection means,
This achieves the above object.

【００１２】前記自然音声素片セットは単音節の子音開
始から少なくとも子音母音間の過渡部あるいは母音定常
部までの素片のセットであり、前記特徴パラメタベクト
ルは少なくとも該単音節素片の子音開始位置あるいは子
音母音間の過渡部あるいは母音定常部から抽出されたベ
クトルであってもよい。The natural speech unit set is a set of units from the start of a consonant of a single syllable to at least a transitional part between consonant vowels or a stationary part of a vowel, and the feature parameter vector is at least a consonant start of the single syllable unit. It may be a vector extracted from a transition part between positions or consonant vowels or a vowel stationary part.

【００１３】前記特徴パラメタ格納手段が格納する特徴
パラメタベクトルは少なくとも各ホルマント周波数、各
ホルマントバンド幅、音源の特性を有していてもよい。The characteristic parameter vector stored in the characteristic parameter storage means may have at least each formant frequency, each formant band width, and the characteristic of the sound source.

【００１４】前記自然音声格納手段は自然音声から切り
出した複数の単音節原音素片の子音開始位置から子音と
その後続母音の渡り位置までを切りだし格納するもので
あり、前記特徴パラメタ格納手段は、該自然音声素片格
納手段に格納されている各単音節素片の子音から母音へ
の渡り位置から抽出した音響的特性を特徴付けるパラメ
タベクトルをその抽出した位置の情報と共に第１ターゲ
ットとして格納し、かつ該単音節素片に対応する該単音
節原音素片の母音定常部から抽出した音響的特性を特徴
付けるパラメタベクトルをその抽出した位置の情報と共
に第２ターゲットとして格納するものであり、前記音声
合成手段は、該第１ターゲットと該第２ターゲット間を
補間し、かつ該第２ターゲットから該単音節素片に接続
せしめる所望の音韻までの音声の遷移を音声合成するも
のであり、前記音声素片接続手段は、該単音節素片と該
音声合成手段が合成する音声素片とを該第１ターゲット
の位置近傍において接続するという構成であってよい。The natural voice storing means cuts out and stores from a consonant start position of a plurality of single syllable original phoneme segments cut out from natural voice to a transition position of a consonant and a succeeding vowel, and the characteristic parameter storing means. , A parameter vector characterizing the acoustic characteristics extracted from the transition position from the consonant to the vowel of each monosyllabic unit stored in the natural voice unit storing means is stored as the first target together with the information of the extracted position. And a parameter vector characterizing the acoustic characteristic extracted from the vowel stationary part of the monosyllabic original phoneme corresponding to the monosyllabic phoneme is stored as the second target together with the information of the extracted position. The synthesizing means interpolates between the first target and the second target and connects the desired sound to the monosyllabic segment from the second target. The speech element connecting means connects the monosyllabic element and the speech element synthesized by the speech synthesizing means in the vicinity of the position of the first target. It may be a configuration.

【００１５】[0015]

【実施例】以下、本発明の実施例について図面を参照し
て説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１６】（第１の実施例）図１は本発明の一実施例
における音声合成装置の構成を示すものである。同図に
おいて、合成制御部１は各部にタイミング信号や基本周
波数などの制御情報を送出する部分、自然音声素片格納
部２は自然音声から切りだした素片セットを正規化や圧
縮化して格納する部分であり、本実施例では単音節素片
セットで、子音開始部分から母音定常部までを有する。
特徴パラメタ格納部３は自然音声素片セットの各々の素
片に対して単一あるいは複数の位置から音響的特性を特
徴付けるパラメタベクトルを抽出し、抽出した位置の情
報と共に該特徴パラメタベクトルを格納する部分であ
り、本実施例では、各単音節素片の母音定常部のホルマ
ント周波数、ホルマントバンド幅、さらに音源特性とし
てスペクトルの平均的な傾斜をパラメタベクトルとす
る。ホルマント合成規則格納部４は、ある音素環境にお
ける適当なホルマント周波数やバンド幅などを生成する
部分、有声音源部５は、声帯の振動を模擬し音源信号を
生成する部分である。ホルマント合成部６は、母音など
の有声音の周波数的特徴を複数の共振器で模擬し合成音
声を生成する部分である。素片接続部７は、自然音声素
片とホルマント合成音声を滑らかに接続する部分であ
る。(First Embodiment) FIG. 1 shows the structure of a speech synthesizer according to an embodiment of the present invention. In the figure, a synthesis control unit 1 sends control information such as a timing signal and a fundamental frequency to each unit, and a natural voice element storage unit 2 stores a voice unit set cut out from natural voice after normalizing or compressing it. In this embodiment, the syllable segment set includes a consonant start portion to a vowel stationary portion.
The characteristic parameter storage unit 3 extracts a parameter vector that characterizes acoustic characteristics from a single or a plurality of positions for each unit of the natural speech unit set, and stores the characteristic parameter vector together with information on the extracted position. In the present embodiment, the parameter vector is the formant frequency of the vowel stationary part of each monosyllabic segment, the formant band width, and the average slope of the spectrum as the sound source characteristic. The formant synthesis rule storage unit 4 is a unit that generates an appropriate formant frequency or bandwidth in a certain phoneme environment, and the voiced sound source unit 5 is a unit that simulates vibration of the vocal cords and generates a sound source signal. The formant synthesis unit 6 is a unit that simulates the frequency characteristics of voiced sounds such as vowels with a plurality of resonators to generate synthetic speech. The unit connection unit 7 is a unit that smoothly connects the natural voice unit and the formant synthesized voice.

【００１７】上記のように構成された本実施例の音声合
成装置について以下にその動作を説明する。The operation of the speech synthesizer of this embodiment having the above-mentioned structure will be described below.

【００１８】「ｓｏｒａ」という言葉を例にして合成
動作を説明する。先ず、合成制御部１は、自然音声素片
格納部２から音節「ｓｏ」を選択し、同時に特徴パラメ
タ格納部３から音節「ｓｏ」の母音「ｏ」の定常部に対
応するパラメタベクトルを選択する。このパラメタベク
トルの中でスペクトルの傾きに関する情報は、有声音源
部５に送られて音節「ｓｏ」の母音「ｏ」に近いスペク
トル傾斜の音源信号が生成される。さらに、パラメタベ
クトル中のホルマント周波数、バンド幅に関する情報は
ホルマント合成部６に送られ、ターゲット情報として格
納される。次に、合成制御部１は、「ｓ−ｏ−ｒ」とい
う環境下での母音「ｏ」の振る舞いを記述した規則をホ
ルマント合成規則格納部４から選択し、この規則はホル
マント合成部６に送られる。ホルマント合成部６は、先
ほどのターゲットのホルマント周波数、バンド幅の値を
開始値とし、合成規則が示す子音「ｒ」までの遷移を合
成フィルタの係数を変化させながら音声合成する。次
に、接続部７が自然音声素片の「ｓｏ」と合成音声の
「ｏ」をパラメタベクトルの位置情報の地点で接続す
る。接続方法としては、例えば、図２に示すようにピッ
チ同期窓かけ重ね合わせ方法がある。このようにして自
然音声素片の「ｓｏ」と合成音声の「ｏ」が滑らかに接
続でるので、本実施例の音声合成装置の出力音声の自然
性、明瞭性が改善できる。図３は、接続点で自然音声素
片のスペクトルにホルマント型合成音声を整合させる様
子を示したものである。このように、パラメタベクトル
を適切に与えるとホルマント型合成音声は自然音声にか
なり良い一致を示す。以下、同様にして、合成制御部１
が「ｒａ」を選択し、ホルマント型合成部が「ａ」を合
成して、全体の合成音声ができあがる。図４は、合成音
声と自然音声素片を接続する様子である。The synthesizing operation will be described by taking the word "sora" as an example. First, the synthesis control unit 1 selects the syllable “so” from the natural speech unit storage unit 2, and simultaneously selects the parameter vector corresponding to the stationary part of the vowel “o” of the syllable “so” from the characteristic parameter storage unit 3. To do. Information about the slope of the spectrum in this parameter vector is sent to the voiced sound source unit 5 to generate a sound source signal having a spectrum slope close to the vowel "o" of the syllable "so". Further, information on the formant frequency and the bandwidth in the parameter vector is sent to the formant synthesis unit 6 and stored as target information. Next, the synthesis control unit 1 selects a rule describing the behavior of the vowel “o” under the environment “s-or” from the formant synthesis rule storage unit 4, and this rule is stored in the formant synthesis unit 6. Sent. The formant synthesizing unit 6 synthesizes the voices of the transitions up to the consonant “r” indicated by the synthesis rule while changing the coefficient of the synthesis filter, using the values of the target formant frequency and the bandwidth as the start values. Next, the connection unit 7 connects the natural voice unit “so” and the synthetic voice “o” at the position of the position information of the parameter vector. As a connection method, for example, there is a pitch synchronization window overlapping method as shown in FIG. In this way, the "so" of the natural speech unit and the "o" of the synthesized speech can be connected smoothly, so that the naturalness and clarity of the output speech of the speech synthesis apparatus of this embodiment can be improved. FIG. 3 shows how the formant-type synthesized speech is matched with the spectrum of the natural speech segment at the connection point. Thus, formant-type synthetic speech shows a fairly good match with natural speech when parameter vectors are given appropriately. Hereinafter, similarly, the synthesis control unit 1
Selects "ra", and the formant-type synthesis unit synthesizes "a" to produce the entire synthesized voice. FIG. 4 shows how the synthetic speech and the natural speech segment are connected.

【００１９】上述したように、本発明の第１の実施例の
音声合成装置は、特徴パラメタベクトルを単音節音声の
母音定常部から得ているが、特徴ベクトルとして各音節
の子音開始位置での特徴ベクトルを持ち、ホルマント合
成部６が、ある母音から次の音節の特徴ベクトルをター
ゲットとして合成を行うことも当然可能であり、音節の
種類によっては、音質向上に大きな項かが得られる。As described above, in the speech synthesizer of the first embodiment of the present invention, the characteristic parameter vector is obtained from the vowel stationary part of the monosyllabic voice, but the characteristic vector at the consonant start position of each syllable is obtained. It is naturally possible that the formant synthesizer 6 has a feature vector and synthesizes a certain vowel targeting the feature vector of the next syllable, and depending on the type of syllable, a large term can be obtained for improving the sound quality.

【００２０】（第２の実施例）次に、図１および図５を
参照しながら本発明の第２の実施例の音声合成装置を説
明する。図１において、自然音声素片格納部２は自然音
声から切り出した複数の単音節原音素片の子音開始位置
から子音とその後続母音の渡り位置までを切りだし格納
するものであり、特徴パラメタ格納部３は、自然音声素
片格納部１に格納されている各単音節素片の子音から母
音への渡り位置から抽出した音響的特性を特徴付けるパ
ラメタベクトルをその抽出した位置の情報と共に第１タ
ーゲットとして格納し、かつ単音節素片に対応する単音
節原音素片の母音定常部から抽出した音響的特性を特徴
付けるパラメタベクトルをその抽出した位置の情報と共
に第２ターゲットとして格納するものとする。図５はこ
の様子を示したものである。(Second Embodiment) Next, a speech synthesizer according to a second embodiment of the present invention will be described with reference to FIGS. 1 and 5. In FIG. 1, a natural speech unit storage unit 2 cuts out and stores from a consonant start position of a plurality of monosyllabic original phonemes cut out from a natural voice to a transition position of a consonant and a succeeding vowel, and stores a characteristic parameter. The unit 3 includes a parameter vector characterizing the acoustic characteristic extracted from the transition position from the consonant to the vowel of each monosyllabic unit stored in the natural speech unit storage unit 1 together with the extracted position information and the first target. And the parameter vector characterizing the acoustic characteristic extracted from the vowel stationary part of the monosyllabic original phoneme corresponding to the monosyllabic phoneme is stored as the second target together with the information of the extracted position. FIG. 5 shows this state.

【００２１】第１の実施例と同様に、「ｓｏｒａ」と
いう言葉を例にして合成動作を説明する。先ず、合成制
御部１は、自然音声素片格納部２から音節「ｓｏ」を選
択し、同時に特徴パラメタ格納部３から音節「ｓｏ」の
子音・母音間渡り部分と母音「ｏ」の定常部に対応する
２つのパラメタベクトルをそれぞれ第１ターゲット、第
２ターゲットとして選択する。このそれぞれのパラメタ
ベクトルの成分でスペクトルの傾きに関する情報は、有
声音源部５に送られる。また、それぞれのパラメタベク
トル中のホルマント周波数、バンド幅に関する情報はホ
ルマント型音声合成部６に送られ、それぞれ第１ターゲ
ットおよび第２ターゲット情報として格納される。次
に、合成制御部１は、「ｓ−ｏ−ｒ」という環境下での
母音「ｏ」の振る舞いを記述した規則をホルマント合成
規則格納部４から選択し、この規則はホルマント合成部
６に送られる。ホルマント合成部６は、先ほどの第１タ
ーゲットのホルマント周波数、バンド幅の値を開始値と
し、第２ターゲットをそれぞれのパラメタの通過点とし
て、合成規則が示す子音「ｒ」までの遷移を合成フィル
タの係数を変化させながら音声合成する。次に、接続部
７が自然音声素片の「ｓｏ」と合成音声の「ｏ」を第１
ターゲットの位置であるｓ−ｏ間遷移部分で接続する。
図５はこの合成操作および接続操作をまとめたものであ
る。このように構成することにより、自然音声素片格納
部２に格納される音節の長さは、子音−母音わたり部分
まででよく、合成に必要な素片セットの記憶容量を大幅
に削減できる。さらに、それぞれの素片の母音定常部を
第２ターゲットにより的確にに再現出来るので、素片の
母音定常部欠落による音質劣化を防ぐことができる。同
様にして、合成制御部１が「ｒａ」を選択し、ホルマン
ト合成部が「ａ」を合成して、全体の合成音声ができあ
がる。Similar to the first embodiment, the synthesizing operation will be described by taking the word "sora" as an example. First, the synthesis control unit 1 selects the syllable "so" from the natural speech unit storage unit 2, and at the same time, from the characteristic parameter storage unit 3, the consonant / vowel transition part of the syllable "so" and the stationary part of the vowel "o". Two parameter vectors corresponding to are selected as the first target and the second target, respectively. Information about the slope of the spectrum in each of the parameter vector components is sent to the voiced sound source unit 5. Further, information on the formant frequency and the bandwidth in each parameter vector is sent to the formant type speech synthesizer 6 and stored as the first target and second target information, respectively. Next, the synthesis control unit 1 selects a rule describing the behavior of the vowel “o” under the environment “s-or” from the formant synthesis rule storage unit 4, and this rule is stored in the formant synthesis unit 6. Sent. The formant synthesis unit 6 synthesizes the transitions up to the consonant “r” indicated by the synthesis rule with the former formant frequency and the bandwidth value of the first target as starting values and the second target as the passing point of each parameter. Speech synthesis is performed while changing the coefficient of. Next, the connection unit 7 first sets "so" which is a natural speech segment and "o" which is a synthetic speech.
The connection is made at the so-o transition portion, which is the target position.
FIG. 5 summarizes the combining operation and the connecting operation. With this configuration, the length of the syllable stored in the natural voice element storage unit 2 may be a consonant-vowel crossing portion, and the storage capacity of the element set required for synthesis can be significantly reduced. Furthermore, since the vowel stationary part of each phoneme can be accurately reproduced by the second target, it is possible to prevent the sound quality from being deteriorated due to the missing vowel stationary part of the phoneme. Similarly, the synthesis control unit 1 selects "ra", and the formant synthesis unit synthesizes "a" to complete the synthesized speech.

【００２２】このように構成することにより、母音に関
してはホルマント合成方式により柔軟で様々な音質やイ
ントネーションを付与でき、子音に関しては音声素片を
用いた方式によりホルマント合成方式では実現出来ない
高品質な音声を提供できる。音声素片としての格納は持
続時間の短い子音に限るため小容量の記憶装置で実現が
可能である。With this configuration, it is possible to give various timbres and intonations flexibly for the vowels by the formant synthesis method, and for the consonants of high quality which cannot be realized by the formant synthesis method by the speech unit method. Can provide audio. Since storage as a voice unit is limited to consonants with a short duration, it can be realized with a small-capacity storage device.

【００２３】[0023]

【発明の効果】以上のように本発明によれば、母音性信
号は直列型ホルマント合成方式により柔軟で様々な音質
やイントネーションを付与でき、子音性信号は音声素片
を用いた方式によりホルマント合成方式では実現出来な
い高品質な子音を提供できるので、それらを組み合わせ
た合成音は高品質で且つ色々な声質に対応できる。ま
た、従来の音声素片を用いた方式に対して、本方式の場
合、音声素片としての格納が持続時間の短い音節の子音
開始から渡り部分まででよいため小容量の記憶装置で実
現が可能である。As described above, according to the present invention, a vowel signal can be flexibly provided with various tones and intonation by a serial formant synthesis method, and a consonant signal can be formant synthesized by a method using a speech unit. Since it is possible to provide high-quality consonants that cannot be realized by the method, the synthesized voice that combines them can have high quality and can support various voice qualities. In contrast to the conventional method using speech units, this method can be implemented with a small-capacity storage device because it can be stored as a speech unit from the consonant start to the crossover part of a syllable having a short duration. It is possible.

【００２４】さらに、素片の接続点で、音響的特徴を整
合させるので、接続部分の音質の違いによる劣化を大幅
に改善できる。Furthermore, since the acoustic characteristics are matched at the connection points of the elemental pieces, the deterioration due to the difference in the sound quality of the connection portion can be greatly improved.

[Brief description of drawings]

【図１】本発明第１および第２の実施例における音声合
成装置のブロック図FIG. 1 is a block diagram of a speech synthesizer according to first and second embodiments of the present invention.

【図２】自然音声素片「ｓｏ」と合成母音「ｏ」の接続
の様子を示す図FIG. 2 is a diagram showing a state of connection between a natural voice unit “so” and a synthetic vowel “o”.

【図３】自然音声素片と合成音声との整合の様子を示す
図FIG. 3 is a diagram showing a state of matching between natural speech units and synthetic speech.

【図４】自然音声素片と合成音声素片とで単語「ｓｏ
ｒａ」が合成される様子を示す図FIG. 4 shows the word “so” in a natural speech unit and a synthetic speech unit.
Diagram showing how "ra" is combined

【図５】本発明第２の実施例における自然音声素片と合
成音声素片の接続方法を示す図FIG. 5 is a diagram showing a method of connecting a natural speech unit and a synthetic speech unit according to the second embodiment of the present invention.

【図６】従来のホルマント合成装置の構成図FIG. 6 is a block diagram of a conventional formant synthesizer.

【図７】従来の音声素片を用いた音声合成装置の構成図FIG. 7 is a block diagram of a conventional speech synthesizer using speech units.

【図８】ホルマントと音声素片の融合により音声合成を
行う場合の装置の構成図FIG. 8 is a block diagram of an apparatus for performing voice synthesis by fusing formants and voice units.

[Explanation of symbols]

１合成制御部（手段）２自然音声素片格納部（手段）３特徴パラメタ格納部（手段）４ホルマント合成規則格納部（手段）５音源部（手段）６ホルマント合成部（手段）７素片接続部（手段）８ホルマント合成器制御用係数生成部（手段）９ホルマント合成器制御規則格納部（手段）１０ホルマント合成器（手段）１１有声音源部（手段）１２無声音源部（手段）１３直列型ホルマント合成部（手段）１４並列型ホルマント合成部（手段）１５合成部（手段）１６音声素片選択部（手段）１７音声素片データベース格納部（手段）１８素片接続合成部（手段） 1 synthesis control unit (means) 2 natural speech unit storage unit (means) 3 characteristic parameter storage unit (means) 4 formant synthesis rule storage unit (means) 5 sound source unit (means) 6 formant synthesis unit (means) 7 unit pieces Connection unit (means) 8 Formant synthesizer control coefficient generation unit (means) 9 Formant synthesizer control rule storage unit (means) 10 Formant synthesizer (means) 11 Voiced sound source unit (means) 12 Unvoiced sound source unit (means) 13 Serial formant synthesis section (means) 14 Parallel formant synthesis section (means) 15 Synthesis section (means) 16 Speech element selection section (means) 17 Speech element database storage section (means) 18 Segment connection synthesis section (means) )

Claims

[Claims]

1. A natural speech unit storage means for storing a natural speech unit set, and a single unit for each unit of the natural speech unit set stored in the natural speech unit storage unit. A characteristic vector storage unit that extracts a parameter vector that characterizes the acoustic characteristics of the position from a plurality of positions, and stores the characteristic parameter vector together with the extracted position information, and the natural voice unit storage unit that stores the characteristic parameter vector. From the natural speech unit set, targeting one of the feature parameter vectors corresponding to a certain natural speech unit,
A voice synthesizing unit for synthesizing a transition of a voice from the target to the natural speech unit to a desired phoneme, and a voice connecting the natural speech unit and the synthesized speech unit near the position of the target. A voice synthesizer comprising a segment connecting means.

2. A natural speech segment set is a set of segments from the start of a consonant of a single syllable to at least a transition part between consonant vowels or a vowel steady part, and the feature parameter vector is at least a consonant of the single syllable segment. The speech synthesizer according to claim 1, wherein the vector is extracted from a start position or a transitional part between consonant vowels or a vowel stationary part.

3. The speech synthesizer according to claim 1, wherein the characteristic parameter vector stored in the characteristic parameter storage means has at least each formant frequency, each formant bandwidth, and the characteristic of the sound source.

4. The natural voice storing means cuts out and stores from a consonant start position of a plurality of monosyllabic phoneme segments cut out from a natural voice to a transition position of a consonant and a subsequent vowel, and the characteristic parameter storing means. Stores, as a first target, a parameter vector characterizing an acoustic characteristic extracted from a transition position from a consonant to a vowel of each monosyllabic segment stored in the natural voice segment storage means together with information on the extracted position. And a parameter vector characterizing the acoustic characteristic extracted from the vowel stationary part of the monosyllabic original phoneme corresponding to the monosyllabic phoneme is stored as the second target together with the information of the extracted position, The speech synthesizing means interpolates between the first target and the second target and connects the second target to the monosyllabic segment to obtain a desired phoneme. Said speech unit connecting means connects said monosyllable element and the speech element synthesized by said speech synthesizing means in the vicinity of the position of said first target. 1. The speech synthesizer according to 1.