JP2006227367A

JP2006227367A - Speech synthesizer

Info

Publication number: JP2006227367A
Application number: JP2005042130A
Authority: JP
Inventors: Takashi Yato; 隆矢頭
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2005-02-18
Filing date: 2005-02-18
Publication date: 2006-08-31

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer capable of generating a natural synthesized sound by solving conventional problem points regarding voiceless sounding of a speech. <P>SOLUTION: In the speech synthesizer equipped with an speech element dictionary in which speech elements as basic units of a speech are registered, a parameter generation section which generates synthesis parameters of at least a speech element, phoneme duration, and fundamental frequency for a phoneme/rhythm symbol string, and a waveform generation section which generates a synthesized waveform while referring to the speech element dictionary according to the composition parameters from the parameter generation section, the parameter generation section 300 is equipped with a vowel voiceless sounding level decision section 304 which corrects a result of a voiceless sounding decision made according to a normal voiceless sounding rule according to a speaking speed and the kind of a syllable, and the vowel voiceless sounding level decision section 304 makes the voiceless sounding decision based upon at least two or more levels. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、日常読み書きしている漠字・かな混じり文のテキストを入力として、それを音声に変換するテキスト音声合成装置、特に、母音無声化の実現処理に関するものである。 The present invention relates to a text-to-speech synthesizer that takes text of a mixture of vague and kana sentences that are read and written daily and converts them into speech, and more particularly to a process for realizing vowel devoicing.

テキスト音声変換技術は、我々が日常読み書きしている漠字かな混じり文を入力し、それを音声に変換して出力するもので、出力語彙の制限がないことから録音・再生型の音声合成に代わる技術として種々の利用分野での応用が期待できる。 Text-to-speech conversion technology inputs vaguely mixed sentences that we read and write every day, converts them into speech, and outputs them. Since there is no restriction on the output vocabulary, it is suitable for recording / playback speech synthesis. As an alternative technology, application in various fields of use can be expected.

従来、この種の音声合成装置としては、図２に示すような処理形態となっているものが代表的である。図２において、１０１はテキスト解析部、１０２はパラメータ生成部、１０３は波形生成部、１０４は単語辞書、１０５は素片辞書である。 Conventionally, this type of speech synthesizer typically has a processing form as shown in FIG. In FIG. 2, 101 is a text analysis unit, 102 is a parameter generation unit, 103 is a waveform generation unit, 104 is a word dictionary, and 105 is a segment dictionary.

テキスト解析部１０１は、漢字かな混じり文を入力し、単語辞書を参照して形態素解析し、読み、アクセント、イントネーションを決定し、韻律記号付き発音記号（中間言語）を出力する。 The text analysis unit 101 inputs a kanji-kana mixed sentence, performs morphological analysis with reference to the word dictionary, determines reading, accent, and intonation, and outputs a phonetic symbol with prosodic symbols (intermediate language).

パラメータ生成部１０２は、ピッチ周波数パターンや音韻継続時間等の設定を行い、波形生成部１０３では、音声の合成処理を行う。波形生成部１０３は、中間言語で与えられる発音記号列から、音声合成に使用する音声合成単位を、素片辞書１０５から選択し、パラメータ生成部で決定したパラメータに従って、結合／変形して音声の合成処理を行う。音声合成単位には、音素、音節（ＣＶ）、ＶＣＶ，ＣＶＣ（Ｃ：子音、Ｖ：母音）や、音韻連鎖を拡張した単位などが用いられる。 The parameter generation unit 102 sets a pitch frequency pattern, phoneme duration, and the like, and the waveform generation unit 103 performs speech synthesis processing. The waveform generation unit 103 selects a speech synthesis unit to be used for speech synthesis from the phonetic dictionary 105 from a phonetic symbol string given in an intermediate language, and combines / transforms the speech synthesis unit according to the parameters determined by the parameter generation unit. Perform synthesis processing. As the speech synthesis unit, phonemes, syllables (CV), VCV, CVC (C: consonant, V: vowel), a unit in which the phoneme chain is expanded, or the like is used.

以上の構成において、日常読み書きしている漠字・かな混じり文（以下、テキストという）を入力すると、テキスト解析部１０１は、文字情報から音韻・韻律記号列を生成する。音韻・韻律記号列とは、入力文の読み、アクセント、イントネーション等を文字列として記述したもの（以下、中間言語という）である。単語辞書１０４は、単語の読みやアクセント等が登録された発音辞書で、テキスト解析部１０１はこの単語辞書を参照しながら中間言語を生成する。 In the above configuration, when a vague character / kana mixed sentence (hereinafter referred to as text) that is read and written daily is input, the text analysis unit 101 generates a phoneme / prosodic symbol string from the character information. A phoneme / prosodic symbol string is a text string describing the reading, accent, intonation, etc. of an input sentence (hereinafter referred to as an intermediate language). The word dictionary 104 is a pronunciation dictionary in which word readings, accents, and the like are registered, and the text analysis unit 101 generates an intermediate language while referring to the word dictionary.

テキスト解析部１０１で生成された中間言語は、パラメータ生成部１０２で、音声素片（音の種類）、音韻継続時間（音の長さ）、基本周波数（声の高さ、以下ピッチという）等の各パターンからなる合成パラメータを決定し、波形生成部１０３に送る。音声素片とは、接続して合成波形を作るための音声の基本単位で、音の種類等に応じて様々なものがある。 The intermediate language generated by the text analysis unit 101 is a parameter generation unit 102, which includes a speech segment (sound type), phoneme duration (sound length), fundamental frequency (voice pitch, hereinafter referred to as pitch), and the like. Are determined and sent to the waveform generation unit 103. A speech unit is a basic unit of speech for connecting and creating a synthesized waveform, and there are various types depending on the type of sound.

パラメータ生成部１０２で生成された各種パラメータは、波形生成部１０３で音声素片等を蓄積するＲＯＭ等から構成された素片辞書１０５を参照しながら、合成波形が生成され、スピーカーを通して合成音声が出力される。以上がテキスト音声変換処理の流れである。 Various parameters generated by the parameter generation unit 102 are generated with reference to a segment dictionary 105 composed of a ROM or the like that stores speech units and the like in the waveform generation unit 103, and a synthesized waveform is generated through a speaker. Is output. The above is the flow of the text-to-speech conversion process.

次に、パラメータ生成部１０２における処理を、図３を参照して詳細に説明する。図３は従来の音声合成装置のパラメータ生成部１０２の構成を示すブロック図である。図３において、パラメータ生成部１０２は、中間言語解析部２０１、ピッチパタン生成部２０２、母音無声化判定部２０３、音韻パワー決定部２０４、音韻継続時間算出部２０５、継続時間修正部２０６から構成される。 Next, processing in the parameter generation unit 102 will be described in detail with reference to FIG. FIG. 3 is a block diagram showing the configuration of the parameter generation unit 102 of the conventional speech synthesizer. In FIG. 3, the parameter generation unit 102 includes an intermediate language analysis unit 201, a pitch pattern generation unit 202, a vowel devoicing determination unit 203, a phoneme power determination unit 204, a phoneme duration calculation unit 205, and a duration correction unit 206. The

パラメータ生成部１０２に入力される中間言語は、アクセント位置、ポーズ位置などを含んだ音韻文字列であり、これより、ピッチの時間的な変化（以下、ピッチパタンという）、それぞれの音韻の継続時間（以下、音韻継続時間という）、音声パワー等、波形を生成する上で必要なパラメータ（以下、波形生成用パラメータという）を決定する。 The intermediate language input to the parameter generation unit 102 is a phonological character string including an accent position, a pose position, and the like. From this, a temporal change in pitch (hereinafter referred to as a pitch pattern), and a duration of each phonology. Parameters necessary for generating a waveform (hereinafter, referred to as a waveform generation parameter) such as a phoneme duration (hereinafter, referred to as a phoneme duration) and a voice power are determined.

入力された中間言語は、中間言語解析部２０１で文字列の解析が行われ、中間言語上に記された単語区切り記号から単語境界を判定し、アクセント記号からアクセント核のモーラ位置を得る。 The input intermediate language is analyzed for a character string by the intermediate language analysis unit 201, a word boundary is determined from a word delimiter written on the intermediate language, and a mora position of an accent kernel is obtained from the accent symbol.

アクセント核とは、単語内でピッチが下降する位置のことで、１モーラ目にアクセント核が存在する単語を１型アクセント、ｎモーラ目にアクセント核が存在する単語をｎ型アクセントと称し、総称して起伏型アクセント単語と称する。逆に、アクセント核の存在しない単語（例えば「新聞」や「パソコン」）を０型アクセントまたは平板型アクセント単語と称する。 An accent nucleus is a position where the pitch falls within a word. A word having an accent nucleus in the first mora is called a 1-type accent, and a word having an accent nucleus in the n-mora is called an n-type accent. This is referred to as a undulating accent word. Conversely, a word that does not have an accent nucleus (for example, “newspaper” or “computer”) is referred to as a 0-type accent or a flat accent word.

ピッチパタン生成部２０２は、中間言語上のフレーズ記号・アクセント記号、単語のモーラ数、フレーズのモーラ数などにより、ピッチパタンを生成する。 The pitch pattern generation unit 202 generates a pitch pattern based on phrase symbols / accent symbols in the intermediate language, the number of mora of words, the number of mora of phrases, and the like.

母音無声化判定部２０３は、中間言語上の音韻記号やアクセント記号などから、母音の無声化判定を行い、その結果を音韻パワー決定部２０４と音韻継続時間算出部２０５に送る。母音の無声化については後述する。 The vowel devoicing determination unit 203 performs vowel devoicing determination from a phonological symbol or accent symbol in an intermediate language, and sends the result to the phonological power determination unit 204 and the phonological duration calculation unit 205. The vowel devoicing will be described later.

音韻継続時間算出部２０５は、音韻文字列からそれぞれの音韻の持続時間を計算し、継続時間修正部２０６に送る。ユーザが発声速度レベルを指定する場合は、指定されたレベルに応じて、２０５で算出された音韻継続時間を継続時間修正部２０６で線形伸縮する処理を行う。継続時間修正部２０６で発声速度レベルに応じて伸縮された音韻継続時間は、図示していない波形生成部に送られる。 The phoneme duration calculation unit 205 calculates the duration of each phoneme from the phoneme string and sends it to the duration correction unit 206. When the user designates the utterance speed level, the phoneme duration calculated in 205 is linearly expanded / contracted by the duration correction unit 206 according to the designated level. The phoneme duration extended and contracted according to the utterance speed level by the duration correction unit 206 is sent to a waveform generation unit (not shown).

音韻パワー決定部２０４は、波形の振幅値を算出し、波形生成部１０３（図２参照）へ送る。 The phoneme power determination unit 204 calculates the amplitude value of the waveform and sends it to the waveform generation unit 103 (see FIG. 2).

以上説明したこれらの波形生成用パラメータは波形生成部１０３へ送られ、合成波形が生成される。 These waveform generation parameters described above are sent to the waveform generation unit 103 to generate a composite waveform.

次に、母音無声化について詳細に説明する。人間が言葉を発する時には、肺から押し出された空気を声帯の開閉運動により音源とし、顎・舌・唇などを動かすことにより声道の共鳴特性を変化させて種々の音韻を表現している。前述したピッチは、声帯の振動周期に対応し、この時間的変化がアクセントやイントネーションの表現となる。 Next, vowel devoicing will be described in detail. When a person speaks, the air pushed out from the lungs is used as a sound source by opening and closing movements of the vocal cords, and the vocal tract resonance characteristics are changed by moving the chin, tongue, lips, etc. to express various phonemes. The pitch described above corresponds to the vibration period of the vocal cords, and this temporal change is an expression of accent and intonation.

このような声帯振動の他に、舌によって声道のある部分に狭い場所を作り、そこを空気流が通り抜けるときに乱流を生じて雑音的な音を生成する摩擦音や、舌や唇で声道を遮断して一時空気流を止めた後に一気に開放してインパルス的な音を生成する破裂音もある。 In addition to this vocal cord vibration, the tongue creates a narrow space in the part of the vocal tract, and when the airflow passes through it, turbulence is generated and a noisy sound is generated. There is also a popping sound that blocks the road and stops the temporary air flow, then opens at once and generates an impulse sound.

母音や破裂音／ｂ／，／ｄ／，／ｇ／、摩擦音／ｊ／，／ｚ／、鼻子音・流音／ｍ／，／ｎ／，／ｒ／」などといった、声帯の振動を伴う音韻を総称して有声音、破裂音／ｐ／，／ｔ／，／ｋ／、摩擦音／ｓ／，／ｈ／，／ｆ／といった、声帯の振動を伴わない音韻を無声音と呼ぶ。特に、子音に注目して、声帯の振動を伴う子音を有声子音、伴わない子音を無声子音と呼ぶ。有声音では、声帯振動による周期的な波形が生成され、無声音では雑音的な波形が生成される。 Accompanied by vocal cord vibration such as vowels, plosives / b /, / d /, / g /, friction sounds / j /, / z /, nasal consonants / stream sounds / m /, / n /, / r / The phonemes are collectively referred to as voiced sounds, plosive sounds / p /, / t /, / k /, friction sounds / s /, / h /, / f /, and the phonemes not accompanied by vocal cord vibration are called unvoiced sounds. In particular, focusing on consonants, consonants accompanied by vocal cord vibrations are called voiced consonants, and consonants not accompanied are called unvoiced consonants. For voiced sounds, a periodic waveform due to vocal cord vibration is generated, and for unvoiced sounds, a noisy waveform is generated.

無声子音（／ｐ／、／ｔ／、／ｋ／、／ｓ／、／ｆ／、／ｈ／ｅｔｃ．）に母音が／ｉ／、／ｕ／が挟まれると、口構えだけを残して声帯を振動させず、息だけで発音する現象が見られる。これを母音の無声化という。 When voiceless consonants (/ p /, / t /, / k /, / s /, / f /, / h / etc.) Are interleaved with / i /, / u /, leaving only the mouth There is a phenomenon in which the vocal cords are not vibrated and are pronounced only by breathing. This is called vowel devoicing.

図４は「取材した」の実発音波形を示す図である。図４に示すように、「取材した」の「ｓｈｉ」の母音は周期的な波形として現れず、無声摩擦音「ｓｈ」に融合された様な波形となっている。 FIG. 4 is a diagram showing an actual sound waveform of “covered”. As shown in FIG. 4, the vowel of “shi” “covered” does not appear as a periodic waveform, but has a waveform that is fused with the silent friction sound “sh”.

テキスト音声変換システムにおいても、母音の無声化表現は聴感品質を向上させるために重要であり、この判定を行うのが図３の母音無声化判定部２０３である。ここで無声化すると判定された母音は、音韻パワー決定部２０４と音韻継続時間算出部２０５において、それに応じた特殊な処理が施される。無声化母音は、通常の母音とは異なり、音韻パワー０、音韻継続時間０として波形生成部１０３に送られる。 Also in the text-to-speech conversion system, the vowel devoicing expression is important for improving the auditory quality, and the vowel devoicing determination unit 203 in FIG. 3 performs this determination. The vowels determined to be unvoiced here are subjected to special processing according to the phoneme power determination unit 204 and the phoneme duration calculation unit 205. Unlike normal vowels, the unvoiced vowels are sent to the waveform generation unit 103 as phoneme power 0 and phoneme duration 0.

無声化判定は、たとえば以下のような規則に従って行われる。
（１）無声子音（無音を含む）に挟まれた母音／ｉ／、／ｕ／は無声化する
（２）但し、アクセント核が存在すれば無声化しない
（３）但し、前母音がすでに無声化していれば無声化しない
（４）但し、疑問文末は無声化しない The devoicing determination is performed according to the following rules, for example.
(1) Vowels / i /, / u / sandwiched between unvoiced consonants (including silence) are devoiced (2) However, if there is an accent nucleus, they are not devoiced (3) However, the previous vowel is already unvoiced (4) However, the end of the question is not silent

但し、これらは一般的な傾向から導き出された規則であり、実際の発声では上述の規則通りに無声化が起こるとは限らない。また、個人によっても差がある。上記規則は、通常話速の発声においては至極、妥当な規則であるが、一般的に発話速度が遅くなった場合、無声化は起き難くなる。図５に通常速度と発話速度が遅い場合の／ｋｉｓｏｋｕ／の音声波形を示す。／ｋｉｓｏｋｕ／下線部は無声化の条件に合致し、通常速度では無声化が起こっているのに対し、遅い発声ではいずれも母音が明確に現れている。 However, these are rules derived from general trends, and devoicing does not always occur according to the rules described above in actual speech. There are also differences among individuals. The above rule is extremely appropriate for normal speech rate utterance, but generally speaking, when the speech rate is slow, devoicing is difficult to occur. FIG. 5 shows a voice waveform of / kisoku / when the normal speed and the speech speed are slow. The / kisoku / underlined part matches the devoicing condition, and devoicing occurs at normal speed, whereas vowels clearly appear in any slow utterance.

このような現象に対する対策の一つとして特許文献１において、発声速度に応じて母音無声化処理を行うか否かの判定基準を変える母音無声化判定手段を備えた音声合成装置が提案されている。この文献では、発声速度に閾値を設け、閾値以下となる発声速度においては、たとえ規則による無声化要件を満たしても無声化処理を行わないとしている。
特開平１１−１１６２６３号公報 As one countermeasure against such a phenomenon, Patent Document 1 proposes a speech synthesizer including a vowel devoicing determination unit that changes a determination criterion as to whether or not to perform a vowel devoicing process according to the utterance speed. . In this document, a threshold value is provided for the voice rate, and at the voice rate that is equal to or lower than the threshold value, the voice-off process is not performed even if the voice-off requirement according to the rule is satisfied.
JP-A-11-116263

発話速度が遅くなると無声化が起き難くなる傾向にあることは確かであるが、すべての音節がそうであるかと言えば、必ずしもそうではない。「〜でした：〜ｄｅｓｈｉｔａ」などの音節では、たとえ発話速度が遅くなっても、無声化して発音される場合が多い。傾向として子音部が持続して発音可能な摩擦音／ｓ／、／ｆ／、／ｈ／などは、低話速でも無声化しやすい。 While it is certain that devoicing tends to be difficult to occur at lower utterance speeds, this is not necessarily the case for all syllables. In syllables such as “It was: ~ deshita”, even if the utterance speed is slow, the syllable is often made silent. As a tendency, the frictional sounds / s /, / f /, / h /, etc., in which the consonant part can be continuously generated, are likely to be silent even at low speech speeds.

発話速度が遅い場合、このような音節も含めて一律に無声化を抑止しては、たどたどしく不自然な合成音となる。また、発話速度が遅くなると無声化が起き難くなる音節についても、閾値を境にして、完全に無声化する／まったく無声化しない、というように発声の性質が両極端に変化するわけではない。もともと無声化という現象は、開口面積が狭い子音に挟まれた母音区間では、狭い開口面積が維持されるために、有声音として声帯を振動させるのに必要な呼気流が確保できないことから起こるものである。人間の発声器官の構造と発声機構によるもので、性質の変化は段階的に変わるものである。しかし、従来の一律的、画一的な無声化判定処理では、発話速度の段階的変化に伴う、無声化の段階的性質の変化は表現できず、不自然な合成音にならざるを得ない。 If the speech rate is slow, even if such syllables are included and devoicing is uniformly suppressed, the synthesized sound becomes distorted and unnatural. In addition, even for a syllable in which devoicing is difficult to occur when the utterance speed is slowed, the utterance property does not change to extremes, such as complete devoicing / no devoicing at the threshold. Originally, the phenomenon of devoicing occurs because the vowel interval between consonants with a narrow aperture area maintains a narrow aperture area, and the expiratory airflow required to vibrate the vocal cords as a voiced sound cannot be secured. It is. Due to the structure of the human vocal organs and the vocalization mechanism, the change in properties changes in stages. However, the conventional uniform and uniform devoicing determination process cannot express the change in the stepwise nature of the devoicing that accompanies the step change in the speech rate, and must be an unnatural synthesized sound. .

従って、本発明の課題は、音声の無声化に関する従来の問題点を解決して自然な合成音が得られるような音声合成装置を提供する点にある。 Accordingly, an object of the present invention is to provide a speech synthesizer that can solve a conventional problem related to voice unvoiced and obtain a natural synthesized sound.

本発明に係る音声合成装置は、音声の基本単位となる音声素片が登録された素片辞書と、音韻・韻律記号列に対して少なくとも音声素片、音韻継続時間、基本周波数の合成パラメータを生成するパラメータ生成部と、パラメータ生成部からの合成パラメータを、素片辞書を参照しながら合成波形を生成する波形生成部とを備えた音声合成装置において、パラメータ生成部は、通常の無声化規則にしたがって無声化判定を行った結果に対して、発声速度と音節の種類に応じて判定結果の補正をおこなう判定補正手段を備えており、この判定補正手段は、無声化の程度に応じて少なくとも２以上のレベルによる無声化判定を行うようにしている。 The speech synthesizer according to the present invention includes a unit dictionary in which speech units as basic units of speech are registered, and at least speech units, phoneme durations, and fundamental frequency synthesis parameters for phoneme / prosodic symbol strings. In a speech synthesizer comprising: a parameter generation unit to generate; and a waveform generation unit that generates a composite waveform while referring to a segment dictionary with respect to a synthesis parameter from the parameter generation unit, the parameter generation unit includes a normal devoicing rule A determination correction means for correcting the determination result according to the utterance speed and the type of syllable, and the determination correction means at least according to the degree of devoicing. The determination of devoicing is performed at two or more levels.

本発明に係る音声合成装置では、無声化の判定において、少なくとも２以上の無声化レベルを設け、発声速度レベルに応じて、無声化のレベルを段階的に割り当てるようにしたことにより、発声速度の境界点前後においても無声化の程度が急激に変化することなく、いかなる発声速度においても自然な無声化を実現できる。また、発声速度レベルに応じた無声化レベルの設定は、音節の種別ごとに適した設定則を与える構成にしているため、発声速度の変化に対して子音の特性を反映した自然な無声化が実現可能であり、自然で滑らかな合成音声を得ることができる。 In the speech synthesizer according to the present invention, in the determination of devoicing, at least two devoicing levels are provided, and the devoicing level is assigned step by step according to the utterance speed level. Natural devoicing can be realized at any utterance speed without abruptly changing the degree of devoicing before and after the boundary point. In addition, the devoicing level setting according to the utterance speed level is configured to give a setting rule suitable for each syllable type, so that natural devoicing that reflects the characteristics of consonants with respect to changes in utterance speed is possible. It is feasible, and natural and smooth synthesized speech can be obtained.

以下、図面を参照して本発明の実施の形態について説明する。尚、各図面はこの発明が理解できる程度に概略的に示しているにすぎない。 Embodiments of the present invention will be described below with reference to the drawings. The drawings are only schematically shown so that the present invention can be understood.

図１は実施形態に係る音声合成装置のパラメータ生成部の構成を示すブロック図である。本発明の特徴部分は、無声化判定手段及びその実現方法にある。本実施形態に於いては、図２に示すテキスト解析部１０１、単語辞書１０４、波形生成部１０３、素片辞書１０５は従来の技術を用いることが出来る。 FIG. 1 is a block diagram illustrating a configuration of a parameter generation unit of the speech synthesizer according to the embodiment. The characteristic part of the present invention resides in the devoicing determination means and the method for realizing the same. In this embodiment, the text analysis unit 101, the word dictionary 104, the waveform generation unit 103, and the segment dictionary 105 shown in FIG.

図１において、パラメータ生成部３００は、中間言語解析部３０１、ピッチパタン生成部３０２、母音無声化一次判定部３０３、母音無声化レベル判定部３０４、音韻パワー決定部３０５、音韻継続時間算出部３０６を備えている。 In FIG. 1, a parameter generation unit 300 includes an intermediate language analysis unit 301, a pitch pattern generation unit 302, a vowel devoicing primary determination unit 303, a vowel devoicing level determination unit 304, a phonological power determination unit 305, and a phonological duration calculation unit 306. It has.

パラメータ生成部３００への入力は、従来と同様に韻律記号の付加された中間言語とユーザから指定される発声速度パラメータである。また、ユーザの好みや利用形態などにより、声の高さやイントネーションの大きさを示す抑揚などの声質パラメータを外部から指定する場合もある。 The input to the parameter generation unit 300 is an intermediate language to which prosodic symbols are added and a speech rate parameter designated by the user as in the conventional case. Also, voice quality parameters such as intonation indicating the voice pitch and intonation level may be designated from the outside depending on the user's preference and usage pattern.

合成対象の中間言語は、中間言語解析部３０１に入力され、ユーザ指定の発声速度パラメータは、母音無声化レベル判定部３０４に入力される。中間言語解析部３０１からの出力データのうち、例えば、フレーズ区切り記号、単語区切り記号、アクセント記号のようなパラメータはピッチパタン生成部３０２に入力され、音韻記号列や単語区切り記号、アクセント記号のようなパラメータは音韻パワー決定部３０５と音韻継続時間算出部３０６に入力され、音韻記号列やアクセント記号のようなパラメータは母音無声化一次判定部３０３に入力される。 The intermediate language to be synthesized is input to the intermediate language analysis unit 301, and the utterance speed parameter specified by the user is input to the vowel devoicing level determination unit 304. Of the output data from the intermediate language analysis unit 301, parameters such as a phrase delimiter, a word delimiter, and an accent symbol are input to the pitch pattern generation unit 302, such as a phoneme symbol string, a word delimiter, and an accent symbol. These parameters are input to the phoneme power determination unit 305 and the phoneme duration calculation unit 306, and parameters such as phoneme symbol strings and accent symbols are input to the vowel devoicing primary determination unit 303.

ピッチパタン生成部３０２は、入力されたパラメータから、フレーズ指令の生起時点と大きさ、アクセント指令の開始時点・終了時点と大きさ等のデータを算出し、ピッチパタンを生成する。生成されたピッチパタンは波形生成部１０３（図２参照）に入力される。ピッチパタン生成過程については、本発明と直接関係がないので説明を省略する。 The pitch pattern generation unit 302 calculates data such as the occurrence time and size of the phrase command and the start time / end time and size of the accent command from the input parameters, and generates a pitch pattern. The generated pitch pattern is input to the waveform generation unit 103 (see FIG. 2). Since the pitch pattern generation process is not directly related to the present invention, the description thereof is omitted.

母音無声化一次判定部３０３は、字面やアクセントなどの入力テキストのみを基準に母音の無声化判定を行い、その判定結果を母音無声化レベル判定部３０４に出力する。 The vowel devoicing primary determination unit 303 performs vowel devoicing determination based only on input text such as face and accent, and outputs the determination result to the vowel devoicing level determination unit 304.

母音無声化レベル判定部３０４は、母音無声化一次判定結果とユーザから指定される発声速度レベルとから最終的な無声化判定を行い、その最終的な判定結果を音韻パワー決定部３０５と音韻継続時間算出部３０６に出力する。 The vowel devoicing level determination unit 304 performs final devoicing determination from the primary vowel devoicing determination result and the utterance speed level specified by the user, and the final determination result is used as the phoneme power determination unit 305 and phonological continuation. The result is output to the time calculation unit 306.

音韻パワー決定部３０５は、母音無声化判定結果と、中間言語解析部３０１から入力される音韻記号列とから、音韻それぞれの振幅形状を算出し、波形生成部１０３に出力する。 The phoneme power determination unit 305 calculates the amplitude shape of each phoneme from the vowel devoicing determination result and the phoneme symbol string input from the intermediate language analysis unit 301, and outputs it to the waveform generation unit 103.

音韻継続時間算出部３０６は、母音無声化判定結果と、中間言語解析部３０１から入力される音韻記号列とから、音韻それぞれの継続時間を算出し、その結果を波形生成部１０３に出力する。 The phoneme duration calculation unit 306 calculates the duration of each phoneme from the vowel devoicing determination result and the phoneme symbol string input from the intermediate language analysis unit 301, and outputs the result to the waveform generation unit 103.

以下、上述のように構成された音声合成装置の動作を説明する。従来技術と異なる点は、パラメータ生成部３００内の処理であるので、それ以外の処理については省略する。 The operation of the speech synthesizer configured as described above will be described below. Since the difference from the prior art is the processing in the parameter generation unit 300, other processing is omitted.

まず、ユーザはあらかじめ発声速度レベルを指定する。発声速度は通常、１分間に何モーラの割合で発声するかといった形式のパラメータとして与えられ、利用便利上、５〜１０段階程度に量子化してそのレベル値を与える。このレベルに応じて、音韻継続時間の伸長などの処理が行われる。また、声の高さや抑揚などの声質制御のためのパラメータなども指定することができる。ユーザが特に指定しない場合は、あらかじめ定められた値（デフォルト値）が指定値として設定される。 First, the user designates an utterance speed level in advance. The utterance speed is usually given as a parameter in the form of how many mora is uttered per minute, and for convenience of use, the level value is quantized to about 5 to 10 levels. Depending on this level, processing such as extension of phoneme duration is performed. In addition, parameters for voice quality control such as voice pitch and intonation can be specified. Unless otherwise specified by the user, a predetermined value (default value) is set as the specified value.

ユーザにより指定された発声速度制御用パラメータは、母音無声化レベル判定部３０４と継続時間算出部３０６に送られる。もう一方の入力の中間言語は、中間言語解析部３０１に送られ、中間言語解析部３０１で入力文字列の解析が行われる。ここでの解析単位は、仮に１文章単位とする。 The utterance speed control parameter designated by the user is sent to the vowel devoicing level determination unit 304 and the duration calculation unit 306. The other input intermediate language is sent to the intermediate language analysis unit 301, and the input language string is analyzed by the intermediate language analysis unit 301. The analysis unit here is assumed to be one sentence unit.

１文章に対応する中間言語から、ピッチパタンの生成に関わるパラメータとして例えば、フレーズ指令の数とそれぞれのフレーズ指令のモーラ数、アクセント指令の数とそれぞれのアクセント指令のモーラ数・アクセント型などの情報がピッチパタン生成部３０２に送られる。 For example, information on the number of phrase commands and the number of mora of each phrase command, the number of accent commands and the number of mora / accent type of each accent command, etc. Is sent to the pitch pattern generation unit 302.

ピッチパタン生成部３０２では、入力されたパラメータから、フレーズ指令・アクセント指令それぞれの大きさや立ち上げ・立ち下げ位置などの算出を行い、あらかじめ規定した応答関数を用いてピッチパタンの生成を行う。算出されたピッチパタンは波形生成部１０３（図２参照）に送られる。 The pitch pattern generation unit 302 calculates the size of each phrase command / accent command, rise / fall position, and the like from the input parameters, and generates a pitch pattern using a response function defined in advance. The calculated pitch pattern is sent to the waveform generation unit 103 (see FIG. 2).

また、アクセント記号・音韻文字列などは、母音無声化一次判定部３０３に送られ、母音の無声化判定が行われる。この一次判定では文字列の並びとアクセント核の有無のみから判定が行われ、暫定的な判定結果が母音無声化レベル判定部３０４に送られる。 Accent symbols / phonological character strings and the like are sent to the vowel devoicing primary determination unit 303, and vowel devoicing determination is performed. In this primary determination, a determination is made only from the arrangement of character strings and the presence or absence of an accent nucleus, and a provisional determination result is sent to the vowel devoicing level determination unit 304.

母音無声化レベル判定部３０４には、ユーザから指定される発声速度レベルも入力されており、前記一次判定結果と併せて最終判定処理が行われる。この判定処理に於いては、発声速度がある規定値を超えたか否かを比較し、比較結果から発声速度が遅いと判定された場合に限り母音無声化処理を行わないようにする。この処理の後、母音無声化の最終判定を行うために、判定結果が、音韻パワー決定部３０５と音韻継続時間算出部３０６に送られる。母音無声化一時判定部３０３と母音無声化レベル判定部３０４の処理については、後に詳述する。 The vowel devoicing level determination unit 304 is also input with the utterance speed level designated by the user, and the final determination process is performed together with the primary determination result. In this determination process, it is compared whether or not the utterance speed exceeds a specified value, and the vowel devoicing process is not performed only when it is determined that the utterance speed is slow from the comparison result. After this processing, the determination result is sent to the phoneme power determination unit 305 and the phoneme duration calculation unit 306 in order to make a final determination of vowel devoicing. The processing of the vowel devoicing temporary determination unit 303 and the vowel devoicing level determination unit 304 will be described in detail later.

音韻パワー決定部３０５では、中間言語解析部３０１から入力された音韻文字列などのパラメータから、音韻あるいは音節それぞれの波形振幅値を算出し、波形生成部１０３に出力する。 The phoneme power determination unit 305 calculates a waveform amplitude value of each phoneme or syllable from parameters such as a phoneme character string input from the intermediate language analysis unit 301 and outputs the waveform amplitude value to the waveform generation unit 103.

音韻継続時間算出部３０６では、中間言語解析部３０１から入力された音韻文字列などのパラメータから音韻あるいは音節それぞれの継続時間を算出する。 The phoneme duration calculation unit 306 calculates the duration of each phoneme or syllable from parameters such as a phoneme character string input from the intermediate language analysis unit 301.

次に、母音無声化一時判定部３０３と母音無声化レベル判定部３０４における母音無声化判定処理について、フローチャートを参照して詳細に説明する。図６は母音無声化判定処理の具体例を示すフローチャートであり、図中、ＳＴはフローの各処理ステップを示している。 Next, the vowel devoicing determination process in the vowel devoicing temporary determination unit 303 and the vowel devoicing level determination unit 304 will be described in detail with reference to a flowchart. FIG. 6 is a flowchart showing a specific example of the vowel devoicing determination process. In the figure, ST indicates each processing step of the flow.

ＳＴ７までが母音無声化一時判定部３０３の処理に相当し、以降が母音無声化レベル判定部３０４の処理に相当する。ＳＴ７までの母音無声化一次判定部の処理は、従来技術における無声化判定と同様であり、一般の無声化規則にしたがって、着目母音に対して前後の音韻環境、アクセント核の有無などから一次の無声化判定を行う。 Up to ST7 corresponds to the processing of the vowel devoicing temporary determination unit 303, and the subsequent processing corresponds to the processing of the vowel devoicing level determination unit 304. The processing of the vowel devoicing primary determination unit up to ST7 is the same as the devoicing determination in the prior art, and in accordance with general devoicing rules, the primary vowel environment is determined based on the phonological environment before and after the target vowel, the presence or absence of an accent nucleus, etc. Make a devoicing decision.

まず、ステップＳＴ１では、入力された中間言語を音節単位に検索するための音節ポインタｉを０に初期化し、ステップＳＴ２で第ｉ番目の音節の母音の種類（ａ，ｉ，ｕ，ｅ，ｏ）と、当該音節の子音の種類（無声子音あるいは無音・有声子音）、および後続音節の子音の種類を、それぞれＶ１、Ｃ１、Ｃ２に設定する。 First, in step ST1, a syllable pointer i for searching the input intermediate language in syllable units is initialized to 0. In step ST2, the type of vowel of the i-th syllable (a, i, u, e, o) is initialized. ), The consonant type of the syllable (unvoiced consonant or unvoiced / voiced consonant), and the consonant type of the subsequent syllable are set to V1, C1, and C2, respectively.

ステップＳＴ３、ＳＴ４の処理は、母音が無声化する前提条件についての判定処理であり、ステップＳＴ５〜ＳＴ７の処理は、ステップＳＴ３，ＳＴ４に於ける母音無声化の前提条件を満たしても、無声化が起きない場合の判定処理である。 The processes in steps ST3 and ST4 are determination processes for the precondition that the vowel is unvoiced, and the processes in steps ST5 to ST7 are made unvoiced even if the precondition for vowel devoicing in steps ST3 and ST4 is satisfied. This is a determination process when no occurrence occurs.

ステップＳＴ３では、当該音節の母音Ｖ１が「ｉ」または「ｕ」であるかの判定を行う。当該音節の母音Ｖ１が「ｉ」または「ｕ」であるときはステップＳＴ４に進み、そうでなければ、無声化の対象ではないと判断してステップＳＴ１５に進む。 In step ST3, it is determined whether the vowel V1 of the syllable is “i” or “u”. If the vowel V1 of the syllable is “i” or “u”, the process proceeds to step ST4, and if not, it is determined that the syllable vowel V1 is not an object of devoicing, and the process proceeds to step ST15.

ステップＳＴ４では、当該音節の子音Ｃ１が無声子音で、かつ、後続音節の子音Ｃ２が無声子音か、もしくは文末・ポーズなのかを判定する。双方を満足すればステップＳＴ５に進み、該当音節にアクセント核が存在するか否かの判定を行う。日本語はピッチの高低変化でアクセントの位置を表現している。したがって高ピッチから低ピッチへの遷移が存在してアクセントの位置を示すアクセント核のある音節では、ピッチ構造のない無声化が起き難い。 In step ST4, it is determined whether the consonant C1 of the syllable is an unvoiced consonant and whether the subsequent syllable consonant C2 is an unvoiced consonant or a sentence end / pause. If both are satisfied, the process proceeds to step ST5 to determine whether or not an accent nucleus exists in the corresponding syllable. Japanese expresses the position of the accent by changing the pitch. Therefore, in a syllable with an accent nucleus indicating a position of an accent with a transition from a high pitch to a low pitch, devoicing without a pitch structure hardly occurs.

ステップＳＴ５では、当該音節がアクセント核であるか否かを判定し、アクセント核であれば、無声化しないと判断してステップＳＴ１５に進む。該当音節にアクセント核が存在しなければ（アクセント核でなければ）、ステップＳＴ６で前音節が無声化したか否かを判定する。無声化は、連続しては発生し難い性質があり、前音節が既に無声化していれば、後続する音節では通常、無声化は起こらない。 In step ST5, it is determined whether or not the syllable is an accent nucleus. If there is no accent nucleus in the corresponding syllable (if it is not an accent nucleus), it is determined in step ST6 whether or not the previous syllable is devoiced. Devoicing has the property that it is difficult to occur continuously. If the previous syllable is already devoiced, the subsequent syllable usually does not devoicing.

ステップＳＴ７では、該当音節が疑問文終端であるか否かを判定する。疑問文末はピッチの急激な上昇が起こるため無声化は発生しない。例えば、「〜します」と「〜します？」では後者の疑問文末の音節には明らかに強調の意図が含まれた発声になるため無声化は起こらない。 In step ST7, it is determined whether or not the corresponding syllable is the question sentence end. At the end of the question, there is no devoicing because the pitch rises rapidly. For example, in the case of “to do” and “to do?”, The syllable at the end of the latter question sentence clearly includes the intention of emphasis, so devoicing does not occur.

以上のステップＳＴ３〜ＳＴ７の条件を満たし、ＳＴ８へと到達したものが無声化の候補となる。ＳＴ８以降、本発明の特徴である発声速度と音節の種別に応じた無声化レベルの判定をおこなう。 Those that satisfy the conditions of steps ST3 to ST7 and reach ST8 are candidates for devoicing. After ST8, the devoicing level is determined according to the utterance speed and syllable type, which is a feature of the present invention.

まず、無声化候補となった母音を含む音節の種類の切り分けをおこなう。好適には音節を細かく分類してそれぞれに規則化することも可能であるが、処理構造を理解し易くするため本実施形態では音節を３種に分類し、個別に発声速度による判定レベルを設定している。 First, syllable types including vowels that are candidates for devoicing are classified. Although it is possible to classify the syllables and make them regular, it is possible to classify the syllables into three types in this embodiment for easy understanding of the processing structure, and individually set the judgment level based on the utterance speed. is doing.

前述のように、子音部Ｃ１が摩擦音である音節では、発話速度が遅くなっても無声化して発音される場合が多い。子音部Ｃ１が／ｐ／，／ｔ／，／ｋ／などの破裂音では発声速度が遅くなると無声化しにくくなる。子音に関して、継続時間が伸びても発声時の開口面積が狭い状態が維持される摩擦音／ｓ／，／ｈ／，／ｆ／では、発声速度が遅くても無声化が起きる可能性が高い。この中では比較的、開口面積が広い／ｈ／，／ｆ／では、／ｓ／に比べれば無声化が起きにくい。破裂音は、破裂時に開口面積が極端に狭くなるが、その後は広くなる。開口面積を狭い状態で維持して発声することが困難であるため、発声速度が遅くなれば時間経過とともに開口面積が広がり声帯振動が開始する。 As described above, a syllable in which the consonant part C1 is a frictional sound is often pronounced unvoiced even when the utterance speed is slow. When the consonant part C1 is a plosive such as / p /, / t /, / k /, it becomes difficult to devoice if the utterance speed is slow. With respect to the consonant sound, the frictional sound / s /, / h /, / f / in which the opening area at the time of utterance is maintained even when the duration is extended is highly likely to be devoiced even if the utterance speed is low. Among these, at / h / and / f / having a relatively large opening area, devoicing is less likely to occur than at / s /. The plosive sound has an extremely narrow opening area at the time of rupture, but then becomes wider. Since it is difficult to utter while keeping the aperture area small, if the utterance speed becomes slow, the aperture area increases with time and vocal cord vibration starts.

このような性質を反映し、本実施形態では、Ｃ１＝／ｓ／（ＳＴ８、ＳＴ１０）、Ｃ１＝／ｈ／，／ｆ／（ＳＴ９、ＳＴ１１）、それ以外（ＳＴ１２）とで、３種の音節グループに分類し、それぞれの分類に対して、発声速度に応じた無声化判定レベルＴＨ１、ＴＨ２を設定する。ここで、ＴＨ１、ＴＨ２は、モーラ数／分を単位とする発声速度の判定閾値を表し、後続ステップにおいて、ＴＨ１は、（無声化なし）←→（準無声化）の境界判定に、ＴＨ２は、（準無声化）←→（無声化）の境界判定に用いられる。 Reflecting such properties, in this embodiment, there are three types of C1 = / s / (ST8, ST10), C1 = / h /, / f / (ST9, ST11), and other (ST12). It classifies into syllable groups, and devoicing determination levels TH1 and TH2 corresponding to the utterance speed are set for each classification. Here, TH1 and TH2 represent utterance speed determination threshold values in units of mora / min. In the subsequent steps, TH1 is used for boundary determination of (no devoicing) ← → (quasi-unvoiced), and TH2 is , (Quasi-voiceless) ← → (voiceless) boundary determination.

上記閾値は、具体的には、第一分類（Ｃ１＝／ｓ／）の場合、ＴＨ１＝Ｎａ１，ＴＨ２＝Ｎａ２に設定し、第二分類（Ｃ１＝／ｈ／，／ｆ／）の場合、ＴＨ１＝Ｎｂ１，ＴＨ２＝Ｎｂ２に設定し、第三分類（上記以外）の場合、ＴＨ１＝Ｎｃ１，ＴＨ２＝Ｎｃ２に設定する。これらの各閾値は、各音節の性質から明らかなように、Ｎａ１≦Ｎｂ１≦Ｎｃ１、Ｎａ２≦Ｎｂ２≦Ｎｃ２、またＮａ１≦Ｎａ２、Ｎｂ１≦Ｎｂ２、Ｎｃ１≦Ｎｃ２、の関係に設定する。 Specifically, the threshold is set to TH1 = Na1, TH2 = Na2 in the case of the first classification (C1 = / s /), and in the case of the second classification (C1 = / h /, / f /), Set TH1 = Nb1, TH2 = Nb2, and in the case of the third classification (other than the above), set TH1 = Nc1, TH2 = Nc2. Each of these threshold values is set to a relationship of Na1 ≦ Nb1 ≦ Nc1, Na2 ≦ Nb2 ≦ Nc2, Na1 ≦ Na2, Nb1 ≦ Nb2, and Nc1 ≦ Nc2, as is apparent from the properties of each syllable.

ＳＴ１３からのステップでは、設定された発声速度判定閾値ＴＨ１、ＴＨ２を用いて、無声化レベルを決定する。無声化レベルとは、発声速度に対して、無声化する／まったく無声化しない、の両極端に切り分けるのではなく、無声化の程度に応じて複数のレベルを設けたものである。本実施形態では、「無声化しない」と「完全に無声化する」の間に１レベルの中間的レベル「準無声化」を設け、合計３レベルとして説明する。 In steps from ST13, the devoicing level is determined using the set speech rate determination thresholds TH1 and TH2. The devoicing level is a level in which a plurality of levels are provided according to the degree of devoicing, instead of separating them into two extremes of devoicing / no devoicing with respect to the utterance speed. In the present embodiment, an intermediate level “quasi-devoiced” is provided between “not devoiced” and “completely devoiced” for a total of three levels.

無声化母音は、通常の母音とは異なり、音韻パワー０、音韻継続時間０として実現される。「準無声化」とは、無声化のように母音を無くすのではなく、パワーも継続時間も通常の母音より小さいながら、ある程度、母音を残すようにするものである。本実施形態では、音節ごとに母音の無声化レベルを表わすフラグ変数ｕｖｆｌａｇ［ｉ］を設け、発声速度判定閾値ＴＨ１、ＴＨ２と発声速度Ｔｌｅｖｅｌを比較することにより、（無声化なし：ｕｖｆｌａｇ［ｉ］＝０）、（準無声化：ｕｖｆｌａｇ［ｉ］＝２）、（無声化：ｕｖｆｌａｇ［ｉ］＝１）の何れかの母音無声化最終判定を行う。 Unvoiced vowels are realized as phoneme power 0 and phoneme duration 0, unlike normal vowels. “Quasi-voicing” is not to eliminate vowels as in the case of devoicing, but to leave vowels to some extent while their power and duration are smaller than normal vowels. In the present embodiment, a flag variable uvflag [i] representing a vowel devoicing level is provided for each syllable, and by comparing the utterance speed determination thresholds TH1 and TH2 with the utterance speed Tlevel, (no devoicing: uvflag [i] = 0), (quasi-unvoiced: uvflag [i] = 2), and (voicing: uvflag [i] = 1) are finalized.

ステップＳＴ１８で音節カウンタｉを１インクリメントし、ステップＳＴ１９で音節カウンタｉが総モーラ数ｓｕｍ＿ｍｏｒａ以上か（ｉ≧ｓｕｍ＿ｍｏｒａか）否かを判別し、ｉ＜ｓｕｍ＿ｍｏｒａのときは処理が終了していないと判断してステップＳＴ２に戻って次音節の処理を同様に繰り返す。 In step ST18, the syllable counter i is incremented by 1. In step ST19, it is determined whether the syllable counter i is greater than or equal to the total number of mora sum_mora (i ≧ sum_mora). If i <sum_mora, it is determined that the processing is not completed. The process then returns to step ST2 and the next syllable process is repeated in the same manner.

上述の処理は、入力テキスト全音節に対して行った後、すなわち、ステップＳＴ１９で音節カウンタｉが総モーラ数ｓｕｍ＿ｍｏｒａを超えた時点で終了する。 The above-described processing is finished for all input text syllables, that is, when the syllable counter i exceeds the total number of mora sum_mora in step ST19.

以上説明したように、この実施形態によれば、無声化の判定において、無声化のレベルを複数設け、発声速度レベルに応じて、無声化のレベルを段階的に割り当てるようにしたことにより、発声速度の境界点前後においても無声化の程度が急激に変化することなく、いかなる発声速度においても自然な無声化を実現できる。また、発声速度レベルに応じた無声化レベルの設定は、音節の種別ごとに適した設定則を与える構成にしているため、発声速度の変化に対して子音の特性を反映した自然な無声化が実現可能であり、自然で滑らかな合成音声を得ることができる。 As described above, according to this embodiment, in the determination of devoicing, a plurality of devoicing levels are provided, and the devoicing level is assigned step by step according to the utterance speed level. Natural devoicing can be achieved at any utterance speed without abruptly changing the degree of devoicing before and after the speed boundary point. In addition, the devoicing level setting according to the utterance speed level is configured to give a setting rule suitable for each syllable type, so that natural devoicing that reflects the characteristics of consonants with respect to changes in utterance speed is possible. It is feasible, and natural and smooth synthesized speech can be obtained.

実施形態に係る音声合成装置のパラメータ生成部の構成を示すブロック図である。It is a block diagram which shows the structure of the parameter production | generation part of the speech synthesizer which concerns on embodiment. テキスト音声変換処理の機能ブロック図である。It is a functional block diagram of a text voice conversion process. 従来技術に於けるパラメータ生成部の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the parameter production | generation part in a prior art. 「取材した」の実発生波形を示す図である。It is a figure which shows the actual generation | occurrence | production waveform of "covered". 「規則」の実発生波形を示す図である。It is a figure which shows the actual generation | occurrence | production waveform of "rule". 実施形態に係る母音無声化判定のフローチャートである。It is a flowchart of the vowel devoicing determination according to the embodiment.

Explanation of symbols

３００パラメータ生成部
３０１中間言語解析部
３０２ピッチパタン生成部
３０３母音無声化一次判定部
３０４母音無声化レベル判定部
３０５母音パワー決定部
３０６音韻継続時間算出部 300 Parameter generation unit 301 Intermediate language analysis unit 302 Pitch pattern generation unit 303 Vowel devoicing primary determination unit 304 Vowel devoicing level determination unit 305 Vowel power determination unit 306 Phoneme duration calculation unit

Claims

A speech unit dictionary in which speech units that are basic units of speech are registered; a parameter generation unit that generates at least speech units, phoneme durations, and fundamental frequency synthesis parameters for phoneme / prosodic symbol strings; and the parameters In a speech synthesizer comprising a waveform generation unit that generates a synthesized waveform while referring to the segment dictionary for synthesis parameters from a generation unit,
The parameter generation unit includes a vowel devoicing determination unit that determines whether or not to perform a vowel devoicing process, and the vowel devoicing determination unit determines at least two devoicing levels according to the degree of devoicing. A speech synthesizer characterized by the above.

The speech synthesizer according to claim 1, wherein the devoicing level has at least one intermediate level between “not devoicing” and “unvoiced”.

The vowel devoicing determining means includes primary determining means for performing vowel devoicing determination based on the arrangement of character strings and the presence or absence of accent nuclei, according to the result of the primary determining means and the utterance speed designated by the user, The speech synthesizer according to claim 1, further comprising a devoicing level determination criterion for selecting the devoicing level.

4. The speech synthesizer according to claim 3, wherein the devoicing level determination criterion is determined according to the type of syllable to be devoiced.