JP2003140678A

JP2003140678A - Voice quality control method for synthesized voice and voice synthesizer

Info

Publication number: JP2003140678A
Application number: JP2001333991A
Authority: JP
Inventors: Yumiko Kato; 弓子加藤; Katsuyoshi Yamagami; 勝義山上; Takahiro Kamai; 孝浩釜井
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2001-10-31
Filing date: 2001-10-31
Publication date: 2003-05-16
Anticipated expiration: 2021-10-31
Also published as: JP3900892B2

Abstract

PROBLEM TO BE SOLVED: To synthesize an articulate and stressed voice for a word or part showing meaning contents of information and an inarticulate voice for a word or part which does not show meaning contents directly, but shows a sentence structure. SOLUTION: A voice synthesizer which synthesis a voice according to a sound source-patency model comprises a language processing part which outputs part-of-speech information or a voice quality tag in addition to reading information and ascent information, a meter control part which generates meter information from the reading information and accent information, and an acoustic processing part which adjusts sound source parameters and patency parameters according to the part-of-speech information or voice quality tag and meter information, converts patency, and generates a voice waveform based upon the meter information.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明はテキストを音声に変
換する音声合成方法および音声合成装置に属するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing method and a voice synthesizing apparatus for converting text into voice.

【０００２】[0002]

【従来の技術】波形重畳方式に代表される、従来の音声
合成装置では、自立語のような発話の意味内容を示すこ
とばと、付属語のような構文構造を示すことばが同じ明
瞭度、同じ強度で発声されたため、聞き手は注意を絞る
ことが出来ず、長時間聴取すると疲労する音声となって
いた。2. Description of the Related Art In a conventional speech synthesizer typified by a waveform superposition method, a word indicating the semantic content of an utterance such as an independent word and a word indicating a syntactic structure such as an adjunct word have the same intelligibility and the same. The listener was unable to focus his attention because he was uttered with high intensity, and his voice became tired after listening for a long time.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、波形重
畳方式の音声合成では、自立語と付属語の音質を変える
ためには、自立語用音声素片と付属語用音声素片を保持
せねばならず、素片データの容量が大幅に増加するとい
う課題があった。However, in the speech synthesis of the waveform superposition method, in order to change the sound quality of the independent word and the adjunct word, it is necessary to hold the independent speech element and the adjunct speech element. However, there is a problem that the capacity of the fragment data is significantly increased.

【０００４】この発明は上記の課題を鑑み、データ量を
増加させずに言語情報あるいは意味情報に対応して声質
を変化させる合成音声の音質調整方法と音声合成装置を
提供することを目的とする。In view of the above problems, it is an object of the present invention to provide a sound quality adjusting method and a voice synthesizing apparatus for a synthetic voice, which changes the voice quality in accordance with linguistic information or semantic information without increasing the amount of data. .

【０００５】[0005]

【課題を解決するための手段】上記の目的を達成するた
めの第1の手段は声道パラメータと音源パラメータを制
御して音声を合成する音声合成方法において、入力され
た言語情報に基づいて前記声道パラメータと前記音源パ
ラメータとの少なくともいずれか一方を制御して声質を
変換する音質調整方法である。[Means for Solving the Problems] A first means for achieving the above object is a voice synthesizing method for synthesizing a voice by controlling vocal tract parameters and sound source parameters. It is a sound quality adjustment method for converting a voice quality by controlling at least one of a vocal tract parameter and the sound source parameter.

【０００６】第2の手段は声道パラメータと音源パラメ
ータを制御して音声を合成する音声合成方法において、
入力された言語情報と意味情報に基づいて前記声道パラ
メータと前記音源パラメータとの少なくともいずれか一
方を制御して声質を変換する音質調整方法である。A second means is a voice synthesizing method for synthesizing voice by controlling vocal tract parameters and sound source parameters,
It is a sound quality adjustment method for converting a voice quality by controlling at least one of the vocal tract parameter and the sound source parameter based on the inputted language information and semantic information.

【０００７】第3の手段は前記言語情報は品詞情報を含
み、前記品詞情報に基づいて前記音源パラメータのうち
音源開口度パラメータを制御することを特徴とする音質
調整方法である。A third means is a sound quality adjusting method characterized in that the language information includes part-of-speech information, and the sound source aperture parameter among the sound source parameters is controlled based on the part-of-speech information.

【０００８】第4の手段は前記言語情報は活用情報を含
み、前記活用情報に基づいて前記音源パラメータのうち
音源開口度パラメータを制御することを特徴とする音質
調整方法である。A fourth means is a sound quality adjusting method characterized in that the language information includes utilization information, and the sound source aperture parameter among the sound source parameters is controlled based on the utilization information.

【０００９】第5の手段は前記言語情報は品詞情報を含
み、前記品詞情報に基づいて前記声道パラメータのうち
ホルマント中心周波数パラメータを制御することを特徴
とする音質調整方法である。A fifth means is a sound quality adjusting method characterized in that the language information includes part-of-speech information, and the formant center frequency parameter of the vocal tract parameters is controlled based on the part-of-speech information.

【００１０】第6の手段は前記言語情報は活用情報を含
み、前記活用情報に基づいて前記声道パラメータのうち
ホルマント中心周波数パラメータを制御することを特徴
とする音質調整方法である。A sixth means is a sound quality adjusting method characterized in that the language information includes utilization information, and the formant center frequency parameter among the vocal tract parameters is controlled based on the utilization information.

【００１１】第７の手段は前記言語情報は品詞情報を含
み、前記品詞情報に基づいて前記声道パラメータのうち
ホルマントバンド幅パラメータを制御することを特徴と
する音質調整方法である。A seventh means is a sound quality adjusting method characterized in that the language information includes part-of-speech information, and the formant bandwidth parameter of the vocal tract parameters is controlled based on the part-of-speech information.

【００１２】第８の手段は前記言語情報は活用情報を含
み、前記活用情報に基づいて前記声道パラメータのうち
ホルマントバンド幅パラメータを制御することを特徴と
する音質調整方法である。An eighth means is a sound quality adjusting method characterized in that the language information includes utilization information, and the formant bandwidth parameter of the vocal tract parameters is controlled based on the utilization information.

【００１３】第９の手段は声道パラメータと音源パラメ
ータとに基づいて音声を合成する音声合成装置におい
て、発音記号列と言語情報との少なくともどちらか一方
に基づいて韻律情報を生成する韻律制御手段と、前記言
語情報に基づいて前記声道パラメータと前記音源パラメ
ータとの少なくともどちらか一方を制御して声質を変換
し、前記発音記号列と前記韻律情報とに基づいて音声を
合成する音声生成手段を備えた音声合成装置である。A ninth means is a prosody control means for generating prosody information based on at least one of a phonetic symbol string and language information in a voice synthesizing device for synthesizing a voice based on a vocal tract parameter and a sound source parameter. And a voice generation means for controlling at least one of the vocal tract parameter and the sound source parameter based on the language information to convert the voice quality, and synthesizing a voice based on the phonetic symbol string and the prosody information. It is a speech synthesizer equipped with.

【００１４】第１０の手段は声道パラメータと音源パラ
メータとに基づいて音声を合成する音声合成装置におい
て、発音記号列と言語情報と意味情報との少なくともい
ずれか一つに基づいて韻律情報を生成する韻律制御手段
と、前記言語情報と前記意味情報との少なくともいずれ
か一つに基づいて前記声道パラメータと前記音源パラメ
ータとの少なくともいずれか一つを制御して声質を変換
し、前記発音記号列と前記韻律情報とに基づいて音声を
合成する音声生成手段を備えた音声合成装置である。A tenth means is a voice synthesizing apparatus for synthesizing a voice based on a vocal tract parameter and a sound source parameter, wherein prosodic information is generated based on at least one of a phonetic symbol string, language information and semantic information. The prosody control means for controlling the voice quality by controlling at least one of the vocal tract parameter and the sound source parameter based on at least one of the language information and the semantic information, It is a voice synthesizing device provided with a voice generating means for synthesizing a voice based on a string and the prosody information.

【００１５】[0015]

【発明の実施の形態】以下、本発明の音質調整方法と音
声合成装置について、実施例を用いて説明する。BEST MODE FOR CARRYING OUT THE INVENTION A sound quality adjusting method and a voice synthesizing apparatus according to the present invention will be described below with reference to embodiments.

【００１６】（実施の形態１）図１は、本発明の実施の
形態１における音声合成装置の概念構成と各部の入出力
データの形式を示した機能ブロック図である。(Embodiment 1) FIG. 1 is a functional block diagram showing a conceptual configuration of a speech synthesizer according to Embodiment 1 of the present invention and a format of input / output data of each unit.

【００１７】図１において１１０は漢字かな混じりテキ
ストを入力とし、形態素解析および構文解析を行い、読
み、アクセント情報および自立語付属語判断情報を出力
する言語処理部であり、１２０は言語処理部１１０より
出力された読み、アクセント情報に従って、音韻ごとの
時間長、ピッチおよびパワー情報(韻律情報)を生成する
韻律制御部であり、１３０は韻律制御部１２０より出力
された韻律情報と言語処理部１１０より出力された、自
立語付属語判別情報に従って、音源-声道モデルのパラ
メータを制御して音声波形を生成する音響処理部であ
る。In FIG. 1, 110 is a language processing unit for inputting text mixed with kanji and kana, performing morphological analysis and syntactic analysis, and outputting reading, accent information and independent word adjunct word judgment information, and 120 is a language processing unit 110. Reference numeral 130 denotes a prosody control unit that generates time length, pitch, and power information (prosodic information) for each phoneme according to the reading and accent information output from the prosody control unit 120 and the language processing unit 110. It is an acoustic processing unit that controls the parameters of the sound source-vocal tract model according to the independent word adjunct word discrimination information output, and generates a speech waveform.

【００１８】以上のように構成された音声合成装置の動
作を説明する。言語処理１１０は入力された漢字かな混
じりテキスト（１０１）「明日は全国的に晴れるところ
が多く、日中の気温は最高気温が30度を超えるところが
多くなる見込みです。」を形態素解析および構文解析
し、読み、アクセント区切り、アクセント、付属語記号
を含む言語情報（１０２）を出力する。言語情報１０２
は音韻をカタカナで示し、改行によりアクセント句を示
し、アポストロフィ記号によりアクセントを示し、音韻
記号を中カッコで囲むことで付属語を示している。韻律
制御部は例えば特開平１２−０７５８８３のようにアク
セント句のモーラ数とアクセント型に従って音韻ごとの
ピッチとパワーを決定し、音韻並びから音韻語との時間
長を特定して、音韻毎に時間長、ピッチ、パワーの韻律
情報を生成する。一方言語処理１１０より入力された付
属語情報に基づいて、自立語に含まれる音韻は標準の声
質、付属語に含まれる音韻はあいまいな声質を指定する
声質情報を音韻毎に生成し、音韻毎の韻律情報及び声質
情報（１０３）を出力する。音響処理部１３０は音韻毎
の韻律情報および声質情報（１０３）に従って、音声を
合成する。あいまいな声質が指定された音韻に付いて
は、音韻の母音部のホルマント周波数を各母音の特徴的
ホルマント周波数の重心に近づけ、さらにホルマントバ
ンド幅を標準の２倍にする。このときホルマントのエネ
ルギーが標準ホルマントバンド幅の場合と変わらないよ
うにエネルギーを調整する。上記のように標準声質のパ
ラメータを変更することで、あいまい声質の音声を音韻
単位で作り、パラメータを接続し韻律情報に合わせて音
源パラメータを変更して、音声を合成する。The operation of the speech synthesizer configured as above will be described. The linguistic processing 110 morphologically and syntactically analyzes the input kanji-kana mixed text (101) "Tomorrow there will be many sunny places nationwide, and daytime temperatures are likely to exceed 30 degrees Celsius." , Language information (102) including pronunciation, accent delimiter, accent, and attached word symbol is output. Language information 102
Indicates phonemes in katakana, accent phrases are indicated by line breaks, accents are indicated by apostrophe symbols, and adjuncts are indicated by enclosing phonological symbols in braces. The prosody control unit determines the pitch and power for each phoneme according to the number of mora and the accent type of the accent phrase, for example, as disclosed in Japanese Patent Laid-Open No. 12-075883, specifies the time length of the phoneme word from the phoneme sequence, and determines the time for each phoneme. Prosody information of length, pitch, and power is generated. On the other hand, based on the adjunct word information input from the language processing 110, the phoneme included in the independent word has a standard voice quality, and the phoneme included in the adjunct word generates voice quality information designating an ambiguous voice quality for each phoneme. It outputs prosody information and voice quality information (103). The sound processing unit 130 synthesizes a voice according to the prosody information and voice quality information (103) for each phoneme. For a phoneme with an ambiguous voice quality, the formant frequency of the vowel part of the phoneme is brought close to the center of gravity of the characteristic formant frequency of each vowel, and the formant bandwidth is doubled from the standard. At this time, the energy of the formant is adjusted so that it does not differ from that of the standard formant band width. By changing the parameters of the standard voice quality as described above, a voice with an ambiguous voice quality is created in phonological units, the parameters are connected, the sound source parameters are changed in accordance with the prosody information, and the voice is synthesized.

【００１９】以上のように、本実施の形態の音声合成装
置により、付属語に含まれる音韻のみをあいまいな声質
で合成することができ、意味内容を伝える自立語を相対
的に明瞭な声質で発声することにより、聴取者が自然に
意味内容に注目でき、自然で疲れにくい合成音声を生成
することが出来る。As described above, the speech synthesizer according to the present embodiment can synthesize only the phoneme included in the adjunct word with an ambiguous voice quality, and the independent word that conveys the meaning content can be expressed with a relatively clear voice quality. By uttering, the listener can naturally pay attention to the meaning and content, and can generate a synthetic voice that is natural and less tiring.

【００２０】（実施の形態２）図２は、本発明の実施の
形態2における音声合成装置の概念構成と各部の入出力
データの形式を示した機能ブロック図である。(Embodiment 2) FIG. 2 is a functional block diagram showing a conceptual configuration of a speech synthesizer according to Embodiment 2 of the present invention and a format of input / output data of each unit.

【００２１】図２において２１０はFM電波を受信して電
波に多重変調されている文字データを出力するFM文字放
送受信部であり、２２０はFM文字放送受信部２１０が出
力した文字データの中から交通情報を抜き出して出力す
る交通情報抽出部である。２３０は交通情報の文例と、
文例毎にあらかじめ定められた、強調あるいはあいまい
の声質指定情報とを保持する音質タグ付き文例データベ
ースであり、２４０は交通情報抽出部２２０が出力した
交通情報を音質タグ付きデータベース２３０のデータと
マッチングし、音声出力のための読み、アクセント情報
および音質情報を出力する言語情報出力部である。韻律
制御部１２０、音響処理部１３０は図１と同様である。In FIG. 2, reference numeral 210 denotes an FM teletext receiving unit that receives FM radio waves and outputs the character data multiplexed and modulated into the radio waves, and 220 indicates character data output from the FM teletext receiving unit 210. It is a traffic information extraction unit that extracts and outputs traffic information. 230 is a sentence example of traffic information,
A sentence example database with a sound quality tag that holds, in advance, emphasized or ambiguous voice quality designation information for each sentence example, and 240 matches the traffic information output by the traffic information extraction unit 220 with the data in the sound quality tagged database 230. , A language information output unit for outputting reading, accent information and sound quality information for voice output. The prosody control unit 120 and the sound processing unit 130 are the same as those in FIG.

【００２２】以上のように構成された音声合成装置の動
作を説明する。FM文字放送受信部２１０はFM電波を受信
して文字データを抽出し、出力する。交通情報抽出部２
２０はFM文字放送受信部２１０が出力した文字データよ
り音質タグ付き文例データベース２３０を参照して交通
情報のパタンを持つ情報のみ抽出し文字列（２０１）を
出力する。言語情報出力部２４０は音質タグ付き文例デ
ータベース２３０を参照して路線、方向、始点等の構成
要素をマッチングし、文字列２０１に最適な文例を選択
する。交通情報の抽出と文例の選択は例えば、特開平０
８−３３９４９０に示されるようなマッチングによって
行うものとする。言語情報出力部２４０は文例に文字列
２０１の構成要素を当てはめ、完結した文を生成しその
文の読み、アクセント区切り、アクセント、音質タグを
含む言語情報（２０２）を出力する。言語情報２０２は
音韻をカタカナで示し、改行によりアクセント句を示
し、アポストロフィ記号によりアクセントを示し、音韻
記号を＜＞で囲むことで強調音声を摘要する音韻列を示
し、中カッコで囲むことであいまい音声を適用する音韻
列を示している。韻律制御部は例えば特開平１２−０７
５８８３のようにアクセント句のモーラ数とアクセント
型に従って音韻ごとのピッチとパワーを決定し、音韻並
びから音韻語との時間長を特定して、音韻毎に時間長、
ピッチ、パワーの韻律情報を生成する。一方言語情報出
力部２４０より出力された音質タグに基づいて、強調音
声が指定された音韻については、強調のタグを付与し、
あいまい音声を指定された音韻にはあいまい音声のタグ
を付与して音韻毎の韻律情報及び声質情報（１０３）を
出力する。音響処理部１３０は音韻毎の韻律情報および
声質情報（１０３）に従って、音声を合成する。強調タ
グが付与された音韻に付いては音韻の子音部のパワーを
標準の１．１倍にし、母音部のホルマントバンド幅を標
準の０．８倍にする。あいまいな声質が指定された音韻
に付いては、音韻の母音部のホルマント周波数を各母音
の特徴的ホルマント周波数の重心に近づけ、さらにホル
マントバンド幅を標準の２倍にする。強調、あいまいの
どちらのパラメータ変更についても、ホルマントのエネ
ルギーが標準ホルマントバンド幅の場合と変わらないよ
うにエネルギーを調整する。上記のように標準声質のパ
ラメータを変更することで、強調音声、あいまい声質の
音声を音韻単位で作り、パラメータを接続し韻律情報に
合わせて音源パラメータを変更して、音声を合成する。The operation of the speech synthesizer configured as above will be described. The FM teletext receiver 210 receives FM radio waves, extracts character data, and outputs the character data. Traffic information extraction unit 2
20 refers to the sound quality tagged sentence example database 230 from the character data output by the FM teletext receiver 210, extracts only the information having the pattern of traffic information, and outputs the character string (201). The language information output unit 240 refers to the sound quality tagged sentence example database 230 to match the constituent elements such as the route, the direction, and the starting point, and selects the optimum sentence example for the character string 201. Extraction of traffic information and selection of sentence examples are described in, for example, Japanese Patent Laid-Open No.
It is assumed that the matching is performed as shown in 8-339490. The language information output unit 240 applies the constituent elements of the character string 201 to the sentence example, generates a complete sentence, and outputs the language information (202) including reading of the sentence, accent delimiter, accent, and sound quality tag. The linguistic information 202 indicates phonemes in katakana, accent phrases are indicated by line breaks, accents are indicated by apostrophes, and phoneme strings that require emphasized speech are indicated by enclosing the phonological symbols in <> and enclosed in braces. The phonological sequence to which voice is applied is shown. The prosody control unit is disclosed in, for example, Japanese Patent Laid-Open No. 12-07.
As in 5883, the pitch and power of each phoneme are determined according to the number of mora and accent type of the accent phrase, the time length of the phoneme word is specified from the phoneme sequence, and the time length of each phoneme is
Prosody information of pitch and power is generated. On the other hand, based on the sound quality tag output from the language information output unit 240, an emphasis tag is added to the phoneme for which emphasized speech is designated,
A fuzzy voice tag is added to the phoneme for which a fuzzy voice is designated, and prosody information and voice quality information (103) for each phoneme are output. The sound processing unit 130 synthesizes a voice according to the prosody information and voice quality information (103) for each phoneme. For a phoneme to which an emphasis tag is added, the power of the consonant part of the phoneme is 1.1 times the standard, and the formant band width of the vowel part is 0.8 times the standard. For a phoneme with an ambiguous voice quality, the formant frequency of the vowel part of the phoneme is brought close to the center of gravity of the characteristic formant frequency of each vowel, and the formant bandwidth is doubled from the standard. Regardless of whether the parameter is emphasized or ambiguous, the energy is adjusted so that the formant energy is not different from the standard formant bandwidth. By changing the parameters of the standard voice quality as described above, emphasized voices and voices of ambiguous voice quality are created in phonological units, parameters are connected, the sound source parameters are changed according to the prosody information, and the voices are synthesized.

【００２３】以上のように、本実施の形態の音声合成装
置により、聴取者が注目すべき構成要素に対して強調音
声で、意味内容に関連の小さい部分の音韻はあいまいな
声質で合成することができ、聴取者が自然に意味内容に
注目でき、自然で疲れにくい合成音声を生成することが
出来る。As described above, the speech synthesis apparatus according to the present embodiment synthesizes the emphasized voice with respect to the component which the listener should pay attention to, and the phonology of the portion having a small relation to the meaning content with an ambiguous voice quality. Therefore, the listener can naturally pay attention to the meaning and content, and can generate a synthetic voice that is natural and less tiring.

【００２４】[0024]

【発明の効果】聴取者が注目すべき意味内容を示す部分
を強調された明瞭な音声で提示し、意味内容を直接示さ
ない部分をあいまいな音声で提示することにより、聴取
者が自然に意味内容を示す部分に注目でき、自然で疲れ
ない合成音声を生成することが出来る。[Effects of the Invention] By providing a clear and emphasized voice for a portion that shows the meaning content that the listener should pay attention to, and a vague voice for the portion that does not directly show the meaning content, the listener naturally senses the meaning. You can pay attention to the part that shows the content, and you can generate natural and tired synthetic speech.

[Brief description of drawings]

【図１】実施の形態１の音声合成装置の概念構成と各部
の入出力データ形式を示すブロック図FIG. 1 is a block diagram showing a conceptual configuration of a speech synthesizer according to a first embodiment and an input / output data format of each unit.

【図２】実施の形態２の音声合成装置の概念構成と各部
の入出力データ形式を示すブロック図FIG. 2 is a block diagram showing a conceptual configuration of a speech synthesizer according to a second embodiment and an input / output data format of each unit.

[Explanation of symbols]

１１０言語処理部１２０韻律制御部１３０音響処理部２１０ FM文字放送受信部２２０交通情報抽出部２３０音質タグ付き文例データベース 110 Language Processing Department 120 Prosody control unit 130 Sound processing unit 210 FM teletext receiver 220 Traffic information extractor 230 Sentence example database with sound quality tag

───────────────────────────────────────────────────── フロントページの続き (72)発明者釜井孝浩大阪府門真市大字門真1006番地松下電器産業株式会社内Ｆターム(参考） 5D045 AA00 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Takahiro Kamai 1006 Kadoma, Kadoma-shi, Osaka Matsushita Electric Sangyo Co., Ltd. F-term (reference) 5D045 AA00

Claims

[Claims]

1. A voice synthesizing method for synthesizing a voice by controlling a vocal tract parameter and a sound source parameter, wherein at least one of the vocal tract parameter and the sound source parameter is controlled based on input language information. A method for adjusting the sound quality of synthetic voice that converts voice quality.

2. A voice synthesis method for synthesizing a voice by controlling a vocal tract parameter and a sound source parameter, wherein at least one of the vocal tract parameter and the sound source parameter is based on input linguistic information and semantic information. A method for adjusting the sound quality of synthetic speech by controlling and converting the voice quality.

3. The sound quality adjustment of synthetic speech according to claim 1, wherein the language information includes part-of-speech information, and the sound source aperture parameter of the sound source parameters is controlled based on the part-of-speech information. Method.

4. The sound quality adjustment of synthetic speech according to claim 1, wherein the language information includes utilization information, and a sound source opening degree parameter of the sound source parameters is controlled based on the utilization information. Method.

5. The sound quality of synthetic speech according to claim 1, wherein the language information includes part-of-speech information, and the formant center frequency parameter of the vocal tract parameters is controlled based on the part-of-speech information. Adjustment method.

6. The sound quality of synthetic speech according to claim 1, wherein the language information includes utilization information, and the formant center frequency parameter of the vocal tract parameters is controlled based on the utilization information. Adjustment method.

7. The sound quality of synthetic speech according to claim 1, wherein the language information includes part-of-speech information, and the formant bandwidth parameter of the vocal tract parameters is controlled based on the part-of-speech information. Adjustment method.

8. The sound quality of synthetic speech according to claim 1, wherein the language information includes utilization information, and a formant bandwidth parameter of the vocal tract parameters is controlled based on the utilization information. Adjustment method.

9. A voice synthesizer for synthesizing a voice based on a vocal tract parameter and a sound source parameter, and prosody control means for generating prosody information based on at least one of a phonetic symbol string and language information, A voice generation means is provided for controlling at least one of the vocal tract parameter and the sound source parameter based on language information to convert a voice quality, and synthesizing a voice based on the phonetic symbol string and the prosody information. Speech synthesizer.

10. A prosody control device for producing prosody information based on at least one of a phonetic symbol string, language information, and semantic information in a voice synthesis device for synthesizing a voice based on a vocal tract parameter and a sound source parameter. Means for controlling the voice quality by controlling at least one of the vocal tract parameter and the sound source parameter based on at least one of the language information and the semantic information, the phonetic symbol string and the A voice synthesis device comprising a voice generation means for synthesizing a voice based on prosody information.