JPH09244680A

JPH09244680A - Device and method for rhythm control

Info

Publication number: JPH09244680A
Application number: JP5315996A
Authority: JP
Inventors: Takahiko Niimura; 貴彦新村; Keiji Hayashi; 慶士林
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1996-03-11
Filing date: 1996-03-11
Publication date: 1997-09-19
Anticipated expiration: 2016-03-11
Also published as: JP3241582B2

Abstract

PROBLEM TO BE SOLVED: To obtain a natural speech as a synthesis result when a speech is synthesized by combining a speech of a document different in pronunciation style and a speech of word together. SOLUTION: In addition to a data base 1 wherein speeches of various words are registered and a data base 5 wherein speeches of fixed form parts of various documents are registered, a table 3 containing registered rhythms of those words and a table 7 containing ranges of rhythms that word parts of those documents should have are prepared. The speech and rhythm of a desired word are read out and the speech of a fixed part of a desired document and the rhythm range of the word part are read out. Then the rhythm 23 of the word is compared with the rhythm range 27 of the word part of the document to calculate the difference 29 between both the rhythms. Then rhythm control 17 based upon the difference 29 is exercised over the speech 21 of the word to convert the speech 21 of the word into a speech 31 having a rhythm converging in the rhythm range 27. Then the converted word speech 31 is combined with the speech 25 of the fixed form part of the document to synthesize the speech 33 of the complete document.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の技術分野】本発明は、一般には音声合成の技術
に関わり、特に、システム応答文のように定型文であっ
てその中の特定の単語だけが取り替えられるような類い
の文の自然な発声音を合成するのに適した韻律制御の技
術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention generally relates to speech synthesis technology, and more particularly, it is a natural sentence of a type which is a fixed sentence such as a system response sentence in which only specific words are replaced. The present invention relates to a technique of prosody control suitable for synthesizing various vocal sounds.

【０００２】[0002]

【従来の技術】普及している自動音声応答装置の構成音
で用いられる文章は一般にシステム応答文と呼ばれ、そ
の中の特定の単語だけが取り替え得る定型文であること
が多い。例えば、「ＡＢＣ銀行から振込がございまし
た」なるシステム応答文では、銀行名の単語が「ＡＢ
Ｃ」や「ＣＤＦ」等の異なるものに取り替えられる。2. Description of the Related Art A sentence used in a constituent sound of a popular automatic voice response device is generally called a system response sentence, and it is often a fixed sentence in which only a specific word can be replaced. For example, in the system response sentence "There was a transfer from ABC Bank", the word of the bank name was "AB
It can be replaced with a different one such as "C" or "CDF".

【０００３】このようなシステム応答文を合成する場
合、従来は編集合成法により単語と文章とを繋げてい
る。編集合成法とは、例えば上記において「○○銀行か
ら振込がございました」という定型的な文章の単語部分
「○○」に、「ＡＢＣ」や「ＣＤＦ」等の任意の単語を
貼り付けて合成する方法である。When synthesizing such a system response sentence, a word and a sentence are conventionally connected by an edit synthesizing method. The editing and composition method is, for example, the composition of a certain sentence such as "There was a transfer from XX bank" to the word portion "XX" by pasting an arbitrary word such as "ABC" or "CDF". Is the way to do it.

【０００４】[0004]

【発明が解決しようとする課題】編集合成法を用いた従
来の音声合成処理では、「ＡＢＣ」のような単語の音声
と、文章の「銀行から振込がございました」のような定
型部分の音声とを単純に繋げている。しかし、同じ文章
又は単語でも、異なる発話様式で発声されたものは韻律
が微妙に異なるため、聴覚的に異なった感覚を与える。
例えば、複数人による会話の音声と、一人による朗読の
音声とを聞き比べると、同じ文章であっても話し方つま
り発話様式が異なるために、異なった感じに聞こえる。
また、単独に発声された単語の音声と、文章に組込まれ
て発声された同じ単語の音声とを聞き比べても、やはり
異なった感じに聞こえる。通常、音声合成システムに登
録されている個々の単語や文章の音声は異なる発話様式
で発声されたものであるから、従来のようにその単語と
文章の音声を単純に繋げると、文章全体の音声は不自然
に聞こえてしまう。[Problems to be Solved by the Invention] In the conventional voice synthesis processing using the edit synthesis method, the voice of a word such as "ABC" and the voice of a fixed part such as "There was a transfer from the bank" in the sentence. And is simply connected. However, even if the same sentence or word is uttered in different utterance styles, the prosody is slightly different, and thus the sensation is auditorily different.
For example, when a voice of a conversation by a plurality of people and a voice of a reading by one person are compared, even the same sentence sounds different because the way of speaking, that is, the utterance style is different.
In addition, even if the sound of a word uttered alone is compared with the sound of the same word uttered incorporated in a sentence, it still sounds different. Usually, the voices of individual words and sentences registered in the voice synthesis system are uttered in different utterance styles. Sounds unnatural.

【０００５】この従来技術の下で、多様な発話様式をも
つ応答文を自然に聞こえるように作るためには、各単語
毎に多様な登録文に適合した韻律をもった他種類の音声
を登録しておく必要がある。しかし、特に小さいシステ
ムではデータベースの容量制限から他種類の音声を各登
録語毎に用意することができない。また、音声収録の面
でも、あらゆる韻律で単語を発声することは困難であ
る。According to this conventional technique, in order to create response sentences having various utterance styles so as to be heard naturally, other kinds of voices having prosody suitable for various registration sentences are registered for each word. You need to do it. However, especially in a small system, it is not possible to prepare other types of voices for each registered word due to the limited capacity of the database. Also, in terms of voice recording, it is difficult to utter a word with any prosody.

【０００６】従って、本発明の目的は、発話様式が異な
る文章の音声と単語の音声とを結合して音声を合成する
場合、合成結果が自然な音声に聞こえるようにすること
にある。Therefore, an object of the present invention is to make the synthesized result sound natural when the speech of a sentence and the speech of a word having different utterance styles are combined to synthesize a speech.

【０００７】[0007]

【課題を解決するための手段】本発明に従えば、文章の
定型部分の音声と単語の音声とを繋げて完全な文章の音
声を合成する場合に、その単語の音声の韻律がその文章
に適合するように制御される。この韻律制御では、ま
ず、その単語の音声がもつ韻律を示すデータと、その文
章の単語部分がもつ韻律範囲を示すデータとが取得さ
れ、そして、それらのデータが比較される。次に、その
比較の結果に基づいて、その単語の音声の韻律が上記単
語部分の韻律範囲内に入るように、その単語の音声が調
節される。このように韻律が制御された単語の音声を、
文章の定型部分の音声に繋げれば、自然に聞こえる完全
な文章の音声が得られる。According to the present invention, when a voice of a fixed part of a sentence and a voice of a word are connected to synthesize a voice of a complete sentence, the prosody of the voice of the word is changed to the sentence. Controlled to fit. In this prosody control, first, data indicating the prosody of the voice of the word and data indicating the prosody range of the word portion of the sentence are acquired, and these data are compared. Then, based on the result of the comparison, the voice of the word is adjusted so that the prosody of the voice of the word falls within the prosodic range of the word portion. In this way, the sound of a word whose prosody is controlled is
If you connect it to the sound of a fixed part of a sentence, you get a complete sentence sound that sounds natural.

【０００８】韻律制御で操作する韻律パラメータには、
ピッチ周波数、振幅、及び時間長の３種類のパラメータ
を採用することができる。韻律制御法には、ピッチ同期
波形重畳法を用いることができる。The prosody parameters operated by the prosody control are:
Three types of parameters can be adopted: pitch frequency, amplitude, and time length. A pitch synchronization waveform superposition method can be used as the prosody control method.

【０００９】本発明によれば、一つの発話様式で発声さ
れた単語の音声を、多様な発話様式で発声された種々の
文章の各々に適合する韻律をもった音声に変換できるの
で、各単語について多様な韻律をもつ他種類の音声を用
意する必要がない。よって、データベース容量の小さい
小システムでも、自然な音声をもった文章が合成でき
る。According to the present invention, the voice of a word uttered in one utterance style can be converted into a voice having a prosody suitable for each of various sentences uttered in various utterance styles. There is no need to prepare other kinds of sounds with various prosody. Therefore, even a small system with a small database capacity can synthesize a sentence with a natural voice.

【００１０】[0010]

【発明の実施の形態】図１は本発明の一実施形態にかか
る音声合成装置の構成を示す。この装置は実際には、本
発明に従う音声合成処理のためのアプリケーションプロ
グラムがインストールされたコンピュータシステムであ
る。1 shows the configuration of a speech synthesizer according to an embodiment of the present invention. This device is actually a computer system in which an application program for speech synthesis processing according to the present invention is installed.

【００１１】磁気ディスク装置の様な適当なストレージ
内に、単語データベース１、単語韻律テーブル３、文章
データベース５、及び文中韻律テーブル７が用意されて
いる。単語データベース１には、種々の単語の単語番号
と、それらの単語の実際の発声から得られた音声データ
とが格納されている。単語韻律テーブル３には、上記種
々の単語の実際の発声から測定された韻律のデータが、
単語番号と共に格納されている。ここで、各単語の韻律
データは、各単語の音声区間で測定されたピッチ周波
数、振幅及び時間長という３種類の韻律パラメータの平
均値から構成される。A word database 1, a word prosody table 3, a sentence database 5, and a sentence prosody table 7 are prepared in an appropriate storage such as a magnetic disk device. The word database 1 stores word numbers of various words and voice data obtained from actual utterance of those words. In the word prosody table 3, the prosody data measured from the actual utterances of the various words are
It is stored with the word number. Here, the prosody data of each word is composed of an average value of three types of prosody parameters, which are pitch frequency, amplitude, and time length, which are measured in the voice section of each word.

【００１２】文章データベース５には、種々の文章の文
章番号と、それらの文章の実際の発声から得られた各文
章の定型部分（つまり、取り替え可能な単語の部分を抜
いた残り部分）の音声データとが格納されている。文中
韻律テーブル７には、上記種々の文章の取り替え可能な
単語部分がもつ韻律の範囲を示すデータが格納されてい
る。各文章の韻律範囲データは、各文章の上記単語部分
がもつ上記３種の韻律パラメータの平均値と標準偏差と
から構成され、それは次の様にして作成されたものであ
る。すなわち、各文章について、上記単語部分だけを種
々の単語に入れ替えた多数の文章を実際に発声して、そ
れらの単語部分の韻律（上記３種の韻律パラメータの
値）を測定し、そして、各パラメータについて測定値の
平均値と標準偏差とを求める。発明者が行った実験によ
れば、個々の文章毎に単語部分のピッチ周波数、振幅及
び時間長に特有の範囲が存在することが分った。従っ
て、上記の様にして作成された各文章の韻律範囲データ
は、各文章の単語部分がもつ各文章に特有の韻律範囲を
示している。このことは、各文章を発声する時、その韻
律範囲データ内の韻律で単語部分を発声すれば、その単
語は各文章の定型部分と聴覚的に整合して、文章全体が
自然に聞こえることを意味する。In the sentence database 5, the sentence numbers of various sentences and the voices of the fixed parts of each sentence (that is, the remaining parts excluding the replaceable word parts) obtained from the actual utterances of those sentences are recorded. Data and are stored. The in-sentence prosody table 7 stores data indicating the range of prosody of the replaceable word parts of the various sentences. The prosodic range data of each sentence is composed of the average value and the standard deviation of the above-mentioned three kinds of prosodic parameters of the word portion of each sentence, and is created as follows. That is, for each sentence, a large number of sentences in which only the word portions are replaced with various words are actually uttered, and the prosody of these word portions (the values of the above three types of prosody parameters) are measured, and Obtain the average value and standard deviation of the measured values for the parameters. According to an experiment conducted by the inventor, it was found that there is a range peculiar to the pitch frequency, the amplitude and the time length of the word part for each individual sentence. Therefore, the prosody range data of each sentence created as described above indicates the prosody range peculiar to each sentence included in the word portion of each sentence. This means that when uttering each sentence, if a word part is uttered in the prosody of the prosodic range data, that word will be auditorily matched with the fixed part of each sentence and the whole sentence will sound natural. means.

【００１３】ＣＰＵ９は、アプリケーションプログラム
を実行することにより、単語検索処理１１、文章検索処
理１３、韻律生成処理１５、韻律制御処理１７、および
編集合成処理１９という５つのプロセスを行う。ＣＰＵ
９への入力は、所望のシステム応答文を構成する単語と
文章の単語番号と文章番号である。By executing the application program, the CPU 9 carries out five processes including a word search process 11, a sentence search process 13, a prosody generation process 15, a prosody control process 17, and an edit synthesis process 19. CPU
Inputs to 9 are word numbers and sentence numbers of words and sentences that form a desired system response sentence.

【００１４】入力された単語番号に応答して単語検索処
理１１が行われる。この処理１１では、単語データベー
ス１及び単語韻律テーブル３から、入力された単語番号
により識別される単語の音声データ２１及び韻律データ
２３が検索される。検索された単語の音声データ２１は
韻律制御処理１７へ渡され、韻律データ２３は韻律生成
処理１５へ渡される。Word search processing 11 is performed in response to the input word number. In this process 11, the voice data 21 and the prosody data 23 of the word identified by the input word number are searched from the word database 1 and the word prosody table 3. The voice data 21 of the retrieved word is passed to the prosody control process 17, and the prosody data 23 is passed to the prosody generation process 15.

【００１５】入力された文章番号に応答して文章検索処
理１３が行われる。この処理１３では、文章データベー
ス５及び文中韻律テーブル７から、入力された文章番号
により識別される文章の定型部分の音声データ２５及び
単語部分の韻律範囲データ２７が検索される。検索され
た定型部分の音声データ２５は編集合成処理１９へ渡さ
れ、単語部分の韻律範囲データ２７は韻律生成処理１５
へ渡される。A sentence search process 13 is performed in response to the inputted sentence number. In this process 13, the sentence database 5 and the in-sentence prosody table 7 are searched for the voice data 25 of the fixed part of the sentence and the prosody range data 27 of the word part identified by the input sentence number. The retrieved fixed-form voice data 25 is passed to the edit synthesis processing 19, and the prosodic range data 27 of the word portion is prosody generation processing 15.
Passed to

【００１６】韻律生成処理１５では、単語の韻律データ
２３と、文章の単語部分の韻律範囲データ２７とが比較
され、両者の差分２９が計算される。例えば、韻律デー
タ２３の示すピッチ周波数が２７７［Ｈｚ］であり、韻
律範囲データ２７の示すピッチ周波数の平均が３２５
［Ｈｚ］かつ標準偏差が３４［Ｈｚ］である場合、韻律
範囲データ２７が示すピッチ周波数の範囲は２９１〜３
５９［Ｈｚ］であるから、周波数についての差分２９は
２９１−２７７＝１４［Ｈｚ］である。他の韻律パラメ
ータについても、同様の方法で差分２９が計算される。
これら３種の韻律パラメータの差分２９は韻律制御処理
１７へ渡される。尚、単語の韻律データ２３の示す３種
のパラメータ値の内のいずれかが、韻律範囲データ２７
の示すそのパラメータの範囲内に収っている場合は、そ
のパラメータについての差分２９はゼロである。In the prosody generation processing 15, the prosody data 23 of the word is compared with the prosody range data 27 of the word portion of the sentence, and the difference 29 between them is calculated. For example, the pitch frequency indicated by the prosody data 23 is 277 [Hz], and the average pitch frequency indicated by the prosody range data 27 is 325.
When the frequency is [Hz] and the standard deviation is 34 [Hz], the pitch frequency range indicated by the prosody range data 27 is 291-3.
Since it is 59 [Hz], the difference 29 regarding the frequency is 291-277 = 14 [Hz]. The difference 29 is calculated for other prosody parameters in the same manner.
The difference 29 between these three types of prosody parameters is passed to the prosody control process 17. Note that any of the three types of parameter values indicated by the word prosody data 23 is the prosody range data 27.
, The difference 29 for that parameter is zero.

【００１７】韻律制御処理１７では、単語の音声データ
２１に対して韻律の差分２９に基づいた韻律制御が行わ
れる。韻律制御方法には、例えばピッチ同期波形重畳法
が用いられる。韻律制御処理１７の結果、元の音声デー
タ２１は、韻律範囲データ２７の示す韻律範囲内に入る
韻律を有した音声データ３１に変換される。尚、差分２
９がゼロであるパラメータについては、制御が行われな
いから元の音声データ２１の値がそのまま維持される。
この処理１７により得られた制御された韻律をもつ音声
データ３１は、編集合成処理１９へ渡される。In the prosody control processing 17, prosody control based on the prosody difference 29 is performed on the voice data 21 of the word. As the prosody control method, for example, the pitch synchronization waveform superposition method is used. As a result of the prosody control processing 17, the original voice data 21 is converted into voice data 31 having a prosody within the prosody range indicated by the prosody range data 27. The difference is 2
For the parameter in which 9 is zero, the original value of the audio data 21 is maintained as it is because no control is performed.
The voice data 31 having the controlled prosody obtained by this processing 17 is passed to the edit synthesis processing 19.

【００１８】編集合成処理１９では、制御された韻律を
もつ単語の音声データ３１が、文章の定型部分の音声デ
ータ２５の空白な単語部分に組込まれて、完全なシステ
ム応答文の音声データ３３が生成される。この音声デー
タ３３はスピーカのような音声出力装置によって音声に
再生される。In the edit / synthesis processing 19, the voice data 31 of the word having the controlled prosody is incorporated into the blank word portion of the voice data 25 of the fixed part of the sentence, and the voice data 33 of the complete system response sentence is obtained. Is generated. The voice data 33 is reproduced as voice by a voice output device such as a speaker.

【００１９】因みに、韻律制御で用いるピッチ同期波形
重畳法は、制御結果の合成音が高品質である、及びピッ
チ波形単位の容易な処理であるという特長を有する。こ
の方法の詳細は例えばE.Moulines及びF. Charpentierに
よる“Pitch-Syncronous Waveform Processing Techniq
ues for Text-to-Speech Synthesis using Diphones,”
Speech Communication, Vol.9, pp.453-467, Dec. 199
0に説明されている。Incidentally, the pitch synchronization waveform superimposing method used in prosody control is characterized in that the synthesized speech as a control result has a high quality and the pitch waveform unit is easily processed. Details of this method can be found, for example, in “Pitch-Syncronous Waveform Processing Techniq” by E. Moulines and F. Charpentier.
ues for Text-to-Speech Synthesis using Diphones, ”
Speech Communication, Vol.9, pp.453-467, Dec. 199
It is described in 0.

【００２０】図２はピッチ同期波形重畳法による韻律制
御の流れを示し、図３はこの韻律制御の各段階における
音声波形を示している。FIG. 2 shows the flow of prosody control by the pitch synchronization waveform superposition method, and FIG. 3 shows the speech waveform at each stage of this prosody control.

【００２１】まず、図３Ａに示すような原音声波形を表
す元の音声データ２１に窓関数をかけて、図３Ｂに示す
ように個々のピッチ波形を取り出す（Ｓ１）。次に、各
ピッチ波形に対し振幅の差分で決まる重み関数をかけ
て、図３Ｃに示すように各ピッチ波形の振幅を調節する
（Ｓ２）。次に、音声区間内に存在するピッチ波形の個
数を時間長の差分に応じて加減することにより、図３Ｄ
に示すように時間長を調節する（Ｓ３）。次に、ピッチ
波形の間隔（周期）をピッチ周波数の差分に応じて変更
してピッチ周波数を調節し、そして、それらのピッチ波
形を結合することにより、図３Ｅに示すような制御され
た韻律をもつ音声波形を表した音声データ３１を作成す
る（Ｓ４）。First, a window function is applied to the original voice data 21 representing the original voice waveform as shown in FIG. 3A to extract individual pitch waveforms as shown in FIG. 3B (S1). Next, a weighting function determined by the difference in amplitude is applied to each pitch waveform to adjust the amplitude of each pitch waveform as shown in FIG. 3C (S2). Next, by adding or subtracting the number of pitch waveforms existing in the voice section according to the difference in the time length, as shown in FIG.
The time length is adjusted as shown in (S3). Next, the pitch frequency is adjusted by changing the pitch waveform interval (cycle) according to the difference in pitch frequency, and by combining these pitch waveforms, a controlled prosody as shown in FIG. 3E is obtained. The voice data 31 representing the voice waveform is stored is created (S4).

【００２２】このような韻律制御により、単語の原音声
の韻律が、組込まれるべき文章に適した韻律範囲内のも
のに修正されるから、その単語音声を組込んだ文章全体
の音声は自然に聞こえる。従って、１単語当たり１つの
音声データを、発話様式の異なる多様な文章に適合させ
て用いることができる。By such prosody control, the prosody of the original voice of the word is corrected to be within the prosody range suitable for the sentence to be incorporated, so that the voice of the entire sentence incorporating the word voice naturally. hear. Therefore, one voice data per word can be used by adapting to various sentences having different utterance styles.

[Brief description of drawings]

【図１】本発明の一実施形態の構成を示すブロック
図。FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention.

【図２】ピッチ同期波形重畳法による韻律制御の流れ
を示したフローチャート。FIG. 2 is a flowchart showing the flow of prosody control by the pitch synchronization waveform superposition method.

【図３】韻律制御の各ステップにおける音声波形を示
した波形図。FIG. 3 is a waveform diagram showing a speech waveform at each step of prosody control.

[Explanation of symbols]

１単語データベース３単語韻律テーブル５文章データベース７文中韻律テーブル９ＣＰＵ１１単語検索処理１３文章検索処理１５韻律生成処理１７韻律制御処理１９編集合成処理２１単語の音声データ２３単語の韻律データ２５文章の定型部分の音声データ２７文章の単語部分の韻律範囲データ２９韻律の差分３１制御された韻律をもつ単語の音声データ３３合成されたシステム応答文 1 word database 3 word prosody table 5 sentence database 7 sentence prosodic table 9 CPU 11 word search process 13 sentence search process 15 prosody generation process 17 prosody control process 19 edit synthesis process 21 word speech data 23 word prosodic data 25 sentence pattern Partial voice data 27 Prosodic range data of word part of sentence 29 Prosody difference 31 Speech data of words with controlled prosody 33 Synthesized system response sentence

Claims

[Claims]

1. A system for synthesizing a voice of a fixed part of a sentence having a word part and a fixed part and a voice of a word to be included in the word part to synthesize the whole voice of the sentence. A device for controlling the prosody of the word so as to match the sentence, a word acquiring unit for acquiring a voice and a prosody of the word, a prosody range acquiring unit for acquiring a prosody range of a word portion of the sentence, Prosody comparison means for comparing the prosody of the word and the prosody range of the word portion, for the voice of the acquired word, performs prosody control based on the comparison result from the prosody comparison means, thereby, A prosody control device comprising: a prosody control unit that converts the voice of the word into a voice of a controlled word having a prosody existing in the prosody range.

2. The device according to claim 1, wherein the prosody of the word includes three prosody parameter values of a pitch frequency, an amplitude, and a time length of the voice of the word, and the prosody range of the word portion is , The word part has the 3
A prosody control device comprising a range of values of seed prosody parameters.

3. The prosody control device according to claim 1, wherein the prosody control means performs prosody control using a pitch synchronization waveform superposition method.

4. The apparatus according to claim 1, wherein the word acquisition means stores a word database storing sounds of a plurality of words, a word prosody table storing prosody of the plurality of words, and a selected word And a means for retrieving speech and prosody from the word database and a word prosody table, the prosody range acquisition means, a sentence prosody table storing the prosody range of the word portion of the plurality of sentences, and the word of the selected sentence A prosody control device comprising means for retrieving a prosody range of a portion from the in-sentence prosody table.

5. A system for synthesizing a voice of the fixed part of a sentence having a word part and a fixed part and a voice of a word to be included in the word part to synthesize the whole voice of the sentence. , A method of controlling the prosody of the word to match the sentence, a step of acquiring a voice and a prosody of the word, a step of acquiring a prosody range of a word portion of the sentence, of the acquired word The process of comparing the prosody and the prosody range of the word portion, and the prosody control based on the comparison result from the comparison process is performed on the acquired voice of the word, whereby the voice of the word is converted into the prosody. A prosody control method comprising: a process for converting a controlled word having a prosody existing in a range into a voice.

6. A system for synthesizing the entire voice of the sentence by combining the voice of the fixed part of a sentence having a word part and a fixed part with the voice of a word to be included in the word part, A word acquisition unit that acquires a voice and a prosody of a word, a sentence acquisition unit that acquires a voice of a fixed part of the sentence and a prosody range of the word part, and a prosody of the acquired word and a prosody range of the word part And a prosody comparison means for comparing the obtained speech of the word, prosody control based on the comparison result from the prosody comparison means, thereby, the speech of the word, the prosody existing in the prosody range. A prosody control means for converting into a voice of a controlled word, and a voice of the controlled word from the prosody control means;
An edit / synthesis unit that combines the voice of the fixed part from the sentence acquisition unit to create the entire voice of the sentence,
A voice synthesis system characterized by comprising.