JP3313310B2

JP3313310B2 - Speech synthesis apparatus and synthesis method

Info

Publication number: JP3313310B2
Application number: JP24169897A
Authority: JP
Inventors: 利光蓑輪; 康彦新居; 洋文西村
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1997-08-25
Filing date: 1997-08-25
Publication date: 2002-08-12
Anticipated expiration: 2017-08-25
Also published as: JPH1165596A

Abstract

PROBLEM TO BE SOLVED: To provide a device and a method of voice compositing which allows a small capacity of memory for storing a data base by maintaining a high- quality sound of composited voices even when a small number of voice elements are stored in the data base. SOLUTION: Inputted characters are converted into phonetic notation (means 101). The rhythms of synthesized voice are calculated by the converted phonetic notation (means 102), and based on the converted phonetic notation, voice elements including at least one of two syllables in the heading of word which sharpen human hearing sense, the voice elements of high-pitch frequencies upto the accent kernel, the voice elements including the endings of words and other voice elements are selected (means 103) from a memory 104. The variations of the rhythms of these voice elements are executed (means 105) in the conditions that the fine structures of the rhythm at the time of recording voice elements are left with respect to the voice elements including at least one of the two heating syllables of words, high-pitch frequencies up to the accent kernel and endings of words and the variation of the rhythm of the other voice elements is executed by making the use of the fine structures of the rhythms at the time of recording or using calculated rhythms.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力した文字を規
則に基づき変換して音声となしうるよう合成する音声合
成装置および合成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizing apparatus and a synthesizing method for converting input characters based on rules and synthesizing them so that they can be converted into speech.

【０００２】[0002]

【従来の技術】従来、この種の音声合成装置としては、
特開平６−９５６９２「音声合成装置」に開示されてい
るようなものがあった。以下、図１０を参照して、従来
の音声合成装置について説明する。図１０は上記従来の
音声合成装置の構成を示すブロック図である。2. Description of the Related Art Conventionally, as this kind of speech synthesizer,
There is an apparatus disclosed in Japanese Patent Application Laid-Open No. 6-95592 "Speech synthesis apparatus". Hereinafter, a conventional speech synthesizer will be described with reference to FIG. FIG. 10 is a block diagram showing the configuration of the above-mentioned conventional speech synthesizer.

【０００３】図１０において、５０１は文字列及びその
韻律情報などを入力して音声表記に変換する入力部、５
０２は韻律規則辞書５０３に基づき入力部５０１からの
音声表記に対し韻律パラメータを生成する韻律パラメー
タ生成部、５０４は多くの音声単位を含む大容量音声単
位データベース５０５から音声単位セット選択部５０６
により選択した音声単位セット５０７の中から韻律パラ
メータを用い音声単位の接続点におけるスペクトル変化
が滑らかになるような音声単位を選択する音声単位選択
部、５０８は選択した音声単位を接続して音声を合成す
る音声合成器、５０９は合成した音声を出力する出力部
である。In FIG. 10, reference numeral 501 denotes an input unit for inputting a character string and its prosody information and converting it into a phonetic notation.
02 is a prosody parameter generation unit that generates prosody parameters for the phonetic notation from the input unit 501 based on the prosody rule dictionary 503, and 504 is a speech unit set selection unit 506 from a large-capacity speech unit database 505 containing many speech units.
A voice unit selection unit for selecting a voice unit from the selected voice unit set 507 using the prosodic parameters so that the spectrum change at the connection point of the voice unit is smooth, and a voice unit selection unit 508 connects the selected voice units to generate voice. An audio synthesizer 509 is an output unit that outputs synthesized audio.

【０００４】次に、図１０を参照して、上記従来の音声
合成装置の動作を説明する。まず、入力した文字列は入
力部５０１において音声表記に変換され、韻律パラメー
タ生成部５０２において、韻律規則辞書５０３に基づき
韻律パラメータが生成される。音声単位選択部におい
て、多くの音声単位（以下、音声素片という）を含む大
容量音声単位データベース５０５から音声単位セット選
択部５０６により選択した音声単位セット５０７の中か
ら韻律パラメータを用いて音声素片の接続点におけるス
ペクトル変化が滑らかになるような音声素片を選択す
る。選択された音声単位は音声合成器５０８で接続され
連続した音声として合成されて、出力部５０９から出力
する。このようにして、従来の音声合成装置でも高音質
の合成音声を提供することができる。Next, the operation of the above-mentioned conventional speech synthesizer will be described with reference to FIG. First, an input character string is converted into a phonetic notation in an input unit 501, and a prosody parameter generation unit 502 generates a prosody parameter based on a prosody rule dictionary 503. The speech unit selection unit uses a prosodic parameter from the speech unit set 507 selected by the speech unit set selection unit 506 from a large-capacity speech unit database 505 containing a large number of speech units (hereinafter referred to as speech units). A speech unit is selected such that the spectrum change at the connection point of the segments becomes smooth. The selected voice unit is connected by the voice synthesizer 508 and synthesized as continuous voice, and output from the output unit 509. In this way, a conventional voice synthesizer can provide a high-quality synthesized voice.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上記従
来の音声合成装置においては、音声素片の接続点におけ
るスペクトル変化が滑らかになるような音声素片を確保
するためには、さまざまなスペクトルを持つ大量の音声
素片を用意しなければならず、そのように大量の音声素
片を格納するためには、大容量のメモリーを用意しなけ
ればならないという問題があった。However, in the above-mentioned conventional speech synthesizer, in order to secure a speech unit in which a spectrum change at a connection point of the speech unit becomes smooth, various speech spectrums are required. There is a problem that a large number of speech units must be prepared, and a large-capacity memory must be prepared to store such a large number of speech units.

【０００６】本発明は、上記従来の問題を解決するため
になされたもので、データベースに保有される音声素片
が少なくても合成された音声の音質を高く維持するよう
にして、音声素片データベースを記憶するメモリーを小
容量になしうることを目的とする。SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned conventional problems, and is intended to maintain a high quality of synthesized speech even if the number of speech units stored in the database is small, so that the speech unit is improved. An object of the present invention is to make a memory for storing a database small.

【０００７】[0007]

【課題を解決するための手段】本発明による音声合成装
置及び合成方法は、入力した文字列及び韻律情報から発
音表記に変換し、変換された発音表記により合成音声の
韻律を計算し、変換された発音表記に基づきメモリーか
ら人間の聴覚が鋭敏になる語頭２音節の少なくとも一方
を含む音声素片とアクセント核までの高いピッチ周波数
の音声素片と語尾を含む音声素片と、またはそれ以外用
の音声素片とを選択し、これらの音声素片の韻律の変形
は人間の聴覚が鋭敏になる部分である語頭２音節の少な
くとも一方を含む音声素片とアクセント核までの高いピ
ッチ周波数の音声素片と語尾を含む音声素片については
音声素片録音時の韻律の微細構造を残したままで行い、
それ以外用の音声素片の韻律の変形は録音時の韻律の微
細構造を利用するか、または計算した韻律を使用して行
うようにしたものである。A speech synthesizing apparatus and a synthesizing method according to the present invention convert an input character string and prosody information into a phonetic notation, calculate a prosody of the synthesized speech based on the converted phonetic notation, and convert the speech. A speech unit containing at least one of the first two syllables, a speech unit having a high pitch frequency up to the accent nucleus, and a speech unit containing the ending, or a speech unit containing at least one of the two syllables at which the human auditory sense is sharpened from the memory based on the phonetic transcription Of the speech unit, and the prosodic transformation of these speech units is such that a speech unit including at least one of the first two syllables, which is a part where human hearing is sharp, and a high pitch frequency speech up to the accent nucleus. For the speech unit including the unit and the ending, it is performed while keeping the fine structure of the prosody at the time of recording the speech unit,
The modification of the prosody of the speech unit for other uses the fine structure of the prosody at the time of recording or uses the calculated prosody.

【０００８】本発明は、合成音声の音質の維持に特に重
要な語頭２音節、アクセント核までのピッチ周波数の高
いもの、または語尾の音声素片の韻律の変形に、大量の
記憶容量を必要としない音声素片録音時の韻律の微細構
造を利用し、合成音声の音質にあまり影響しないそれ以
外用の音声素片の韻律の変形は音声素片録音時の韻律の
微細構造を利用するか、または計算により行うようにし
たことにより、合成音声の音質を高く維持しながら、デ
ータベースに格納する音声素片を少なくして、音声素片
データベースを記憶するメモリーの容量を小さくするこ
とができる音声合成装置及び合成方法が得られる。According to the present invention, a large amount of storage capacity is required for two syllables at the beginning, particularly those having a high pitch frequency up to the accent nucleus, or deformation of the prosody of the speech unit at the end of the word, which are particularly important for maintaining the sound quality of the synthesized speech. Use the fine structure of the prosody at the time of voice unit recording, and use the fine structure of the prosody at the time of the voice unit recording to transform the prosody of the other voice unit that does not significantly affect the sound quality of the synthesized speech. Or, by performing calculations, speech synthesis can reduce the number of speech units stored in the database and reduce the capacity of the memory for storing the speech unit database, while maintaining high quality of synthesized speech. An apparatus and a synthesis method are obtained.

【０００９】[0009]

【発明の実施の形態】本発明の請求項１に記載の発明
は、入力した漢字仮名混じり文字列または読み仮名と韻
律情報とからこれを発音表記に変換する入力文字列変換
手段と、変換された発音表記により合成音声の韻律を計
算する韻律計算手段と、語頭２音節の少なくとも一方を
含む音声素片とアクセント核までの高いピッチ周波数の
音声素片と語尾を含む音声素片とこれらの音声素片以外
の音声素片とを少なくとも格納したメモリーと、前記発
音表記に従い前記メモリーから該当する音声素片を選択
する音声素片選択手段と、音声素片録音時の韻律の微細
構造を残したままで前記選択された音声素片の韻律の変
形を行う音声素片変形手段と、前記変形された音声素片
を接続する音声素片接続手段と、接続された合成音声を
出力する合成音声出力手段とを備えるようにしたもので
あり、合成音声の音質の維持に特に重要な語頭２音節、
アクセント核までのピッチ周波数の高いもの、語尾の各
音声素片、またはそれ以外用の音声素片の韻律の変形
は、大量の記憶容量を必要としない音声素片録音時の韻
律の微細構造を利用して行うようにしたことにより、合
成音声の音質を高く維持しながら、データベースに格納
する音声素片を少なくして、それを記憶するメモリーの
容量を小さくすることができるという作用を有する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention according to claim 1 of the present invention comprises input character string conversion means for converting an input character string mixed with kanji kana or a reading kana and prosodic information into phonetic notation, and Prosody calculation means for calculating the prosody of the synthesized speech using the phonetic notation, a speech unit containing at least one of the first two syllables, a speech unit having a high pitch frequency up to the accent nucleus, and a speech unit containing the ending, and these speeches Other than a fragment
A memory that stores at least the speech unit, the selection while remaining the speech unit selection means for selecting a speech unit corresponding from the memory, the prosody of the microstructure during the speech unit recorded in accordance with the phonetic Speech unit transformation means for transforming the prosody of the transformed speech unit, speech unit connection means for connecting the transformed speech unit, and synthesized speech output means for outputting the connected synthesized speech. The first two syllables, which are particularly important for maintaining the sound quality of synthesized speech,
Deformation of the prosody of a pitch unit with a high pitch frequency up to the accent nucleus, each speech unit at the end, or a speech unit for other uses, changes the fine structure of the prosody when recording speech units that does not require a large amount of storage capacity. By using this method, it is possible to reduce the number of speech units stored in the database and to reduce the capacity of the memory for storing the speech units while maintaining high sound quality of the synthesized speech.

【００１０】本発明の請求項２に記載の発明は、入力し
た漢字仮名混じり文字列または読み仮名と韻律情報とか
らこれを発音表記に変換する入力文字列変換手段と、変
換された発音表記により合成音声の韻律を計算する韻律
計算手段と、語頭２音節の少なくとも一方を含む音声素
片とアクセント核までの高いピッチ周波数の音声素片と
語尾を含む音声素片とこれらの音声素片以外の音声素片
とを少なくとも格納したメモリーと、前記発音表記に従
い前記メモリーから該当する音声素片を選択する音声素
片選択手段と、録音時の韻律の微細構造を残したままで
前記選択された語頭２音節の少なくとも一方を含む音声
素片とアクセント核までのピッチ周波数の高い音声素片
と語尾を含む音声素片の韻律の変形を行い、これらの音
声素片以外の音声素片の韻律の変形は音声素片録音時の
韻律の微細構造を利用せず前記計算した韻律を用いて行
う音声素片変形手段と、前記変形された音声素片を接続
する音声素片接続手段と、接続された合成音声を出力す
る合成音声出力手段とを備えるようにしたものであり、
合成音声の音質の維持に特に重要な語頭２音節、アクセ
ント核までのピッチ周波数の高いもの、または語尾の音
声素片の韻律の変形は、大量の記憶容量を必要としない
音声素片録音時の韻律の微細構造を利用し、合成音声の
音質にあまり影響しないそれ以外用の音声素片の韻律の
変形は録音時の韻律の微細構造を使用せず計算したもの
を使用するようにしたことにより、合成音声の音質を高
く維持しながら、データベースに格納する音声素片を少
なくして、それを記憶するメモリーの容量を小さくする
ことができるという作用を有する。According to a second aspect of the present invention, there is provided an input character string converting means for converting an input kanji kana mixed character string or a reading kana and prosody information into a phonetic notation, and using the converted phonetic notation. A prosody calculating means for calculating the prosody of the synthesized speech, a speech unit including at least one of the first two syllables, a speech unit having a high pitch frequency up to the accent nucleus, a speech unit including the ending and a speech unit other than these speech units A memory for storing at least a speech unit, a speech unit selection means for selecting a corresponding speech unit from the memory according to the phonetic notation, and the selected beginning 2 while retaining the fine structure of the prosody at the time of recording. syllables perform prosody modification of speech units including speech segment and ending high pitch frequencies up speech unit and accent core containing at least one, these sounds
Deformation of a prosody of a voice unit other than a voice unit is performed using the calculated prosody without using the fine structure of the prosody at the time of the voice unit recording, and Speech unit connection means to be connected, and synthesized speech output means for outputting the connected synthesized speech,
The first two syllables that are particularly important for maintaining the sound quality of synthesized speech, those with a high pitch frequency up to the accent nucleus, or the deformation of the prosody of the speech unit at the end of the word, can be used when recording speech units that do not require a large amount of storage capacity. By using the fine structure of the prosody, the prosody deformation of the other speech unit that does not significantly affect the sound quality of the synthesized speech is calculated without using the fine structure of the prosody at the time of recording In addition, while maintaining high quality of synthesized speech, the number of speech units stored in the database can be reduced, and the capacity of the memory for storing the speech segments can be reduced.

【００１１】本発明の請求項３に記載の発明は、入力し
た漢字仮名混じり文字列または読み仮名と韻律情報とか
らこれを発音表記に変換する入力文字列変換手段と、変
換された発音表記により合成音声の韻律を計算する韻律
計算手段と、語頭２音節の少なくとも一方を含む音声素
片と語尾を含む音声素片とこれらの音声素片以外の音声
素片とを少なくとも格納したメモリーと、前記発音表記
に従い前記メモリーから該当する音声素片を選択する音
声素片選択手段と、音声素片録音時の韻律の微細構造を
残したままで前記選択された音声素片の韻律の変形を行
う音声素片変形手段と、前記変形された音声素片を接続
する音声素片接続手段と、接続された合成音声を出力す
る合成音声出力手段とを備えるようにしたものであり、
合成音声の音質の維持に特に重要な語頭２音節、語尾の
各音声素片、またはそれ以外用の音声素片の韻律の変形
は、大量の記憶容量を必要としない音声素片録音時の韻
律の微細構造を利用して行うようにしたことにより、合
成音声の音質を高く維持しながら、データベースに格納
する音声素片を少なくして、それを記憶するメモリーの
容量を小さくすることができるという作用を有する。According to a third aspect of the present invention, there is provided an input character string converting means for converting an input character string mixed with kanji kana or a reading kana and prosodic information into a phonetic notation, and a converted phonetic notation. A prosody calculating means for calculating a prosody of the synthesized speech; a memory storing at least a speech unit including at least one of the first two syllables, a speech unit including the ending, and a speech unit other than these speech units ; Voice unit selection means for selecting a corresponding voice unit from the memory according to phonetic notation, and a voice unit for transforming the prosody of the selected voice unit while retaining the fine structure of the prosody at the time of voice unit recording One-side transforming means, a speech unit connecting means for connecting the transformed speech unit, and a synthesized speech output means for outputting the connected synthesized speech,
Modification of the prosody of two syllables at the beginning, each speech unit at the end, or a speech unit for other purposes, which is particularly important for maintaining the sound quality of synthesized speech, is a prosody at the time of speech unit recording that does not require a large amount of storage capacity. By using the microstructure of the above, it is possible to reduce the number of speech units stored in the database and the memory capacity for storing it while maintaining high sound quality of synthesized speech. Has an action.

【００１２】本発明の請求項４に記載の発明は、入力し
た漢字仮名混じり文字列または読み仮名と韻律情報とか
らこれを発音表記に変換する入力文字列変換手段と、変
換された発音表記により合成音声の韻律を計算する韻律
計算手段と、語頭２音節の少なくとも一方を含む音声素
片と語尾を含む音声素片とこれらの音声素片以外の音声
素片とを少なくとも格納したメモリーと、前記発音表記
に従い前記メモリーから該当する音声素片を選択する音
声素片選択手段と、録音時の韻律の微細構造を残したま
まで前記選択された語頭２音節の少なくとも一方を含む
音声素片と語尾を含む音声素片の韻律の変形を行い、こ
れらの音声素片以外の音声素片の韻律の変形は音声素片
録音時の韻律の微細構造を利用せず前記計算した韻律を
用いて行う音声素片変形手段と、前記変形された音声素
片を接続する音声素片接続手段と、接続された合成音声
を出力する合成音声出力手段とを備えるようにしたもの
であり、合成音声の音質の維持に特に重要な語頭２音
節、または語尾の音声素片の韻律の変形は、大量の記憶
容量を必要としない音声素片録音時の韻律の微細構造を
利用し、合成音声の音質にあまり影響しないそれ以外用
の音声素片の韻律の変形は録音時の韻律の微細構造を使
用せず計算したものを使用するようにしたことにより、
合成音声の音質を高く維持しながら、データベースに格
納する音声素片を少なくして、それを記憶するメモリー
の容量を小さくすることができるという作用を有する。According to a fourth aspect of the present invention, there is provided an input character string converting means for converting an input character string mixed with kanji kana or a reading kana and prosodic information into a phonetic notation, and using the converted phonetic notation. A prosody calculating means for calculating a prosody of the synthesized speech; a memory storing at least a speech unit including at least one of the first two syllables, a speech unit including the ending, and a speech unit other than these speech units ; A speech unit selecting means for selecting a corresponding speech unit from the memory according to the phonetic notation, and a speech unit and ending which include at least one of the selected two-syllables at the beginning while leaving the fine structure of the prosody at the time of recording. It performs a prosody of the deformation of the speech segment, including, this
The speech unit deformation means for performing the transformation of the prosody of the speech unit other than these speech units using the calculated prosody without using the fine structure of the prosody at the time of the speech unit recording; and A speech unit connecting means for connecting the speech units; and a synthesized speech output means for outputting the connected synthesized speech, wherein two initial syllables or endings particularly important for maintaining the sound quality of the synthesized speech. The prosody transformation of the speech unit uses the fine structure of the prosody at the time of speech unit recording that does not require a large amount of storage capacity, and the prosody of the other speech unit that does not significantly affect the sound quality of the synthesized speech. By using the one calculated without using the fine structure of the prosody at the time of recording,
While maintaining high sound quality of the synthesized speech, there is an effect that the number of speech units stored in the database can be reduced and the capacity of a memory for storing the speech segments can be reduced.

【００１３】本発明の請求項５に記載の発明は、入力し
た漢字仮名混じり文字列または読み仮名と韻律情報とか
らこれを発音表記に変換し、前記変換された発音表記に
基づきメモリーから語頭２音節の少なくとも一方を含む
音声素片とアクセント核までの高いピッチ周波数の音声
素片と語尾を含む音声素片とこれらの音声素片以外の音
声素片とから前記発音表記に該当する音声素片を選択
し、音声素片録音時の韻律の微細構造を残したままで前
記選択された音声素片の韻律の変形を行い、前記変形さ
れた音声素片を接続し合成音声として出力するようにし
たものであり、合成音声の音質の維持に特に重要な語頭
２音節、アクセント核までのピッチ周波数の高いもの、
語尾の各音声素片、またはそれ以外用の音声素片の韻律
の変形は、大量の記憶容量を必要としない音声素片録音
時の韻律の微細構造を利用して行うようにしたことによ
り、合成音声の音質を高く維持しながら、データベース
に格納する音声素片を少なくして、それを記憶するメモ
リーの容量を小さくすることができるという作用を有す
る。[0013] According to a fifth aspect of the present invention, a kanji kana-mixed character string or a reading kana and prosodic information are converted into phonetic notations, and based on the converted phonetic notations, the first two words are read from a memory. the phonetic from the speech unit and the sound <br/> Koemotohen other than those speech segments including speech unit and endings of high pitch frequencies up speech unit and accent core containing at least one syllable Select the corresponding speech unit, perform the prosody modification of the selected speech unit while leaving the fine structure of the prosody at the time of speech unit recording, connect the transformed speech unit as synthetic speech The first two syllables that are particularly important for maintaining the sound quality of synthesized speech, those with a high pitch frequency up to the accent nucleus,
By modifying the prosody of each speech unit at the end, or the speech unit for the rest, by using the fine structure of the prosody at the time of speech unit recording that does not require a large amount of storage capacity, While maintaining high sound quality of the synthesized speech, there is an effect that the number of speech units stored in the database can be reduced and the capacity of a memory for storing the speech segments can be reduced.

【００１４】本発明の請求項６に記載の発明は、入力し
た漢字仮名混じり文字列または読み仮名と韻律情報とか
らこれを発音表記に変換し、前記変換された発音表記に
より合成音声の韻律を計算し、前記変換された発音表記
に基づきメモリーから語頭２音節の少なくとも一方を含
む音声素片とアクセント核までの高いピッチ周波数の音
声素片と語尾を含む音声素片とこれらの音声素片以外の
音声素片とから前記発音表記に該当する音声素片を選択
し、音声素片録音時の韻律の微細構造を残したままで前
記選択された語頭２音節の少なくとも一方を含む音声素
片とアクセント核までの高いピッチ周波数の音声素片と
語尾を含む音声素片との韻律の変形を行い、これらの音
声素片以外の音声素片の韻律の変形は録音時の韻律の微
細構造を利用せず前記計算した韻律を使用して行い、前
記変形された音声素片を接続し合成音声として出力する
ようにしたものであり、合成音声の音質の維持に特に重
要な語頭２音節、アクセント核までのピッチ周波数の高
いもの、または語尾の音声素片の韻律の変形は、大量の
記憶容量を必要としない音声素片録音時の韻律の微細構
造を利用し、合成音声の音質にあまり影響しないそれ以
外用の音声素片の韻律の変形は録音時の韻律の微細構造
を使用せず計算したものを使用するようにしたことによ
り、合成音声の音質を高く維持しながら、データベース
に格納する音声素片を少なくして、それを記憶するメモ
リーの容量を小さくすることができるという作用を有す
る。According to a sixth aspect of the present invention, a kanji kana mixed character string or a reading kana and prosody information are converted into phonetic notations, and the converted phonetic notation is used to convert the prosody of the synthesized speech. Calculated, based on the converted phonetic notation, a speech unit containing at least one of the first two syllables from the memory, a speech unit having a high pitch frequency up to the accent nucleus, a speech unit containing the ending, and other than these speech units A speech unit corresponding to the phonetic notation is selected from the speech units, and a speech including at least one of the selected two-syllables at the beginning while leaving the fine structure of the prosody at the time of recording the speech unit. performs a prosody of the deformation of the speech segment, including the speech segment and endings of high pitch frequency of up to fragment and the accent nucleus, these sounds
Modification of the prosody of the speech unit other than the voice unit is performed using the calculated prosody without using the fine structure of the prosody at the time of recording, and the transformed speech unit is connected and output as a synthesized speech. The first two syllables that are particularly important for maintaining the sound quality of the synthesized speech, those with a high pitch frequency up to the accent nucleus, or the prosodic transformation of the speech unit at the end do not require a large amount of storage capacity. Utilizes the fine structure of the prosody at the time of speech unit recording, and uses the one calculated without using the fine structure of the prosody at the time of recording to modify the prosody of the other speech unit that does not significantly affect the sound quality of the synthesized speech By doing so, it is possible to reduce the number of speech units stored in the database and to reduce the capacity of the memory for storing the speech units while maintaining high sound quality of the synthesized speech.

【００１５】本発明の請求項７に記載の発明は、入力し
た漢字仮名混じり文字列または読み仮名と韻律情報とか
らこれを発音表記に変換し、前記変換された発音表記に
基づきメモリーから語頭２音節の少なくとも一方を含む
音声素片と語尾を含む音声素片とこれらの音声素片以外
の音声素片とから前記発音表記に該当する音声素片を選
択し、音声素片録音時の韻律の微細構造を残したままで
前記選択された音声素片の韻律の変形を行い、前記変形
された音声素片を接続し合成音声として出力するように
したものであり、合成音声の音質の維持に特に重要な語
頭２音節、語尾の各音声素片、またはそれ以外用の音声
素片の韻律の変形は、大量の記憶容量を必要としない音
声素片録音時の韻律の微細構造を利用して行うようにし
たことにより、合成音声の音質を高く維持しながら、デ
ータベースに格納する音声素片を少なくして、それを記
憶するメモリーの容量を小さくすることができるという
作用を有する。According to a seventh aspect of the present invention, a kanji kana mixed character string or a reading kana and prosody information are converted into phonetic notations, and based on the converted phonetic notations, the first two words are read from a memory. Speech units containing at least one of syllables, speech units containing endings, and other than these speech units
A speech unit corresponding to the phonetic transcription is selected from the speech units, and the prosody transformation of the selected speech unit is performed while leaving the fine structure of the prosody at the time of recording the speech unit. The modified speech units are connected to be output as a synthesized speech, and are particularly important for maintaining the sound quality of the synthesized speech. The prosodic transformation of speech units is performed using the fine structure of the prosody at the time of speech unit recording, which does not require a large amount of storage capacity. This has the effect that the number of speech units to be stored can be reduced and the capacity of the memory for storing the same can be reduced.

【００１６】本発明の請求項８に記載の発明は、入力し
た漢字仮名混じり文字列または読み仮名と韻律情報とか
らこれを発音表記に変換し、前記変換された発音表記に
より合成音声の韻律を計算し、前記変換された発音表記
に基づきメモリーから語頭２音節の少なくとも一方を含
む音声素片と語尾を含む音声素片とこれらの音声素片以
外の音声素片とから前記発音表記に該当する音声素片を
選択し、音声素片録音時の韻律の微細構造を残したまま
で前記選択された語頭２音節の少なくとも一方を含む音
声素片と語尾を含む音声素片との韻律の変形を行い、こ
れらの音声素片以外の音声素片の韻律の変形は録音時の
韻律の微細構造を利用せず前記計算した韻律を使用して
行い、前記変形された音声素片を接続し合成音声として
出力するようにしたものであり、合成音声の音質の維持
に特に重要な語頭２音節、または語尾の音声素片の韻律
の変形は、大量の記憶容量を必要としない音声素片録音
時の韻律の微細構造を利用し、合成音声の音質にあまり
影響しないそれ以外用の音声素片の韻律の変形は録音時
の韻律の微細構造を使用せず計算したものを使用するよ
うにしたことにより、合成音声の音質を高く維持しなが
ら、データベースに格納する音声素片を少なくして、そ
れを記憶するメモリーの容量を小さくすることができる
という作用を有する。The invention according to claim 8 of the present invention converts the input kanji kana mixed character string or the reading kana and the prosody information into phonetic notation, and converts the prosody of the synthesized speech by the converted phonetic notation. Based on the converted phonetic notation, a speech unit including at least one of the first two syllables, a speech unit including the ending, and a speech unit including these words are calculated based on the converted phonetic notation.
A speech unit corresponding to the phonetic notation is selected from outside speech units and includes at least one of the selected two-syllables at the beginning while keeping the fine structure of the prosody at the time of recording the speech unit. It performs a prosody of the deformation of the speech segment, including the speech segment and endings, this
The transformation of the prosody of the speech unit other than these speech units is performed using the calculated prosody without using the fine structure of the prosody at the time of recording, and the transformed speech unit is connected as a synthesized speech. The prosodic transformation of the first two syllables or the ending speech unit, which is particularly important for maintaining the sound quality of the synthesized speech, is performed when the speech unit recording that does not require a large amount of storage capacity. By using the fine structure of, the deformation of the prosody of the other speech unit that does not significantly affect the sound quality of the synthesized speech uses the one calculated without using the fine structure of the prosody at the time of recording, While maintaining high sound quality of the synthesized speech, there is an effect that the number of speech units stored in the database can be reduced and the capacity of a memory for storing the speech segments can be reduced.

【００１７】以下、添付図面、図１乃至図９に基づき、
本発明の実施の形態を詳細に説明する。図１は本発明の
第１の実施の形態における音声合成装置の構成を示すブ
ロック図、図２は本発明の第２の実施の形態における音
声合成装置の構成を示すブロック図、図３は本発明の第
３の実施の形態における音声合成装置の構成を示すブロ
ック図、図４は本発明の第４の実施の形態における音声
合成装置の構成を示すブロック図、図５は図１に示す音
声合成装置において入力が漢字文字列の場合の発音表記
への変換の様子を説明する図、図６は本発明の第５の実
施の形態における音声合成方法を示すフローチャート、
図７は本発明の第６の実施の形態における音声合成方法
を示すフローチャート、図８は本発明の第７の実施の形
態における音声合成方法を示すフローチャート、図９は
本発明の第８の実施の形態における音声合成方法を示す
フローチャートである。Hereinafter, based on the attached drawings, FIGS. 1 to 9,
An embodiment of the present invention will be described in detail. FIG. 1 is a block diagram illustrating a configuration of a speech synthesis device according to a first embodiment of the present invention, FIG. 2 is a block diagram illustrating a configuration of a speech synthesis device according to a second embodiment of the present invention, and FIG. FIG. 4 is a block diagram illustrating a configuration of a speech synthesizer according to a third embodiment of the present invention. FIG. 4 is a block diagram illustrating a configuration of a speech synthesizer according to a fourth embodiment of the present invention. FIG. 6 is a view for explaining how the synthesizer converts kanji character strings into phonetic notations when the input is performed. FIG. 6 is a flowchart showing a speech synthesis method according to a fifth embodiment of the present invention.
FIG. 7 is a flowchart showing a speech synthesis method according to the sixth embodiment of the present invention, FIG. 8 is a flowchart showing a speech synthesis method according to the seventh embodiment of the present invention, and FIG. 9 is an eighth embodiment of the present invention. 9 is a flowchart showing a speech synthesis method according to the embodiment.

【００１８】まず、図１を参照して、本発明の第１の実
施の形態における音声合成装置の構成を説明する。図１
において、１０１は漢字仮名混じり文字列または読み仮
名と韻律情報を入力して発音表記に変換する入力文字列
変換手段、１０２は発音表記に基づき合成音声全体の韻
律（ピッチ周波数）を計算する韻律計算手段、１０４は
語頭２音節の少なくとも一方を含む音声素片とアクセン
ト核までの音声素片と語尾を含む音声素片とそれ以外用
の音声素片とを含むデータベースを格納したメモリー、
１０３は入力文字列変換手段１０１から出力された発音
表記をＣＶまたはＶＣＶ単位の音声素片表記に変換し、
メモリー１０４を検索してそれぞれに該当する音声素片
を選択する音声素片選択手段である。First, with reference to FIG. 1, the configuration of a speech synthesizer according to a first embodiment of the present invention will be described. FIG.
101, input character string conversion means for inputting a character string mixed with kanji or kana or reading kana and prosodic information and converting them into phonetic transcription, and 102, a prosody calculation for calculating the prosody (pitch frequency) of the whole synthesized speech based on the phonetic transcription Means 104, a memory storing a database including a speech unit including at least one of the first two syllables, a speech unit up to an accent nucleus, a speech unit including a ending, and a speech unit for other use;
103 converts the phonetic notation output from the input character string converting means 101 into a speech unit notation in CV or VCV units,
This is a speech unit selection unit that searches the memory 104 and selects a corresponding speech unit.

【００１９】また、１０５は選択された語頭２音節の少
なくとも一方を含む音声素片とアクセント核までのピッ
チ周波数の高い音声素片と語尾を含む音声素片とそれ以
外用の音声素片とを録音時の韻律の微細構造を残したま
まで、式（１−１）または（１−２）（後述する）に従
い、その韻律（ピッチ周波数）を変形する音声素片変形
手段、１０６は変形された音声素片を加算平均により接
続する音声素片接続手段、１０７は音声素片が接続され
た合成音声を出力する合成音声出力手段である。Reference numeral 105 designates a speech unit including at least one of the selected two initial syllables, a speech unit having a high pitch frequency up to the accent nucleus, a speech unit including the ending, and a speech unit for the other. While retaining the fine structure of the prosody at the time of recording, the speech unit transforming means 106 for transforming the prosody (pitch frequency) according to the formula (1-1) or (1-2) (described later) is transformed. Speech unit connection means 107 for connecting speech units by averaging, and 107 a synthesized speech output unit for outputting a synthesized speech to which speech units are connected.

【００２０】次に、図１を参照して、本発明の第１の実
施の形態における音声合成装置の動作を説明する。入力
文字列変換手段１０１は入力した漢字仮名混じり文字列
または読み仮名と韻律情報を発音表記に変換する。この
変換の例は図５を参照して後述する。次に、変換された
発音表記に基づき韻律計算手段１０２で合成音声全体の
韻律（ピッチ周波数）が計算される。また、入力文字列
変換手段１０１から出力された発音表記は、音声素片選
択手段１０３においてＣＶまたはＶＣＶ単位の音声素片
表記に変換され、語頭２音節の少なくとも一方を含む音
声素片と、アクセント核までの音声素片と、語尾を含む
音声素片と、それ以外用の音声素片とを格納したメモリ
ー１０４のデータベースを検索して、それぞれ該当する
音声素片を選択する。Next, the operation of the voice synthesizing apparatus according to the first embodiment of the present invention will be described with reference to FIG. The input character string conversion means 101 converts the input character string mixed with kanji kana or the reading kana and prosodic information into phonetic notation. An example of this conversion will be described later with reference to FIG. Next, the prosody (pitch frequency) of the entire synthesized speech is calculated by the prosody calculation means 102 based on the converted phonetic notation. The phonetic notation output from the input character string conversion means 101 is converted into a speech unit notation in units of CV or VCV by the speech unit selection means 103, and a speech unit including at least one of the two syllables at the beginning and an accent. A database in the memory 104 storing the speech unit up to the nucleus, the speech unit including the ending, and the speech unit for the other is searched, and the corresponding speech unit is selected.

【００２１】次に、これらの音声素片は音声素片変形手
段１０５に送られ、語頭２音節の少なくとも一方を含む
音声素片と、アクセント核までのピッチ周波数の高い音
声素片と、語尾を含む音声素片と、それ以外用の音声素
片とは、録音時の韻律の微細構造を残したままで、式
（１−１）または（１−２）（後述する）に従い、合成
音声のモデルピッチ周波数の母音開始部及び母音定常部
ピッチ周波数と一致するように韻律（ピッチ周波数）が
変形される。これら変形された音声素片は音声素片接続
手段１０６において１ピッチ程度の範囲で加算平均によ
り接続され、接続された合成音声は合成音声出力手段１
０７から出力される。また、記憶音声素片は語頭２音
節、アクセント核前までのピッチ周波数の高いもの、語
尾を含む音声素片ほど録音時の音声素片波形に近くな
り、総音声素片量が小であるにもかかわらず、合成音声
が高音質になるという特徴を有する。Next, these speech segments are sent to speech segment transformation means 105, and a speech segment containing at least one of the two syllables at the beginning, a speech segment having a high pitch frequency up to the accent nucleus, and an ending are added. The speech unit to be included and the speech unit for the other are modeled for the synthesized speech in accordance with the formula (1-1) or (1-2) (described later) while keeping the fine structure of the prosody at the time of recording. The prosody (pitch frequency) is modified so that it matches the vowel start part and the vowel stationary part pitch frequency of the pitch frequency. These deformed speech units are connected by averaging in a range of about one pitch in the speech unit connection means 106, and the connected synthesized speech is output by the synthesized speech output means 1.
07. Also, the stored speech units have two syllables at the beginning, those with a higher pitch frequency up to the accent nucleus, and those with the endings are closer to the speech unit waveform at the time of recording, and the total speech unit amount is smaller. Nevertheless, it has the feature that the synthesized speech has high sound quality.

【００２２】次に、図２を参照して、本発明の第２の実
施の形態における音声合成装置の構成を説明する。図２
において、２０１は漢字仮名混じり文字列または読み仮
名と韻律情報を入力して発音表記に変換する入力文字列
変換手段、２０２は変換された発音表記に基づき合成音
声全体の韻律（ピッチ周波数）を計算する韻律計算手
段、２０４は語頭２音節の少なくとも一方を含む音声素
片とアクセント核までの音声素片と語尾を含む音声素片
とそれ以外用の音声素片とを含むデータベースを格納し
たメモリー、２０３は入力文字列変換手段２０１から出
力された発音表記をＣＶまたはＶＣＶ単位の音声素片表
記に変換し、メモリー２０４を検索してそれぞれに該当
する音声素片を選択する音声素片選択手段である。Next, the configuration of a speech synthesizer according to a second embodiment of the present invention will be described with reference to FIG. FIG.
, 201 is an input character string conversion means for inputting a character string mixed with kanji kana or reading kana and prosody information and converting it into phonetic notation, and 202 calculates a prosody (pitch frequency) of the entire synthesized speech based on the converted phonetic notation. A memory storing a database including a speech unit including at least one of the first two syllables, a speech unit up to the accent nucleus, a speech unit including the ending, and a speech unit for the other, Reference numeral 203 denotes a speech unit selection unit that converts the phonetic notation output from the input character string conversion unit 201 into a speech unit notation in CV or VCV units, searches the memory 204, and selects a corresponding speech unit. is there.

【００２３】また、２０５は選択された語頭２音節の少
なくとも一方を含む音声素片とアクセント核までのピッ
チ周波数の高い音声素片と語尾を含む音声素片とを録音
時の韻律の微細構造を残したままで、式（２−１）また
は（２−２）（後述する）に従いその韻律（ピッチ周波
数）を変形し、それ以外用の音声素片の変形は音声素片
録音時の韻律の微細構造を利用せず、先に韻律計算手段
２０２で計算した韻律を用いて変形する音声素片変形手
段、２０６は変形された音声素片を加算平均により接続
する音声素片接続手段、２０７は音声素片が接続された
合成音声を出力する合成音声出力手段である。Reference numeral 205 denotes the fine structure of the prosody at the time of recording the selected speech unit including at least one of the first two syllables, the speech unit having a high pitch frequency up to the accent nucleus, and the speech unit including the ending. As it is, the prosody (pitch frequency) is transformed according to the formula (2-1) or (2-2) (described later), and the deformation of the speech unit for the other is performed by refining the prosody at the time of recording the speech unit. Speech unit transformation means for transforming using the prosody calculated by the prosody calculation means 202 without using the structure, speech unit connection means 206 for connecting the transformed speech units by averaging, This is a synthesized voice output unit that outputs a synthesized voice to which the segments are connected.

【００２４】次に、図２を参照して、本発明の第２の実
施の形態における音声合成装置の動作を説明する。入力
文字列変換手段２０１は入力した漢字仮名混じり文字列
または読み仮名と韻律情報を発音表記に変換する。この
変換の例は図５を参照して後述する。次に、変換された
発音表記に基づき韻律計算手段２０２で合成音声全体の
韻律（ピッチ周波数）が計算される。また、入力文字列
変換手段２０１から出力された発音表記は、音声素片選
択手段２０３においてＣＶまたはＶＣＶ単位の音声素片
表記に変換され、語頭２音節の少なくとも一方を含む音
声素片と、アクセント核までの音声素片と、語尾を含む
音声素片と、それ以外用の音声素片とを格納したメモリ
ー２０４のデータベースを検索して、それぞれ該当する
音声素片を選択する。Next, the operation of the speech synthesizer according to the second embodiment of the present invention will be described with reference to FIG. The input character string conversion means 201 converts the input character string mixed with kanji kana or the reading kana and prosodic information into phonetic notation. An example of this conversion will be described later with reference to FIG. Next, the prosody (pitch frequency) of the entire synthesized speech is calculated by the prosody calculation means 202 based on the converted phonetic notation. The phonetic notation output from the input character string converting means 201 is converted into a phonemic unit notation in CV or VCV units by the voice unit selecting means 203, and a phoneme unit including at least one of the first two syllables and an accent The database in the memory 204 storing the speech unit up to the nucleus, the speech unit including the ending, and the speech unit for the others is searched, and the corresponding speech unit is selected.

【００２５】次に、これらの音声素片は音声素片変形手
段２０５に送られ、語頭２音節の少なくとも一方を含む
音声素片と、アクセント核までのピッチ周波数の高い音
声素片と、語尾を含む音声素片とは、録音時の韻律の微
細構造を残したままで、式（２−１）または（２−２）
（後述する）に従い、合成音声のモデルピッチ周波数の
母音開始部及び母音定常部ピッチ周波数と一致するよう
に韻律（ピッチ周波数）が変形され、それ以外用の音声
素片の変形は音声素片録音時の韻律の微細構造を利用せ
ず、先に韻律計算手段２０２で計算した韻律を用いて変
形する。これら変形された音声素片は音声素片接続手段
２０６において１ピッチ程度の範囲で加算平均により接
続され、接続された合成音声は合成音声出力手段２０７
から出力される。Next, these speech segments are sent to speech segment transformation means 205, and a speech segment including at least one of the two syllables at the beginning, a speech segment having a high pitch frequency up to the accent nucleus, and an ending are added. The speech unit to be included is the expression (2-1) or (2-2) while keeping the fine structure of the prosody at the time of recording.
In accordance with (described later), the prosody (pitch frequency) is transformed so as to match the vowel start part and the vowel stationary part pitch frequency of the model pitch frequency of the synthesized speech, and the speech unit for the other is transformed by the speech unit recording. Instead of using the fine structure of the prosody at the time, the prosody is transformed using the prosody calculated by the prosody calculation means 202 first. These deformed speech units are connected by averaging in a range of about one pitch in the speech unit connection unit 206, and the connected synthesized speech is output to the synthesized speech output unit 207.
Output from

【００２６】次に、図３を参照して、本発明の第３の実
施の形態における音声合成装置の構成を説明する。図３
において、３０１は漢字仮名混じり文字列または読み仮
名と韻律情報を入力して発音表記に変換する入力文字列
変換手段、３０２は発音表記に基づき合成音声全体の韻
律（ピッチ周波数）を計算する韻律計算手段、３０４は
語頭２音節の少なくとも一方を含む音声素片と語尾を含
む音声素片とそれ以外用の音声素片とを含むデータベー
スを格納したメモリー、３０３は入力文字列変換手段３
０１から出力された発音表記をＣＶまたはＶＣＶ単位の
音声素片表記に変換し、メモリー３０４を検索してそれ
ぞれに該当する音声素片を選択する音声素片選択手段で
ある。Next, with reference to FIG. 3, the configuration of a speech synthesizing apparatus according to a third embodiment of the present invention will be described. FIG.
, 301 is an input character string conversion means for inputting a character string mixed with kanji or kana or reading kana and prosody information and converting it into phonetic notation, and 302 is a prosody calculation for calculating the prosody (pitch frequency) of the whole synthesized speech based on the phonetic notation. Means 304, a memory storing a database including a speech unit including at least one of the first two syllables, a speech unit including the ending, and a speech unit for other words; 303, an input character string conversion unit 3;
This is a speech unit selection unit that converts the phonetic notation output from No. 01 into a speech unit notation in CV or VCV units, searches the memory 304 and selects a corresponding speech unit.

【００２７】また、３０５は選択された語頭２音節の少
なくとも一方を含む音声素片と語尾とを含む音声素片
と、それ以外用の音声素片とを録音時の韻律の微細構造
を残したままで、式（３−１）または（３−２）（後述
する）に従いその韻律（ピッチ周波数）を変形する音声
素片変形手段、３０６は変形された音声素片を加算平均
により接続する音声素片接続手段、３０７は音声素片が
接続された合成音声を出力する合成音声出力手段であ
る。Reference numeral 305 denotes a speech unit including at least one of the selected two initial syllables and a speech unit including the ending and a speech unit for the other words, while retaining the fine structure of the prosody at the time of recording. The speech unit transforming means 306 transforms the prosody (pitch frequency) according to the formula (3-1) or (3-2) (to be described later). A speech unit 306 connects the transformed speech units by averaging. One-side connecting means 307 is a synthesized voice output means for outputting a synthesized voice to which a voice unit is connected.

【００２８】次に、図３を参照して、本発明の第３の実
施の形態における音声合成装置の動作を説明する。入力
文字列変換手段３０１は入力した漢字仮名混じり文字列
または読み仮名と韻律情報を発音表記に変換する。この
変換の例は図５を参照して後述する。次に、変換された
発音表記に基づき韻律計算手段３０２で合成音声全体の
韻律（ピッチ周波数）が計算される。また、入力文字列
変換手段３０１から出力された発音表記は、音声素片選
択手段３０３においてＣＶまたはＶＣＶ単位の音声素片
表記に変換され、語頭２音節の少なくとも一方を含む音
声素片と、語尾を含む音声素片と、それ以外用の音声素
片とを格納したメモリー３０４のデータベースを検索し
て、それぞれ該当する音声素片を選択する。Next, the operation of the speech synthesizer according to the third embodiment of the present invention will be described with reference to FIG. The input character string conversion means 301 converts the input character string mixed with kanji kana or the reading kana and prosodic information into phonetic notation. An example of this conversion will be described later with reference to FIG. Next, the prosody (pitch frequency) of the entire synthesized speech is calculated by the prosody calculation means 302 based on the converted phonetic notation. The phonetic notation output from the input character string conversion means 301 is converted into a speech unit notation in units of CV or VCV by the speech unit selection means 303, and a speech unit including at least one of the first two syllables and a suffix are added. Is searched in the database of the memory 304 in which the speech unit including the above and the other speech units are stored, and the corresponding speech unit is selected.

【００２９】次に、これらの音声素片は音声素片変形手
段３０５に送られ、語頭２音節の少なくとも一方を含む
音声素片と語尾とを含む音声素片と、それ以外用の音声
素片とは、録音時の韻律の微細構造を残したままで、式
（３−１）または（３−２）（後述する）に従い、合成
音声のモデルピッチ周波数の母音開始部及び母音定常部
ピッチ周波数と一致するように韻律（ピッチ周波数）が
変形される。これら変形された音声素片は音声素片接続
手段３０６において１ピッチ程度の範囲で加算平均によ
り接続され、接続された合成音声は合成音声出力手段３
０７から出力される。Next, these speech units are sent to speech unit transformation means 305, and a speech unit including at least one of the first two syllables and a speech unit including the ending and a speech unit for the other are used. Means the vowel start part and the vowel stationary part pitch frequency of the model pitch frequency of the synthesized voice according to the equation (3-1) or (3-2) (described later) while keeping the fine structure of the prosody at the time of recording. The prosody (pitch frequency) is modified to match. These deformed speech units are connected by averaging in a range of about one pitch in the speech unit connection unit 306, and the connected synthesized speech is output by the synthesized speech output unit 3.
07.

【００３０】次に、図４を参照して、本発明の第４の実
施の形態における音声合成装置の構成を説明する。図４
において、４０１は漢字仮名混じり文字列または読み仮
名と韻律情報を入力して発音表記に変換する入力文字列
変換手段、４０２は変換された発音表記に基づき合成音
声全体の韻律（ピッチ周波数）を計算する韻律計算手
段、４０４は語頭２音節の少なくとも一方を含む音声素
片と語尾を含む音声素片とそれ以外用の音声素片とを含
むデータベースを格納したメモリー、４０３は入力文字
列変換手段４０１から出力された発音表記をＣＶまたは
ＶＣＶ単位の音声素片表記に変換し、メモリー４０４を
検索してそれぞれに該当する音声素片を選択する音声素
片選択手段である。Next, with reference to FIG. 4, a configuration of a speech synthesizing apparatus according to a fourth embodiment of the present invention will be described. FIG.
, 401 is an input character string conversion means for inputting a character string mixed with kanji kana or reading kana and prosody information and converting it into a phonetic transcription, and 402 calculates a prosody (pitch frequency) of the whole synthesized speech based on the converted phonetic transcription. 404 is a memory storing a database including a speech unit including at least one of the first two syllables, a speech unit including the ending, and a speech unit for other words, and 403 is an input character string conversion unit 401. Is a speech unit selection unit that converts the phonetic notation output from the unit into a speech unit notation in CV or VCV units, searches the memory 404 and selects a corresponding speech unit.

【００３１】また、４０５は選択された語頭２音節の少
なくとも一方を含む音声素片と語尾を含む音声素片とを
録音時の韻律の微細構造を残したままで、式（４−１）
または（４−２）（後述する）に従いその韻律（ピッチ
周波数）を変形し、それ以外用の音声素片の韻律の変形
は音声素片録音時の韻律の微細構造を利用せず、先に韻
律計算手段４０２で計算した韻律を用いて変形する音声
素片変形手段、４０６は変形された音声素片を加算平均
により接続する音声素片接続手段、４０７は音声素片が
接続された合成音声を出力する合成音声出力手段であ
る。Further, reference numeral 405 denotes an expression (4-1) in which the speech unit including at least one of the selected two syllables at the beginning and the speech unit including the ending are retained with the fine structure of the prosody at the time of recording.
Alternatively, the prosody (pitch frequency) is transformed in accordance with (4-2) (to be described later), and the prosody of the speech unit for the other is transformed without using the fine structure of the prosody at the time of recording the speech unit. Speech unit transformation means for transforming using the prosody calculated by the prosody calculation means 402, speech unit connection means 406 for connecting the transformed speech units by averaging, and 407 a synthesized speech to which speech units are connected. Is output as synthesized voice output means.

【００３２】次に、図４を参照して、本発明の第４の実
施の形態における音声合成装置の動作を説明する。入力
文字列変換手段４０１は入力した漢字仮名混じり文字列
または読み仮名と韻律情報を発音表記に変換する。この
変換の例は図５を参照して後述する。次に、変換された
発音表記に基づき韻律計算手段４０２で合成音声全体の
韻律（ピッチ周波数）が計算される。また、入力文字列
変換手段４０１から出力された発音表記は、音声素片選
択手段４０３においてＣＶまたはＶＣＶ単位の音声素片
表記に変換され、語頭２音節の少なくとも一方を含む音
声素片と語尾を含む音声素片と、それ以外用の音声素片
とを格納したメモリー４０４のデータベースを検索し
て、それぞれ該当する音声素片を選択する。Next, the operation of the speech synthesizer according to the fourth embodiment of the present invention will be described with reference to FIG. The input character string conversion means 401 converts the input character string mixed with kanji kana or the reading kana and prosodic information into phonetic notation. An example of this conversion will be described later with reference to FIG. Next, the prosody calculation unit 402 calculates the prosody (pitch frequency) of the entire synthesized speech based on the converted phonetic notation. The phonetic notation output from the input character string conversion means 401 is converted into a speech unit notation in CV or VCV units by the speech unit selection means 403, and the phoneme unit including at least one of the first two syllables and the ending are added. The database in the memory 404 storing the speech units including the speech units and the speech units for other speech units is searched, and the corresponding speech units are selected.

【００３３】次に、これらの音声素片は音声素片変形手
段４０５に送られ、語頭２音節の少なくとも一方を含む
音声素片と語尾を含む音声素片とは、録音時の韻律の微
細構造を残したままで、式（４−１）または（４−２）
（後述する）に従い、合成音声のモデルピッチ周波数の
母音開始部及び母音定常部ピッチ周波数と一致するよう
に韻律（ピッチ周波数）が変形され、それ以外用の音声
素片の変形は音声素片録音時の韻律の微細構造を利用せ
ず、先に韻律計算手段４０２で計算した韻律を用いて変
形する。これら変形された音声素片は、音声素片接続手
段４０６において１ピッチ程度の範囲で加算平均により
接続され、接続された合成音声は合成音声出力手段４０
７から出力される。Next, these speech units are sent to the speech unit transformation unit 405, and the speech unit including at least one of the first two syllables and the speech unit including the ending are combined with the fine structure of the prosody at the time of recording. (4-1) or (4-2)
In accordance with (described later), the prosody (pitch frequency) is transformed so as to match the vowel start part and the vowel stationary part pitch frequency of the model pitch frequency of the synthesized speech, and the speech unit for the other is transformed by the speech unit recording. Instead of using the fine structure of the prosody at the time, the prosody is transformed using the prosody calculated by the prosody calculation means 402 first. These transformed speech units are connected by averaging in a range of about one pitch in the speech unit connection unit 406, and the connected synthesized speech is output by the synthesized speech output unit 40.
7 is output.

【００３４】次に、図６のフローチャートを参照して、
本発明の第５の実施の形態における音声合成方法の流れ
を説明する。ステップ１１０において、入力された漢字
仮名まじり表記またはその読み仮名等の文字列から発音
表記が決定され、更にこの発音表記がＣＶまたはＶＣＶ
単位（Ｃは子音、Ｖは母音を意味する）に分解される。
次に、ステップ１１１に進み、藤崎モデルなどに基づ
き、合成しようとする単語の漢字仮名まじり表記または
その読み仮名とアクセント情報などの情報から、合成す
べき単語の韻律（特にピッチ周波数）を計算する（以
後、モデルピッチ周波数という）。Next, referring to the flowchart of FIG.
The flow of the speech synthesis method according to the fifth embodiment of the present invention will be described. In step 110, a phonetic notation is determined from a character string such as the input kanji kana spelling notation or its reading kana, and the phonetic notation is further converted to CV or VCV.
It is decomposed into units (C means consonant, V means vowel).
Next, proceeding to step 111, based on the Fujisaki model or the like, the prosody (particularly pitch frequency) of the word to be synthesized is calculated from information such as the kanji kana spelling of the word to be synthesized or its reading kana and accent information. (Hereinafter referred to as model pitch frequency).

【００３５】次に、ステップ１１２において、音声素片
カウンタを初期化し、ステップ１１３において、各音声
素片について、それらが合成音声の語頭２音節のいずれ
か１つか、アクセント核までのピッチ周波数の高いもの
か、語尾を含むものか、またはこれら以外のものとして
語頭から順に選択され、音声素片が合成音声の語頭２音
節のいずれか１つか、アクセント核までのピッチ周波数
の高いものか、語尾を含むものの場合はこれらのために
録音された音声素片から発音表記が同じＣＶまたはＶＣ
Ｖ音声素片が選択され（ステップ１１４）、これら以外
の部分に用いらる音声素片は平板音声された音声素片で
あり、同様に発音表記の同じＣＶまたはＶＣＶ音声素片
が選択される（ステップ１１５）。Next, in step 112, a speech unit counter is initialized, and in step 113, for each speech unit, one of the first two syllables of the synthesized speech, or the pitch frequency of the pitch pitch up to the accent nucleus is high. Are selected in order from the beginning of the word, including the end of the word, or those other than these, and the speech unit is one of the first two syllables of the synthesized speech, the one with a high pitch frequency to the accent nucleus, or the end of the word. If they contain CV or VC with the same phonetic notation from the speech units recorded for them
The V speech unit is selected (step 114), and the speech unit used for the other parts is a flat speech unit, and a CV or VCV speech unit having the same phonetic notation is also selected. (Step 115).

【００３６】図５は文字列「横須賀市」を例に、「横須
賀市」を合成する場合について例示する。図５におい
て、その文字列表記は「横須賀市」であり、それを発音
表記に変換すると「ｙｏｋｏｓｘｕｋａ１ｓｈ
ｉ」となり、それに対応する合成音声素片または合成素
片は「ｙｏ（Ｓ）ｋｏ（Ｓ）ｏｓｘｕｘｕｋａ
（Ａ）ａｓｈｉｉ（Ｅ）」であることを示す。FIG. 5 illustrates a case where "Yokosuka City" is synthesized using the character string "Yokosuka City" as an example. In FIG. 5, the character string notation is “Yokosuka City”, which is converted into a phonetic notation “yo ko sxu ka1 sh”.
i ”, and the corresponding synthesized speech unit or synthesized unit is“ yo (S) ko (S) osxu xuka
(A) ashi i (E) ".

【００３７】ステップ１１４及びステップ１１５におい
て選択された語頭２音節のいずれか１つの音声素片かア
クセント核までのピッチ周波数の高い音声素片か語尾を
含む音声素片かこれら以外の音声素片は、ステップ１１
６において、合成音声の韻律（ピッチ周波数）に合わせ
て変形される。すなわち、下記の式〔１−１〕または
〔１−２〕により、合成音声のモデルピッチ周波数の母
音開始部および母音定常部ピッチ周波数と一致するよう
に音声素片のピッチ周波数を変形する。ピッチ周波数の
変形は、ピッチ波形の重畳によって実現される。Any one of the two initial syllables selected in step 114 and step 115, the speech unit having a high pitch frequency up to the accent nucleus, the speech unit including the ending, or the other speech unit is , Step 11
In step 6, it is deformed according to the prosody (pitch frequency) of the synthesized speech. That is, the pitch frequency of the speech unit is modified by the following equation [1-1] or [1-2] so as to match the vowel start part and the vowel stationary part pitch frequency of the model pitch frequency of the synthesized speech. The deformation of the pitch frequency is realized by superimposition of a pitch waveform.

【００３８】ＣＶ音声素片の場合： p(t) = p0(t)｛p1/p01+(P2/p02-p1/p01)t/t1｝・・・〔１−１〕ただし p(t)：変形後の音声素片ピッチ周波数 p0(t): 変形前の音声素片ピッチ周波数 p01 : 変形前の音声素片の母音開始部ピッチ周波数 p02 : 変形前の音声素片の母音定常部ピッチ周波数 p1：素片の母音開始部に相当する位置のモデルピッ
チ周波数 p2：素片の母音定常部に相当する位置のモデルピッ
チ周波数 t ：母音開始時点を原点とする時刻 t1：母音定常部時刻（原点は母音開始部の時刻）In the case of a CV speech unit: p (t) = p0 (t) {p1 / p01 + (P2 / p02-p1 / p01) t / t1} (1-1) where p (t): Speech unit pitch frequency after deformation p0 (t): Speech unit pitch frequency before deformation p01: Vowel start pitch frequency of speech unit before deformation p02: Vowel steady part pitch frequency of speech unit before deformation p1 : Model pitch frequency at the position corresponding to the vowel start part of the unit p2: Model pitch frequency at the position corresponding to the vowel stationary part of the unit t: Time from the start of the vowel as the origin t1: Time of the vowel stationary part (the origin is Vowel start time)

【００３９】ＶＣＶ音声素片の場合： p(t) = p0(t)｛p1/p01+(P2/p02-p1/p01)t/t1｝・・・〔１−２〕ただし p(t)：変形後の音声素片ピッチ周波数 p0(t): 変形前の音声素片ピッチ周波数 p01 : 変形前の音声素片の第１母音定常部ピッチ周波
数 p02 : 変形前の音声素片の第２母音定常部ピッチ周波
数 p1：素片の第１母音定常部に相当する位置のモデル
ピッチ周波数 p2：素片の第２母音定常部に相当する位置のモデル
ピッチ周波数 t ：第１母音定常部時点を原点とする時刻 t1：第２母音定常部時刻（原点は母音開始部の時
刻）In the case of a VCV speech unit: p (t) = p0 (t) {p1 / p01 + (P2 / p02-p1 / p01) t / t1} (1-2) where p (t): Speech unit pitch frequency after deformation p0 (t): Speech unit pitch frequency before deformation p01: First vowel stationary part pitch frequency of speech unit before deformation p02: Second vowel stationary state of speech unit before deformation Part pitch frequency p1: Model pitch frequency at the position corresponding to the first vowel stationary part of the unit p2: Model pitch frequency at the position corresponding to the second vowel stationary part of the unit t: The origin is the time of the first vowel stationary part Time t1: the second vowel steady part time (the origin is the time of the vowel start part)

【００４０】以上のような音声素片変形の後、先行する
音声素片の第２母音定常部と、当該音声素片の第１母音
の定常部のピッチ波形どうしを加算平均して接続を行う
（ステップ１１７）。接続終了後、次の音声素片の有無
を判断して（ステップ１１８）、ある場合は次の音声素
片の選択を行う（ステップ１１３）。このようにして、
音声素片を変形しては接続してゆくことにより合成音声
が作成される。After the above speech unit deformation, the pitch waveforms of the second steady part of the second vowel of the preceding speech unit and the steady part of the first vowel of the first speech unit are averaged and connected. (Step 117). After the connection is completed, the presence or absence of the next speech unit is determined (step 118), and if so, the next speech unit is selected (step 113). In this way,
A synthesized speech is created by deforming and connecting the speech units.

【００４１】尚、このような音声合成方法は単語ばかり
でなく、句や文節あるいは文などの合成音声の作成にも
同様に適用されることは言うまでもない。また、ＣＶや
ＶＣＶの形式以外の音声素片、例えば、ＶＣＶやＶＶな
ど他の形式または種類の音声素片を使用しても、同様に
処理が可能であることは言うまでもない。It goes without saying that such a speech synthesizing method is similarly applied not only to the creation of words but also to the production of synthesized speech such as phrases, phrases or sentences. Also, it goes without saying that the same processing can be performed by using a speech unit other than the CV or VCV format, for example, a speech unit of another format or type such as VCV or VV.

【００４２】次に、図７のフローチャートを参照して、
本発明の第６の実施の形態における音声合成方法の流れ
を説明する。ステップ１２０において、入力された漢字
仮名まじり表記またはその読み仮名等の文字列から発音
表記が決定され、更にこの発音表記がＣＶまたはＶＣＶ
単位（Ｃは子音、Ｖは母音を意味する）に分解される。
次に、ステップ１２１に進み、藤崎モデルなどに基づ
き、合成しようとする単語の漢字仮名まじり表記または
その読み仮名とアクセント情報などの情報から、合成す
べき単語の韻律（特にピッチ周波数）を計算する（以
後、モデルピッチ周波数という）。Next, referring to the flowchart of FIG.
The flow of the speech synthesis method according to the sixth embodiment of the present invention will be described. In step 120, a phonetic notation is determined from a character string such as the input kanji kana spelling or the reading kana, and the phonetic notation is converted to CV or VCV.
It is decomposed into units (C means consonant, V means vowel).
Next, proceeding to step 121, based on the Fujisaki model or the like, the prosody (particularly pitch frequency) of the word to be synthesized is calculated from information such as the kanji kana spelling of the word to be synthesized or its reading kana and accent information. (Hereinafter referred to as model pitch frequency).

【００４３】次に、ステップ１２２において、音声素片
カウンタを初期化し、ステップ１２３において、各音声
素片について、それらが合成音声の語頭２音節のいずれ
か１つか、アクセント核までのピッチ周波数の高いもの
か、語尾を含むものか、またはこれら以外のものとして
語頭から順に選択され、音声素片が合成音声の語頭２音
節のいずれか１つか、アクセント核までのピッチ周波数
の高いものか、語尾を含むものの場合はこれらのために
録音された音声素片から発音表記が同じＣＶまたはＶＣ
Ｖ音声素片が選択され（ステップ１２４）、これら以外
の部分に用いらる音声素片は平板音声された音声素片で
あり、同様に発音表記の同じＣＶまたはＶＣＶ音声素片
が選択される（ステップ１２６）。Next, in step 122, a speech unit counter is initialized, and in step 123, for each speech unit, one of the two syllables at the beginning of the synthesized speech, or the pitch frequency up to the accent nucleus is high. Are selected in order from the beginning of the word, including the end of the word, or those other than these, and the speech unit is one of the first two syllables of the synthesized speech, the one with a high pitch frequency to the accent nucleus, or the end of the word. If they contain CV or VC with the same phonetic notation from the speech units recorded for them
A V-speech unit is selected (step 124), and speech units used for other parts are flat-speech-based speech units, and similarly, a CV or VCV speech unit having the same phonetic notation is selected. (Step 126).

【００４４】図５は文字列「横須賀市」を例に、「横須
賀市」を合成する場合について例示する。図５におい
て、その文字列表記は「横須賀市」であり、それを発音
表記に変換すると「ｙｏｋｏｓｘｕｋａ１ｓｈ
ｉ」となり、それに対応する合成音声素片または合成素
片は「ｙｏ（Ｓ）ｋｏ（Ｓ）ｏｓｘｕｘｕｋａ
（Ａ）ａｓｈｉｉ（Ｅ）」であることを示す。FIG. 5 illustrates a case where "Yokosuka City" is synthesized using the character string "Yokosuka City" as an example. In FIG. 5, the character string notation is “Yokosuka City”, which is converted into a phonetic notation “yo ko sxu ka1 sh”.
i ”, and the corresponding synthesized speech unit or synthesized unit is“ yo (S) ko (S) osxu xuka
(A) ashi i (E) ".

【００４５】ステップ１２４において選択された語頭２
音節のいずれか１つの音声素片かアクセント核までのピ
ッチ周波数の高い音声素片か語尾を含む音声素片は、ス
テップ１２５において、合成音声の韻律（ピッチ周波
数）に合わせて変形される。すなわち、下記の式〔２−
１〕または〔２−２〕により、合成音声のモデルピッチ
周波数の母音開始部および母音定常部ピッチ周波数と一
致するように音声素片のピッチ周波数を変形する。ピッ
チ周波数の変形は、ピッチ波形の重畳によって実現され
る。Initial 2 selected in step 124
At step 125, any one of the syllables or a speech unit having a high pitch frequency up to the accent nucleus or a speech unit including the ending is deformed according to the prosody (pitch frequency) of the synthesized speech. That is, the following formula [2-
According to [1] or [2-2], the pitch frequency of the speech unit is changed so as to match the vowel start portion and the vowel stationary portion pitch frequency of the model pitch frequency of the synthesized speech. The deformation of the pitch frequency is realized by superimposition of a pitch waveform.

【００４６】ＣＶ音声素片の場合： p(t) = p0(t)｛p1/p01+(P2/p02-p1/p01)t/t1｝・・・〔２−１〕ただし p(t)：変形後の音声素片ピッチ周波数 p0(t): 変形前の音声素片ピッチ周波数 p01 : 変形前の音声素片の母音開始部ピッチ周波数 p02 : 変形前の音声素片の母音定常部ピッチ周波数 p1：素片の母音開始部に相当する位置のモデルピッ
チ周波数 p2：素片の母音定常部に相当する位置のモデルピッ
チ周波数 t ：母音開始時点を原点とする時刻 t1：母音定常部時刻（原点は母音開始部の時刻）In the case of a CV speech unit: p (t) = p0 (t) {p1 / p01 + (P2 / p02-p1 / p01) t / t1} (2-1) where p (t): Speech unit pitch frequency after deformation p0 (t): Speech unit pitch frequency before deformation p01: Vowel start pitch frequency of speech unit before deformation p02: Vowel steady part pitch frequency of speech unit before deformation p1 : Model pitch frequency at the position corresponding to the vowel start part of the unit p2: Model pitch frequency at the position corresponding to the vowel stationary part of the unit t: Time from the start of the vowel as the origin t1: Time of the vowel stationary part (the origin is Vowel start time)

【００４７】ＶＣＶ音声素片の場合： p(t) = p0(t)｛p1/p01+(P2/p02-p1/p01)t/t1｝・・・〔２−２〕ただし p(t)：変形後の音声素片ピッチ周波数 p0(t): 変形前の音声素片ピッチ周波数 p01 : 変形前の音声素片の第１母音定常部ピッチ周波
数 p02 : 変形前の音声素片の第２母音定常部ピッチ周波
数 p1：素片の第１母音定常部に相当する位置のモデル
ピッチ周波数 p2：素片の第２母音定常部に相当する位置のモデル
ピッチ周波数 t ：第１母音定常部時点を原点とする時刻 t1：第２母音定常部時刻（原点は母音開始部の時
刻）In the case of the VCV speech unit: p (t) = p0 (t) {p1 / p01 + (P2 / p02-p1 / p01) t / t1} (2-2) where p (t): Speech unit pitch frequency after deformation p0 (t): Speech unit pitch frequency before deformation p01: First vowel stationary part pitch frequency of speech unit before deformation p02: Second vowel stationary state of speech unit before deformation Part pitch frequency p1: Model pitch frequency at the position corresponding to the first vowel stationary part of the unit p2: Model pitch frequency at the position corresponding to the second vowel stationary part of the unit t: The origin is the time of the first vowel stationary part Time t1: the second vowel steady part time (the origin is the time of the vowel start part)

【００４８】一方、語頭２音節のいずれか１つの音声素
片か語尾を含む音声素片以外のそれ以外用の音声素片
は、ステップ１２７において、合成音声のモデルピッチ
周波数によって定められる各時刻のピッチ周期に合わせ
てピッチ波形単位で重畳することにより変形される。On the other hand, any other speech unit other than the speech unit including any one of the first two syllables or the speech ending at the end of the word is determined at step 127 at each time determined by the model pitch frequency of the synthesized speech. It is deformed by superimposing in units of pitch waveforms in accordance with the pitch cycle.

【００４９】以上のような音声素片変形の後、先行する
音声素片の第２母音定常部と、当該音声素片の第１母音
の定常部のピッチ波形どうしを加算平均して接続を行う
（ステップ１２８）。接続終了後、次の音声素片の有無
を判断して（ステップ１２９）、ある場合は次の音声素
片の選択を行う（ステップ１２３）。このようにして、
音声素片を変形しては接続してゆくことにより合成音声
が作成される。After the above speech unit deformation, the pitch waveforms of the second stationary part of the preceding vowel and the stationary part of the first vowel of the preceding speech unit are averaged and connected. (Step 128). After the connection is completed, the presence or absence of the next speech unit is determined (step 129), and if there is, the next speech unit is selected (step 123). In this way,
A synthesized speech is created by deforming and connecting the speech units.

【００５０】尚、このような音声合成方法は単語ばかり
でなく、句や文節あるいは文などの合成音声の作成にも
同様に適用されることは言うまでもない。また、ＣＶや
ＶＣＶの形式以外の音声素片、例えば、ＶＣＶやＶＶな
ど他の形式または種類の音声素片を使用しても、同様に
処理が可能であることは言うまでもない。It is needless to say that such a speech synthesizing method is similarly applied not only to the creation of words but also to the production of synthesized speech such as phrases, phrases or sentences. Also, it goes without saying that the same processing can be performed by using a speech unit other than the CV or VCV format, for example, a speech unit of another format or type such as VCV or VV.

【００５１】次に、図８のフローチャートを参照して、
本発明の第７の実施の形態における音声合成方法の流れ
を説明する。ステップ１３０において、入力された漢字
仮名まじり表記またはその読み仮名等の文字列から発音
表記が決定され、更にこの発音表記がＣＶまたはＶＣＶ
単位（Ｃは子音、Ｖは母音を意味する）に分解される。
次に、ステップ１３１に進み、藤崎モデルなどに基づ
き、合成しようとする単語の漢字仮名まじり表記または
その読み仮名とアクセント情報などの情報から、合成す
べき単語の韻律（特にピッチ周波数）を計算する（以
後、モデルピッチ周波数という）。Next, referring to the flowchart of FIG.
The flow of the speech synthesis method according to the seventh embodiment of the present invention will be described. In step 130, the phonetic notation is determined from the character string such as the input kanji kana spelling notation or its reading kana, and the phonetic notation is further converted to CV or VCV.
It is decomposed into units (C means consonant, V means vowel).
Next, proceeding to step 131, based on the Fujisaki model or the like, the prosody (particularly pitch frequency) of the word to be synthesized is calculated from the information such as the kanji kana spelling of the word to be synthesized or its reading kana and accent information. (Hereinafter referred to as model pitch frequency).

【００５２】次に、ステップ１３２において、音声素片
カウンタを初期化し、ステップ１３３において、各音声
素片について、それらが合成音声の語頭２音節のいずれ
か１つか、語尾を含むものか、またはこれら以外のもの
として語頭から順に選択され、音声素片が合成音声の語
頭２音節のいずれか１つか語尾を含むものの場合はこれ
らのために録音された音声素片から発音表記が同じＣＶ
またはＶＣＶ音声素片が選択され（ステップ１３４）、
これら以外の部分に用いらる音声素片は平板音声された
音声素片であり、同様に発音表記の同じＣＶまたはＶＣ
Ｖ音声素片が選択される（ステップ１３５）。Next, in step 132, a speech unit counter is initialized, and in step 133, for each speech unit, whether they include any one of the first two syllables of the synthesized speech, the ending of the syllable, or the like. If the speech unit includes any one of the first two syllables or the end of the synthesized speech, the phonetic notation is the same from the speech unit recorded for them.
Alternatively, a VCV speech unit is selected (step 134),
Speech units used for other parts are plate-speech-based speech units, and also have the same phonetic notation of CV or VC.
A V speech unit is selected (step 135).

【００５３】図５は文字列「横須賀市」を例に、「横須
賀市」を合成する場合について例示する。図５におい
て、その文字列表記は「横須賀市」であり、それを発音
表記に変換すると「ｙｏｋｏｓｘｕｋａ１ｓｈ
ｉ」となり、それに対応する合成音声素片または合成素
片は「ｙｏ（Ｓ）ｋｏ（Ｓ）ｏｓｘｕｘｕｋａ
（Ａ）ａｓｈｉｉ（Ｅ）」であることを示す。FIG. 5 illustrates a case where the character string "Yokosuka City" is used as an example and "Yokosuka City" is synthesized. In FIG. 5, the character string notation is “Yokosuka City”, which is converted into a phonetic notation “yo ko sxu ka1 sh”.
i ”, and the corresponding synthesized speech unit or synthesized unit is“ yo (S) ko (S) osxu xuka
(A) ashi i (E) ".

【００５４】ステップ１３４及びステップ１３５におい
て選択された語頭２音節のいずれか１つの音声素片か語
尾を含む音声素片かこれら以外の音声素片は、ステップ
１３６において、合成音声の韻律（ピッチ周波数）に合
わせて変形される。すなわち、下記の式〔３−１〕また
は〔３−２〕により、合成音声のモデルピッチ周波数の
母音開始部および母音定常部ピッチ周波数と一致するよ
うに音声素片のピッチ周波数を変形する。ピッチ周波数
の変形は、ピッチ波形の重畳によって実現される。In step 136, any one of the first two syllables selected in steps 134 and 135, the speech unit including the ending, or the other speech unit is processed in step 136 by the prosody (pitch frequency) of the synthesized speech. ). That is, the pitch frequency of the speech unit is modified by the following equation [3-1] or [3-2] so as to match the vowel start portion and the vowel stationary portion pitch frequency of the model pitch frequency of the synthesized speech. The deformation of the pitch frequency is realized by superimposition of a pitch waveform.

【００５５】ＣＶ音声素片の場合： p(t) = p0(t)｛p1/p01+(P2/p02-p1/p01)t/t1｝・・・〔３−１〕ただし p(t)：変形後の音声素片ピッチ周波数 p0(t): 変形前の音声素片ピッチ周波数 p01 : 変形前の音声素片の母音開始部ピッチ周波数 p02 : 変形前の音声素片の母音定常部ピッチ周波数 p1：素片の母音開始部に相当する位置のモデルピッ
チ周波数 p2：素片の母音定常部に相当する位置のモデルピッ
チ周波数 t ：母音開始時点を原点とする時刻 t1：母音定常部時刻（原点は母音開始部の時刻）In the case of a CV speech unit: p (t) = p0 (t) {p1 / p01 + (P2 / p02-p1 / p01) t / t1} (3-1) where p (t): Speech unit pitch frequency after deformation p0 (t): Speech unit pitch frequency before deformation p01: Vowel start pitch frequency of speech unit before deformation p02: Vowel steady part pitch frequency of speech unit before deformation p1 : Model pitch frequency at the position corresponding to the vowel start part of the unit p2: Model pitch frequency at the position corresponding to the vowel stationary part of the unit t: Time from the start of the vowel as the origin t1: Time of the vowel stationary part (the origin is Vowel start time)

【００５６】ＶＣＶ音声素片の場合： p(t) = p0(t)｛p1/p01+(P2/p02-p1/p01)t/t1｝・・・〔３−２〕ただし p(t)：変形後の音声素片ピッチ周波数 p0(t): 変形前の音声素片ピッチ周波数 p01 : 変形前の音声素片の第１母音定常部ピッチ周波
数 p02 : 変形前の音声素片の第２母音定常部ピッチ周波
数 p1：素片の第１母音定常部に相当する位置のモデル
ピッチ周波数 p2：素片の第２母音定常部に相当する位置のモデル
ピッチ周波数 t ：第１母音定常部時点を原点とする時刻 t1：第２母音定常部時刻（原点は母音開始部の時
刻）In the case of a VCV speech unit: p (t) = p0 (t) {p1 / p01 + (P2 / p02-p1 / p01) t / t1} (3-2) where p (t): Speech unit pitch frequency after deformation p0 (t): Speech unit pitch frequency before deformation p01: First vowel stationary part pitch frequency of speech unit before deformation p02: Second vowel stationary state of speech unit before deformation Part pitch frequency p1: Model pitch frequency at the position corresponding to the first vowel stationary part of the unit p2: Model pitch frequency at the position corresponding to the second vowel stationary part of the unit t: The origin is the time of the first vowel stationary part Time t1: the second vowel steady part time (the origin is the time of the vowel start part)

【００５７】以上のような音声素片変形の後、先行する
音声素片の第２母音定常部と、当該音声素片の第１母音
の定常部のピッチ波形どうしを加算平均して接続を行う
（ステップ１３７）。接続終了後、次の音声素片の有無
を判断して（ステップ１３８）、ある場合は次の音声素
片の選択を行う（ステップ１３３）。このようにして、
音声素片を変形しては接続してゆくことにより合成音声
が作成される。After the above speech unit deformation, the pitch waveforms of the second steady part of the second vowel of the preceding speech unit and the steady part of the first vowel of the first speech unit are averaged and connected. (Step 137). After the connection is completed, the presence or absence of the next speech unit is determined (step 138), and if there is, the next speech unit is selected (step 133). In this way,
A synthesized speech is created by deforming and connecting the speech units.

【００５８】尚、このような音声合成方法は単語ばかり
でなく、句や文節あるいは文などの合成音声の作成にも
同様に適用されることは言うまでもない。また、ＣＶや
ＶＣＶの形式以外の音声素片、例えば、ＶＣＶやＶＶな
ど他の形式または種類の音声素片を使用しても、同様に
処理が可能であることは言うまでもない。It is needless to say that such a speech synthesizing method is similarly applied not only to the production of words but also to the production of synthesized speech such as phrases, phrases or sentences. Also, it goes without saying that the same processing can be performed by using a speech unit other than the CV or VCV format, for example, a speech unit of another format or type such as VCV or VV.

【００５９】次に、図９のフローチャートを参照して、
本発明の第８の実施の形態における音声合成方法の流れ
を説明する。ステップ１４０において、入力された漢字
仮名まじり表記またはその読み仮名等の文字列から発音
表記が決定され、更にこの発音表記がＣＶまたはＶＣＶ
単位（Ｃは子音、Ｖは母音を意味する）に分解される。
次に、ステップ１４１に進み、藤崎モデルなどに基づ
き、合成しようとする単語の漢字仮名まじり表記または
その読み仮名とアクセント情報などの情報から、合成す
べき単語の韻律（特にピッチ周波数）を計算する（以
後、モデルピッチ周波数という）。Next, referring to the flowchart of FIG.
The flow of the speech synthesis method according to the eighth embodiment of the present invention will be described. In step 140, a phonetic notation is determined from a character string such as the input kanji kana spelling notation or its reading kana, and the phonetic notation is further converted to CV or VCV.
It is decomposed into units (C means consonant, V means vowel).
Next, proceeding to step 141, based on the Fujisaki model or the like, the prosody (particularly pitch frequency) of the word to be synthesized is calculated from information such as the kanji kana spelling of the word to be synthesized or its reading kana and accent information. (Hereinafter referred to as model pitch frequency).

【００６０】次に、ステップ１４２において、音声素片
カウンタを初期化し、ステップ１４３において、各音声
素片について、それらが合成音声の語頭２音節のいずれ
か１つか、語尾を含むものか、またはこれら以外のもの
として語頭から順に選択され、音声素片が合成音声の語
頭２音節のいずれか１つか語尾を含むものの場合はこれ
らのために録音された音声素片から発音表記が同じＣＶ
またはＶＣＶ音声素片が選択され（ステップ１４４）、
これら以外の部分に用いらる音声素片は平板音声された
音声素片であり、同様に発音表記の同じＣＶまたはＶＣ
Ｖ音声素片が選択される（ステップ１４６）。Next, in step 142, a speech unit counter is initialized, and in step 143, for each speech unit, whether they include any one of the first two syllables of the synthesized speech, the end of the syllable, or the like. If the speech unit includes any one of the first two syllables or the end of the synthesized speech, the phonetic notation is the same from the speech unit recorded for them.
Alternatively, a VCV speech unit is selected (step 144),
Speech units used for other parts are plate-speech-based speech units, and also have the same phonetic notation of CV or VC.
A V speech unit is selected (step 146).

【００６１】図５は文字列「横須賀市」を例に、「横須
賀市」を合成する場合について例示する。図５におい
て、その文字列表記は「横須賀市」であり、それを発音
表記に変換すると「ｙｏｋｏｓｘｕｋａ１ｓｈ
ｉ」となり、それに対応する合成音声素片または合成素
片は「ｙｏ（Ｓ）ｋｏ（Ｓ）ｏｓｘｕｘｕｋａ
（Ａ）ａｓｈｉｉ（Ｅ）」であることを示す。FIG. 5 illustrates a case where "Yokosuka City" is synthesized using the character string "Yokosuka City" as an example. In FIG. 5, the character string notation is “Yokosuka City”, which is converted into a phonetic notation “yo ko sxu ka1 sh”.
i ”, and the corresponding synthesized speech unit or synthesized unit is“ yo (S) ko (S) osxu xuka
(A) ashi i (E) ".

【００６２】ステップ１４４において選択された語頭２
音節のいずれか１つの音声素片か語尾を含む音声素片
は、ステップ１４５において、合成音声の韻律（ピッチ
周波数）に合わせて変形される。すなわち、下記の式
〔４−１〕または〔４−２〕により、合成音声のモデル
ピッチ周波数の母音開始部および母音定常部ピッチ周波
数と一致するように音声素片のピッチ周波数を変形す
る。ピッチ周波数の変形は、ピッチ波形の重畳によって
実現される。The head 2 selected in step 144
In step 145, any one of the syllable speech units or the speech unit including the ending is deformed according to the prosody (pitch frequency) of the synthesized speech. That is, the pitch frequency of the speech unit is modified by the following equation [4-1] or [4-2] so as to match the vowel start part and the vowel stationary part pitch frequency of the model pitch frequency of the synthesized speech. The deformation of the pitch frequency is realized by superimposition of a pitch waveform.

【００６３】ＣＶ音声素片の場合： p(t) = p0(t)｛p1/p01+(P2/p02-p1/p01)t/t1｝・・・〔４−１〕ただし p(t)：変形後の音声素片ピッチ周波数 p0(t): 変形前の音声素片ピッチ周波数 p01 : 変形前の音声素片の母音開始部ピッチ周波数 p02 : 変形前の音声素片の母音定常部ピッチ周波数 p1：素片の母音開始部に相当する位置のモデルピッ
チ周波数 p2：素片の母音定常部に相当する位置のモデルピッ
チ周波数 t ：母音開始時点を原点とする時刻 t1：母音定常部時刻（原点は母音開始部の時刻）In the case of the CV speech unit: p (t) = p0 (t) {p1 / p01 + (P2 / p02-p1 / p01) t / t1} (4-1) where p (t): Speech unit pitch frequency after deformation p0 (t): Speech unit pitch frequency before deformation p01: Vowel start pitch frequency of speech unit before deformation p02: Vowel steady part pitch frequency of speech unit before deformation p1 : Model pitch frequency at the position corresponding to the vowel start part of the unit p2: Model pitch frequency at the position corresponding to the vowel stationary part of the unit t: Time from the start of the vowel as the origin t1: Time of the vowel stationary part (the origin is Vowel start time)

【００６４】ＶＣＶ音声素片の場合： p(t) = p0(t)｛p1/p01+(P2/p02-p1/p01)t/t1｝・・・〔４−２〕ただし p(t)：変形後の音声素片ピッチ周波数 p0(t): 変形前の音声素片ピッチ周波数 p01 : 変形前の音声素片の第１母音定常部ピッチ周波
数 p02 : 変形前の音声素片の第２母音定常部ピッチ周波
数 p1：素片の第１母音定常部に相当する位置のモデル
ピッチ周波数 p2：素片の第２母音定常部に相当する位置のモデル
ピッチ周波数 t ：第１母音定常部時点を原点とする時刻 t1：第２母音定常部時刻（原点は母音開始部の時
刻）In the case of a VCV speech unit: p (t) = p0 (t) {p1 / p01 + (P2 / p02-p1 / p01) t / t1} (4-2) where p (t): Speech unit pitch frequency after deformation p0 (t): Speech unit pitch frequency before deformation p01: First vowel stationary part pitch frequency of speech unit before deformation p02: Second vowel stationary state of speech unit before deformation Part pitch frequency p1: Model pitch frequency at the position corresponding to the first vowel stationary part of the unit p2: Model pitch frequency at the position corresponding to the second vowel stationary part of the unit t: The origin is the time of the first vowel stationary part Time t1: the second vowel steady part time (the origin is the time of the vowel start part)

【００６５】一方、語頭２音節のいずれか１つの音声素
片か語尾を含む音声素片以外のそれ以外用の音声素片
は、ステップ１４７において、合成音声のモデルピッチ
周波数によって定められる各時刻のピッチ周期に合わせ
てピッチ波形単位で重畳することにより変形される。On the other hand, in step 147, a speech unit other than any one speech unit of the first two syllables or a speech unit including the ending is added at step 147 at each time determined by the model pitch frequency of the synthesized speech. It is deformed by superimposing in units of pitch waveforms in accordance with the pitch cycle.

【００６６】以上のような音声素片変形の後、先行する
音声素片の第２母音定常部と、当該音声素片の第１母音
の定常部のピッチ波形どうしを加算平均して接続を行う
（ステップ１４８）。接続終了後、次の音声素片の有無
を判断して（ステップ１４９）、ある場合は次の音声素
片の選択を行う（ステップ１４３）。このようにして、
音声素片を変形しては接続してゆくことにより合成音声
が作成される。After the above speech unit deformation, the pitch waveforms of the second steady part of the second vowel of the preceding speech unit and the steady part of the first vowel of the first speech unit are averaged and connected. (Step 148). After the connection is completed, the presence or absence of the next speech unit is determined (step 149). If there is, the next speech unit is selected (step 143). In this way,
A synthesized speech is created by deforming and connecting the speech units.

【００６７】尚、このような音声合成方法は単語ばかり
でなく、句や文節あるいは文などの合成音声の作成にも
同様に適用されることは言うまでもない。また、ＣＶや
ＶＣＶの形式以外の音声素片、例えば、ＶＣＶやＶＶな
ど他の形式または種類の音声素片を使用しても、同様に
処理が可能であることは言うまでもない。It is needless to say that such a speech synthesizing method is similarly applied not only to the production of words but also to the production of synthesized speech such as phrases, phrases or sentences. Also, it goes without saying that the same processing can be performed by using a speech unit other than the CV or VCV format, for example, a speech unit of another format or type such as VCV or VV.

【００６８】[0068]

【発明の効果】本発明は、以上のように構成し、特に、
従来技術におけるような大記憶容量を必要とするスペク
トル変換を使用する代わりに、合成音声の音質の維持に
特に重要な語頭２音節、アクセント核までのピッチ周波
数の高いもの、または語尾の音声素片の（韻律の）変形
に、スペクトル変換を使用する場合ほど記憶容量を必要
としない音声素片録音時の韻律の微細構造を利用し、合
成音声の音質にあまり影響しないそれ以外用の音声素片
の（韻律の）変形は音声素片録音時の韻律の微細構造を
利用せず計算により行うようにしたことにより、合成音
声の音質を高く維持しながら、データベースに格納する
音声素片を少なくして、音声素片データベースを記憶す
るメモリーの容量を小さくすることができる。The present invention is configured as described above, and in particular,
Instead of using spectral transformations that require large storage capacity as in the prior art, two initial syllables, one with a high pitch frequency up to the accent nucleus, or a speech unit at the end, which are particularly important for maintaining the sound quality of the synthesized speech Uses the fine structure of prosody at the time of recording speech units that do not require as much storage capacity as the case of using spectrum conversion to transform (prosody), and uses other speech units that do not significantly affect the sound quality of synthesized speech The (prosody) transformation is performed by calculation without using the fine structure of the prosody at the time of speech unit recording, so that the speech units stored in the database can be reduced while maintaining the high quality of synthesized speech. Thus, the capacity of the memory for storing the speech unit database can be reduced.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態における音声合成装
置の構成を示すブロック図FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to a first embodiment of the present invention.

【図２】本発明の第２の実施の形態における音声合成装
置の構成を示すブロック図FIG. 2 is a block diagram illustrating a configuration of a speech synthesizer according to a second embodiment of the present invention;

【図３】本発明の第３の実施の形態における音声合成装
置の構成を示すブロック図FIG. 3 is a block diagram showing a configuration of a speech synthesizer according to a third embodiment of the present invention.

【図４】本発明の第４の実施の形態における音声合成装
置の構成を示すブロック図FIG. 4 is a block diagram showing a configuration of a speech synthesizer according to a fourth embodiment of the present invention.

【図５】図１に示す音声合成装置において入力が漢字文
字列の場合の発音表記への変換の様子を説明する図FIG. 5 is a view for explaining a state of conversion into phonetic notation when an input is a kanji character string in the speech synthesizer shown in FIG. 1;

【図６】本発明の第５の実施の形態における音声合成方
法を示すフローチャートFIG. 6 is a flowchart illustrating a speech synthesis method according to a fifth embodiment of the present invention.

【図７】本発明の第６の実施の形態における音声合成方
法を示すフローチャートFIG. 7 is a flowchart illustrating a speech synthesis method according to a sixth embodiment of the present invention.

【図８】本発明の第７の実施の形態における音声合成方
法を示すフローチャートFIG. 8 is a flowchart showing a speech synthesis method according to a seventh embodiment of the present invention.

【図９】本発明の第８の実施の形態における音声合成方
法を示すフローチャートFIG. 9 is a flowchart illustrating a speech synthesis method according to an eighth embodiment of the present invention.

【図１０】従来の音声合成装置の構成を示すブロック図FIG. 10 is a block diagram showing the configuration of a conventional speech synthesizer.

[Explanation of symbols]

１０１、２０１入力文字列変換手段１０２、２０２韻律計算手段１０３、２０３音声素片選択手段３０１、４０１入力文字列変換手段３０２、４０２韻律計算手段３０３、４０３音声素片選択手段１０４、２０４メモリー（語頭、アクセント核までの
ピッチ周波数の高いもの、語尾、及びその他用の音声素
片データベース格納）１０５、２０５音声素片変形手段１０６、２０６音声素片接続手段１０７、２０７合成音声出力手段３０５、４０５音声素片変形手段３０６、４０６音声素片接続手段３０７、４０７合成音声出力手段３０４、４０４メモリー（語頭、語尾、及びその他用
の音声素片データベース格納）５０１入力部５０２韻律パラメータ生成部５０３韻律規則辞書５０４音声単位選択部５０５大容量音声単位データベース（ＤＢ）５０６音声単位セット選択部５０７音声単位セット５０８音声合成器５０９出力部101, 201 input character string conversion means 102, 202 prosody calculation means 103, 203 speech unit selection means 301, 401 input character string conversion means 302, 402 prosody calculation means 303, 403 speech unit selection means 104, 204 105, 205 Speech unit transformation means 106, 206 Speech unit connection means 107, 207 Synthetic speech output means 305, 405 Speech unit database for speech units with high pitch frequencies up to accent nuclei, endings, etc. Unit transforming means 306, 406 Speech unit connecting means 307, 407 Synthetic speech output means 304, 404 Memory (stores speech prefix, ending, and other speech unit databases for other uses) 501 Input unit 502 Prosodic parameter generation unit 503 Prosodic rule dictionary 504 Audio unit selection unit 505 Large capacity audio Unit database (DB) 506 Speech unit set selection unit 507 Speech unit set 508 Speech synthesizer 509 Output unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平６−95692（ＪＰ，Ａ) 特開平４−281495（ＪＰ，Ａ) 特開平１−284898（ＪＰ，Ａ) 特開平２−238497（ＪＰ，Ａ) 特開平７−181995（ＪＰ，Ａ) 特開平７−56597（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/06 G10L 13/08 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-6-95692 (JP, A) JP-A-4-281495 (JP, A) JP-A 1-284898 (JP, A) JP-A-2- 238497 (JP, A) JP-A-7-181995 (JP, A) JP-A-7-56597 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 13/06 G10L 13 / 08

Claims

(57) [Claims]

An input character string converting means for converting an input kanji kana mixed character string or a reading kana and prosody information into a phonetic notation, and a prosody calculating means for calculating a prosody of a synthesized speech based on the converted phonetic notation. If, prefix 2 syllables of speech units and these audio including speech unit and endings of high pitch frequencies up speech unit and accent core containing at least one
A memory storing at least speech units other than the speech unit, a speech unit selection unit for selecting a corresponding speech unit from the memory according to the phonetic notation, and a fine structure of the prosody at the time of the speech unit recording. Speech unit transforming means for transforming the prosody of the selected speech unit, speech unit connecting means for connecting the transformed speech unit, and synthesized speech output means for outputting the connected synthesized speech. A speech synthesizer comprising:

2. An input character string converting means for converting an input character string mixed with kanji kana or a reading kana and prosodic information into a phonetic notation, and a prosody calculating means for calculating a prosody of a synthesized speech based on the converted phonetic notation. If, prefix 2 syllables of speech units and these audio including speech unit and endings of high pitch frequencies up speech unit and accent core containing at least one
A memory storing at least a speech unit other than a unit, a speech unit selection unit for selecting a corresponding speech unit from the memory according to the phonetic notation, and the selection while retaining the fine structure of the prosody at the time of recording. been prefixes perform prosody modification of speech units including speech segment and ending high pitch frequencies up speech unit and accent nucleus containing 2 syllables least one of, speech units other than those of speech units Transformation of the prosody is performed using the calculated prosody without using the fine structure of the prosody at the time of the speech unit recording, a speech unit transformation unit, and a speech unit connection unit that connects the transformed speech unit, A speech synthesizer comprising a connected synthesized speech output unit for outputting the synthesized speech.

3. An input character string converting means for converting an input character string mixed with kanji kana or a reading kana and prosodic information into a phonetic notation, and a prosody calculating means for calculating a prosody of a synthesized speech based on the converted phonetic notation. When, the prefix 2 syllables speech unit and of including speech unit and endings including at least one
A memory storing at least a speech unit other than the speech unit, a speech unit selection means for selecting a corresponding speech unit from the memory according to the phonetic notation, and a fine structure of a prosody at the time of speech unit recording. Speech unit transformation means for transforming the prosody of the selected speech unit as it is, speech unit connection means for connecting the transformed speech unit, and synthesized speech output for outputting the connected synthesized speech And a voice synthesizing device.

4. An input character string converting means for converting an input character string mixed with kanji kana or a reading kana and prosodic information into a phonetic notation, and a prosody calculating means for calculating a prosody of a synthesized speech based on the converted phonetic notation. When, the prefix 2 syllables speech unit and of including speech unit and endings including at least one
A memory storing at least a speech unit other than the speech unit, a speech unit selection means for selecting a corresponding speech unit from the memory according to the phonetic notation, and the fine structure of the prosody at the time of recording is retained. The prosody of the speech unit including at least one of the selected two initial syllables and the speech unit including the ending is modified, and the prosody of the speech unit other than these speech units is modified according to the prosody at the time of recording the speech unit. Speech unit deformation means for performing using the calculated prosody without using the fine structure of, speech unit connection means for connecting the deformed speech unit,
A speech synthesizer comprising a connected synthesized speech output unit for outputting the synthesized speech.

5. A speech unit including at least one of the first two syllables from a memory based on the input kanji kana mixed character string or reading kana and prosodic information and converted into phonetic notation based on the converted phonetic notation. Speech units with high pitch frequency up to the accent nucleus and speech units including endings and these
Be applicable from the speech segment of the non-voice segment in the phonetic transcription
The speech unit is selected, the prosody of the selected speech unit is transformed while the fine structure of the prosody at the time of recording the speech unit is left, and the transformed speech unit is connected and output as a synthesized speech. A speech synthesis method comprising the steps of:

6. Converting the input kanji kana mixed character string or reading kana and prosodic information into a phonetic notation, calculating the prosody of the synthesized speech based on the converted phonetic notation, and converting the phonetic notation into the converted phonetic notation. This and speech units from based memory containing speech unit and endings of high pitch frequencies up speech unit and accent core containing at least one of the prefix 2 syllables
The above phonetic notation from speech units other than these speech units
And a speech unit including at least one of the selected two syllables and a speech unit having a high pitch frequency up to the accent nucleus while retaining the fine structure of the prosody at the time of recording the speech unit. Transform the prosody with the speech unit including the ending, and perform the transformation of the prosody of the speech unit other than these speech units using the calculated prosody without using the fine structure of the prosody at the time of recording, A speech synthesis method comprising the steps of connecting the transformed speech units and outputting the speech as synthesized speech.

7. A speech unit including at least one of the first two syllables from a memory based on the input kanji kana mixed character string or reading kana and prosodic information and converted into phonetic notation based on the converted phonetic notation. this and the speech segment, including the endings
The phonetic unit other than these phonemic units is used as the phonetic notation.
Select the speech unit equivalent to perform the deformation of prosody of the selected speech unit while leaving the prosody of the microstructure during the speech unit recording, as synthesized speech by connecting the modified speech unit A speech synthesis method comprising the steps of outputting.

8. Converting the input kanji kana mixed character string or reading kana and prosodic information into a phonetic notation, calculating the prosody of the synthesized speech based on the converted phonetic notation, and converting the phonetic notation into the converted phonetic notation. A speech unit containing at least one of the first two syllables from the memory and a speech unit containing the ending
From these speech units other than these speech units ,
Selecting a corresponding speech unit and transforming the prosody between the speech unit including at least one of the selected two initial syllables and the speech unit including the ending while retaining the fine structure of the prosody at the time of recording the speech unit. The prosody modification of the speech units other than these speech units is performed using the calculated prosody without using the fine structure of the prosody at the time of recording, and the transformed speech units are connected and synthesized. A speech synthesis method comprising the steps of outputting as speech.