JPH1195797A

JPH1195797A - Device and method for voice synthesis

Info

Publication number: JPH1195797A
Application number: JP9258097A
Authority: JP
Inventors: Yoshinori Shiga; 芳則志賀
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1997-09-24
Filing date: 1997-09-24
Publication date: 1999-04-09

Abstract

PROBLEM TO BE SOLVED: To obtain a voice synthesizer that can unprove the quality of synthesized voices by employing an extremely easy means without recording uttered voices of an announcer and resegmenting voice elements. SOLUTION: The synthesizer is provided with a voice element memory 19, which accumulates the voice elements of the voices obtained by analyzing the discrete voice signals that are sampled in a first sampling period, a sampling frequency conversion processing section 33, which selectively reads the voice elements from the memory 19 based on given phoneme information and couverts the sampling period of the read voice elements into a second period that differs from the first period, and a waveform superimposing process section 20 which successively connects the sampling period of the voice elements, that is converted into the second period by the section 33, and synthesizes voices.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、テキストを（漢字
かな混じり文）を音声で読み上げる音声合成ソフトウェ
ア若しくは音声合成装置に関し、特に合成音声を多様化
する声質変換に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to voice synthesis software or a voice synthesizer for reading text (sentences mixed with kanji and kana) by voice, and more particularly to voice quality conversion for diversifying synthesized voice.

【０００２】[0002]

【従来技術】従来は、音声合成装置として、音声波形を
細分化して蓄積し、その組み合わせによって音声を合成
可能な音声合成装置があることが知られている。以下
に、この音声合成装置の従来技術を図を用いながら説明
する。図６は、従来の音声合成装置の構成を示した図で
ある。2. Description of the Related Art Conventionally, it is known that there is a voice synthesizer capable of subdividing and storing a voice waveform and synthesizing voice by a combination thereof as a voice synthesizer. Hereinafter, a conventional technique of the speech synthesizer will be described with reference to the drawings. FIG. 6 is a diagram showing a configuration of a conventional speech synthesizer.

【０００３】図６の音声合成装置はテキストを音韻と韻
律からなる記号列に変換し、その記号列から音声を生成
する文音声変換（Text-To-Speech conversion ；以下Ｔ
ＴＳと称する）処理を行う。この図６の音声合成装置に
おけるＴＴＳ処理は、大きく分けて言語処理部１０１と
音声合成部１０２の２つの処理部からなり、日本語の規
則合成を例にとると次のように行われるのが一般的であ
る。The speech synthesizer shown in FIG. 6 converts a text into a symbol string composed of phonemes and prosody, and generates a speech from the symbol string (Text-To-Speech conversion; hereinafter T).
TS). The TTS process in the speech synthesizer in FIG. 6 is roughly divided into two processing units, a language processing unit 101 and a speech synthesis unit 102. Taking the rule synthesis of Japanese as an example, the TTS process is performed as follows. General.

【０００４】まず言語処理部１０１では、テキストファ
イル１０３から入力されるテキスト（漢字かな混じり
文）に対して形態素解析・構文解析等の言語処理を加
え、形態素への分解、係り受け関係の推定等の処理を行
うと同時に、各形態素に読みとアクセント型を与える。
その後言語処理部１０１では、アクセントに関しては複
合語等のアクセント移動規則を用いて、読み上げの際の
区切りとなる句（以下、アクセント句と称する）毎のア
クセント型を決定する。通常ＴＴＳの言語処理部では、
こうして得られるアクセント句毎の読みとアクセント型
を記号列（以下、音声記号列と称する）として出力でき
るようになっている。First, a language processing unit 101 performs language processing such as morphological analysis and syntax analysis on text (kanji mixed sentence) input from a text file 103 to decompose it into morphemes, estimate dependency relations, and the like. At the same time, the pronunciation and accent type are given to each morpheme.
After that, the language processing unit 101 determines an accent type for each phrase (hereinafter referred to as an accent phrase) serving as a delimiter at the time of reading aloud using accent movement rules such as a compound word for the accent. Normally, in the language processing unit of TTS,
The reading and accent type for each accent phrase obtained in this way can be output as a symbol string (hereinafter referred to as a phonetic symbol string).

【０００５】次に音声合成部１０２内では、得られた読
みに含まれる各音韻の継続時間長を、その音韻の音韻環
境等をもとに、所定の規則により音韻継続時間計算処理
部１０４にて決定する。更に、ピッチパターン生成処理
部１０５が上記音声記号列に含まれるアクセント型をも
とにピッチの高低変化が生じる時点にて点ピッチを設定
し、複数設定された点ピッチ間を直線補間してピッチの
アクセント成分を生成し、これにイントネーション成分
（通常は周波数軸上での単調減少直線）を重畳してピッ
チパターンを生成する。Next, in the speech synthesis unit 102, the duration of each phoneme included in the obtained reading is sent to the phoneme duration calculation processing unit 104 according to a predetermined rule based on the phoneme environment of the phoneme and the like. To decide. Further, the pitch pattern generation processing unit 105 sets a point pitch at a point in time when a change in pitch occurs based on the accent type included in the phonetic symbol string, and performs linear interpolation between the plurality of set point pitches to perform pitch interpolation. Is generated, and an intonation component (usually a monotonically decreasing straight line on the frequency axis) is superimposed on the accent component to generate a pitch pattern.

【０００６】図中の音声素片メモリ１０７は、予め作成
された多数の音声素片から構成されている。音声素片
は、アナウンサ等が発声した音声波形から、所定の合成
単位例えば日本語の音節（子音＋母音：以下ＣＶと称す
る）単位で、日本語の音声に含まれる全ての音節を切り
出すことにより作成される。更に、音声合成の際に音声
のピッチを変化させるために、上述のようにして切り出
された音声素片は、音声の有声区間においてさらにピッ
チ単位に分割されている。ピッチ単位への分割方法は、
図７に示されるように、ピッチ周期のおよそ１〜２倍の
窓関数（ここではハニング窓）をかけることによって行
われるのが一般的である。The speech unit memory 107 in the figure is composed of a large number of speech units created in advance. The speech unit is obtained by cutting out all syllables included in the Japanese speech from a speech waveform uttered by an announcer or the like in a predetermined synthesis unit, for example, a Japanese syllable (consonant + vowel: hereinafter referred to as CV). Created. Further, in order to change the pitch of the voice at the time of voice synthesis, the voice segment cut out as described above is further divided into pitch units in the voiced section of the voice. How to divide into pitch units
As shown in FIG. 7, this is generally performed by applying a window function (here, a Hanning window) of about 1 to 2 times the pitch period.

【０００７】音声合成処理部１０７は、上記音声記号列
に含まれる「読み」に基づいて、音声素片メモリ１０７
から、順次必要な音声素片を読み出す。そして、決定さ
れた「音韻の継続時間長」にしたがって、無声区間では
蓄えられた波形を時間軸方向に伸縮させながら、有声区
間では既に生成されたピッチパターンに基づいた周期で
音声素片のピッチ単位の波形を間引き・繰り返しをしな
がら重ね合わせ所望の音声を合成する。[0007] The speech synthesis processing unit 107, based on the "reading" included in the speech symbol string, the speech unit memory 107
, Necessary speech units are sequentially read out. Then, in accordance with the determined “length of phoneme duration”, the pitch of the speech unit is increased in the unvoiced section in a period based on the pitch pattern already generated in the voiced section while expanding and contracting the stored waveform in the time axis direction. A desired speech is synthesized by superimposing and repeating unit waveforms.

【０００８】ここまでの処理はプログラムによって行わ
れるのが一般的で、したがって合成された音声は離散信
号であるから、音声合成部１０２では最後に、この離散
信号をＤ／Ａ変換器１０８に供給する。これを受けてＤ
／Ａ変換器１０８は、離散信号を電気的なアナログ信号
（アナログ音声信号）に変換する。こうして得られたア
ナログ信号でアンプ１０９を介してスピーカ１１０等を
駆動することにより聴覚で知覚できる音声が合成でき
る。The processing up to this point is generally performed by a program. Therefore, the synthesized speech is a discrete signal. Therefore, the speech synthesizing unit 102 finally supplies the discrete signal to the D / A converter 108. I do. D
The / A converter 108 converts the discrete signal into an electric analog signal (analog audio signal). By driving the loudspeaker 110 and the like via the amplifier 109 with the analog signal thus obtained, a sound that can be perceived by hearing can be synthesized.

【０００９】[0009]

【発明が解決しようとする課題】音声合成装置に関し、
現在上記のような従来技術が存在しているが、この技術
の音声合成装置で合成される音声には次のような問題が
ある。即ち、従来の音声合成装置では、合成音声の声の
種類（以下、声質と称する）に制約があり、音声素片フ
ァイル作成時のアナウンサの声質か、あるいは音声の規
則合成によりそれが多少劣化した音質でしか合成できな
い。The present invention relates to a speech synthesizer,
At present, the above-mentioned conventional techniques exist, but the speech synthesized by the speech synthesizer of this technique has the following problems. That is, in the conventional speech synthesizer, the type of synthesized speech voice (hereinafter referred to as "voice quality") is limited, and the voice quality of the announcer at the time of creating the speech unit file, or the voice quality is slightly degraded due to the rule synthesis of voice. Can only be synthesized with sound quality.

【００１０】したがって、会話文等を音声合成するに当
って合成音声の声質を増やそうとした場合、音声合成装
置開発者は新たに異なるアナウンサを雇い、発声を録音
して、音声素片の作成を始めからやり直さなければなら
ない。このため、アナウンサを雇うための賃金が必要と
なり、またアナウンサの発声の収録・音声素片の切り出
し等のために開発者は多大な労力を要することになる。
そして、このことが装置開発のコストを増加させること
につながる。Therefore, when attempting to increase the voice quality of synthesized speech in speech synthesis of a conversational sentence or the like, the speech synthesizer developer newly hires a different announcer, records the utterance, and prepares a speech unit. You have to start over. For this reason, a wage for hiring the announcer is required, and the developer requires a great deal of labor to record the utterance of the announcer and cut out the speech unit.
This leads to an increase in device development costs.

【００１１】本発明はこのような事情を考慮してなされ
たものであり、その目的は、アナウンサ発声の収録や音
声素片の再切り出しを行うことなく、極めて容易な手段
で、合成音声の声質を増やすことのできる音声合成装置
及び方法を提供することにある。The present invention has been made in view of such circumstances, and has as its object to record the voice quality of synthesized speech by extremely easy means without recording announcer utterances or re-cutting out speech units. It is an object of the present invention to provide a speech synthesis device and method capable of increasing the number of speech synthesis.

【００１２】[0012]

【課題を解決するための手段】上記課題を解決する為
に、第１の標本周期で標本化した離散音声信号を分析し
て得られる音声の特徴パラメータを蓄積する蓄積手段
と、与えられた音韻情報に基づいて前記蓄積手段より前
記特徴パラメータを選択的に読み出すとともに、この読
み出した前記特徴パラメータの標本周期を前記第１の周
期と異なる第２の周期へ変換する標本周期変換手段と、
該標本周期変換手段によって前記第２の周期に変換さ
れた前記特徴パラメータの標本周期を順次接続して音声
を合成する音声合成手段とを有することを特徴とするも
のである。Means for Solving the Problems To solve the above problems, a storage means for storing characteristic parameters of a voice obtained by analyzing a discrete voice signal sampled at a first sampling period, and a given phoneme. Sample cycle conversion means for selectively reading the feature parameter from the storage means based on information, and converting a sample cycle of the read feature parameter into a second cycle different from the first cycle;
And a speech synthesizing unit for sequentially connecting the sample periods of the characteristic parameter converted into the second period by the sample period converting unit to synthesize a voice.

【００１３】この発明の構成によれば、第１の周期と第
２の周期を異ならせることにより、音声のスペクトルが
周波数対数軸上でシフトするため、同じ音声パラメータ
を用いて異なる声質の音声を合成することができる。According to the configuration of the present invention, since the speech spectrum is shifted on the frequency logarithmic axis by making the first cycle and the second cycle different, speech having different voice qualities can be produced using the same speech parameter. Can be synthesized.

【００１４】また本発明では、標本周期変換手段は音声
の有声区間の場合のみ標本周期を第１の周期と異なる第
２の周期へ変換することを特徴とするものである。この
発明の構成によれば、第１の周期と第２の周期を異なら
せることにより、音声のスペクトルが周波数対数軸上で
シフトするため、同じ音声パラメータを用いて異なる声
質の音声を合成することができるとともに、音声が有声
区間の場合のみ作用するため無声区間のスペクトルは変
化しないので、合成音声の無声区間において明瞭性が損
なわれない。Further, in the present invention, the sampling period conversion means converts the sampling period into a second period different from the first period only in a voiced section of the voice. According to the configuration of the present invention, since the spectrum of the voice is shifted on the frequency logarithmic axis by making the first cycle and the second cycle different, voices of different voice qualities are synthesized using the same voice parameter. And the spectrum of the unvoiced section does not change since it operates only when the voice is in the voiced section, so that the clarity is not lost in the unvoiced section of the synthesized voice.

【００１５】さらに本発明では、標本周期変換手段は蓄
積手段から読み出された音声の標本周期を第１の周期と
異なる第２の周期に変換させる切り替え手段を有するこ
とを特徴とするものである。Further, in the present invention, the sampling period conversion means has a switching means for converting the sampling period of the voice read from the storage means into a second period different from the first period. .

【００１６】この発明の構成によれば、第１の周期と第
２の周期を異ならせることにより、音声のスペクトルが
周波数対数軸上でシフトすることから、同じ音声パラメ
ータを用いて異なる声質の音声を合成することができ、
更に、スペクトル変化が合成音声に悪影響を与えるよう
な音声区間では作用しないようできる。According to the structure of the present invention, since the spectrum of the voice is shifted on the frequency logarithmic axis by making the first cycle and the second cycle different, voices having different voice qualities using the same voice parameter are used. Can be synthesized,
Further, it is possible not to operate in a voice section in which a spectrum change adversely affects the synthesized voice.

【００１７】[0017]

【発明の実施の形態】以下、本発明の実施の形態につき
図面を参照して説明する。図１は本発明の第１の実施形
態に係る音声の規則合成装置の概略構成を示すブロック
図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a schematic configuration of a speech rule synthesizing apparatus according to a first embodiment of the present invention.

【００１８】図１の音声規則合成装置は、例えばパーソ
ナルコンピュータ等の情報処理装置上で専用のソフトウ
ェア（文音声変換ソフトウェア）を実行することにより
実現されるもので、文音声変換（ＴＴＳ）処理機能、即
ちテキストから音声を生成する文音声変換処理（文音声
合成処理）機能を有しており、その機能構成は、大別し
て言語処理部１と、音声合成部３とに分けられる。The speech rule synthesizing apparatus shown in FIG. 1 is realized by executing dedicated software (sentence / speech conversion software) on an information processing apparatus such as a personal computer, and has a sentence / speech conversion (TTS) processing function. That is, it has a sentence-to-speech conversion process (sentence-to-speech synthesis process) function of generating speech from text, and its functional configuration is roughly divided into a language processing unit 1 and a speech synthesis unit 3.

【００１９】上記言語処理部１は、入力文、例えば漢字
かな混じり文を解析して読み情報とアクセント情報を生
成する処理と、これら情報に基づき音韻記号系列・アク
セント情報が記述された音声記号列を生成する処理を司
る。上記音声合成部３は、前記言語処理部１の出力であ
る音声記号列をもとに音声を生成する処理を司る。The language processing unit 1 analyzes input sentences, for example, sentences mixed with kanji and kana, to generate reading information and accent information, and based on the information, a phonetic symbol sequence and a speech symbol string in which accent information is described. Controls the process of generating The speech synthesis unit 3 controls a process of generating a speech based on a speech symbol string output from the language processing unit 1.

【００２０】さて、図１の音声規則合成装置において、
文音声変換（読み上げ）の対象となる文書（ここでは日
本語文書）はテキストファイル５として保存されてい
る。本装置では、文音声変換ソフトウェアに従い、当該
ファイルから漢字かな混じり文を１文ずつ読みだして、
言語処理部１及び音声合成部３により以下に述べる文音
声変換処理を行い、音声を合成する。Now, in the speech rule synthesizer of FIG.
A document (here, a Japanese document) to be subjected to sentence-to-speech conversion (speech) is stored as a text file 5. In this device, according to the sentence-to-speech conversion software, a sentence mixed with kanji or kana is read from the file one by one,
The sentence-to-speech conversion process described below is performed by the language processing unit 1 and the speech synthesis unit 3 to synthesize speech.

【００２１】まず、テキストファイル５から読みだされ
た漢字かな混じり文は、言語処理部１内の言語解析処理
部７に入力される。言語解析処理部７は、入力される漢
字かな混じり文に対して形態素解析を行い、読み情報と
アクセント情報を生成する。形態素解析とは、与えられ
た文の中で、どの文字列が語句を構成しているか、そし
てその語の構造がどのようなものかを改正する作業であ
る。First, the sentence mixed with kanji and kana read from the text file 5 is input to the language analysis processing unit 7 in the language processing unit 1. The linguistic analysis processing unit 7 performs a morphological analysis on the input kanji-kana mixed sentence to generate reading information and accent information. Morphological analysis is an operation to revise which character string forms a phrase in a given sentence and what the structure of the word is.

【００２２】そのために言語解析処理部７は、文の最小
構成要素である「形態素」を見出し語にもつ形態素辞書
９と形態素間の接続規則が登録されている接続規則ファ
イル１１を利用する。即ち言語解析処理部７は、入力文
と形態素辞書９とを照合することで得られる全ての形態
素系列候補を求め（総当たり法）、その中から、接続規
則ファイル１１を参照して文法的に前後に接続できる組
み合わせを出力する。For this purpose, the linguistic analysis processing unit 7 uses a morphological dictionary 9 having "morpheme", which is the minimum component of a sentence, as a headword and a connection rule file 11 in which connection rules between morphemes are registered. That is, the linguistic analysis processing unit 7 obtains all morpheme sequence candidates obtained by matching the input sentence with the morphological dictionary 9 (brute-force method), and grammatically refers to the connection rule file 11 from among them. Output combinations that can be connected before and after.

【００２３】形態素辞書９には、解析時に用いられる文
法情報と共に、形態素の読み並びにアクセントの型が登
録されている。このため、形態素解析により形態素が定
まれば、同時に読みとアクセント型も与えることができ
る。In the morphological dictionary 9, grammatical readings and accent types are registered together with grammatical information used at the time of analysis. For this reason, if a morpheme is determined by morphological analysis, reading and accent type can be given at the same time.

【００２４】例えば、「公園へ行って本を読みます。」
という文に対して形態素解析を行うと、／公園／へ／行って／本／を／読み／ます／。For example, "Go to the park and read a book."
When you perform a morphological analysis on the sentence "/ Park / Go / Go / Book / Read / Read /".

【００２５】と形態素に分割される。各形態素に読みと
アクセント型が与えられ、／コウエン／エ／イッテ／ホ＾ン／ヲ／ヨミ／マ＾ス／となる。ここで「＾」の入っている形態素は、その直前
の音節でピッチが高く、その直後の音節ではピッチが落
ちるアクセントであることを意味する。また「＾」がな
い場合は、平板型のアクセントであることを意味する。And morphemes. Yomi and accent type are given to each morpheme, and they are / Kouen / D / Itte / Phon / ヲ / Yomi / Mas /. Here, the morpheme containing “＾” means that the pitch is high in the syllable immediately before the syllable, and the pitch is decreased in the syllable immediately after the syllable. When there is no "＾", it means that the accent is a flat type.

【００２６】ところで、人間が文章を読むときには、こ
のような形態素単位でアクセントをつけて読むことはせ
ず、いくつかの形態素を一まとめにして、そのまとまり
にアクセントを付して読んでいる。By the way, when a human reads a sentence, he does not read with such an accent in morpheme units, but reads several morphemes together and adds an accent to the group.

【００２７】そこで、このようなことを考慮して、言語
解析処理部７ではさらに、一つのアクセント句（アクセ
ントを与える単位）で形態素をまとめると同時に、まと
めたことによるアクセントの移動も推定する。これに加
えて言語解析処理部７は、母音の無声化や読み上げの際
のポーズ（息継ぎ）等の情報も付加し、上記の例は、最
終的に次のような読み情報を生成する。In view of the above, the linguistic analysis processing unit 7 further summarizes the morphemes by one accent phrase (accent giving unit), and also estimates the movement of the accent due to the summary. In addition to this, the linguistic analysis processing unit 7 also adds information such as vowel devoicing and pause (breathing) at the time of reading out. In the above example, the following reading information is finally generated.

【００２８】／コーエンエ／イッテ．／ホ＾ンオ／ヨミ
マ＾（ス）／ここで、ピリオド「．」は息継ぎを、「（）」は母音
が無声化した音節を表わす。/ Cohenue / Itte. Here, the period “.” Represents a breath, and “()” represents a syllable in which a vowel is unvoiced.

【００２９】さて、上記のようにして言語処理部１内の
言語解析処理部７により読み情報が生成されると、音声
合成部３内の音韻継続時間長計算処理部１３が起動され
る。音韻継続時間長計算処理部１３は、言語解析処理部
７で生成した読み情報にしたがって、入力文に含まれる
各音節の子音部ならびに母音部の継続時間長（単位はｍ
ｓ）を決定する。When the reading information is generated by the language analysis processing section 7 in the language processing section 1 as described above, the phoneme duration calculation processing section 13 in the speech synthesis section 3 is activated. The phoneme duration calculation unit 13 calculates the duration of the consonant part and vowel part of each syllable included in the input sentence according to the reading information generated by the language analysis processing unit 7 (unit is m
s) is determined.

【００３０】この音韻継続時間長計算処理部１３での継
続時間長の決定処理は、子音（Ｃ）と母音（Ｖ）の境界
（ＣＶわたり）の位置が等間隔に並ぶようにするとい
う、極めて簡単なアルゴリズムにより実現されている。
ＣＶわたりの間隔は、発話速度制御部１５により与えら
れる。図示しないが、本実施形態で用いられるソフトウ
ェアではユーザが合成音声のスピードを指定することが
可能となっている。そして、ユーザが指定した音声のス
ピードがこの発話速度制御部１５に与えられることによ
り、当該発話速度制御部１５が（音韻継続時間長計算処
理部１３での継続時間の決定処理にて決定される）先程
のＣＶわたりの間隔を調整して合成音声の速度を実際に
変化させている。但し、日本語の音声は、発声の速度を
変えても子音の継続時間長はほぼ一定であることが分析
結果からわかっているので、子音の継続時間長は一定に
保ち、母音の継続時間長を調節してＣＶわたりの間隔を
変える。The processing for determining the duration in the phoneme duration calculation processing unit 13 is performed such that the positions of the boundaries (crossing the CV) between the consonants (C) and the vowels (V) are arranged at equal intervals. It is realized by a simple algorithm.
The interval over the CV is given by the speech rate control unit 15. Although not shown, the software used in the present embodiment allows the user to specify the speed of the synthesized speech. Then, the speech speed specified by the user is given to the utterance speed control unit 15, so that the utterance speed control unit 15 is determined by the duration determination processing in the phoneme duration calculation unit 13. The speed of the synthesized voice is actually changed by adjusting the interval between the CVs. However, it is known from analysis results that the duration of consonants is almost constant even if the speed of utterance is changed, so the duration of consonants is kept constant, and the duration of vowels is kept constant. To change the interval across the CV.

【００３１】音韻継続時間長計算処理部１３により入力
文に含まれる各音節の（子音部ならびに母音部の）継続
時間長が決定されると、同じ音声合成部３内のピッチ生
成処理部１７が起動される。ピッチ生成処理部１７は音
韻継続時間長計算処理部１３により決定された継続時間
長と、（言語処理部１内の）言語処理解析処理部７によ
り決定されたアクセント情報に基づいて、まず点ピッチ
位置を設定する。次に、設定された複数の点ピッチを直
線で補間して例えば１０ｍｓｅｃ毎のピッチパターンを
得る。When the duration of each syllable (consonant part and vowel part) included in the input sentence is determined by the phoneme duration calculation processing unit 13, the pitch generation processing unit 17 in the same speech synthesis unit 3 Is activated. The pitch generation processing unit 17 first determines the point pitch based on the duration determined by the phoneme duration calculation processing unit 13 and the accent information determined by the language processing analysis processing unit 7 (in the language processing unit 1). Set the position. Next, a plurality of set point pitches are interpolated by a straight line to obtain a pitch pattern every 10 msec, for example.

【００３２】次に、本実施形態では、予め次のような手
順で音声素片が作成されている。まず、サンプリング周
波数１１０２５Ｈｚで標本化した実音声から、その実音
声に含まれる所定の合成単位を切り出す範囲を決定す
る。そして、切り出し範囲内の無声区間では波形をその
まま保存し、有声区間ではピッチ周期の繰り返し波形に
ピッチ周期の２倍の窓幅のハニング窓関数をかけて、有
声区間内に含まれるすべてのピッチ波形を抽出し保存す
る。Next, in this embodiment, speech units are created in advance in the following procedure. First, a range from which a predetermined synthesis unit included in the real voice is cut out from the real voice sampled at the sampling frequency of 11025 Hz is determined. Then, in the unvoiced section within the cutout range, the waveform is stored as it is, and in the voiced section, the pitch waveform repetition waveform is multiplied by a Hanning window function having a window width twice as large as the pitch cycle to obtain all the pitch waveforms included in the voiced section. Extract and save.

【００３３】このことから分かるように、一つの合成単
位（音声素片）は、無声区間波形と有声区間の複数のピ
ッチ波形から構成される（無声区間が含まれない合成単
位もあり得る。）ここでは、上述の所定の合成単位とし
て、子音＋母音（ＣＶ）を用いており、日本語音声の合
成に必要な全音節を切り出した計１３７個の音声素片を
予め用意し蓄積しておく。As can be seen from this, one synthesis unit (speech unit) is composed of an unvoiced section waveform and a plurality of pitch waveforms in a voiced section (some synthesis units may not include unvoiced sections). Here, a consonant + vowel (CV) is used as the above-mentioned predetermined synthesis unit, and a total of 137 speech units obtained by cutting out all syllables necessary for the synthesis of Japanese speech are prepared and stored in advance. .

【００３４】しかし、音声波形を（ピッチ波形単位で）
そのまま蓄積すると大容量のメモリが必要となるので、
圧縮した形で蓄積しておくのが一般的である。本実施形
態では、ピッチ波形を線形予測分析して得られる線形予
測係数と、得られた線形予測係数から構成される線形予
測逆フィルタに音声波形を通すことによって得られる線
形予測残差信号を、前記１３７個の音声素片に含まれる
ピッチ波形についてすべて求めておき、得られたすべて
の線形予測係数・残差波形対に対して、音声圧縮符号化
の分野でよく用いられるベクトル量子化により圧縮を行
っている。圧縮された音声素片は、予め用意された音声
素片ファイル（不図示）に蓄積される。However, the voice waveform is changed (in units of pitch waveform).
If you store it as it is, you will need a large amount of memory,
It is common to store them in compressed form. In the present embodiment, a linear prediction coefficient obtained by performing linear prediction analysis on a pitch waveform and a linear prediction residual signal obtained by passing a speech waveform through a linear prediction inverse filter composed of the obtained linear prediction coefficients, All pitch waveforms included in the 137 speech units are obtained, and all of the obtained linear prediction coefficient / residual waveform pairs are compressed by vector quantization which is often used in the field of speech compression coding. It is carried out. The compressed speech segments are stored in a speech segment file (not shown) prepared in advance.

【００３５】この音声素片ファイルの内容は、文音声変
換ソフトウェアに従う文音声変換処理の開始時に、上記
のようにして圧縮した音声素片から、線形予測係数と残
差信号対を再生成し、線形予測フィルタにより元のすべ
てのピッチ波形を再合成した（即ち復号化を行った）
後、例えばメインメモリ（不図示）に確保された音声素
片領域（以下、音声素片メモリと称する）１９に読み込
まれる。The contents of the speech unit file are obtained by regenerating a linear prediction coefficient and a residual signal pair from the speech unit compressed as described above at the start of the sentence-to-speech conversion process according to the sentence-to-speech conversion software. All original pitch waveforms were resynthesized (ie, decoded) by a linear prediction filter
Thereafter, the data is read into, for example, a speech unit area (hereinafter referred to as a speech unit memory) 19 secured in a main memory (not shown).

【００３６】音声合成処理部２１は、音声素片メモリ１
９から読み出された音声素片のピッチ波形を重畳する波
形重畳処理部２０と、前記音声素片メモリ１９から読み
出された音声素片のサンプリング周波数を変換するサン
プリング周波数変換処理部３３とからなる。The voice synthesizing unit 21 stores the voice unit memory 1
9 includes a waveform superimposition processing unit 20 for superimposing a pitch waveform of a speech unit read from 9 and a sampling frequency conversion processing unit 33 for converting a sampling frequency of a speech unit read from the speech unit memory 19. Become.

【００３７】更に、このサンプリング周波数変換処理部
３３は上記音声素片メモリ１９からピッチ波形を読み出
し、このピッチ波形のサンプリング周波数を変換する／
変換しないための切り替え回路３５を有し、また（後述
する）声質制御部３１の命令に従いこの切り替え回路３
５を制御する。Further, the sampling frequency conversion processing section 33 reads the pitch waveform from the speech unit memory 19 and converts the sampling frequency of the pitch waveform.
It has a switching circuit 35 for not performing conversion, and in accordance with a command from the voice quality control unit 31 (described later), the switching circuit 3
5 is controlled.

【００３８】ここで、上記音声素片のサンプリング周波
数を変換しない場合を説明する。まず、サンプリング周
波数変換制御部３７は声質制御部３１からサンプリング
周波数を変換しないという情報を得ると、上記切り替え
回路３５のスイッチをスイッチ１（Ｓ１）に切り替え、
（元の）音声素片を音声素片メモリ１９から読み出す。Here, a case where the sampling frequency of the speech unit is not converted will be described. First, when the sampling frequency conversion control unit 37 obtains from the voice quality control unit 31 information that the sampling frequency is not to be converted, the switch of the switching circuit 35 is switched to the switch 1 (S1).
The (original) speech unit is read from the speech unit memory 19.

【００３９】これによって波形重畳処理部２０は、言語
処理部１から渡される音声記号列中の音韻情報に従っ
て、この読み出された音声素片を、音韻継続時間長計算
処理部１３により決定された継続時間長に応じて、無声
区間では線形伸縮を行ない、有声区間ではピッチ波形の
繰り返しまたは間引きを行って、所望の音韻継続時間長
の音声信号を合成する。また、有声区間においてはピッ
チ生成処理部１７により生成されたピッチパターンから
計算される周期で、前記読み出された音声素片に含まれ
る複数のピッチ波形を重畳することにより、所望のピッ
チの音声を合成する。In this way, the waveform superposition processing section 20 determines the read speech unit by the phoneme duration calculation processing section 13 according to the phoneme information in the speech symbol string passed from the language processing section 1. In accordance with the duration, linear expansion and contraction are performed in the unvoiced section, and the pitch waveform is repeated or thinned out in the voiced section to synthesize a speech signal having a desired phoneme duration. In the voiced section, a plurality of pitch waveforms included in the read speech unit are superimposed at a cycle calculated from the pitch pattern generated by the pitch generation processing unit 17, thereby producing a voice of a desired pitch. Are synthesized.

【００４０】以上のようにして、音声合成処理部２１に
より合成された音声は離散音声信号であり、Ｄ／Ａ変換
器２３によりアナログ信号に変換し、アンプ２５を通し
てスピーカ２７に出力することで、初めて音として聞く
ことができる。As described above, the voice synthesized by the voice synthesis processor 21 is a discrete voice signal, which is converted into an analog signal by the D / A converter 23 and output to the speaker 27 through the amplifier 25. You can hear it as a sound for the first time.

【００４１】本実施形態のポイントは、音声合成部３内
に声質切替部２９及び声質制御部３１が加えられ、さら
に音声合成処理部２１においてピッチ波形を重畳する前
に、サンプリング周波数変換処理部３３がピッチ波形の
サンプリング周波数の変換を行なっている点である。The point of the present embodiment is that a voice quality switching unit 29 and a voice quality control unit 31 are added to the voice synthesis unit 3, and before the voice synthesis processing unit 21 superimposes the pitch waveform, the sampling frequency conversion processing unit 33. Is that the sampling frequency of the pitch waveform is converted.

【００４２】声質切替部２９は、ユーザによる指定もし
くは文音声変換ソフトウェアを利用するアプリケーショ
ンプログラム等によって合成する際の声質を指定するこ
とができるようになっている。本実施形態では、この声
質切替部２９にて３種類の声質が指定可能であるものと
する。The voice quality switching section 29 can specify the voice quality at the time of synthesis by an application program or the like using a user or a sentence-to-speech conversion software. In the present embodiment, it is assumed that three types of voice qualities can be designated by the voice quality switching unit 29.

【００４３】声質制御部３１は、上記声質切替部２９で
指定された声質に応じて、音声素片メモリ１９から読み
出された音声素片の有声区間における各々のピッチ波形
のサンプリング周波数を、１１０２５Ｈｚ（変換しな
い）、１２０００Ｈｚ、１００００Ｈｚのいずれかに変
換するようにサンプリング周波数変換処理部３３を制御
する。The voice quality control unit 31 sets the sampling frequency of each pitch waveform in the voiced section of the voice unit read from the voice unit memory 19 to 11025 Hz in accordance with the voice quality specified by the voice quality switching unit 29. (No conversion) The sampling frequency conversion processing unit 33 is controlled so as to convert to any of 12000 Hz and 10000 Hz.

【００４４】ここで、サンプリング周波数変換処理部３
３によるサンプリング周波数変換処理の詳細を説明す
る。このサンプリング周波数変換には種々の方法が適用
可能であるが、本実施形態では、図２に示す構成による
簡単な方法を用いているものとする。Here, the sampling frequency conversion processing unit 3
3 will be described in detail. Various methods can be applied to the sampling frequency conversion. In this embodiment, it is assumed that a simple method having the configuration shown in FIG. 2 is used.

【００４５】サンプリング周波数変換処理部３３は図２
（ａ）に示すように、サンプリング周波数拡大器５１、
ローパスフィルタ（ＬＰＦ）５２及びサンプリング周波
数圧縮器５３から構成されている。The sampling frequency conversion processing section 33 is shown in FIG.
As shown in (a), the sampling frequency expander 51,
It comprises a low-pass filter (LPF) 52 and a sampling frequency compressor 53.

【００４６】図２（ａ）のサンプリング周波数変換処理
部３３で、サンプリング周波数ｆ１からｆ２＝（Ｌ／
Ｍ）ｆ１に周波数変換するには、図２（ｂ）に示すよう
に、まずサンプリング周波数拡大器５１にて、サンプル
ｓ１間に（Ｌ−１）個の零サンプルｓ０を挿入する。次
に、サンプリング周波数拡大器５１から出力される音声
データ、即ちサンプルｓ１間に（Ｌ−１）個の零サンプ
ルｓ０が挿入された音声データを、エイリアシング防止
のために、ｆ１またはｆ２のいづれか小さい方（ｍｉｎ
（ｆ１，ｆ２））を遮断周波数とするローパスフィルタ
（ローパス型のデジタルフィルタ）５２に通す。ここ
で、サンプリング周波数拡大器５１での零サンプル挿入
によるゲイン低下（１／Ｌ倍）を防ぐために、ローパス
フィルタ５２は通過帯域でＬ倍のゲインを持つ。最後に、ローパスフィルタ５２を通過した音声データに
対して、サンプリング周波数圧縮器５３において、図２
（ｂ）に示すように、Ｍサンプル毎に１サンプルのみを
取り出す間引き処理を行うことにより、サンプリング周
波数ｆ２＝（Ｌ／Ｍ）ｆ１の音声データが得られる。図
２（ｂ）には理解を助けるために、サンプリング周波数
ｆ１からｆ２＝（４／３）ｆ１に変換する簡単な例を示
してある。In the sampling frequency conversion processing section 33 shown in FIG. 2A, from the sampling frequency f1 to f2 = (L /
To convert the frequency to M) f1, first, as shown in FIG. 2B, the sampling frequency expander 51 inserts (L-1) zero samples s0 between the samples s1. Next, the audio data output from the sampling frequency expander 51, that is, the audio data in which (L-1) zero samples s0 are inserted between the samples s1, is reduced to f1 or f2 to prevent aliasing. (Min
(F1, f2)) is passed through a low-pass filter (low-pass digital filter) 52 having a cutoff frequency. Here, in order to prevent a decrease in gain (1 / L times) due to insertion of a zero sample in the sampling frequency expander 51, the low-pass filter 52 has a gain of L times in the pass band. Finally, the sampling frequency compressor 53 converts the audio data that has passed through the low-pass
As shown in (b), by performing a thinning process of extracting only one sample for every M samples, audio data of a sampling frequency f2 = (L / M) f1 is obtained. FIG. 2B shows a simple example of converting the sampling frequency f1 to f2 = (4/3) f1 for easy understanding.

【００４７】ここで、前記したように、サンプリング周
波数を１１０２５Ｈｚから１２０００Ｈｚに変換する場
合であれば、ｆ１＝１１０２５［Ｈｚ］ｆ２＝１２０００［Ｈｚ］ｆ２＝（１２０００／１１０２５）ｆ１＝（１６０／１４７）ｆ１であるので、サンプリング周波数変換処理部３３では、Ｌ＝１６０Ｍ＝１４７（ＬＰＦの遮断周波数）＝ｍｉｎ（ｆ１，ｆ２）＝ｆ１
＝１１０２５［Ｈｚ］として、上述した処理を行えばよい。Here, as described above, if the sampling frequency is converted from 11025 Hz to 12000 Hz, f1 = 111025 [Hz] f2 = 12000 [Hz] f2 = (12000/11025) f1 = (160/147) ) F1, the sampling frequency conversion processing unit 33 calculates L = 160 M = 147 (cutoff frequency of LPF) = min (f1, f2) = f1
= 111025 [Hz] and the above processing may be performed.

【００４８】同様に、サンプリング周波数を１１０２５
Ｈｚから１００００Ｈｚに変換する場合であれば、ｆ１＝１１０２５［Ｈｚ］ｆ２＝１００００［Ｈｚ］ｆ２＝（１００００／１１０２５）ｆ１＝（４００／４４１）ｆ１であるので、サンプリング周波数変換処理部３３では、Ｌ＝４００Ｍ＝４４１（ＬＰＦの遮断周波数）＝ｍｉｎ（ｆ１，ｆ２）＝ｆ１
＝１００００［Ｈｚ］として、上述した処理を行えばよい。Similarly, the sampling frequency is set to 11025
In the case of converting from Hz to 10000 Hz, since f1 = 111025 [Hz] f2 = 10000 [Hz] f2 = (10000/11025) f1 = (400/441) f1, the sampling frequency conversion processing unit 33 L = 400 M = 441 (cutoff frequency of LPF) = min (f1, f2) = f1
= 10000 [Hz] and the above-described processing may be performed.

【００４９】サンプリング周波数変換処理部３３にて、
ピッチ波形のサンプリング周波数を、音声素片を作成し
た際のサンプリング周波数と同じサンプリング周波数、
即ち１１０２５Ｈｚに変換する場合（サンプリング周波
数を変換しない場合）、上述したように元の音声の声質
で音声合成することができる。In the sampling frequency conversion processing unit 33,
The sampling frequency of the pitch waveform is the same as the sampling frequency when the speech unit was created,
That is, when converting to 11025 Hz (when the sampling frequency is not converted), voice synthesis can be performed using the voice quality of the original voice as described above.

【００５０】一方、他のサンプリング周波数に変換する
場合について説明する。ここでの変換とは、１１０２５
Ｈｚから１２０００Ｈｚ、あるいは１１０２５Ｈｚから
１００００Ｈｚに変換することである。On the other hand, the case of converting to another sampling frequency will be described. The conversion here is 11025
Hz to 12000 Hz, or 11025 Hz to 10,000 Hz.

【００５１】まず、ユーザによって３種類の声質のうち
どれか１つ指定されたとすると、声質切替部２９は声質
制御部３１に対しある声質が指定されたことを送信す
る。声質制御部３１は、上記声質に応じて音声素片メモ
リ１９から読み出された音声素片に含まれるピッチ波形
のサンプリング周波数を変換するようサンプリング周波
数変換処理部３３を制御する。この声質制御部３１の制
御に従って、サンプリング周波数変換処理部３７は前記
切り替え回路３５のスイッチをスイッチ２（Ｓ２）に切
り替え、音声素片に含まれるピッチ波形のサンプリング
周波数を、上記説明した方法で元の１１０２５Ｈｚから
１００００Ｈｚに変換する。この場合、サンプリング周
波数を変更したピッチ波形のサンプル数は変化し、この
変換では、図３に示すようにピッチ波形のサンプル数は
減少する。（即ち、サンプリング周波数変換はサンプル
の間引き、補間することと同じである。）同様にして、サンプリング周波数１２０００Ｈｚへの変
換も可能であるので、こちらの説明は省略する。このよ
うにサンプリング周波数を変えたにもかかわらず、１１
０２５ＨｚでＤ／Ａ変換すると、図４に示すように周波
数軸方向にシフトした効果が得られることから音声の個
人性が変化する。First, if one of the three voice qualities is specified by the user, the voice quality switching unit 29 transmits to the voice quality control unit 31 that a certain voice quality has been specified. The voice quality control unit 31 controls the sampling frequency conversion processing unit 33 to convert the sampling frequency of the pitch waveform included in the voice segment read from the voice unit memory 19 according to the voice quality. Under the control of the voice quality control unit 31, the sampling frequency conversion processing unit 37 switches the switch of the switching circuit 35 to the switch 2 (S2), and determines the sampling frequency of the pitch waveform included in the speech unit by the method described above. From 11025 Hz to 10,000 Hz. In this case, the number of samples of the pitch waveform whose sampling frequency is changed changes, and in this conversion, the number of samples of the pitch waveform decreases as shown in FIG. (That is, sampling frequency conversion is the same as thinning out and interpolating samples.) Similarly, conversion to a sampling frequency of 12000 Hz is also possible, and a description thereof will be omitted. Although the sampling frequency was changed in this way, 11
When the D / A conversion is performed at 025 Hz, the effect shifted in the frequency axis direction is obtained as shown in FIG.

【００５２】図５に示すように、波形重畳処理部２０は
前記読み出された音声素片に含まれる複数のピッチ波形
を上記手法によりサンプリング周波数変換した後重畳す
ることにより、所望のピッチの音声を合成することがで
きる。こうして得られる離散音声信号を１１０２５Ｈｚ
でＤ／Ａ変換したアナログ音声信号の声質は、音声素片
の元となっている音声の声質とは異なったものとなる。As shown in FIG. 5, the waveform superposition processing section 20 superposes a plurality of pitch waveforms included in the read speech unit after performing sampling frequency conversion according to the above-described method and superimposing the plurality of pitch waveforms, thereby obtaining a speech having a desired pitch. Can be synthesized. The discrete sound signal obtained in this way is 11025 Hz
The voice quality of the analog voice signal that has been D / A-converted is different from the voice quality of the voice that is the basis of the voice unit.

【００５３】また、上記のように、サンプリング周波数
変換処理を行うのは音声の有声区間のみで、無声区間に
おいては同変換処理を施さないため、無声区間スペクト
ルの周波数軸方向へのシフトによって生じる、無声摩擦
音等の明瞭性低下は起こらない。Further, as described above, the sampling frequency conversion processing is performed only in the voiced section of the voice, and the conversion processing is not performed in the unvoiced section, so that the shift occurs in the unvoiced section spectrum in the frequency axis direction. No decrease in clarity such as unvoiced fricatives occurs.

【００５４】以上、本発明の実施形態について説明して
きたが、本発明はこれら実施形態に限定されるものでは
ない。例えば、前記した実施形態では、合成単位を音節
としているがその限りではない。また、音声素片に関し
て、圧縮（符号化）した音声素片を復号化したのち音声
素片メモリに読み込んでいるが、復号化せずに音声素片
メモリに読み込んでおき、合成時に復号化しながら波形
重畳してもよく、こうすれば音声合成実行時の演算量は
多くなるが、実行時のメモリ容量を節約することができ
る。声質の数も実施形態において３種類としたが、２種
類または４種類以上でもよい。言語処理部に関しても形
態素解析以外に構文解析等が挿入されても全く問題無
く、また日本語のＴＴＳに限らず英語やその他の言語の
ＴＴＳに応用可能である。The embodiments of the present invention have been described above, but the present invention is not limited to these embodiments. For example, in the embodiment described above, the synthesis unit is a syllable, but this is not a limitation. As for speech units, compressed (encoded) speech units are decoded and then read into the speech unit memory. However, they are loaded into the speech unit memory without decoding, and decoded during synthesis. The waveforms may be superimposed, which increases the amount of computation during speech synthesis, but can save memory capacity during execution. Although the number of voice qualities is three in the embodiment, it may be two or four or more. There is no problem even if a syntax analysis or the like other than the morphological analysis is inserted in the language processing unit, and the present invention can be applied not only to the TTS of Japanese but also to the TTS of English and other languages.

【００５５】また、継続時間決定の方法に関してもＣＶ
わたり間隔一定といった方法でなく、統計的な手法に基
づいた制御によっても構わない。ピッチ生成に関して
も、点ピッチによる方法でなくとも例えば藤崎モデルを
利用した場合でも本発明は適用可能である。要するに、
本発明はその要旨に逸脱しない範囲で種々変形して実施
することができる。Further, regarding the method of determining the duration, the CV
Instead of a method in which the intervals are constant, control based on a statistical method may be used. The present invention can be applied to pitch generation not only by the point pitch method but also by using, for example, a Fujisaki model. in short,
The present invention can be implemented with various modifications without departing from the gist thereof.

【００５６】[0056]

【発明の効果】以上詳述したように本発明によれば、第
１の周期と第２の周期を異ならせることにより、音声の
スペクトルが周波数対数軸上でシフトするため、同じ音
声パラメータを用いて異なる声質の音声を合成すること
ができるとともに、音声が有声区間の場合のみ作用する
ため無声区間のスペクトルは変化しないので、合成音声
の無声区間において明瞭性が損なわれない。また、スペ
クトル変化が合成音声に悪影響を与えるような音声区間
では作用しないようできる。As described above in detail, according to the present invention, by making the first cycle and the second cycle different, the spectrum of the voice is shifted on the frequency logarithmic axis. Thus, voices having different voice qualities can be synthesized, and the spectrum of the unvoiced section does not change because the voice operates only in the voiced section. Therefore, the clarity is not lost in the unvoiced section of the synthesized voice. In addition, it is possible to prevent the effect from occurring in a speech section in which the spectrum change adversely affects the synthesized speech.

[Brief description of the drawings]

【図１】本発明の実施形態に係る音声の規則合成装置の
概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a speech rule synthesis device according to an embodiment of the present invention.

【図２】同実施の形態に係り、サンプリング周波数変換
処理部の構成を、その動作と共に説明するための図であ
る。FIG. 2 is a diagram for describing a configuration of a sampling frequency conversion processing unit and its operation according to the embodiment;

【図３】同実施の形態に係り、サンプリング周波数の変
換後のピッチ波形幅を示す図である。FIG. 3 is a diagram showing a pitch waveform width after conversion of a sampling frequency according to the embodiment.

【図４】同実施の形態に係り、Ｄ／Ａ変換器への入力と
なる合成音声のサンプリング周波数を変えることにより
得られる効果を説明するための図である。FIG. 4 is a diagram for explaining an effect obtained by changing a sampling frequency of a synthesized voice to be input to a D / A converter according to the embodiment;

【図５】同実施の形態に係り、音声素片に含まれる複数
のピッチ波形を重畳して合成音声波形の生成を示す図で
ある。FIG. 5 is a diagram showing generation of a synthesized speech waveform by superimposing a plurality of pitch waveforms included in speech segments according to the embodiment.

【図６】従来の音声の規則合成装置の概略構成を示すブ
ロック図である。FIG. 6 is a block diagram showing a schematic configuration of a conventional speech rule synthesis device.

【図７】従来の音声の有声区間においてピッチ単位に分
割する方法を示す図である。FIG. 7 is a diagram showing a conventional method of dividing a voiced section of a voice into pitch units.

[Explanation of symbols]

１，１０１・・・言語処理部３，１０２・・・音声合成部５，１０３・・・テキストファイル７・・・言語解析処理部９・・・形態素辞書１１・・・接続規則１３，１０４・・・音韻継続時間長計算処理部１５・・・発話速度制御部１７，１０５・・・ピッチ生成処理部１９，１０６・・・音声素片メモリ２０・・・波形重畳処理部２１，１０７・・・音声合成処理部２３，１０８・・・Ｄ／Ａ変換器（デジタル／アナログ変
換器）２５，１０９・・・アンプ２７，１１０・・・スピーカー２９・・・声質切替部３１・・・声質制御部３３・・・サンプリング周波数変換処理部３５・・・切り替え回路３７・・・サンプリング周波数変換制御部1, 101 language processing unit 3, 102 voice synthesis unit 5, 103 text file 7 language analysis processing unit 9 morphological dictionary 11 connection rules 13, 104 ··· Phoneme duration calculation processing unit 15 ··· Speech rate control unit 17 and 105 ··· Pitch generation processing unit 19 and 106 ··· Voice unit memory 20 ··· Waveform superimposition processing unit 21 and 107 ··· Voice synthesizer 23, 108 D / A converter (digital / analog converter) 25, 109 Amplifier 27, 110 Speaker 29 Voice quality switching unit 31 Voice control Unit 33 ... Sampling frequency conversion processing unit 35 ... Switching circuit 37 ... Sampling frequency conversion control unit

Claims

[Claims]

An accumulating means for accumulating speech segments generated from discrete speech signals sampled at a first sampling period; and selectively accumulating the speech segments from the accumulating means based on given phoneme information. Reading and converting the read sample period of the speech unit into a second period different from the first period; and a sample period converting unit that converts the sample period into the second period by the sample period conversion unit. A voice synthesizing means for sequentially connecting voice units to synthesize voice.

2. A storage means for storing a plurality of pitch waveforms of a first sample period, a pitch waveform being sequentially read out from said storage means based on given phonological information, and a sample period of the read pitch waveform being stored in a second one. A sample period conversion unit for converting the sample period into a second period different from the first period, and a pitch waveform obtained by converting the sample period by the sample period conversion unit is superimposed on the basis of given prosody information. A voice synthesizing device comprising: a waveform superimposing unit that synthesizes voice.

3. A storage unit for storing a speech unit created from a discrete speech signal sampled at a first sampling period, and the speech unit read from the storage unit based on given phonemic information. Sample period converting means for converting the sample period of the sample period from the first period to a different second period in units of pitch waveforms; A speech synthesizing device, comprising: speech synthesizing means for synthesizing.

4. The speech synthesizer according to claim 1, wherein the sample period conversion means converts the sample period into a second period different from the first period only in a voiced section of the voice.

5. The sampling period conversion unit according to claim 1, further comprising a switching unit configured to convert a sampling period of the voice read from the storage unit into a second period different from the first period. A speech synthesizer as described.

6. A voice quality selecting means for selecting a voice quality of a voice to be synthesized, wherein the sample period converting means converts the voice quality into a second cycle corresponding to the voice quality selected by the voice quality selecting means. Item 4. The speech synthesizer according to any one of Items 1 to 3.

7. A speech unit created from a discrete speech signal sampled at a first sampling period is stored in a storage unit, and the speech unit is selected from the storage unit based on given phoneme information. Converting the sample period of the read speech unit into a second sample period different from the first sample period, and sequentially connecting the converted speech units to synthesize speech. A speech synthesis method characterized by the following.

8. A voice synthesizing method for superposing a plurality of pitch waveforms of a first sample cycle to synthesize voice having desired contents, wherein a sample cycle of the pitch waveform is different from a first cycle.
A voice synthesizing method characterized by superimposing a pitch waveform obtained by converting a sampling period after converting the sampling period into a period.

9. A speech unit created from a discrete speech signal sampled at a first sampling period is stored in a storage unit, and the voice read out from the storage unit based on given phoneme information. A speech synthesis method comprising: converting a sampling period of a unit from the first period to a second period different from each other in units of pitch waveforms; and sequentially connecting the converted pitch waveforms to synthesize a voice.

10. The speech synthesis method according to claim 7, wherein when converting to a second cycle different from the first cycle, the sample cycle is converted only in a voiced section of the speech. .

11. The speech synthesis method according to claim 7, wherein switching is performed to convert a sample period of the sound read from the storage unit to a second period different from the first period. .

12. The speech synthesis method according to claim 7, wherein a voice quality of the voice to be synthesized is selected and converted into a second cycle based on the selected voice quality by a sampling cycle conversion unit.