JP2009258498A

JP2009258498A - Speech synthesis device and speech synthesis method

Info

Publication number: JP2009258498A
Application number: JP2008109190A
Authority: JP
Inventors: Tadashi Yamaura; 正山浦; Satoshi Furuta; 訓古田; Takahiro Otsuka; 貴弘大塚; Hirohisa Tazaki; 裕久田崎
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2008-04-18
Filing date: 2008-04-18
Publication date: 2009-11-05
Anticipated expiration: 2028-04-18
Also published as: JP5089473B2

Abstract

<P>PROBLEM TO BE SOLVED: To create a synthesized speech of high quality with a small storage capacity. <P>SOLUTION: The speech synthesis device comprises: a preceding section length determination part 2 which stores in advance each correspondence between phonological symbol and precedence section length, and determines a preceding section length corresponding to a phonological symbol obtained from an input text according to the correspondence; and a compressed speech waveform reading part 3 which reads, from compressed speech waveforms stored in a speech fragment dictionary 1, a compressed speech waveform including a speech waveform of a speech fragment corresponding to the phonological symbol obtained from the input text, with a speech waveform of a section length corresponding to the preceding section length determined by the determination part 2 added to the front of the compressed speech waveform. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は、例えば、カーナビゲーションシステムや携帯電話機などに実装され、任意の文章から人工的に音声信号を生成する音声合成装置及び音声合成方法に関するものである。 The present invention relates to a speech synthesizer and a speech synthesis method that are mounted on, for example, a car navigation system or a mobile phone and artificially generate speech signals from arbitrary sentences.

任意の文章から人工的に音声信号を作り出すテキスト音声合成技術では、入力テキストに対して言語解析処理や韻律生成処理を実施することにより得られる音韻記号や韻律情報（例えば、ピッチ、音韻継続時間長など）を入力し、その音韻記号や韻律情報から音声信号を生成するようにしている。
即ち、テキスト音声合成技術では、母音を「Ｖ」、子音を「Ｃ」で表して、例えば、「ＣＶ」、「ＣＶＣ」、「ＶＣＶ」などの基本となる小さな単位の特徴パラメータを音声素片として記憶しておき、音韻記号及び韻律情報を入力すると、その音韻記号に対応する音声素片を選択的に読み出し、その韻律情報にしたがってピッチや音韻継続時間長を制御して、その音声素片を順次接続することにより、音声を合成するようにしている。 In text-to-speech synthesis technology that artificially generates speech signals from arbitrary sentences, phonetic symbols and prosodic information (for example, pitch, phoneme duration length) obtained by performing language analysis processing and prosody generation processing on input text Etc.) and a speech signal is generated from the phonetic symbols and prosodic information.
That is, in the text-to-speech synthesis technology, vowels are represented by “V”, consonants are represented by “C”, and for example, basic unit characteristic parameters such as “CV”, “CVC”, and “VCV” are expressed as speech units. When the phoneme symbol and the prosodic information are input, the phoneme unit corresponding to the phoneme symbol is selectively read out, the pitch and the phoneme duration are controlled according to the prosodic information, and the phoneme unit is read out. Are sequentially connected to synthesize voice.

従来の音声合成装置では、音声素片を少ない記憶容量で保持するために、音声素片の圧縮データを保持するようにしている。
ただし、圧縮率が高い圧縮方法で音声素片を圧縮すると、記録容量を低減することができるが、音声区間の先頭での歪みが大きくなり、全体としての歪みも大きくなる傾向がある。
このような歪は、合成音声の品質低下につながるので、あまり音声素片の圧縮率を高くすることができない問題点がある。 In a conventional speech synthesizer, in order to hold a speech unit with a small storage capacity, compressed data of the speech unit is held.
However, if the speech segment is compressed by a compression method with a high compression rate, the recording capacity can be reduced, but the distortion at the beginning of the speech section increases, and the overall distortion tends to increase.
Such distortion leads to deterioration of the quality of the synthesized speech, and therefore there is a problem that the compression rate of the speech unit cannot be increased so much.

そこで、予め人間が発声した単音、単語、単文などの音声波形の中から所望の音声素片の音声波形を抽出して圧縮する際には、その音声素片の音声波形の前方の区間の音声波形も含めて圧縮するようにし、音声素片の音声波形を伸張する際には、その音声素片の音声波形の前方の音声波形を先に伸張して読み捨てることにより、その音声素片の区間での歪みを緩和する技術が開発されている（例えば、特許文献１を参照）。
なお、以下の特許文献１に開示されている音声合成装置では、音声素片の音声波形の伸張に先立って伸張する音声区間の長さを音声素片の圧縮歪によって動的に決定するようにするために、先立って伸張する音声区間の長さを示す情報を音声素片と対で別途記憶するようにしている。 Therefore, when extracting and compressing the speech waveform of a desired speech segment from speech waveforms such as single sounds, words, and single sentences previously uttered by humans, the speech in the front section of the speech waveform of the speech segment is compressed. When the speech waveform of a speech unit is expanded by compressing the waveform including the waveform, the speech waveform in front of the speech unit of the speech unit is first expanded and read and discarded. A technique for reducing the distortion in the section has been developed (see, for example, Patent Document 1).
Note that in the speech synthesizer disclosed in Patent Document 1 below, the length of the speech segment to be expanded prior to the expansion of the speech waveform of the speech unit is dynamically determined by the compression distortion of the speech unit. In order to do this, information indicating the length of the voice section to be expanded in advance is stored separately as a pair with the voice segment.

特開２００２−２８７７８４号公報（段落番号［００５８］）JP 2002-287784 A (paragraph number [0058])

従来の音声合成装置は以上のように構成されているので、音声素片の区間での歪みを緩和することができるが、先立って伸張する音声区間の長さを示す情報を別途記憶する必要がある。このため、記憶容量の圧縮効果が十分に得られないなどの課題があった。 Since the conventional speech synthesizer is configured as described above, the distortion in the speech segment section can be reduced, but it is necessary to store information indicating the length of the speech section to be expanded in advance. is there. For this reason, there is a problem that the compression effect of the storage capacity cannot be obtained sufficiently.

この発明は上記のような課題を解決するためになされたもので、少ない記憶容量で、高い品質の合成音声を生成することができる音声合成装置及び音声合成方法を得ることを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech synthesizer and a speech synthesis method that can generate high-quality synthesized speech with a small storage capacity.

この発明に係る音声合成装置は、音声素片の音声波形として、音声素片区間に先行する区間を含めた音声波形が圧縮されている圧縮音声波形を格納している音声素片辞書と、入力テキストから得られる音韻記号に対応する先行区間長を決定する先行区間長決定手段と、音声素片辞書に格納されている圧縮音声波形の中から、入力テキストから得られる音韻記号に対応する音声素片の音声波形を含み、その音声波形の前方に先行区間長決定手段により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す圧縮音声波形読み出し手段とを設け、音声波形抽出手段が圧縮音声波形読み出し手段により読み出された圧縮音声波形を伸張し、伸張後の音声波形から音声素片の音声波形を抽出するようにしたものである。 The speech synthesizer according to the present invention includes a speech unit dictionary storing a compressed speech waveform in which a speech waveform including a section preceding a speech unit section is compressed as a speech unit speech waveform, and an input A preceding section length determining means for determining a preceding section length corresponding to a phoneme symbol obtained from a text; and a phoneme corresponding to a phoneme symbol obtained from an input text out of compressed speech waveforms stored in a speech segment dictionary. Compressed voice waveform reading means for reading out a compressed voice waveform including a single voice waveform and having a voice waveform having a section length matching the preceding section length determined by the preceding section length determination means in front of the voice waveform; Provided, the voice waveform extracting means expands the compressed voice waveform read by the compressed voice waveform reading means, and extracts the voice waveform of the voice segment from the voice waveform after expansion. That.

この発明によれば、音声素片の音声波形として、音声素片区間に先行する区間を含めた音声波形が圧縮されている圧縮音声波形を格納している音声素片辞書と、入力テキストから得られる音韻記号に対応する先行区間長を決定する先行区間長決定手段と、音声素片辞書に格納されている圧縮音声波形の中から、入力テキストから得られる音韻記号に対応する音声素片の音声波形を含み、その音声波形の前方に先行区間長決定手段により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す圧縮音声波形読み出し手段とを設け、音声波形抽出手段が圧縮音声波形読み出し手段により読み出された圧縮音声波形を伸張し、伸張後の音声波形から音声素片の音声波形を抽出するように構成したので、少ない記憶容量で、高い品質の合成音声を生成することができる効果がある。 According to the present invention, a speech unit dictionary storing a compressed speech waveform in which a speech waveform including a section preceding a speech unit section is compressed as a speech unit speech waveform is obtained from an input text. Voice of speech segments corresponding to phoneme symbols obtained from input text out of compressed speech waveforms stored in the speech segment dictionary, and preceding segment length determination means for determining preceding segment lengths corresponding to the phoneme symbols A compressed speech waveform reading means for reading a compressed speech waveform including a waveform and having a speech waveform having a section length matching the preceding section length determined by the preceding section length determining means in front of the speech waveform; Since the waveform extraction unit is configured to expand the compressed speech waveform read by the compressed speech waveform reading unit and extract the speech waveform of the speech segment from the expanded speech waveform, By volume, the effect that can generate synthesized speech of high quality.

実施の形態１．
図１はこの発明の実施の形態１による音声合成装置を示す構成図であり、図において、音声素片辞書１は音声素片の音声波形として、予め人間が発声した単音、単語、単文などの音声波形が圧縮されている圧縮音声波形を格納している。したがって、単語や単文中の一部分を音声素片として用いる場合には、その音声素片の区間に先行する区間を含めて音声波形が圧縮されて格納される。
例えば、図３に示すように、「あたまが」という発声された圧縮音声波形からＣＶ素片「ｔａ」を用いる場合には、「ｔａ」に先行する区間も含めて音声素片辞書１に格納される。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a speech synthesizer according to Embodiment 1 of the present invention. In FIG. 1, a speech segment dictionary 1 is a speech waveform of a speech segment. Stores a compressed speech waveform in which the speech waveform is compressed. Therefore, when a part of a word or simple sentence is used as a speech unit, the speech waveform is compressed and stored, including a section preceding the section of the speech unit.
For example, as shown in FIG. 3, when the CV segment “ta” is used from the compressed speech waveform uttered “Atama”, the speech segment dictionary 1 including the section preceding “ta” is used. Stored.

先行区間長決定部２は予め音韻記号と先行区間長の対応関係を記憶しており、その対応関係を参照して、入力テキストから得られる音韻記号に対応する先行区間長を決定する処理を実施する。なお、先行区間長決定部２は先行区間長決定手段を構成している。
圧縮音声波形読み出し部３は音声素片辞書１に格納されている圧縮音声波形の中から、入力テキストから得られる音韻記号に対応する音声素片の音声波形を含み、その音声波形の前方に先行区間長決定部２により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す処理を実施する。なお、圧縮音声波形読み出し部３は圧縮音声波形読み出し手段を構成している。 The preceding section length determination unit 2 stores a correspondence relationship between the phoneme symbol and the preceding section length in advance, and performs processing for determining the preceding section length corresponding to the phoneme symbol obtained from the input text with reference to the correspondence relationship. To do. The preceding section length determining unit 2 constitutes a preceding section length determining means.
The compressed speech waveform reading unit 3 includes a speech waveform of a speech unit corresponding to a phoneme symbol obtained from the input text from among the compressed speech waveforms stored in the speech unit dictionary 1, and precedes the speech waveform. A process of reading a compressed speech waveform to which a speech waveform having a section length that matches the preceding section length determined by the section length determination unit 2 is added. The compressed voice waveform reading unit 3 constitutes compressed voice waveform reading means.

音声波形抽出部４は圧縮音声波形読み出し部３により読み出された圧縮音声波形を伸張し、伸張後の音声波形から音声素片の音声波形を抽出する処理を実施する。なお、音声波形抽出部４は音声波形抽出手段を構成している。
音声生成部５は入力テキストから得られる韻律情報にしたがってピッチや音韻継続時間長を制御しながら、その入力テキストから得られる音韻記号にしたがって音声波形抽出部４により抽出された音声素片の音声波形を順次接続して合成音声を生成する処理を実施する。なお、音声生成部５は音声生成手段を構成している。
図２はこの発明の実施の形態１による音声合成装置の処理内容を示すフローチャートである。 The speech waveform extraction unit 4 expands the compressed speech waveform read by the compressed speech waveform reading unit 3 and performs a process of extracting the speech waveform of the speech unit from the expanded speech waveform. The speech waveform extraction unit 4 constitutes speech waveform extraction means.
The speech generation unit 5 controls the pitch and phoneme duration in accordance with the prosodic information obtained from the input text, while the speech waveform of the speech unit extracted by the speech waveform extraction unit 4 in accordance with the phoneme symbol obtained from the input text. Are sequentially connected to generate synthesized speech. The voice generation unit 5 constitutes a voice generation unit.
FIG. 2 is a flowchart showing the processing contents of the speech synthesizer according to Embodiment 1 of the present invention.

次に動作について説明する。
音声合成装置に入力される音韻記号及び韻律情報は、例えば、音韻記号、ピッチ、音韻継続時間長、パワーなどの情報であり、例えば、入力テキストに対して言語解析処理や韻律生成処理を実施することにより得られる。
音声合成装置には、入力テキストから得られる複数の音韻記号及び韻律情報が順次入力される。 Next, the operation will be described.
The phonological symbols and prosodic information input to the speech synthesizer are information such as phonological symbols, pitches, phonological durations, and powers. For example, language analysis processing and prosodic generation processing are performed on the input text. Can be obtained.
A plurality of phonetic symbols and prosodic information obtained from the input text are sequentially input to the speech synthesizer.

音声素片辞書１には、音声素片の音声波形が格納されているが、記憶容量を低減するために音声素片の音声波形が圧縮されて格納されている。
ただし、単語や単文中の一部分を音声素片として用いる場合には、その音声素片の区間に先行する区間を含めて音声波形が圧縮されて格納される。
例えば、図３に示すように、「あたまが」という発声された圧縮音声波形からＣＶ素片「ｔａ」を用いる場合には、「ｔａ」に先行する区間も含めて音声素片辞書１に格納される。 The speech unit dictionary 1 stores the speech waveform of the speech unit, but the speech waveform of the speech unit is compressed and stored in order to reduce the storage capacity.
However, when a part of a word or simple sentence is used as a speech unit, the speech waveform is compressed and stored, including the section preceding the section of the speech unit.
For example, as shown in FIG. 3, when the CV segment “ta” is used from the compressed speech waveform uttered “Atama”, the speech segment dictionary 1 including the section preceding “ta” is used. Stored.

先行区間長決定部２は、予め音韻記号と先行区間長の対応関係を記憶しており、入力テキストから得られる音韻記号を入力すると、その対応関係を参照して、その音韻記号に対応する先行区間長を決定する（ステップＳＴ１）。
具体的には、例えば、子音で始まる音声素片（例えば、「ＣＶ」など）の音韻記号に対する先行区間長は「Ｎｃ」、母音で始まる音声素片（例えば、「ＶＣ」など）の音韻記号に対する先行区間長は「Ｎｖ」のように設定されて、そのような対応関係が先行区間長決定部２に記憶される。
先行区間長決定部２は、上記の対応関係を参照して、入力テキストから得られる音韻記号が、子音で始まる音声素片の音韻記号であれば、入力テキストから得られる音韻記号に対応する先行区間長を「Ｎｃ」に決定する。
一方、入力テキストから得られる音韻記号が、母音で始まる音声素片の音韻記号であれば、入力テキストから得られる音韻記号に対応する先行区間長を「Ｎｖ」に決定する。 The preceding section length determination unit 2 stores a correspondence relationship between the phoneme symbol and the preceding section length in advance, and when a phoneme symbol obtained from the input text is input, the preceding section length corresponding to the phoneme symbol is referred to by referring to the correspondence relationship. The section length is determined (step ST1).
Specifically, for example, the preceding interval length for a phoneme symbol of a speech unit (eg, “CV”) starting with a consonant is “Nc”, and a phoneme symbol of a phoneme unit (eg, “VC”) starting with a vowel The preceding section length for is set as “Nv”, and such a correspondence is stored in the preceding section length determination unit 2.
If the phonological symbol obtained from the input text is a phonological symbol of a phoneme unit starting with a consonant, the preceding section length determining unit 2 refers to the above correspondence relationship, and the preceding interval length determining unit 2 corresponds to the phonological symbol obtained from the input text. The section length is determined as “Nc”.
On the other hand, if the phoneme symbol obtained from the input text is a phoneme symbol of a speech unit starting with a vowel, the preceding section length corresponding to the phoneme symbol obtained from the input text is determined as “Nv”.

ここで、「Ｎｃ」「Ｎｖ」は、例えば、音声素片の区間の歪みが十分に小さくなるように設定されているものとする。
あるいは、ユーザが受聴する際の劣化感が十分に小さくなるように設定されているものとする。
子音（Ｃ）は母音（Ｖ）よりも、平均的に音声信号のパワーが小さく、子音区間で歪が大きくても、母音区間と比べて、ユーザが受聴する際の劣化感が小さいという特性を利用して、「Ｎｃ」＜「Ｎｖ」のように設定してもよい。
これにより、ＣＶ素片に対する先行区間長（音声素片辞書１に記憶するべき区間長）を短くすることができるため記憶容量を削減する効果がある。また、ＣＶ素片の音声波形を伸長する区間長も短くなることから、伸長に要する処理量を削減することができる。
このように、先行区間長決定部２では、音韻記号によって一意に遡るべき先行区間長を決定しているので、音声素片毎に、その先行区間長の情報を保持する必要がなく、記憶容量の増加を抑えることができる。 Here, “Nc” and “Nv” are set, for example, such that the distortion of the segment of the speech segment is sufficiently reduced.
Alternatively, it is assumed that the deterioration feeling when the user listens is set to be sufficiently small.
The consonant (C) has a characteristic that the power of the voice signal is smaller than that of the vowel (V) on average, and even when distortion is large in the consonant section, the deterioration feeling when the user listens is smaller than that in the vowel section. It may be set so that “Nc” <“Nv”.
As a result, the preceding section length for the CV segment (the section length to be stored in the speech segment dictionary 1) can be shortened, which has the effect of reducing the storage capacity. In addition, since the length of the section in which the speech waveform of the CV segment is expanded is shortened, the processing amount required for expansion can be reduced.
Thus, since the preceding section length determination unit 2 determines the preceding section length that should be uniquely traced by the phoneme symbol, it is not necessary to store information on the preceding section length for each speech unit, and the storage capacity Can be suppressed.

圧縮音声波形読み出し部３は、先行区間長決定部２が先行区間長を決定すると、音声素片辞書１に格納されている圧縮音声波形の中から、入力テキストから得られる音韻記号に対応する音声素片の音声波形を含み、その音声波形の前方に、先行区間長決定部２により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す処理を実施する（ステップＳＴ２）。
例えば、入力テキストから得られる音韻記号に対応する音声素片が「ａｍ」であり、先行区間長決定部２により決定された先行区間長が「Ｎｖ」であれば、音声素片辞書１に格納されている圧縮音声波形のうち、「ａｍ」の音声素片の音声波形を含み、その音声波形の前方に、「Ｎｖ」と一致する区間長の音声波形が付加されている圧縮音声波形を読み出すようにする。 When the preceding section length determination unit 2 determines the preceding section length, the compressed speech waveform reading unit 3 determines the speech corresponding to the phoneme symbol obtained from the input text from among the compressed speech waveforms stored in the speech unit dictionary 1. A process of reading a compressed speech waveform that includes a speech waveform of a segment and in which a speech waveform having a section length that matches the preceding section length determined by the preceding section length determination unit 2 is added in front of the speech waveform. (Step ST2).
For example, if the speech unit corresponding to the phoneme symbol obtained from the input text is “am” and the preceding segment length determined by the preceding segment length determining unit 2 is “Nv”, the speech unit dictionary 1 stores it. Among the compressed speech waveforms that have been recorded, a compressed speech waveform that includes the speech waveform of the speech unit “am” and that has a speech waveform with a section length that matches “Nv” in front of the speech waveform is read out. Like that.

音声波形抽出部４は、圧縮音声波形読み出し部３が圧縮音声波形を読み出すと、その圧縮音声波形を伸張する（ステップＳＴ３）。即ち、先行区間＋音声素片区間の圧縮音声波形を伸張する。
次に、音声波形抽出部４は、伸張後の音声波形から音声素片の音声波形（例えば、「ａｍ」の音声波形）を抽出する（ステップＳＴ４）。 When the compressed speech waveform reading unit 3 reads the compressed speech waveform, the speech waveform extracting unit 4 expands the compressed speech waveform (step ST3). That is, the compressed speech waveform of the preceding section + speech segment section is expanded.
Next, the speech waveform extraction unit 4 extracts the speech waveform of the speech unit (for example, the speech waveform of “am”) from the decompressed speech waveform (step ST4).

音声生成部５は、音声波形抽出部４が音声素片の音声波形を抽出すると、入力テキストから得られる韻律情報にしたがってピッチや音韻継続時間長を制御しながら、その入力テキストから得られる音韻記号にしたがって、その音声素片の音声波形を順次接続して、合成音声を生成する（ステップＳＴ５）。
ステップＳＴ１〜ＳＴ５の処理は、音韻記号・韻律情報の入力が終了するまで繰り返し実施される（ステップＳＴ６）。 When the speech waveform extraction unit 4 extracts the speech waveform of the speech unit, the speech generation unit 5 controls the pitch and phoneme duration in accordance with the prosodic information obtained from the input text, and the phoneme symbols obtained from the input text. Accordingly, the speech waveforms of the speech units are sequentially connected to generate synthesized speech (step ST5).
The processes of steps ST1 to ST5 are repeated until the input of phoneme symbols / prosodic information is completed (step ST6).

以上で明らかなように、この実施の形態１によれば、音声素片の音声波形として、音声素片区間に先行する区間も含めた音声波形が圧縮されている圧縮音声波形を格納している音声素片辞書１と、入力テキストから得られる音韻記号に対応する先行区間長を決定する先行区間長決定部２と、音声素片辞書１に格納されている圧縮音声波形の中から、入力テキストから得られる音韻記号に対応する音声素片の音声波形を含み、その音声波形の前方に先行区間長決定部２により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す圧縮音声波形読み出し部３とを設け、音声波形抽出部４が圧縮音声波形読み出し部３により読み出された圧縮音声波形を伸張し、伸張後の音声波形から音声素片の音声波形を抽出するように構成したので、少ない記憶容量で、高い品質の合成音声を生成することができる効果を奏する。 As apparent from the above, according to the first embodiment, a compressed speech waveform in which a speech waveform including a section preceding the speech unit section is compressed is stored as the speech waveform of the speech unit. The speech segment dictionary 1, the preceding segment length determination unit 2 that determines the preceding segment length corresponding to the phoneme symbol obtained from the input text, and the input text from the compressed speech waveform stored in the speech segment dictionary 1 The speech waveform of the speech unit corresponding to the phoneme symbol obtained from the above is included, and the speech waveform having the section length matching the preceding section length determined by the preceding section length determining unit 2 is added in front of the speech waveform. A compressed voice waveform reading unit 3 for reading the voice waveform, and the voice waveform extracting unit 4 expands the compressed voice waveform read by the compressed voice waveform reading unit 3, and the voice waveform of the speech unit from the expanded voice waveform Extract And then, it is, with a small storage capacity, it offers an advantage of being able to generate a synthesized speech of high quality.

即ち、先行区間長決定部２が、音韻記号によって一意に先行区間長を決定するようにしているので、音声素片毎に、その先行区間長の情報を保持する必要がなく、記憶容量の増加を招くことなく、歪みが小さい高品質な合成音声を生成することができる効果を奏する。
また、音韻記号を特性に応じて分類し、ユーザが受聴する際の劣化感が小さい音韻記号の先行区間長を短くするようにしているので、合成音声の品質劣化を回避しつつ、圧縮音声波形の記憶容量や、伸長に要する処理量を削減することができる効果を奏する。 That is, since the preceding section length determination unit 2 uniquely determines the preceding section length based on the phoneme symbol, it is not necessary to store information on the preceding section length for each speech unit, and the storage capacity increases. The effect is that it is possible to generate high-quality synthesized speech with less distortion.
Also, the phoneme symbols are classified according to the characteristics, and the length of the preceding section of the phoneme symbol that is less deteriorated when the user listens is shortened, so that the quality of the synthesized speech is avoided and the compressed speech waveform is avoided. It is possible to reduce the storage capacity and the processing amount required for decompression.

実施の形態２．
上記実施の形態１では、先行区間長決定部２が、例えば、音声素片の区間の歪みが十分に小さくなるように、先行区間長「Ｎｃ」「Ｎｖ」を設定しているものについて示したが、次のようにして、先行区間長を設定するようにしてもよい。
例えば、ＣＶ素片において、子音を無声子音と有声子音に分類し、無声子音に対する先行区間長を「Ｎｃｕ」、有声子音に対する先行区間長を「Ｎｃｖ」に設定する。そして、「Ｎｃｕ」＜「Ｎｃｖ」のように決定する。
これは、無声子音は雑音的な信号であり、無声区間で歪が大きくても、有声区間と比べて、ユーザが受聴する際の劣化感が小さいという特性を利用するものである。
これにより、無声子音で始まるＣＶ素片に対しては先行区間長を短くすることができるため、圧縮音声波形の記憶容量や、伸長に要する処理量を削減することができる効果を奏する。 Embodiment 2. FIG.
In the first embodiment, the preceding section length determination unit 2 has set the preceding section lengths “Nc” and “Nv” so that the distortion of the speech segment section is sufficiently reduced, for example. However, the preceding section length may be set as follows.
For example, in the CV segment, consonants are classified into unvoiced consonants and voiced consonants, the preceding section length for unvoiced consonants is set to “Ncu”, and the preceding section length for voiced consonants is set to “Ncv”. Then, it is determined as “Ncu” <“Ncv”.
This is because the unvoiced consonant is a noisy signal and uses the characteristic that even when distortion is large in the unvoiced section, the deterioration feeling when the user listens is smaller than in the voiced section.
As a result, the length of the preceding section can be shortened for CV segments starting with unvoiced consonants, so that the storage capacity of the compressed speech waveform and the processing amount required for decompression can be reduced.

また、例えば、無声子音で始まるＣＶ素片において、子音を破裂性無声子音と摩擦性無声子音に分類し、破裂性無声子音に対する先行区間長を「Ｎｓ」、摩擦性無声子音に対する先行区間長を「Ｎｆ」に設定する。そして、「Ｎｓ」＜「Ｎｆ」のように決定する。
これは、破裂性無声子音は、音声信号が無音に近い閉鎖区間を伴い、この無音区間長を減らしても、破裂性無声子音の歪は、それほど大きくならないという特性を利用するものである。
これにより、破裂性無声子音で始まるＣＶ素片に対しては先行区間長を短くすることができるため、圧縮音声波形の記憶容量や、伸長に要する処理量を削減することができる効果を奏する。 Also, for example, in a CV segment starting with an unvoiced consonant, the consonant is classified into a bursting unvoiced consonant and a rubbing unvoiced consonant. Set to “Nf”. Then, “Ns” <“Nf” is determined.
This is because the rupturable unvoiced consonant has a closed section in which the speech signal is close to silence, and the distortion of the rupturable unvoiced consonant does not increase so much even if the length of the silent section is reduced.
Accordingly, the length of the preceding section can be shortened for a CV segment starting with a bursting unvoiced consonant, so that the storage capacity of the compressed speech waveform and the processing amount required for decompression can be reduced.

例えば、ＣＶ素片の特殊なものとして、「Ｃ」が無音であるものを含む場合がある。
このように、「Ｃ」が無音である場合の先行区間長を「Ｎｓｉ」、有音である場合の先行区間長を「Ｎｖｏ」に設定する。そして、「Ｎｓｉ」＜「Ｎｖｏ」のように決定する。
これは、先行する無音区間長を減らしても、音声区間の歪みは、それほど大きくならないという特性を利用するものである。
これにより、「Ｃ」が無音であるＣＶ素片に対しては先行区間長を短くすることができるため、圧縮音声波形の記憶容量や、伸長に要する処理量を削減することができる効果を奏する。 For example, there are cases where “C” is silent as a special CV segment.
Thus, the preceding section length when “C” is silent is set to “Nsi”, and the preceding section length when “C” is sound is set to “Nvo”. Then, “Nsi” <“Nvo” is determined.
This uses the characteristic that even if the preceding silent section length is reduced, the distortion of the speech section does not increase so much.
As a result, the length of the preceding section can be shortened for a CV segment in which “C” is silent, so that the storage capacity of the compressed speech waveform and the processing amount required for decompression can be reduced. .

なお、先行区間長を定める要因は、上述したものに限るものではなく、音韻記号の他、韻律情報など、音声素片に対応して保持している任意の情報を用いることができることは言うまでもない。 It should be noted that the factors that determine the preceding section length are not limited to those described above, and it is needless to say that arbitrary information held in correspondence with the speech segment such as prosodic information can be used in addition to the phonetic symbol. .

実施の形態３．
図４はこの発明の実施の形態３による音声合成装置を示す構成図であり、図において、図１と同一符号は同一または相当部分を示すので説明を省略する。
音声素片辞書１１は音声素片の音声波形として、音声素片区間に先行する区間を含めた音声波形が圧縮されている圧縮音声波形を格納している。
ただし、音声素片辞書１１は図１の音声素片辞書１と異なり、音声素片の音声波形が時間反転されている方が、時間反転されていない音声波形より先行区間長が短くなる場合、時間反転されていない音声波形の代わりに、時間反転されている音声波形の前方の先行区間に音声波形が付加されて圧縮されている圧縮音声波形を保持している。
例えば、ＶＣ素片の音声波形については、ＶＣ素片の音声波形を時間反転してＣＶ素片の音声波形とみなし、その音声波形の圧縮データを保持している。 Embodiment 3 FIG.
4 is a block diagram showing a speech synthesizer according to Embodiment 3 of the present invention. In the figure, the same reference numerals as those in FIG.
The speech unit dictionary 11 stores a compressed speech waveform in which a speech waveform including a section preceding the speech unit section is compressed as a speech waveform of the speech unit.
However, the speech unit dictionary 11 is different from the speech unit dictionary 1 of FIG. 1, when the speech segment speech waveform is time-reversed and the preceding section length is shorter than the time-reversed speech waveform, Instead of the speech waveform that is not time-reversed, a compressed speech waveform that is compressed by adding a speech waveform to a preceding section ahead of the speech waveform that is time-reversed is held.
For example, for the speech waveform of the VC segment, the speech waveform of the VC segment is time-inverted and regarded as the speech waveform of the CV segment, and the compressed data of the speech waveform is held.

先行区間長決定部１２は図１の先行区間長決定部２と同様に、入力テキストから得られる音韻記号に対応する先行区間長を決定する処理を実施する。また、先行区間長決定部１２は音韻記号に対応する音声素片の音声波形の時間反転の有無を判定する処理を実施する。なお、先行区間長決定部１２は先行区間長決定手段を構成している。 Similar to the preceding section length determining unit 2 in FIG. 1, the preceding section length determining unit 12 performs a process of determining the preceding section length corresponding to the phoneme symbol obtained from the input text. In addition, the preceding section length determination unit 12 performs a process of determining whether or not the time reversal of the speech waveform of the speech unit corresponding to the phoneme symbol is present. The preceding section length determining unit 12 constitutes a preceding section length determining means.

圧縮音声波形読み出し部１３は音声素片辞書１に格納されている圧縮音声波形の中から、入力テキストから得られる音韻記号に対応する音声素片の音声波形を含み、その音声波形の前方に先行区間長決定部１２により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す処理を実施する。
ただし、圧縮音声波形読み出し部１３は図１の圧縮音声波形読み出し部３と異なり、先行区間長決定部１２により時間反転が有と判定された場合、時間反転されている音声素片の音声波形を含み、その音声波形の前方に先行区間長決定部１２により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す処理を実施する。なお、圧縮音声波形読み出し部１３は圧縮音声波形読み出し手段を構成している。 The compressed speech waveform reading unit 13 includes the speech waveform of the speech unit corresponding to the phoneme symbol obtained from the input text from the compressed speech waveforms stored in the speech unit dictionary 1, and precedes the speech waveform. A process of reading a compressed speech waveform to which a speech waveform having a section length that matches the preceding section length determined by the section length determination unit 12 is added.
However, unlike the compressed speech waveform reading unit 3 in FIG. 1, the compressed speech waveform reading unit 13 determines the speech waveform of the speech unit that is time-reversed when the preceding section length determination unit 12 determines that time reversal is present. In addition, a process of reading a compressed speech waveform to which a speech waveform having a section length that matches the preceding section length determined by the preceding section length determination unit 12 is added in front of the speech waveform is performed. The compressed voice waveform reading unit 13 constitutes a compressed voice waveform reading unit.

時間反転部１４は先行区間長決定部１２により時間反転が有と判定された場合、音声波形抽出部４により抽出された音声素片の音声波形を時間反転する処理を実施する。なお、時間反転部１４は時間反転手段を構成している。
図５はこの発明の実施の形態３による音声合成装置の処理内容を示すフローチャートである。 The time reversing unit 14 performs processing for reversing the speech waveform of the speech segment extracted by the speech waveform extracting unit 4 when the preceding section length determining unit 12 determines that time reversal is present. The time reversing unit 14 constitutes a time reversing unit.
FIG. 5 is a flowchart showing the processing contents of the speech synthesizer according to Embodiment 3 of the present invention.

次に動作について説明する。
例えば、ＣＶ素片の音韻記号に対応する先行区間長と、ＶＣ素片の音韻記号に対応する先行区間長とを比較すると、ＶＣ素片の音韻記号に対応する先行区間長の方が長く、圧縮音声波形の記憶容量が多くなる。
一般的な音声圧縮技術では、ＶＣ素片の音声波形を時間反転して圧縮しても、ＣＶ素片の音声波形を圧縮した場合と比較して、圧縮効率や品質に大きな差異はない。 Next, the operation will be described.
For example, when comparing the preceding interval length corresponding to the phoneme symbol of the CV segment and the preceding interval length corresponding to the phoneme symbol of the VC segment, the preceding interval length corresponding to the phoneme symbol of the VC segment is longer, The storage capacity of the compressed speech waveform increases.
In a general voice compression technique, even if a voice waveform of a VC segment is time-reversed and compressed, there is no significant difference in compression efficiency and quality compared to the case where the voice waveform of a CV segment is compressed.

そこで、音声素片辞書１１は、音声素片の音声波形が時間反転されている方が、時間反転されていない音声波形より先行区間長が短くなる場合、時間反転されていない音声波形の圧縮データを保持せずに、時間反転されている音声波形の前方の先行区間に音声波形が付加されて圧縮されている圧縮音声波形を保持するようにしている。
例えば、ＶＣ素片の音声波形については保持しないようにし、ＶＣ素片の音声波形を時間反転してＣＶ素片の音声波形とみなし、その音声波形の圧縮データを保持するようにしている。 Therefore, the speech unit dictionary 11 compresses the compressed data of the speech waveform that is not time-reversed when the time length of the speech waveform of the speech unit is shorter than that of the speech waveform that is not time-reversed. The compressed speech waveform that is compressed by adding the speech waveform to the preceding section in front of the speech waveform that is time-reversed is retained.
For example, the speech waveform of the VC segment is not held, the speech waveform of the VC segment is time-reversed and regarded as the speech waveform of the CV segment, and the compressed data of the speech waveform is retained.

先行区間長決定部１２は、図１の先行区間長決定部２と同様に、予め音韻記号と先行区間長の対応関係を記憶しており、入力テキストから得られる音韻記号を入力すると、その対応関係を参照して、その音韻記号に対応する先行区間長を決定する（ステップＳＴ１１）。
音韻記号に対応する先行区間長の決定方法は、上記実施の形態１と同様であるため説明を省略する。 Similar to the preceding section length determining unit 2 in FIG. 1, the preceding section length determination unit 12 stores a correspondence relationship between phonological symbols and preceding section lengths in advance, and when a phonological symbol obtained from the input text is input, the correspondence With reference to the relationship, the preceding section length corresponding to the phoneme symbol is determined (step ST11).
Since the method for determining the preceding section length corresponding to the phoneme symbol is the same as in the first embodiment, description thereof is omitted.

また、先行区間長決定部１２は、音韻記号に対応する音声素片の音声波形の時間反転の有無を判定する（ステップＳＴ１２）。
例えば、その音韻記号に対応する先行区間長と、反転した音韻記号に対応する先行区間長とを比較して、反転した音韻記号に対応する先行区間長の方が短ければ、時間反転が“有”であると判定し、反転した音韻記号に対応する先行区間長の方が長ければ、時間反転が“無”であると判定する。 Further, the preceding section length determination unit 12 determines whether or not the time reversal of the speech waveform of the speech unit corresponding to the phoneme symbol is present (step ST12).
For example, comparing the preceding interval length corresponding to the phoneme symbol with the preceding interval length corresponding to the inverted phoneme symbol, if the preceding interval length corresponding to the inverted phoneme symbol is shorter, the time inversion is “present”. If the preceding section length corresponding to the inverted phoneme symbol is longer, it is determined that the time reversal is “none”.

圧縮音声波形読み出し部１３は、先行区間長決定部１２の判定結果が時間反転“無”であれば、図１の圧縮音声波形読み出し部３と同様に、音声素片辞書１１に格納されている圧縮音声波形の中から、入力テキストから得られる音韻記号に対応する音声素片の音声波形を含み、その音声波形の前方に、先行区間長決定部２により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す処理を実施する（ステップＳＴ１３）。
一方、先行区間長決定部１２の判定結果が時間反転“有”であれば、音声素片辞書１１に格納されている圧縮音声波形の中から、時間反転されている音声素片の音声波形を含み、その音声波形の前方に先行区間長決定部１２により決定された先行区間長と一致する区間長の音声波形が付加されている圧縮音声波形を読み出す処理を実施する（ステップＳＴ１３）。 The compressed speech waveform reading unit 13 is stored in the speech unit dictionary 11 as in the case of the compressed speech waveform reading unit 3 of FIG. A compressed speech waveform includes a speech waveform of a speech unit corresponding to a phoneme symbol obtained from input text, and a section that matches the preceding section length determined by the preceding section length determination unit 2 in front of the speech waveform A process of reading a compressed speech waveform to which a long speech waveform is added is performed (step ST13).
On the other hand, if the determination result of the preceding section length determination unit 12 is time reversal “Yes”, the speech waveform of the speech unit that is time-reversed from the compressed speech waveforms stored in the speech unit dictionary 11 is selected. In addition, a process of reading a compressed speech waveform to which a speech waveform having a section length that matches the preceding section length determined by the preceding section length determination unit 12 is added in front of the speech waveform is performed (step ST13).

音声波形抽出部４は、圧縮音声波形読み出し部１３が圧縮音声波形を読み出すと、上記実施の形態１と同様に、その圧縮音声波形を伸張する（ステップＳＴ１４）。即ち、先行区間＋音声素片区間の圧縮音声波形を伸張する。
次に、音声波形抽出部４は、上記実施の形態１と同様に、伸張後の音声波形から音声素片の音声波形を抽出する（ステップＳＴ１５）。 When the compressed speech waveform reading unit 13 reads the compressed speech waveform, the speech waveform extracting unit 4 expands the compressed speech waveform as in the first embodiment (step ST14). That is, the compressed speech waveform of the preceding section + speech segment section is expanded.
Next, the speech waveform extraction unit 4 extracts the speech waveform of the speech segment from the decompressed speech waveform as in the first embodiment (step ST15).

時間反転部１４は、音声波形抽出部４が音声素片の音声波形を抽出すると、先行区間長決定部１２の判定結果が時間反転“有”であれば、音声波形抽出部４により抽出された音声素片の音声波形を時間反転し、時間反転後の音声波形を音声生成部５に出力する（ステップＳＴ１６）。
一方、先行区間長決定部１２の判定結果が時間反転“無”であれば、音声波形抽出部４により抽出された音声素片の音声波形をそのまま音声生成部５に出力する。 When the speech waveform extraction unit 4 extracts the speech waveform of the speech segment, the time reversing unit 14 is extracted by the speech waveform extraction unit 4 if the determination result of the preceding section length determination unit 12 is “time reversal”. The speech waveform of the speech unit is time-reversed, and the speech waveform after the time reversal is output to the speech generation unit 5 (step ST16).
On the other hand, if the determination result of the preceding section length determination unit 12 is time reversal “No”, the speech waveform of the speech segment extracted by the speech waveform extraction unit 4 is output to the speech generation unit 5 as it is.

音声生成部５は、時間反転部１４から音声素片の音声波形を受けると、入力テキストから得られる韻律情報にしたがってピッチや音韻継続時間長を制御しながら、その入力テキストから得られる音韻記号にしたがって、その音声素片の音声波形を順次接続して、合成音声を生成する（ステップＳＴ１７）。
ステップＳＴ１１〜ＳＴ１７の処理は、音韻記号・韻律情報の入力が終了するまで繰り返し実施される（ステップＳＴ１８）。 When the speech generation unit 5 receives the speech waveform of the speech unit from the time reversal unit 14, it controls the pitch and phoneme duration in accordance with the prosodic information obtained from the input text, and converts the phoneme symbol obtained from the input text. Therefore, the synthesized speech is generated by sequentially connecting the speech waveforms of the speech units (step ST17).
The processes of steps ST11 to ST17 are repeated until the input of phoneme symbols / prosodic information is completed (step ST18).

以上で明らかなように、この実施の形態３によれば、音声素片辞書１１に格納されている圧縮音声波形として、時間反転されている圧縮音声波形が含まれている場合、先行区間長決定部１２が音韻記号に対応する先行区間長を決定する際に、その音韻記号に対応する音声素片の音声波形の時間反転の有無を判定し、時間反転が“有”であれば、音声波形抽出部４により抽出された音声素片の音声波形を時間反転するように構成したので、合成音声の品質劣化を招くことなく、圧縮音声波形の記憶容量や、伸長に要する処理量を削減することができる効果を奏する。 As is apparent from the above, according to the third embodiment, when the compressed speech waveform that is time-reversed is included as the compressed speech waveform stored in the speech segment dictionary 11, the preceding section length is determined. When the unit 12 determines the preceding section length corresponding to the phoneme symbol, it determines whether or not the speech waveform of the speech segment corresponding to the phoneme symbol is time-reversed. Since the speech waveform of the speech unit extracted by the extraction unit 4 is configured to be time-reversed, the storage capacity of the compressed speech waveform and the processing amount required for decompression can be reduced without deteriorating the quality of the synthesized speech. There is an effect that can.

実施の形態４．
上記実施の形態１〜３では、先行区間長決定部２，１２が音声素片の音韻記号に基づき先行区間長を決定しているものについて示したが、音韻記号に基づき先行区間長を決定する代わりに、音声素片の始端パワーに基づき先行区間長を決定するようにしてもよい。 Embodiment 4 FIG.
In the first to third embodiments, the preceding section length determination units 2 and 12 have determined the preceding section length based on the phoneme symbol of the speech unit. However, the preceding section length is determined based on the phoneme symbol. Instead, the preceding section length may be determined based on the starting edge power of the speech unit.

音声素片の始端パワーに対応する先行区間長の決定方法として、例えば、始端パワーが小さければ、先行区間長を短くして、始端パワーが大きければ、先行区間長を長くする方法が考えられる。
これは、音声信号のパワーが小さいときは、歪が大きくても、ユーザが受聴する際の劣化感が小さいという特性を利用するものである。
これにより、始端パワーが小さい音声素片に対しては先行区間長を短くすることができるため、圧縮音声波形の記憶容量や、伸長に要する処理量を削減することができる効果を奏する。 As a method of determining the preceding section length corresponding to the starting power of the speech segment, for example, a method is considered in which the preceding section length is shortened if the starting power is small and the preceding section length is increased if the starting power is large.
This utilizes the characteristic that when the power of the audio signal is low, even when the distortion is large, the user is less likely to deteriorate when listening.
As a result, the length of the preceding section can be shortened for a speech unit having a low starting end power, and the storage capacity of the compressed speech waveform and the processing amount required for decompression can be reduced.

また、音声素片の始終端のパワーを比較し、終端のパワーが始端のパワーよりも小さい場合には、音声波形が時間反転されて圧縮された圧縮音声波形を保持し、その圧縮音声波形を用いて、音声を合成する際には、伸長して得られた音声波形を時間反転して合成に用いるようにすれば、圧縮された音声波形の記憶容量や、伸長に要する処理量を更に削減することができる効果を奏する。 In addition, when the power at the start and end of the speech segment is compared, and the power at the end is smaller than the power at the start, the compressed speech waveform that is compressed by time-reversing the speech waveform is held, and the compressed speech waveform is When synthesizing speech, if the speech waveform obtained by decompression is time-reversed and used for synthesis, the storage capacity of the compressed speech waveform and the processing amount required for decompression can be further reduced. There is an effect that can be done.

この発明の実施の形態１による音声合成装置を示す構成図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows the speech synthesizer by Embodiment 1 of this invention. この発明の実施の形態１による音声合成装置の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of the speech synthesizer by Embodiment 1 of this invention. 音声素片辞書に格納される圧縮音声波形を示す説明図である。It is explanatory drawing which shows the compression speech waveform stored in a speech unit dictionary. この発明の実施の形態３による音声合成装置を示す構成図である。It is a block diagram which shows the speech synthesizer by Embodiment 3 of this invention. この発明の実施の形態３による音声合成装置の処理内容を示すフローチャートである。It is a flowchart which shows the processing content of the speech synthesizer by Embodiment 3 of this invention.

Explanation of symbols

１音声素片辞書、２先行区間長決定部（先行区間長決定手段）、３圧縮音声波形読み出し部（圧縮音声波形読み出し手段）、４音声波形抽出部（音声波形抽出手段）、５音声生成部（音声生成手段）、１１音声素片辞書、１２先行区間長決定部（先行区間長決定手段）、１３圧縮音声波形読み出し部（圧縮音声波形読み出し手段）、１４時間反転部（時間反転手段） DESCRIPTION OF SYMBOLS 1 Speech segment dictionary, 2 Leading section length determination part (preceding section length determination means), 3 Compressed speech waveform reading part (compressed speech waveform reading means), 4 Speech waveform extraction part (speech waveform extraction means), 5 Speech generation part (Speech generating means), 11 speech segment dictionary, 12 preceding section length determining section (preceding section length determining means), 13 compressed speech waveform reading section (compressed speech waveform reading means), 14 time reversing section (time reversing means)

Claims

Corresponds to the phoneme symbol that is obtained from the input text and the phoneme unit dictionary storing the compressed phonetic waveform in which the phonetic waveform including the segment preceding the phoneme segment is compressed. A preceding section length determining means for determining a preceding section length; and a speech waveform of a speech unit corresponding to a phoneme symbol obtained from the input text from among the compressed speech waveforms stored in the speech unit dictionary, Compressed speech waveform reading means for reading a compressed speech waveform in which a speech waveform having a section length matching the preceding section length determined by the preceding section length determining means is added in front of the speech waveform; and the compressed speech waveform reading means A speech waveform extraction means for expanding the compressed speech waveform read out by the step, and extracting the speech waveform of the speech unit from the expanded speech waveform; and a sound obtained from the input text Speech synthesis apparatus and a speech generation means for generating sequentially connected to synthesized speech audio waveform of speech units extracted by the speech waveform extracting means in accordance with the symbol.

When the compressed speech waveform stored in the phoneme unit dictionary includes a compressed speech waveform that is time-reversed, the preceding section length determination means determines the preceding section length corresponding to the phoneme symbol. If the speech waveform of the speech unit corresponding to the phoneme symbol is determined to be time-reversed and the time-reversal is determined by the preceding section length determining means, the speech of the speech unit extracted by the speech waveform extracting means 2. The speech synthesizer according to claim 1, further comprising time reversing means for reversing the waveform with time.

The speech synthesizer according to claim 1 or 2, wherein the preceding section length determining means determines the preceding section length corresponding to the power obtained from the input text instead of the phoneme symbol.

The preceding section length determining step for determining the preceding section length corresponding to the phoneme symbol obtained from the input text by the preceding section length determining means, and the speech waveform including the section preceding the speech unit section as the speech waveform of the speech unit From the speech unit dictionary storing the compressed speech waveform in which the compressed speech waveform is compressed, the compressed speech waveform reading means includes the speech waveform of the speech unit corresponding to the phoneme symbol obtained from the input text, and the front of the speech waveform A compressed speech waveform reading step for reading a compressed speech waveform to which a speech waveform having a section length matching the preceding section length determined by the preceding section length determining means is added; and a speech waveform extracting means comprising the compressed speech waveform reading means A speech waveform extraction step of decompressing the compressed speech waveform read out by step (a) and extracting the speech waveform of the speech unit from the decompressed speech waveform; Speech synthesis method and a speech generation step means generates a sequentially connected to synthesized speech audio waveform of speech units extracted by the speech waveform extracting means in accordance with the phoneme symbols obtained from the input text.