JP4762553B2

JP4762553B2 - Text-to-speech synthesis method and apparatus, text-to-speech synthesis program, and computer-readable recording medium recording the program

Info

Publication number: JP4762553B2
Application number: JP2005000498A
Authority: JP
Inventors: 訓古田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2005-01-05
Filing date: 2005-01-05
Publication date: 2011-08-31
Anticipated expiration: 2025-01-05
Also published as: JP2006189554A

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that the speech synthesis of high quality is heretofore unobtainable as the waveform distortion calculation in speech phoneme piece selection during synthesized speech generation is in a speech synthesis unit and therefore the sound quality of a consonant part is extremely good but if the sound quality of a vowel part is poor, the phonemes are precluded from the representative phoneme selection and the goodness of the consonant part is not reflected and the synthesized speech of average quality which is "modestly good in both of the consonants and vowels" is resulted. <P>SOLUTION: A plurality of signal waveforms are segmented from a plurality of training phonemes and a plurality of fused phonemes are generated by combining the segmented signal waveforms. At least either of the pitch and duration length of the fused phonemes is changed according to the corresponding parameters of the prescribed phonemes to generate a plurality of the synthesized phonemes. The distances between the plurality of the generated synthesized phonemes and the prescribed phonemes are evaluated and the fused phonemes are stored into a phoneme dictionary, based on the evaluation and the fused phonemes corresponding to the input phonemes obtained by analyzing the input text are selected from the phonemes dictionary and are connected and the synthesized speech is output. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、テキスト音声合成に係り、特にピッチ長、継続時間長などの情報から生成する合成音声の品質向上を図るテキスト音声合成技術に関するものである。 The present invention relates to text-to-speech synthesis, and more particularly to a text-to-speech synthesis technique for improving the quality of synthesized speech generated from information such as pitch length and duration.

任意の文章から人工的に音声信号を作り出すことをテキスト音声合成という。テキスト音声合成は、一般的に言語処理部、音韻処理部（韻律設定）、音声合成部の３つの段階によって行われる。
入力されたテキストは、まず言語処理部において形態素解析や構文解析などが行われ、次に音韻処理部においてアクセントやイントネーションの処理が行われて、音韻記号、ピッチ長、継続時間長などの音素環境情報が出力される。そして音素環境情報を根拠に、音声素片辞書に登録された音声素片を選択する。最後に、音声合成部で音韻記号、ピッチ長、継続時間長などの情報から音声を合成する。
このような音声合成の技術分野において、複数のトレーニング音声素片のピッチおよび継続時間長の少なくとも一方に従って、既に生成されている代表音声素片のピッチおよび継続時間長の少なくとも一方を変更し、複数の合成音声素片を生成する。この生成した合成音声素片とトレーニング素片との歪を評価し、歪が最小となる音声素片（これを代表音声素片という）を選択して接続することにより合成音声を出力するものがある。（例えば特許文献１参照）。 Synthesizing speech signals artificially from arbitrary sentences is called text-to-speech synthesis. Text-to-speech synthesis is generally performed in three stages: a language processing unit, a phoneme processing unit (prosodic setting), and a speech synthesis unit.
The input text is first subjected to morphological analysis and syntactic analysis in the language processing unit, and then subjected to accent and intonation processing in the phonological processing unit, so that the phoneme environment such as phonological symbol, pitch length, duration length, etc. Information is output. Then, based on the phoneme environment information, a speech unit registered in the speech unit dictionary is selected. Finally, the speech synthesizer synthesizes speech from information such as phonological symbols, pitch length, and duration length.
In such a speech synthesis technical field, according to at least one of the pitch and duration length of a plurality of training speech segments, at least one of the pitch and duration length of a representative speech segment that has already been generated is changed. Generate synthesized speech segments. The one that evaluates the distortion between the generated synthesized speech unit and the training unit, and outputs the synthesized speech by selecting and connecting the speech unit that minimizes the distortion (this is called the representative speech unit). is there. (For example, refer to Patent Document 1).

ここで、音声素片とは、母音をＶ、子音をＣと表すと、ＣＶ、ＶＣ、ＶＣＶ等の音声合成単位で音声信号中から切り出される素片であり、切り出された音声波形またはその波形から何らかの方法で抽出されたパラメータ系列を表している。音素環境は、当該音声素片の環境要因であり、例えば、当該音声素片の音素名、先行する音素、後続する音素、ピッチ周期、ピッチパターン、継続時間長、ＣとＶの音素境界位置、パワー、モーラ数、アクセント位置等の要素が挙げられる。 Here, the speech segment is a segment that is cut out from a speech signal in units of speech synthesis such as CV, VC, and VCV, where V is a vowel and C is a consonant. Represents a parameter series extracted by any method. The phoneme environment is an environmental factor of the speech unit. For example, the phoneme name of the speech unit, the preceding phoneme, the following phoneme, the pitch period, the pitch pattern, the duration, the phoneme boundary position between C and V, Examples include power, number of mora, and accent position.

特開平９−３１９３９１号公報（第４頁〜８頁、第１図）JP-A-9-319391 (pages 4-8, FIG. 1)

従来の音声合成方法は、以上のように構成されているが、音声素片選択の際の波形歪計算が、ＣＶ、ＶＣ、ＶＣＶ等の音声合成単位であるため、ＣＶ単位で構成される音声素片の場合を例にとれば、ある代表音声素片候補において、Ｃ（子音）部の音質は非常に良い（あるいは変形に強い）が、Ｖ（母音）部の音質が悪い（あるいは変形に弱い）と、最終的に選択される代表音声素片からこの音声素片候補は除外されてしまい、Ｃ部の音質の良い点が代表音声素片に反映されず、その結果、代表音声素片として選択される音声素片は「子音も母音もそこそこ良い」平均的な品質のものしか得られないという課題がある。 The conventional speech synthesis method is configured as described above. However, since the waveform distortion calculation at the time of speech segment selection is a speech synthesis unit such as CV, VC, VCV, etc., the speech configured in CV units. Taking the case of a segment as an example, in a representative speech segment candidate, the sound quality of the C (consonant) part is very good (or resistant to deformation), but the sound quality of the V (vowel) part is bad (or deformed). Weak)), the speech unit candidate is excluded from the representative speech unit to be finally selected, and the good quality of the C part is not reflected in the representative speech unit. As a result, the representative speech unit There is a problem that only a speech unit having an average quality can be obtained as a speech segment selected as “good consonants and vowels”.

この発明は、前記問題点を解決するためになされたもので、トレーニング音声素片または音声素片辞書から最適音声素片を選択する素片選択過程において、任意の複数の音声波形、あるいは音声波形を構成するパラメータを組み合わせて最適な音声素片を新規に生成することで、高品質の合成音声を可能にする音声合成方法および装置を得ることを目的とする。 The present invention has been made to solve the above problems, and in the unit selection process of selecting the optimum speech unit from the training speech unit or speech unit dictionary, any plurality of speech waveforms or speech waveforms It is an object of the present invention to obtain a speech synthesis method and apparatus that enables a high-quality synthesized speech by newly generating an optimal speech segment by combining the parameters that constitute the above.

また、この発明は、トレーニング音声素片または音声素片辞書から最適音声素片を選択する素片選択過程において、複数の音声素片中の共通部分、例えば、/ma/、/ka/、/ba/における母音音素/a/や、/na/、/ni/、/nu/における子音音素/n/等であって、複数の切り出された音声波形から歪最小となる波形を組み合わせて複数の音声素片の最適な共通部分を生成し、共通部分を縮退化することで、高品質な合成音声を提供しつつ、音声素片の記憶容量を大幅に削減することを可能にする音声合成方法および装置を得ることを目的とする。 Further, the present invention provides a common part in a plurality of speech units, for example, / ma /, / ka /, /, in a unit selection process for selecting an optimal speech unit from a training speech unit or a speech unit dictionary. The vowel phoneme / a / in ba /, the consonant phoneme / n / in / na /, / ni /, / nu /, etc. A speech synthesis method that generates a high-quality synthesized speech by generating an optimal common part of speech units and degenerates the common part, while enabling a significant reduction in the storage capacity of speech units. And aim to obtain equipment.

この発明に係るテキスト音声合成方法は、
複数のトレーニング音声素片から２つのトレーニング音声素片を用意し、一方のトレーニング音声素片から１つの音声素片の波形分離位置を所定の音素境界で切り出し、複数の信号波形を生成する波形分離ステップと、
前記波形分離位置で切り出された複数の信号波形から、任意の１つまたは複数の信号波形を組み合わせて融合することにより複数の融合音声素片を生成する波形融合ステップと、
前記用意された他方のトレーニング音声素片のピッチおよび継続時間長の少なくとも一方に従って、前記生成した融合音声素片のピッチおよび継続時間長の少なくとも一方を変更した複数の合成音声素片を生成する音声素片合成ステップと、
前記用意された他方のトレーニング音声素片のそれぞれに対する、前記生成した複数の合成音声素片のそれぞれとの間の距離を評価し、その評価に基づき距離が最小となるように波形分離ステップでの切り出し位置を音素境界の前後に微調整した複数の信号波形を組み合わせて融合した融合音声素片を音声素片辞書に保持または記憶する歪み評価ステップと、
前記音声素片辞書に保持または記憶された複数の融合音声素片から、入力テキストを解析して得られる入力音素に対応した融合音声素片を選択して接続することにより合成音声を出力する合成音声生成ステップとを有する。 A text-to-speech synthesis method according to the present invention includes:
Two training speech segments are prepared from a plurality of training speech segments, and the waveform separation position of one speech segment is cut out from one training speech segment at a predetermined phoneme boundary to generate a plurality of signal waveforms A separation step;
A waveform fusion step of generating a plurality of fused speech segments by combining and fusing any one or a plurality of signal waveforms from a plurality of signal waveforms cut out at the waveform separation position;
Speech that generates a plurality of synthesized speech segments in which at least one of the pitch and duration length of the generated fused speech segment is changed according to at least one of the pitch and duration length of the other training speech segment prepared Fragment synthesis step;
In the waveform separation step, the distance between each of the prepared other training speech units and each of the generated synthesized speech units is evaluated, and the distance is minimized based on the evaluation . A distortion evaluation step for holding or storing in a speech segment dictionary a fusion speech unit in which a plurality of signal waveforms obtained by finely adjusting the cut-out position before and after the phoneme boundary are combined ,
Synthesis by outputting synthesized speech by selecting and connecting fused speech units corresponding to input phonemes obtained by analyzing input text from a plurality of fused speech units held or stored in the speech unit dictionary A voice generation step.

また、この発明に係るテキスト音声合成装置は、
入力テキストを解析して入力音素を得る韻律設定部と、
複数のトレーニング音声素片から２つのトレーニング音声素片を用意し、一方のトレーニング音声素片から１つの音声素片の波形分離位置を所定の音素境界で切り出し、複数の信号波形を生成し、この複数の信号波形を任意に組み合わせて複数の融合音声素片候補を生成し、この融合音声素片候補のピッチおよび継続時間長の少なくとも一方を前記用意された他方のトレーニング音声素片の何れか１つのピッチおよび継続時間長の少なくとも一方に従って変更した複数の合成音声素片を生成し、この複数の合成音声素片と前記用意された他方のトレーニング音声素片との間の距離を評価し、その評価に基づき距離が最小となるように波形分離ステップでの切り出し位置を音素境界の前後に微調整した複数の信号波形を組み合わせて融合した融合音声素片を生成する融合音声素片生成手段と、
前記融合音声素片を保持または記憶する融合音声素片記憶手段と、
融合音声素片記憶手段が保持または記憶する融合音声素片から、前記入力音素に対応する融合音声素片を選択する素片選択手段と、
選択された融合音声素片を接続し、合成音声を生成する音声合成手段とを具備する。 A text-to-speech synthesizer according to the present invention provides:
A prosody setting unit that analyzes input text and obtains input phonemes;
Preparing two training speech segments from a plurality of training speech segments , cutting out waveform separation positions of one speech segment from one training speech segment at a predetermined phoneme boundary, and generating a plurality of signal waveforms; A plurality of fused speech unit candidates are generated by arbitrarily combining the plurality of signal waveforms, and at least one of the pitch and duration of the fused speech unit candidate is one of the other training speech units prepared above. Generating a plurality of synthesized speech segments modified according to at least one of one pitch and duration, and evaluating a distance between the plurality of synthesized speech segments and the other prepared training speech segment; melting the distance based on the evaluation is fused by combining a plurality of signal waveforms were finely adjusted before and after the phoneme boundary extraction position of the waveform separation step so as to minimize And fused speech unit generating means for generating a speech segment,
Fusion speech unit storage means for holding or storing the fusion speech unit;
Unit selection means for selecting a fusion speech unit corresponding to the input phoneme from the fusion speech units held or stored by the fusion speech unit storage unit;
Speech synthesis means for connecting the selected fused speech segments and generating synthesized speech.

また、この発明に係るテキスト音声合成プログラムは、
コンピュータに
入力テキストを解析して入力音素を得る韻律設定手段、
複数のトレーニング音声素片から２つのトレーニング音声素片を用意し、一方のトレーニング音声素片から１つの音声素片の波形分離位置を所定の音素境界で切り出し、複数の信号波形を生成し、この融合音声素片候補のピッチおよび継続時間長の少なくとも一方を前記用意された他方のトレーニング音声素片の何れか１つのピッチおよび継続時間長の少なくとも一方に従って変更した複数の合成音声素片を生成し、この複数の合成音声素片と前記用意された他方のトレーニング音声素片との間の距離を評価し、その評価に基づき距離が最小となるように波形分離ステップでの切り出し位置を音素境界の前後に微調整した複数の信号波形を組み合わせて融合融合音声素片を生成する融合音声素片生成手段、
前記融合音声素片を保持または記憶する融合音声素片記憶手段、
融合音声素片記憶手段が保持または記憶する融合音声素片から、前記入力音素に対応する融合音声素片を選択する素片選択手段、
選択された融合音声素片を接続し、合成音声を生成する音声合成手段として機能させる。 The text-to-speech synthesis program according to the present invention is
Prosody setting means to obtain input phonemes by analyzing input text on a computer,
Preparing two training speech segments from a plurality of training speech segments , cutting out waveform separation positions of one speech segment from one training speech segment at a predetermined phoneme boundary, and generating a plurality of signal waveforms; Generating a plurality of synthesized speech segments in which at least one of the pitch and duration length of the fusion speech segment candidate is changed according to at least one pitch and duration length of the other prepared training speech segment Then, the distance between the plurality of synthesized speech segments and the other prepared training speech segment is evaluated, and the cut-out position in the waveform separation step is determined so as to minimize the distance based on the evaluation. A fused speech unit generating means for generating a fused speech unit by combining a plurality of signal waveforms finely adjusted before and after
Fused speech unit storage means for holding or storing the fused speech unit;
Unit selection means for selecting a fusion speech unit corresponding to the input phoneme from the fusion speech units held or stored by the fusion speech unit storage unit;
The selected fused speech segments are connected to function as speech synthesis means for generating synthesized speech.

また、この発明に係るテキスト音声合成プログラムを記録したコンピュータ読み取り可能な記録媒体は、
コンピュータを
入力テキストを解析して入力音素を得る韻律設定手段、
複数のトレーニング音声素片から２つのトレーニング音声素片を用意し、一方のトレーニング音声素片から１つの音声素片の波形分離位置を所定の音素境界で切り出し、複数の信号波形を生成し、この複数の信号波形を任意に組み合わせて複数の融合音声素片候補を生成し、この融合音声素片候補のピッチおよび継続時間長の少なくとも一方を前記用意された他方のトレーニング音声素片の何れか１つのピッチおよび継続時間長の少なくとも一方に従って変更した複数の合成音声素片を生成し、この複数の合成音声素片と前記用意された他方のトレーニング音声素片との間の距離を評価し、その評価に基づき距離が最小となるように波形分離ステップでの波形分離位置を音素境界の前後に微調整した複数の信号波形を組み合わせて融合した融合音声素片を生成する融合音声素片生成手段、
前記融合音声素片を保持または記憶する融合音声素片記憶手段、
融合音声素片記憶手段が保持または記憶する融合音声素片から、前記入力音素に対応する融合音声素片を選択する素片選択手段、
選択された融合音声素片を接続し、合成音声を生成する音声合成手段として機能させるためのテキスト音声合成プログラムを記録した。
Further, a computer-readable recording medium in which the text-to-speech synthesis program according to the present invention is recorded,
Prosody setting means to obtain input phonemes by analyzing input text by computer,
Preparing two training speech segments from a plurality of training speech segments , cutting out waveform separation positions of one speech segment from one training speech segment at a predetermined phoneme boundary, and generating a plurality of signal waveforms; A plurality of fused speech unit candidates are generated by arbitrarily combining the plurality of signal waveforms, and at least one of the pitch and duration of the fused speech unit candidate is one of the other training speech units prepared above. Generating a plurality of synthesized speech segments modified according to at least one of one pitch and duration, and evaluating a distance between the plurality of synthesized speech segments and the other prepared training speech segment; melting the distance based on the evaluation is fused by combining a plurality of signal waveforms were finely adjusted before and after the phoneme boundary waveform separation position of the waveform separation step so as to minimize Fused speech unit generating means for generating a speech segment,
Fused speech unit storage means for holding or storing the fused speech unit;
Unit selection means for selecting a fusion speech unit corresponding to the input phoneme from the fusion speech units held or stored by the fusion speech unit storage unit;
A text-to-speech synthesis program for connecting the selected fused speech units to function as speech synthesis means for generating synthesized speech was recorded.

この発明によれば、トレーニング音声素片から任意の複数の音声波形を組み合わせて、合成音レベルで良好な音質となる音声素片を新規に生成することにより、高品質の合成音声を生成することができる。 According to the present invention, it is possible to generate a high-quality synthesized speech by combining a plurality of arbitrary speech waveforms from a training speech unit and generating a new speech unit having good sound quality at a synthesized sound level. Can do.

また、この発明によれば、複数の切り出された音声波形から、歪最小となる波形を組み合わせて複数の音声素片の最適な共通部分を生成し、共通部分を縮退化することで、高品質な合成音声を提供しつつ、音声素片辞書の記憶容量を大幅に削減することができる。 In addition, according to the present invention, an optimal common part of a plurality of speech segments is generated from a plurality of cut out speech waveforms by combining waveforms that minimize distortion, and the common part is degenerated, thereby achieving high quality. It is possible to greatly reduce the storage capacity of the speech segment dictionary while providing a simple synthesized speech.

以下、図面を参照してこの発明の実施の形態を説明する。
実施の形態１．
図１は、この発明の実施の形態１に係る音声合成方法を実現する音声合成装置の構成を示すブロック図である。図１において、１は入力端子、２は言語処理部、３は言語辞書、４は韻律設定部、５は融合音声素片生成部、６は音声素片辞書、７は素片選択部、８は音声合成部、９は出力端子である。 Embodiments of the present invention will be described below with reference to the drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing the configuration of a speech synthesis apparatus that implements the speech synthesis method according to Embodiment 1 of the present invention. In FIG. 1, 1 is an input terminal, 2 is a language processing unit, 3 is a language dictionary, 4 is a prosody setting unit, 5 is a fused speech unit generation unit, 6 is a speech unit dictionary, 7 is a unit selection unit, 8 Is a speech synthesizer, and 9 is an output terminal.

図１において、入力端子１から入力された入力テキスト１０１は、言語処理部２において言語辞書３を相互参照して形態素解析、構文解析がされ、読みや品詞情報等の解析結果１０２を出力する。 In FIG. 1, an input text 101 input from an input terminal 1 is subjected to morphological analysis and syntax analysis by cross-referencing the language dictionary 3 in the language processing unit 2 and outputs an analysis result 102 such as reading and part-of-speech information.

次に、言語処理部２が出力する解析結果１０２を元に、韻律設定部４において、音韻系列、アクセントならびにイントネーションの制御処理が行われ、音響的特徴のパラメータ、例えば、音韻記号列、音声素片のピッチパターン、ピッチ周期、ピッチマーク、継続時間長または韻律のパラメータである韻律情報１０３が設定される。 Next, based on the analysis result 102 output from the language processing unit 2, the prosody setting unit 4 performs control processing of phoneme sequences, accents, and intonations, and parameters of acoustic features such as phoneme symbol strings, phonemes Prosody information 103, which is a parameter of a pitch pattern, pitch period, pitch mark, duration, or prosody, is set.

融合音声素片生成部５では、第１のトレーニング音声素片１０４と第２のトレーニング音声素片１０５が入力され、例えば、第２のトレーニング音声素片１０５からある音韻に属する複数の信号波形を切り出して、任意の１つまたは複数の信号波形を組み合わせて融合することにより、複数の融合音声素片を生成する。さらに、第２のトレーニング音声素片と同一の音韻に属する第１のトレーニング音声素片１０４を選択し、それら音声素片に含まれるピッチ周期および継続時間長等の情報に従って、第２のトレーニング音声素片１０５から生成された複数の融合音声素片のピッチ周期および継続時間長等を変更することにより、複数の合成音声素片を生成する。続いて、複数の合成音声素片のそれぞれと第１のトレーニング音声素片１０４のそれぞれとの歪評価を行い、歪を最小とする融合音声素片１０６を音声素片辞書６に記憶する。
なお、本実施の形態ではトレーニング音声素片は、人間が発声した自然音声信号を用いている。
また、融合音声素片生成部５の処理については、後に詳細に説明する。 In the fused speech unit generation unit 5, the first training speech unit 104 and the second training speech unit 105 are input, and for example, a plurality of signal waveforms belonging to a phoneme from the second training speech unit 105 are obtained. A plurality of fused speech segments are generated by cutting out and fusing any one or a plurality of signal waveforms in combination. Further, the first training speech unit 104 belonging to the same phoneme as the second training speech unit is selected, and the second training speech is determined according to the information such as the pitch period and the duration length included in these speech units. A plurality of synthesized speech segments are generated by changing the pitch period and duration of the plurality of fused speech segments generated from the segment 105. Subsequently, distortion evaluation of each of the plurality of synthesized speech elements and each of the first training speech elements 104 is performed, and the fusion speech element 106 that minimizes the distortion is stored in the speech element dictionary 6.
In this embodiment, the training speech segment uses a natural speech signal uttered by a human.
The processing of the fused speech element generation unit 5 will be described in detail later.

次に、素片選択部７において、韻律情報１０３と、複数の融合音声素片１０６を保持または記憶する音声素片辞書６を参照して、音声合成に用いる融合音声素片である代表音声素片１０７が選択される。 Next, the unit selection unit 7 refers to the prosody information 103 and the speech unit dictionary 6 that holds or stores a plurality of fused speech units 106 to represent representative speech units that are fused speech units used for speech synthesis. A piece 107 is selected.

音声合成部８は、韻律情報１０３に従って、音声素片辞書６から選択された代表音声素片１０７に対して、ピッチ周期および音韻継続時間長を変更するとともに、素片の接続を行って合成音声信号１０８を出力端子９に出力する。ここで、ピッチ周期および音韻継続時間長を変更し、音声を合成する方法としては、たとえばＬＳＰ（Line Spectral Pair）パラメータ上で合成する残差駆動ＬＳＰ方法、スペクトルパラメータ上で合成するＭＢＥ（Multi Band Excitation）方法、２ピッチ長波形を重畳合成するピッチ波形重畳方法、音素単位等の信号波形を接続合成する波形編集方法など公知の手法を用いることができる。 The speech synthesizer 8 changes the pitch period and phoneme duration for the representative speech segment 107 selected from the speech segment dictionary 6 according to the prosodic information 103, and connects the segments to synthesize speech. The signal 108 is output to the output terminal 9. Here, as a method of synthesizing speech by changing the pitch period and the phoneme duration, for example, a residual drive LSP method for synthesizing on an LSP (Line Spectral Pair) parameter, an MBE (Multi Band) for synthesizing on a spectral parameter, and the like. Excitation), a pitch waveform superposition method for superposing and synthesizing two pitch long waveforms, and a waveform editing method for connecting and synthesizing signal waveforms such as phoneme units can be used.

次に、この発明の特徴をなす融合音声素片生成部５の処理の実施の形態について具体的に説明する。図２のフローチャートは、融合音声素片生成部５の実施の形態１における処理手順を示している。 Next, an embodiment of the processing of the fused speech unit generation unit 5 that characterizes the present invention will be specifically described. The flowchart of FIG. 2 shows the processing procedure in the first embodiment of the fused speech segment generation unit 5.

本実施の形態１における融合音声素片生成処理では、まず、多数の音声データに対して音韻毎にラベリングし、ＣＶ、ＶＣ、ＶＣＶ等の合成単位に従って切り出された第１のトレーニング音声素片S1[i]（i＝１、２、３、…、Ns1）と、同様に切り出された第２のトレーニング音声素片S2[i]（i＝１、２、３、…、Ns2）を用意する。ただし、Ns1およびNs2はそれぞれのトレーニング音声素片の同一音韻に属する素片の個数である。また、ラベリング時に音声素片毎の音韻情報、ピッチ情報、継続時間長、音素境界情報、その他必要に応じて前後音素環境等の情報も抽出して記憶する。 In the fusion speech segment generation processing in the first embodiment, first, a first training speech segment S1 is obtained by labeling a large number of speech data for each phoneme and cutting out according to a synthesis unit such as CV, VC, VCV, and the like. [i] (i = 1, 2, 3,..., Ns1) and a second training speech segment S2 [i] (i = 1, 2, 3,..., Ns2) cut out in the same manner are prepared. . Here, Ns1 and Ns2 are the number of segments belonging to the same phoneme of each training speech segment. Further, at the time of labeling, phoneme information, pitch information, duration time, phoneme boundary information, and other information such as front and back phoneme environments as necessary are extracted and stored.

前記のように第１および第２のトレーニング音声素片を用意した後、まず、波形分離ステップＳ１１で波形分離を行う。本実施の形態では、説明を簡単にするために合成単位をＣＶとし、また、音声素片の信号波形分離位置をＣとＶの音素境界位置（以下、ＣＶ境界とする）に設定し、Ｃ部の信号波形（以下、Ｃ素片とする）とＶ部の信号波形（以下、Ｖ素片とする）の２個に分離するものとして以下の説明を行う。 After preparing the first and second training speech segments as described above, first, waveform separation is performed in the waveform separation step S11. In this embodiment, in order to simplify the description, the synthesis unit is CV, and the signal waveform separation position of the speech unit is set to the phoneme boundary position between C and V (hereinafter referred to as CV boundary). The following description will be given on the assumption that the signal waveform of the part (hereinafter referred to as C segment) and the signal waveform of the V part (hereinafter referred to as V segment) are separated into two parts.

波形分離ステップＳ１１では、第２のトレーニング音声素片S2[i]から各音声素片のＣＶ境界に従ってＣ素片SC[j]（j=１、２、３、…、Ns2）、Ｖ素片SV[k]（k＝１、２、３、…、Ns2）に分離する。例えばＣＶ素片の音韻が/ma/の場合では、Ｃ素片は/m/の音素波形、Ｖ素片は/a/の音素波形である。また、Ｃ素片、Ｖ素片の分離に伴い、Ｃ素片およびＶ素片のピッチ情報、継続時間長、前後音素環境等の情報を記憶する。 In the waveform separation step S11, C segments SC [j] (j = 1, 2, 3,..., Ns2), V segments in accordance with the CV boundary of each speech segment from the second training speech segment S2 [i]. Separated into SV [k] (k = 1, 2, 3,..., Ns2). For example, when the phoneme of the CV segment is / ma /, the C segment has a phoneme waveform of / m /, and the V segment has a phoneme waveform of / a /. Further, along with the separation of the C element and the V element, information such as pitch information, duration time, front and rear phoneme environment of the C element and the V element is stored.

続いて、波形融合ステップＳ１２では、波形分離ステップＳ１１で分離されたＣ素片SC[j]とＶ素片SV[k]から同一音韻に属するものを任意に選択し、融合音声素片SM[jk]（j=１、２、３、…、Ns2、k=１、２、３、…、Ns2）を生成する。ここで、SM[jk]はj番目のＣ素片とk番目のＶ素片とを接続・融合したＣＶ音声素片である。なお、融合音声素片SM[jk]のピッチについては、Ｃ素片およびＶ素片のピッチ情報をそれぞれＣ部のピッチとＶ部のピッチとして継承し、同じく継続時間長については、Ｃ素片の継続時間長とＶ素片の継続時間長の合計値を融合音声素片の継続時間長としている。 Subsequently, in the waveform fusion step S12, those belonging to the same phoneme are arbitrarily selected from the C segment SC [j] and the V segment SV [k] separated in the waveform separation step S11, and the fused speech segment SM [ jk] (j = 1, 2, 3,..., Ns2, k = 1, 2, 3,..., Ns2). Here, SM [jk] is a CV speech element obtained by connecting and merging the jth C element and the kth V element. As for the pitch of the fusion speech element SM [jk], the pitch information of the C element and the V element is inherited as the pitch of the C part and the pitch of the V part, respectively. The total value of the duration length of V and the duration length of the V segment is used as the duration length of the fused speech segment.

前記の融合音声素片SM[jk]を生成する際に、Ｃ素片とＶ素片との接続部の不連続を軽減するために補間処理を行ってもよい。補間処理の例として、フレーム間のパワーや振幅の線形補間、移動平均、Lagrangeの補間多項式を利用した方法等を用いることができる。 When generating the fused speech unit SM [jk], an interpolation process may be performed to reduce discontinuity in the connection between the C unit and the V unit. As an example of the interpolation process, a method using linear interpolation of power and amplitude between frames, a moving average, a Lagrange interpolation polynomial, and the like can be used.

音声素片合成ステップＳ１３では、第１のトレーニング音声素片S1[i]のピッチおよび継続時間長に等しくなるように、融合音声素片SM[jk]のピッチおよび継続時間長を変更して音声合成を行って、合成音声素片G[jk、i]（j=１、２、３、…、Ns2、k=１、２、３、…、Ns2、i=１、２、３、…、Ns1）を生成する。ここで、融合音声素片の音韻が/ma/の場合には、第１のトレーニング音声素片S1[i]も同一の音韻/ma/のＣＶ素片を用いる。 In the speech unit synthesis step S13, the pitch and duration of the fused speech unit SM [jk] are changed so as to be equal to the pitch and duration of the first training speech unit S1 [i]. Performing synthesis, synthesized speech segment G [jk, i] (j = 1, 2, 3,..., Ns2, k = 1, 2, 3,..., Ns2, i = 1, 2, 3,. Ns1) is generated. Here, when the phoneme of the fusion speech unit is / ma /, the first training speech unit S1 [i] also uses the CV unit of the same phoneme / ma /.

歪評価ステップＳ１４では、合成音声素片G[jk、i]の歪評価を行う。この歪評価は合成音声素片G[jk、i]と第１のトレーニング音声素片S1[i]との距離e[jk、i] （j=１、２、３、…、Ns2、k=１、２、３、…、Ns2、i=１、２、３、…、Ns1）を評価することで行う。距離e[jk、i]は、例えば、合成音声素片G[jk、i]の信号波形と第１のトレーニング音声素片S1[i]の信号波形の２乗誤差や、合成音声素片G[jk、i]および第１のトレーニング音声素片S1[i]をＦＦＴ(Fast Fourier Transform)等を用いて、パワースペクトルに変換し、スペクトル間の２乗誤差を用いることができる。あるいは、ＬＳＰパラメータ、ケプストラムパラメータ等の公知のパラメータを用いたそれぞれの素片間の距離であっても良い。また、合成音声素片と第１のトレーニング音声素片を、例えば帯域通過フィルタ処理し、帯域毎に適した別の評価方法を用いても良い。帯域毎に適した評価方法により歪評価を行うことにより、さらに詳細な歪評価が可能となり、合成音声の品質を向上することができる。 In distortion evaluation step S14, distortion evaluation of the synthesized speech segment G [jk, i] is performed. This distortion evaluation is based on the distance e [jk, i] (j = 1, 2, 3,..., Ns2, k =) between the synthesized speech unit G [jk, i] and the first training speech unit S1 [i]. 1, 2, 3,..., Ns2, i = 1, 2, 3,. The distance e [jk, i] is, for example, the square error between the signal waveform of the synthesized speech unit G [jk, i] and the signal waveform of the first training speech unit S1 [i], or the synthesized speech unit G [jk, i] and the first training speech segment S1 [i] can be converted into a power spectrum using FFT (Fast Fourier Transform) or the like, and a square error between the spectra can be used. Alternatively, it may be a distance between each element using a known parameter such as an LSP parameter or a cepstrum parameter. Alternatively, the synthesized speech unit and the first training speech unit may be subjected to, for example, band-pass filter processing, and another evaluation method suitable for each band may be used. By performing distortion evaluation using an evaluation method suitable for each band, further detailed distortion evaluation becomes possible, and the quality of synthesized speech can be improved.

また、ＣＶ境界近傍やスペクトルが大きく変動する部分、例えば、語頭・語尾など音声の立ち上がり・立下り部分や音韻変化過渡部においては、他の部分より大きく重み付けして距離e[jk、i]を評価してもよい。波形融合点であるＣＶ近傍等を大きく重み付けして距離e[jk、i]を評価することにより、波形融合による波形不連続に起因する歪に大きく重み付けして評価することができるので、劣化した融合音声素片の生成を抑制することができ、合成音声の品質を向上することができる。 In addition, the distance e [jk, i] is weighted more heavily than the other parts in the vicinity of the CV boundary and in a part where the spectrum greatly fluctuates, for example, in the rising / falling part of the speech such as the beginning / end of the word and the phonological change transition part. You may evaluate. Since the distance e [jk, i] is evaluated by weighting the vicinity of the CV that is the waveform fusion point with a large weight, the distortion due to the waveform discontinuity due to the waveform fusion can be weighted and evaluated. The generation of the fusion speech unit can be suppressed, and the quality of the synthesized speech can be improved.

さらに、合成音声素片G[jk、i]と第１のトレーニング音声素片S1[i]との距離e[jk、i]を評価する際に、合成音声素片G[jk、i]と第１のトレーニング音声素片S1[i]に対して、なんらかの聴覚的重み付けフィルタ処理を行っても良い。聴覚重み付けの方法としては、例えばＬＰＣ(Linear Predictive Coefficient)パラメータ等を用いた逆フィルタ処理による方法等の公知の方法を用いることができる。この聴覚重み付け処理はトレーニング音声素片に予め処理しておくことで計算を省力化することができる。
このとき、音声素片辞書６へは聴覚重み付け処理を行っていないトレーニング音声素片から、前記歪最小となる融合音声素片のみを生成して出力する。
また、前記の距離e[jk、i]に対して、聴覚重み付けフィルタを構成する関数を距離計算の重み付け関数として組み込んでもよい。聴覚重み付け処理を行うことで、聴覚的に重要な部分を重視した歪評価が可能となり、さらに合成音声の品質を向上することができる。 Further, when evaluating the distance e [jk, i] between the synthesized speech unit G [jk, i] and the first training speech unit S1 [i], the synthesized speech unit G [jk, i] Some perceptual weighting filter processing may be performed on the first training speech segment S1 [i]. As the auditory weighting method, for example, a known method such as a method by inverse filter processing using an LPC (Linear Predictive Coefficient) parameter or the like can be used. This auditory weighting process can save the calculation by processing the training speech segment in advance.
At this time, only the fused speech unit that minimizes the distortion is generated and output to the speech unit dictionary 6 from the training speech unit that is not subjected to auditory weighting processing.
In addition, for the distance e [jk, i], a function constituting an auditory weighting filter may be incorporated as a weighting function for distance calculation. By performing auditory weighting processing, distortion evaluation can be performed with emphasis on auditory important parts, and the quality of synthesized speech can be further improved.

総合評価ステップＳ１５では、歪評価ステップＳ１４にて合成音声素片G[jk、i]の全ての歪評価を行った後、式（１）、（２）に従って、融合音声素片SM[jk]の波形変形歪を表す総合歪E[jk]（j=１、２、３、…、Ns2、k=１、２、３、…、Ns2）を求める。 In the comprehensive evaluation step S15, after all the distortion evaluations of the synthesized speech unit G [jk, i] are performed in the distortion evaluation step S14, the fusion speech unit SM [jk] is performed according to the equations (1) and (2). The total distortion E [jk] (j = 1, 2, 3,..., Ns2, k = 1, 2, 3,..., Ns2) is obtained.

ここで、w(ij)は第１のトレーニング音声素片S1[i]と第２のトレーニング音声素片S2[i]の関係から導出される重み付け関数であり、F01[i]は第１のトレーニング音声素片S1[i]の平均ピッチ周期、F02[i]は第２のトレーニング音声素片S2[i]の平均ピッチ周期である。また、Wcは所定の重み係数であり、音韻毎に実験的に好適な値を設定する。 Here, w (ij) is a weighting function derived from the relationship between the first training speech segment S1 [i] and the second training speech segment S2 [i], and F01 [i] is the first The average pitch period of the training speech segment S1 [i], and F02 [i] is the average pitch period of the second training speech segment S2 [i]. Wc is a predetermined weighting factor, and is set to an experimentally suitable value for each phoneme.

以上、第２のトレーニング音声素片S2[i]から得られた、Ｃ素片SC[j]とＶ素片SV[k]の全ての組み合わせによる融合音声素片SM[jk]の総合歪E[jk]を評価し、総合歪が最小となるＣ素片SC[j]とＶ素片SV[k]の組み合わせで構成される融合音声素片１０６を、音声素片辞書６へ出力する。前記のステップＳ１１〜Ｓ１５の工程を、音声合成に用いるのに必要な全ての音韻に対し、第１のトレーニング音声素片１０４と第２のトレーニング音声素片１０５を、当該音韻のものに取り替えて順次実施することで音声素片辞書６を構築する。 As described above, the total distortion E of the fusion speech unit SM [jk] obtained from all the combinations of the C unit SC [j] and the V unit SV [k] obtained from the second training speech unit S2 [i]. [jk] is evaluated, and the fusion speech unit 106 composed of a combination of the C unit SC [j] and the V unit SV [k] that minimizes the total distortion is output to the speech unit dictionary 6. For all the phonemes necessary for using the steps S11 to S15 for speech synthesis, the first training speech unit 104 and the second training speech unit 105 are replaced with those of the phoneme. The speech unit dictionary 6 is constructed by sequentially performing the operations.

なお、本実施の形態においては、説明の簡略化のために波形分離位置をＣＶ境界丁度としているが、音韻毎に調音結合等を考慮して波形分離位置を移動・調整してもよい。 In this embodiment, the waveform separation position is exactly the CV boundary for the sake of simplicity of explanation, but the waveform separation position may be moved and adjusted in consideration of articulation coupling for each phoneme.

また、上述の総合歪E[jk]あるいは距離e[jk、i]が小さくなるように、音声素片毎に波形分離位置をＣＶ境界の前後にトラッキング（微調整）してもよい。図３はこのときの融合音声素片生成部５の処理の変形例であり、総合評価ステップＳ１５から波形分離ステップＳ１１に戻るフィードバックループを形成し、判断ステップＳ１６にて総合歪あるいは距離が最小と判断されるまで、ステップＳ１１からステップＳ１５までの処理を順次実施することとなる。 Further, the waveform separation position may be tracked (finely adjusted) before and after the CV boundary so that the total distortion E [jk] or the distance e [jk, i] is reduced. FIG. 3 shows a modified example of the processing of the fused speech unit generation unit 5 at this time. A feedback loop is formed from the comprehensive evaluation step S15 to the waveform separation step S11, and the total distortion or distance is minimized in the determination step S16. Until the determination is made, the processing from step S11 to step S15 is sequentially performed.

本実施の形態においては、説明の簡略化のために合成単位をＣＶ素片として説明を行ったが、ＶＣ、ＶＣＶ、ＣＶＣといったような合成単位にも勿論適用できる。また、例えば、/myo/のような半母音/yo/を含む音声素片においては、/m/、/y/、/o/と３分割してそれぞれを組み合わせることで融合音声素片を作成することも可能である。また、半音素単位で/-m/、/m-y/、/y-o/、/o-/と４分割してもかまわない。 In the present embodiment, for the sake of simplification of explanation, the synthesis unit has been described as a CV segment. However, the present invention is naturally applicable to a synthesis unit such as VC, VCV, or CVC. In addition, for example, in a speech unit including a semi-vowel / yo / such as / myo /, a fusion speech unit is created by combining each of them by dividing into / m /, / y /, and / o /. It is also possible. Further, it may be divided into four units of / -m /, / m-y /, / y-o /, / o- / in units of semiphonemes.

本実施の形態においては、有声子音/ma/の場合について例示したが、例えば、無声子音/sa/等のＣＶ素片にこの発明を適用しても良いし、単独母音/a/において、ある/a/の音素と別の/a/の音素を接続して/a/の長音を生成（ＶＶ素片）する場合にもこの発明は適用可能である。さらに単独子音の場合も単独母音の場合と同様に長音を生成（ＣＣ素片）することができる。すなわち、Ｃ、Ｖ単独音素に対応できるので、単独のＣ、Ｖを合成単位とする（単音素単位）の音声合成方法にも適用可能である。 In the present embodiment, the case of voiced consonant / ma / has been illustrated, but for example, the present invention may be applied to CV segments such as unvoiced consonant / sa /, or in a single vowel / a / The present invention can also be applied to a case where a / a / phoneme and another / a / phoneme are connected to generate a long sound of / a / (VV segment). Further, in the case of a single consonant, a long sound can be generated (CC segment) as in the case of a single vowel. That is, since it can deal with C and V single phonemes, it can also be applied to a speech synthesis method using single C and V as synthesis units (single phoneme units).

さらに、Ｃ、Ｖ、ＣＶといった合成単位よりももっと細分化された単位、例えば、２ピッチ長波形重畳合成方法に用いられる２ピッチ長波形を素片組み合わせ単位と見なし、この２ピッチ長波形単位で組み合わせて融合音声素片を生成したり、また、音声素片の時間軸信号を5ms単位のフレームに分割し、そのフレーム単位に分析したＬＳＰパラメータなどのパラメータレベルで組み合わせて融合音声素片を生成しても良い。 Further, a unit that is further subdivided than a synthesis unit such as C, V, and CV, for example, a 2-pitch length waveform used in a 2-pitch length waveform superposition synthesis method is regarded as a unit combination unit, and this 2-pitch length waveform unit Combined to generate a fusion speech unit, or to divide the speech unit time-axis signal into 5ms frames and combine them at the parameter level such as LSP parameters analyzed in that frame unit to generate a fusion speech unit You may do it.

また、歪評価の際に、ある音韻において、融合音声素片を用いて合成音声素片を生成した場合と、従来の融合しない音声素片にて合成音声素片を生成した場合とを比較し、従来の融合音声素片を用いない場合の方が歪が小さくなる場合には、当該音韻に関しては融合音声素片を用いずに通常の音声素片を選択することも可能である。 In addition, when evaluating distortion, we compared the case where a synthesized speech unit was generated using a fusion speech unit for a phoneme and the case where a synthesized speech unit was generated using a conventional speech unit that was not fused. If the distortion is smaller when the conventional fused speech unit is not used, it is possible to select a normal speech unit without using the fused speech unit for the phoneme.

本実施の形態１の構成をとることにより、例えば、トレーニング音声素片の個数が十分用意できない場合でも、任意に波形を組み合わせて融合音声素片を生成して音声素片とすることで、音声素片のバリエーションを増やすことができ、品質の高い合成音声を生成することができる。 By adopting the configuration of the first embodiment, for example, even when the number of training speech units cannot be sufficiently prepared, a speech unit can be obtained by generating a fused speech unit by arbitrarily combining waveforms to obtain a speech unit. The variation of the segment can be increased, and high-quality synthesized speech can be generated.

実施の形態２．
前記実施の形態１にて示した融合音声素片生成部５において、第２のトレーニング音声素片数がNs2個の場合、（Ns2）×（Ns2）通りの組み合わせの融合音声素片の歪評価が必要であり、Ns2が大きくなると飛躍的に処理量が増大するが、融合音声素片の組み合わせに用いる音声素片を予備選択することで、融合音声素片評価に対する処理量を削減することができる。 Embodiment 2. FIG.
In the fusion speech unit generation unit 5 shown in the first embodiment, when the number of second training speech units is Ns2, distortion evaluation of fusion speech units of (Ns2) × (Ns2) combinations However, as Ns2 increases, the amount of processing increases dramatically. However, by pre-selecting speech units to be used for the combination of fused speech units, the amount of processing for evaluation of fused speech units can be reduced. it can.

図４は、融合音声素片生成部５の別の実施の形態の処理手順を示すフローチャートである。図４において、予備選択ステップＳ２１が波形分離ステップＳ１１の処理の前にあり、他は図２の処理と同様に構成される。 FIG. 4 is a flowchart showing a processing procedure of another embodiment of the fused speech unit generation unit 5. In FIG. 4, the preliminary selection step S21 is before the processing of the waveform separation step S11, and the others are configured in the same manner as the processing of FIG.

予備選択ステップＳ２１では、所定の予備選択方法により、第２のトレーニング音声素片から、融合音声素片の生成候補として好適な音声素片のみ選択し、この選択された音声素片に対して波形分離、波形融合の処理を施し、音声素片合成ステップＳ１３へ出力する。また、適宜、第１のトレーニング音声素片も同様に予備選択し、好適な音声素片のみ音声素片合成ステップＳ１３へ出力してもかまわない。 In the preliminary selection step S21, only a speech unit suitable as a candidate for generating a fusion speech unit is selected from the second training speech unit by a predetermined preliminary selection method, and a waveform is generated for the selected speech unit. Processing of separation and waveform fusion is performed, and the result is output to the speech unit synthesis step S13. Also, the first training speech unit may be preliminarily selected as appropriate, and only a suitable speech unit may be output to the speech unit synthesis step S13.

予備選択方法として、例えば、従来の代表音声素片選択方法により選択された上位の音声素片（合成音声レベルでの歪が小さい音声素片）を用いたり、ピッチ周期または継続時間長が所定範囲の音声素片だけや、前後音素環境が同じもの、あるいは、スペクトルが近似している（スペクトル上の距離が一定範囲内など）音声素片だけ選択する等で実施可能である。 As a preliminary selection method, for example, an upper speech unit (a speech unit having a small distortion at the synthesized speech level) selected by a conventional representative speech unit selection method is used, or a pitch period or duration is in a predetermined range. This can be implemented by selecting only speech units having the same front / rear phoneme environment, or speech units having similar spectra (distance on the spectrum is within a certain range, etc.).

また、別の予備選択方法として、音質が悪い音声素片を公知の手法を用いて事前に排除（スクリーニング）することでも実行可能である。音質の悪い音声素片として、例えば、パワーが小さい、信号波形・ピッチ周期・パワーが乱れている音声素片、ラベリング時に有声・無声判定やピッチ抽出を誤っている音声素片、あるいは、ピッチ周期が平均値から大きく外れていたり、継続時間長が短すぎたり長すぎたりして音声合成に用いるには不適当な音声素片等が挙げられる。 In addition, as another preliminary selection method, it is also possible to eliminate (screen) a speech element having poor sound quality in advance using a known method. Examples of speech units with poor sound quality include speech units with low power, signal waveform / pitch period / power disorder, speech units with incorrect voiced / unvoiced judgment and pitch extraction during labeling, or pitch period May be speech units that are inappropriate for use in speech synthesis due to the fact that is significantly deviated from the average value, or the duration is too short or too long.

続いて、予備選択ステップＳ２１から出力された、予備選択後のトレーニング音声素片に対し、ステップＳ１１、ステップＳ１２、ステップＳ１３、ステップＳ１４、ステップＳ１５を実施の形態１と同様に順次実行し、作成された融合音声素片１０６を音声素片辞書６へ出力する。 Subsequently, step S11, step S12, step S13, step S14, and step S15 are sequentially executed in the same manner as in the first embodiment on the training speech segment after the preliminary selection output from the preliminary selection step S21, and created. The merged speech unit 106 is output to the speech unit dictionary 6.

なお、本実施の形態２においては、説明の簡略化のために波形分離位置をＣＶ境界丁度としているが、音韻毎に調音結合等を考慮して波形分離位置を移動・調整してもよい。 In the second embodiment, the waveform separation position is exactly the CV boundary for simplification of description, but the waveform separation position may be moved and adjusted in consideration of articulation coupling for each phoneme.

また、本実施の形態２においても、先の実施の形態１と同様に、総合歪E[jk]あるいは距離e[jk、i]が小さくなるように、音声素片毎に波形分離位置をＣＶ境界の前後にトラッキング（微調整）してもよい。図５はこのときの融合音声素片生成部５の処理の別の変形例であり、総合評価ステップＳ１５から波形分離ステップＳ１１に戻るフィードバックループを形成し、判断ステップＳ１６にて総合歪あるいは距離が最小と判断されるまで、ステップＳ１１からステップＳ１５までの処理を順次実施することとなる。 Also in the second embodiment, similarly to the first embodiment, the waveform separation position is set to CV for each speech unit so that the total distortion E [jk] or the distance e [jk, i] is reduced. Tracking (fine adjustment) may be performed before and after the boundary. FIG. 5 shows another modification of the processing of the fused speech unit generation unit 5 at this time. A feedback loop returning from the comprehensive evaluation step S15 to the waveform separation step S11 is formed, and the total distortion or distance is determined in the determination step S16. Until it is determined to be the minimum, the processing from step S11 to step S15 is sequentially performed.

前記のように、融合音声素片の組み合わせに用いる音声素片を予備選択することで、融合音声素片評価に対する処理量を削減する効果を奏すると共に、音質の悪い第２のトレーニング音声素片を排除することができ、合成音声の品質を向上することができる。 As described above, by pre-selecting the speech unit used for the combination of the fusion speech units, it is possible to reduce the amount of processing for the fusion speech unit evaluation, and the second training speech unit having poor sound quality can be obtained. Therefore, the quality of synthesized speech can be improved.

さらに、融合音声素片のリファレンス（教師データ）となる第１のトレーニング音声素片も予備選択することで、歪評価時においてその処理量を削減する効果を奏すると共に、音質が悪い第１のトレーニング音声素片を排除することができ、合成音声の品質を向上することができる。 In addition, the first training speech unit that serves as a reference (teacher data) for the fusion speech unit is also preliminarily selected, so that the amount of processing can be reduced at the time of distortion evaluation and the first training with poor sound quality can be achieved. Speech segments can be eliminated, and the quality of synthesized speech can be improved.

実施の形態３．
また、実施の形態２の別の形態として、Ｃ素片とＶ素片を分離した後に、それぞれの素片別に予備選択を実施することも可能である。 Embodiment 3 FIG.
As another form of the second embodiment, after the C piece and the V piece are separated, preliminary selection can be performed for each piece.

図６は、融合音声素片生成部５の別の実施の形態の処理手順を示すフローチャートである。図６において、まず、波形分離ステップＳ１１により、第２のトレーニング音声素片の波形分離を行い、その後、予備選択ステップＳ２１において、Ｃ素片、Ｖ素片に対しそれぞれ独立した予備選択が実行され、以下、実施の形態１と同様にステップＳ１２、ステップＳ１３、ステップＳ１４、ステップＳ１５を順次実行し、作成された融合音声素片１０６を音声素片辞書６へ出力する。 FIG. 6 is a flowchart showing a processing procedure of another embodiment of the fused speech unit generation unit 5. In FIG. 6, first, waveform separation of the second training speech segment is performed in waveform separation step S11, and then preliminary selection independent of C segment and V segment is performed in preliminary selection step S21. Thereafter, step S12, step S13, step S14, and step S15 are sequentially executed in the same manner as in the first embodiment, and the created fused speech unit 106 is output to the speech unit dictionary 6.

予備選択方法として、例えば、従来の代表素片選択方法によりそれぞれの上位候補を選定し、それぞれの上位候補で融合音声素片を生成する方法など、前記実施の形態２の予備選択ステップＳ２１で用いているのと同様の方法を用いることができる。 As a preliminary selection method, for example, a method of selecting each superior candidate by a conventional representative segment selection method and generating a fusion speech segment by each superior candidate is used in the preliminary selection step S21 of the second embodiment. The same method can be used.

本実施の形態３においても、先の実施の形態２と同様に、音韻毎に調音結合等を考慮して波形分離位置を移動・調整してもよいし、総合歪E[jk]あるいは距離e[jk、i]が小さくなるように、音声素片毎に波形分離位置をＣＶ境界の前後にトラッキング（微調整）してもよい。 Also in the third embodiment, similarly to the second embodiment, the waveform separation position may be moved and adjusted for each phoneme in consideration of articulation coupling or the like, or the total distortion E [jk] or the distance e The waveform separation position may be tracked (finely adjusted) before and after the CV boundary so that [jk, i] becomes small.

波形分離した後に、分離後の波形別に独立して予備選択を行うことで、融合音声素片の歪評価に対する処理量を削減する効果を奏すると共に、音質が悪いトレーニング音声素片を排除することができ、合成音声の品質を向上することができる。 Performing pre-selection independently for each waveform after separation after waveform separation has the effect of reducing the processing amount for distortion evaluation of fused speech segments, and eliminates training speech segments with poor sound quality And the quality of the synthesized speech can be improved.

実施の形態４．
前記実施の形態１の変形例として、ＣＶ素片のＣ素片とＶ素片を、他の音韻のそれらと共有化することで、合成音声の品質を維持したまま音響辞書のサイズを大幅に削減したり、音素を共通化できるため聴感上の合成音声の安定化を図ることが可能である。 Embodiment 4 FIG.
As a modification of the first embodiment, by sharing the C and V segments of the CV segment with those of other phonemes, the size of the acoustic dictionary can be greatly increased while maintaining the quality of the synthesized speech. Since it can be reduced or phonemes can be made common, it is possible to stabilize the synthesized speech in terms of hearing.

図７は、この発明の実施の形態４に係る音声合成方法を実現するテキスト音声合成装置の構成を示すブロック図である。図１と同一部分については同一の参照符号を付して説明を省き相違点を説明する。本実施の形態では、第２のトレーニング音声素片１０５から複数の共通化音声素片候補２０１および複数の非共通化音声素片候補２０２を生成し出力する共通音声素片生成具２１と、第１のトレーニング音声素片１０４および複数の共通化音声素片候補２０１と複数の非共通化音声素片候補２０２を入力し、共通化音声素片２０３および非共通化音声素片２０４を出力する融合音声素片生成具２２が備えられており、共通音声素片生成具２１と融合音声素片生成具２２で融合音声素片生成部５を構成している点がこれまでの実施の形態と異なる処である。 FIG. 7 is a block diagram showing a configuration of a text-to-speech synthesizer that implements a speech synthesis method according to Embodiment 4 of the present invention. The same parts as those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted. In the present embodiment, a common speech unit generator 21 that generates and outputs a plurality of common speech unit candidates 201 and a plurality of non-common speech unit candidates 202 from the second training speech unit 105, One training speech unit 104, a plurality of common speech unit candidates 201 and a plurality of non-common speech unit candidates 202 are input, and a common speech unit 203 and a non-common speech unit 204 are output. The speech unit generation tool 22 is provided, and the point that the fusion speech unit generation unit 5 is configured by the common speech unit generation tool 21 and the fusion speech unit generation tool 22 is different from the above embodiments. It is a place.

共通音声素片生成具２１では、第２のトレーニング音声素片１０５から、例えば、音韻中の音素名が共通する複数の信号波形を切り出して、複数の共通化音声素片候補２０１および複数の非共通化音声素片候補２０２を生成し出力する。ここで、共通化音声素片とは、例えば有声子音/ma/、/za/、/na/等のグループにおいて、共通音素名である母音部/a/の音声素片波形を示し、非共通化音声素片とは、各子音部/m/、/z/、/n/等の音声素片波形のことを示す。 In the common speech segment generator 21, for example, a plurality of signal waveforms having the same phoneme name in the phoneme are cut out from the second training speech segment 105, and a plurality of common speech segment candidates 201 and a plurality of non-speech segment candidates 201 are extracted. A common speech segment candidate 202 is generated and output. Here, the common speech unit indicates a speech unit waveform of the vowel part / a / which is a common phoneme name in a group of voiced consonants / ma /, / za /, / na /, etc. The speech segment indicates a speech segment waveform of each consonant part / m /, / z /, / n /, etc.

融合音声素片生成具２２には、第１のトレーニング音声素片１０４と、共通音声素片生成具２１において第２のトレーニング音声素片１０５から生成された共通化音声素片候補２０１および非共通化音声素片候補２０２が入力される。入力された共通化音声素片候補２０１および非共通化音声素片候補２０２から、任意の１つまたは複数の信号波形を組み合わせて融合することにより、複数の融合音声素片を生成する。さらに、融合音声素片と同一の音韻に属する第１のトレーニング音声素片１０４を選択し、それら音声素片に含まれるピッチ周期および継続時間長等の情報に従って、複数の融合音声素片のピッチ周期および継続時間長等を変更することにより、複数の合成音声素片を生成する。 The fusion speech unit generator 22 includes a first training speech unit 104 and a common speech unit candidate 201 generated from the second training speech unit 105 in the common speech unit generator 21 and a non-common. A speech unit candidate 202 is input. A plurality of fused speech units are generated by combining and combining arbitrary one or a plurality of signal waveforms from the input common speech unit candidate 201 and the non-common speech unit candidate 202. Further, the first training speech unit 104 belonging to the same phoneme as the fusion speech unit is selected, and the pitches of the plurality of fusion speech units are determined according to information such as the pitch period and duration length included in these speech units. A plurality of synthesized speech segments are generated by changing the period and the duration time.

続いて、複数の合成音声素片のそれぞれと第１のトレーニング音声素片１０４のそれぞれとの歪評価を行う。歪評価は共通化対象である音韻に属する全ての音素に対して実施し、最も歪を最小とする共通化部分の音声素片を共通化音声素片２０３として音声素片辞書６に記憶する。また、共通化音声素片２０３を用いたときに、各音韻単位で歪が最小となる共通化部分以外の音声素片を非共通化音声素片２０４として音声素片辞書６に記憶する。 Subsequently, distortion evaluation of each of the plurality of synthesized speech segments and each of the first training speech segments 104 is performed. The distortion evaluation is performed on all phonemes belonging to the phoneme to be shared, and the phoneme unit of the common part that minimizes the distortion is stored in the phoneme unit dictionary 6 as the common phoneme unit 203. When the common speech element 203 is used, a speech element other than the common part that minimizes distortion in each phoneme unit is stored in the speech element dictionary 6 as a non-common speech element 204.

素片選択部７では、共通化音声素片２０３および非共通化音声素片２０４を保持または記憶する音声素片辞書６と韻律情報１０３を参照して、韻律情報１０３が持つ音韻記号列に従って、共通化音声素片２０３と非共通化音声素片２０４を選択して該当する融合音声素片を生成し、音声合成に用いる代表音声素片１０７として出力する。 The segment selection unit 7 refers to the speech segment dictionary 6 that stores or stores the common speech unit 203 and the non-common speech unit 204 and the prosody information 103, according to the phoneme symbol string of the prosody information 103. The common speech unit 203 and the non-common speech unit 204 are selected to generate a corresponding fused speech unit and output as a representative speech unit 107 used for speech synthesis.

図８のフローチャートは、共通音声素片生成具２１ならびに融合音声素片生成具２２、即ち融合音声素片生成部５の実施の形態４における処理手順を示している。この実施の形態４における融合音声素片生成処理では、前出の実施の形態１の融合音声素片生成部５における処理と同様に、まず、多数の音声データに対して音韻毎にラベリングし、ＣＶ、ＶＣ、ＶＣＶ等の合成単位に従って切り出された複数音韻の第１のトレーニング音声素片S1[i]|ph（i＝１、２、３、…、Ns1、ph=音韻名）と、同様に切り出された複数音韻の第２のトレーニング音声素片S2[i]|ph（i＝１、２、３、…、Ns2、ph=音韻名）を用意する。ただし、Ns1およびNs2はそれぞれのトレーニング音声素片の同一音韻に属する音声素片の個数である。また、ラベリング時に音声素片毎の音韻情報、ピッチ情報、継続時間長、音素境界情報、その他必要に応じて前後音素環境等の情報も抽出して記憶する。
なお、以下の説明の便宜上、他の音韻に対する、第１および第２のトレーニング音声素片の個数もそれぞれNs1およびNs2とするが、この定義はこの発明の範囲を狭めるものではなく、任意の個数を取ることができる。 The flowchart of FIG. 8 shows a processing procedure in the fourth embodiment of the common speech unit generator 21 and the fused speech unit generator 22, that is, the fused speech unit generator 5. In the fusion speech unit generation process in the fourth embodiment, as in the process in the fusion speech unit generation unit 5 in the first embodiment, first, a large number of speech data is labeled for each phoneme, Same as the first training speech segment S1 [i] | ph (i = 1, 2, 3,..., Ns1, ph = phoneme name) of a plurality of phonemes extracted according to a synthesis unit such as CV, VC, VCV, etc. A second training speech segment S2 [i] | ph (i = 1, 2, 3,..., Ns2, ph = phoneme name) of a plurality of phonemes extracted in step S is prepared. Here, Ns1 and Ns2 are the number of speech units belonging to the same phoneme of each training speech unit. Further, at the time of labeling, phoneme information, pitch information, duration time, phoneme boundary information, and other information such as front and back phoneme environments as necessary are extracted and stored.
For the convenience of the following description, the numbers of the first and second training speech segments for other phonemes are also Ns1 and Ns2, respectively, but this definition does not narrow the scope of the present invention, and any number Can take.

前記のように第１および第２のトレーニング音声素片を用意した後、まず、波形分離ステップＳ１１および共通素片抽出ステップＳ４１において、共通音声素片生成具２１における第２のトレーニング音声素片１０５の内部処理を実行する。本実施の形態では、先の実施の形態１と同様に合成単位をＣＶとし、また、音声素片の信号波形分離位置をＣＶ境界に設定して、Ｃ素片とＶ素片の２個に分離するものとして以下の説明を行う。 After preparing the first and second training speech segments as described above, first, in the waveform separation step S11 and the common segment extraction step S41, the second training speech segment 105 in the common speech segment generator 21 is used. Execute internal processing. In this embodiment, as in the first embodiment, the synthesis unit is CV, and the signal waveform separation position of the speech element is set to the CV boundary, so that the C element and the V element are divided into two. The following explanation will be given as an example of separation.

波形分離ステップＳ１１では、第２のトレーニング音声素片S2[i]|phから各音声素片のＣＶ境界に従ってＣ素片SC[j]|ph（j=１、２、３、…、Ns2、ph=音素名）、Ｖ素片SV[k]|ph（k＝１、２、３、…、Ns2、ph=音素名）に分離する。また、Ｃ素片、Ｖ素片の分離に伴い、Ｃ素片およびＶ素片のピッチ情報、継続時間長、前後音素環境等の情報を記憶する。また、非共通化音声素片候補２０２であるＣ素片SC[j]|phを出力する。 In the waveform separation step S11, C segments SC [j] | ph (j = 1, 2, 3,..., Ns2,..., According to the CV boundary of each speech unit from the second training speech unit S2 [i] | ph. ph = phoneme name) and V segment SV [k] | ph (k = 1, 2, 3,..., Ns2, ph = phoneme name). Further, along with the separation of the C element and the V element, information such as pitch information, duration time, front and rear phoneme environment of the C element and the V element is stored. Also, the C segment SC [j] | ph that is the unshared speech segment candidate 202 is output.

共通素片抽出ステップＳ４１では、例えば/a/を共通化要素として、/a/が存在する音韻、すなわち、/ma/、/ba/、/na/、/sa/等の音声素片から、波形分離ステップＳ１１で分離されたＶ素片SV[k]|phを参照して/a/の波形信号を取り出して、共通化音声素片候補２０１であるA[k]（k=１、２、３、…、NA）を生成する。ただし、NAは共通化音声素片候補の個数である。 In the common segment extraction step S41, for example, from / a / as a common element, a phoneme in which / a / exists, that is, from speech units such as / ma /, / ba /, / na /, / sa /, etc. The waveform signal of / a / is extracted with reference to the V segment SV [k] | ph separated in the waveform separation step S11, and A [k] (k = 1, 2) which is the common speech segment candidate 201. 3, ..., NA). However, NA is the number of common speech unit candidates.

続いて、波形融合ステップＳ１２、音声素片合成ステップＳ１３、歪評価ステップＳ１４、総合評価ステップＳ１５では、融合音声素片生成具２２の内部処理を実行する。 Subsequently, in the waveform fusion step S12, the speech unit synthesis step S13, the distortion evaluation step S14, and the comprehensive evaluation step S15, internal processing of the fusion speech unit generator 22 is executed.

波形融合ステップＳ１２では、波形分離ステップＳ１１で分離された非共通化音声素片候補２０２であるＣ素片SC[j]|phと、共通素片抽出ステップＳ４１で生成された共通化音声素片候補２０１であるA[k]を任意に選択し、融合音声素片SM[jk]|ph（j=１、２、３、…、Ns2、k=１、２、３、…、NA、ph=音韻名）を生成する。ここでSM[jk]|ph=maは、/ma/のj番目のＣ素片(/m/)と、k番目の共通化素片候補(/a/)とを接続・融合したＣＶ音声素片/ma/を表している。なお、融合音声素片SM[jk]|phのピッチについては、Ｃ素片および共通化音声素片候補のピッチ情報をそれぞれＣ部のピッチとＶ部のピッチとして継承し、同じく継続時間長については、Ｃ素片の継続時間長と共通化音声素片候補の継続時間長の合計値を融合音声素片の継続時間長としている。 In the waveform fusion step S12, the C segment SC [j] | ph, which is the non-common speech unit candidate 202 separated in the waveform separation step S11, and the common speech unit generated in the common segment extraction step S41. A [k] that is a candidate 201 is arbitrarily selected, and the fused speech element SM [jk] | ph (j = 1, 2, 3,..., Ns2, k = 1, 2, 3,..., NA, ph = Phoneme name). Here, SM [jk] | ph = ma is a CV speech obtained by connecting and merging the jth C segment (/ m /) of / ma / with the kth common segment candidate (/ a /). It represents the unit / ma /. As for the pitch of the fusion speech unit SM [jk] | ph, the pitch information of the C unit and the common speech unit candidate is inherited as the pitch of the C part and the pitch of the V part, respectively, Uses the total value of the duration length of the C segment and the duration length of the common speech segment candidate as the duration length of the fused speech segment.

前記の融合音声素片SM[jk]|phを生成する際に、Ｃ素片と共通化音声素片候補との接続部の不連続を軽減するために補間処理を行ってもよい。補間処理の例として、フレーム間のパワーや振幅の線形補間、移動平均、Lagrangeの補間多項式を利用した方法等を用いることができる。 When generating the fused speech unit SM [jk] | ph, an interpolation process may be performed to reduce discontinuity in the connection between the C unit and the common speech unit candidate. As an example of the interpolation process, a method using linear interpolation of power and amplitude between frames, a moving average, a Lagrange interpolation polynomial, and the like can be used.

音声素片合成ステップＳ１３では、第１のトレーニング音声素片S1[i]|phのピッチおよび継続時間長に等しくなるように、融合音声素片SM[jk]|phのピッチおよび継続時間長を変更して音声合成を行って、合成音声素片G[jk、i]|ph（j=１、２、３、…、Ns2、k=１、２、３、…、NA、i=１、２、３、…、Ns1、ph=音韻名）を生成する。ここで、融合音声素片の音韻が/ma/の場合には、同一の音韻/ma/のＣＶ素片である第１のトレーニング音声素片S1[i]|ph=maを用いて音声合成し、合成音声素片G[jk、i]|ph=maと記す。同様に、融合音声素片が/ba/の場合には、第１のトレーニング音声素片も/ba/のＣＶ素片を用いて合成音声素片G[jk、i]|ph=baと記し、全ての音韻に対する合成音声素片を生成する。 In the speech unit synthesis step S13, the pitch and duration of the fusion speech unit SM [jk] | ph are set to be equal to the pitch and duration of the first training speech unit S1 [i] | ph. The synthesized speech unit G [jk, i] | ph (j = 1, 2, 3,..., Ns2, k = 1, 2, 3,..., NA, i = 1, 2, 3,..., Ns1, ph = phoneme name). Here, when the phoneme of the fusion speech unit is / ma /, speech synthesis is performed using the first training speech unit S1 [i] | ph = ma which is a CV unit of the same phoneme / ma /. And a synthesized speech segment G [jk, i] | ph = ma. Similarly, when the fusion speech unit is / ba /, the first training speech unit is also written as a synthesized speech unit G [jk, i] | ph = ba using the CV unit of / ba /. Generate synthetic speech segments for all phonemes.

歪評価ステップＳ１４では、合成音声素片G[jk、i]|phの歪評価を行う。この歪評価は合成音声素片G[jk、i]|phと第１のトレーニング音声素片S1[i]|phとの距離e[jk、i]|phを評価することで行う。距離e[jk、i]|phは、例えば、合成音声素片G[jk、i]|phの信号波形と第１のトレーニング音声素片S1[i]|phの信号波形の２乗誤差や、合成音声素片G[jk、i]|phおよび第１のトレーニング音声素片S1[i]|phをＦＦＴ(Fast Fourier Transform)等を用いて、パワースペクトルに変換し、スペクトル間の２乗誤差を用いることができる。あるいは、ＬＳＰパラメータ、ケプストラムパラメータ等の公知のパラメータを用いたそれぞれの素片間の距離であっても良い。また、合成音声素片と第１のトレーニング音声素片を、例えば帯域通過フィルタ処理し、帯域毎に適した別の評価方法を用いても良い。帯域毎に適した評価方法により歪評価を行うことにより、さらに詳細な歪評価が可能となり、合成音声の品質を向上することができる。 In distortion evaluation step S14, distortion evaluation of the synthesized speech segment G [jk, i] | ph is performed. This distortion evaluation is performed by evaluating a distance e [jk, i] | ph between the synthesized speech element G [jk, i] | ph and the first training speech element S1 [i] | ph. The distance e [jk, i] | ph is, for example, the square error between the signal waveform of the synthesized speech unit G [jk, i] | ph and the signal waveform of the first training speech unit S1 [i] | ph. , The synthesized speech unit G [jk, i] | ph and the first training speech unit S1 [i] | ph are converted into a power spectrum using FFT (Fast Fourier Transform) or the like, and the square between the spectra Errors can be used. Alternatively, it may be a distance between each element using a known parameter such as an LSP parameter or a cepstrum parameter. Alternatively, the synthesized speech unit and the first training speech unit may be subjected to, for example, band-pass filter processing, and another evaluation method suitable for each band may be used. By performing distortion evaluation using an evaluation method suitable for each band, further detailed distortion evaluation becomes possible, and the quality of synthesized speech can be improved.

また、ＣＶ境界近傍やスペクトルが大きく変動する部分、例えば、語頭・語尾など音声の立ち上がり・立下り部分や音韻変化過渡部においては、他の部分より大きく重み付けして距離e[jk、i]|phを評価してもよい。波形融合点であるＣＶ近傍等を大きく重み付けして距離e[jk、i]|phを評価することにより、波形融合による波形不連続に起因する歪に大きく重み付けして評価することができるので、劣化した融合音声素片の生成を抑制することができ、合成音声の品質を向上することができる。 In addition, in the vicinity of the CV boundary and the part where the spectrum greatly fluctuates, for example, the rising / falling part of the speech such as the beginning / end of the word or the phonological change transition part, the distance e [jk, i] | You may evaluate ph. By evaluating the distance e [jk, i] | ph by weighting the vicinity of the CV that is the waveform fusion point and the like, the distortion caused by the waveform discontinuity due to waveform fusion can be weighted and evaluated. Generation of deteriorated fused speech segments can be suppressed, and the quality of synthesized speech can be improved.

さらに、合成音声素片G[jk、i]|phと第１のトレーニング音声素片S1[i]|phとの距離e[jk、i]|phを評価する際に、合成音声素片G[jk、i]|phと第１のトレーニング音声素片S1[i]|phに対して、なんらかの聴覚的重み付けフィルタ処理を行っても良い。聴覚重み付けの方法としては、例えばＬＰＣ(Linear Predictive Coefficient)パラメータ等を用いた逆フィルタ処理による方法等の公知の方法を用いることができる。この聴覚重み付け処理はトレーニング音声素片に予め処理しておくことで計算を省力化することができる。このとき、音声素片辞書６へは聴覚重み付け処理を行っていないトレーニング音声素片から、前記歪最小となる融合音声素片のみを生成して出力する。また、前記の距離e[jk、i]|phに対して、聴覚重み付けフィルタを構成する関数を距離計算の重み付け関数として組み込んでもよい。聴覚重み付け処理を行うことで、聴覚的に重要な部分を重視した歪評価が可能となり、さらに合成音声の品質を向上することができる。 Further, when evaluating the distance e [jk, i] | ph between the synthesized speech unit G [jk, i] | ph and the first training speech unit S1 [i] | ph, the synthesized speech unit G Some perceptual weighting filter processing may be performed on [jk, i] | ph and the first training speech element S1 [i] | ph. As the auditory weighting method, for example, a known method such as a method by inverse filter processing using an LPC (Linear Predictive Coefficient) parameter or the like can be used. This auditory weighting process can save the calculation by processing the training speech segment in advance. At this time, only the fused speech unit that minimizes the distortion is generated and output to the speech unit dictionary 6 from the training speech unit that is not subjected to auditory weighting processing. In addition, a function constituting an auditory weighting filter may be incorporated as a weighting function for distance calculation with respect to the distance e [jk, i] | ph. By performing auditory weighting processing, distortion evaluation can be performed with emphasis on auditory important parts, and the quality of synthesized speech can be further improved.

総合評価ステップＳ１５では、歪評価ステップＳ１４にて合成音声素片G[jk、i]|phの全ての歪評価を行った後、式（３）、（４）に従って、融合音声素片SM[jk]|phの波形変形歪を評価して、共通化音声素片２０３および非共通化音声素片２０４を音声素片辞書６へ出力する。 In the comprehensive evaluation step S15, after all the distortion evaluations of the synthesized speech unit G [jk, i] | ph are performed in the distortion evaluation step S14, the fusion speech unit SM [ The waveform deformation distortion of jk] | ph is evaluated, and the common speech unit 203 and the non-common speech unit 204 are output to the speech unit dictionary 6.

まず、共通化音声素片を決定するために、第２のトレーニング音声素片S2[i]|phから得られた、Ｃ素片SC[j]|phと共通化音声素片A[k]の全ての組み合わせによる融合音声素片SM[jk]|phの音韻別歪Ep[jk]|phを式（３）から求める。音韻別歪Ep[jk]|phを求めた後、全ての音韻に対する総合歪EA[k]を式（４）で求め、総合歪EA[k]が最小となる共通化音声素片候補A[k]を共通化音声素片２０３として音声素片辞書６に記憶する。 First, in order to determine the common speech unit, the C unit SC [j] | ph and the common speech unit A [k] obtained from the second training speech unit S2 [i] | ph The phonetic-specific distortion Ep [jk] | ph of the fusion speech unit SM [jk] | ph by all combinations of the above is obtained from the equation (3). After obtaining the phoneme-specific distortion Ep [jk] | ph, the overall distortion EA [k] for all phonemes is obtained by Equation (4), and the common speech element candidate A [ k] is stored in the speech unit dictionary 6 as the common speech unit 203.

続いて、共通化音声素片A[k]が求まった後、各音韻において決定した共通化音声素片に対応する合成音声素片G[jk、i]|phを再評価し、各音韻別に歪が最小となるＣ素片を、非共通化音声素片２０４として音声素片辞書に記憶する。以上、前記のステップＳ１１、Ｓ４１およびＳ１２〜Ｓ１５の工程を、全ての共通化音素に対して順次実施することで音声素片辞書６を構築する。 Subsequently, after the common speech element A [k] is obtained, the synthesized speech element G [jk, i] | ph corresponding to the common speech element determined in each phoneme is re-evaluated, The C segment that minimizes the distortion is stored in the speech segment dictionary as the unshared speech segment 204. As described above, the speech segment dictionary 6 is constructed by sequentially performing the steps S11, S41, and S12 to S15 for all the common phonemes.

なお、本実施の形態４においては、説明の簡略化のために波形分離位置をＣＶ境界丁度としているが、音韻毎に調音結合等を考慮して波形分離位置を移動・調整してもよい。 In the fourth embodiment, the waveform separation position is exactly the CV boundary for simplification of description, but the waveform separation position may be moved and adjusted in consideration of articulation coupling for each phoneme.

本実施の形態４においても、先の実施の形態１と同様に、総合歪EA[k]あるいは距離e[jk、i]|phが小さくなるように、音声素片毎に波形分離位置をＣＶ境界の前後にトラッキング（微調整）してもよい。 Also in the fourth embodiment, similarly to the first embodiment, the waveform separation position is set to CV for each speech unit so that the total distortion EA [k] or the distance e [jk, i] | ph becomes small. Tracking (fine adjustment) may be performed before and after the boundary.

図９は、このときの共通音声素片生成具２１と融合音声素片生成具２２の処理の別の変形例であり、総合評価ステップＳ１５と、波形分離ステップＳ１１との間にフィードバックループを形成し、判断ステップＳ１６にて総合歪あるいは距離が最小と判断されるまで、ステップＳ１１、Ｓ４１およびＳ１２〜Ｓ１５までの処理を順次実施することとなる。 FIG. 9 shows another modification of the processing of the common speech unit generator 21 and the fusion speech unit generator 22 at this time, and a feedback loop is formed between the comprehensive evaluation step S15 and the waveform separation step S11. Until the total distortion or the distance is determined to be minimum in the determination step S16, the processes in steps S11, S41 and S12 to S15 are sequentially performed.

本実施の形態４においては、母音/a/について共通化を行った一例を提示しているが、例えば、/ma/、/mi/、/mu/、/me/、/mo/等の有声子音の子音部/m/等についても共通化可能である。また、無声子音/sa/、/shi/、/su/、/se/、/so/等の無声子音についてもこの発明は適用可能である。さらに、/m/等のＣ素片と/a/等のＶ素片をそれぞれ共通化し、Ｃ素片とＶ素片の渡りの部分（/m-a/）、すなわち、音韻過渡部だけを非共通化音声素片とすることも可能である。 In the fourth embodiment, an example in which the vowel / a / is shared is presented, but for example, voiced voices such as / ma /, / mi /, / mu /, / me /, / mo / The consonant consonant part / m / etc. Can also be shared. The present invention is also applicable to unvoiced consonants such as unvoiced consonants / sa /, / shi /, / su /, / se /, / so /. In addition, the C element such as / m / and the V element such as / a / are made common, and the transition part (/ ma /) of the C element and the V element, that is, only the phoneme transition part is not common. It is also possible to use a speech unit.

本実施の形態４においては、説明の簡略化のために合成単位をＣＶ素片として説明を行ったが、ＶＣ、ＶＣＶ、ＣＶＣといったような合成単位にも勿論適用できる。また、例えば、/myo/のような半母音/yo/を含む音声素片においては、/m/、/y/、/o/と３分割してそれぞれを組み合わせることで融合音声素片を作成することも可能である。また、半音素単位で/-m/、/m-y/、/y-o/、/o-/と４分割してもかまわない。 In the fourth embodiment, for the sake of simplification of explanation, the synthesis unit has been described as a CV segment. However, the present invention can also be applied to synthesis units such as VC, VCV, and CVC. In addition, for example, in a speech unit including a semi-vowel / yo / such as / myo /, a fusion speech unit is created by combining each of them by dividing into / m /, / y /, and / o /. It is also possible. Further, it may be divided into four units of / -m /, / m-y /, / y-o /, / o- / in units of semiphonemes.

さらに、Ｃ、Ｖ、ＣＶといった合成単位よりももっと細分化された単位、例えば、２ピッチ長波形重畳合成方法に用いられる２ピッチ長波形を素片組み合わせ単位と見なし、この２ピッチ長波形単位で組み合わせて共通化音声素片を生成したり、また、音声素片の時間軸信号を5ms単位のフレームに分割し、そのフレーム単位に分析したＬＳＰパラメータなどのパラメータレベルで組み合わせて共通化音声素片を生成しても良い。 Further, a unit that is further subdivided than a synthesis unit such as C, V, and CV, for example, a 2-pitch length waveform used in a 2-pitch length waveform superposition synthesis method is regarded as a unit combination unit, Combined to generate a common speech unit, or to divide the time axis signal of the speech unit into frames of 5 ms units and combine them at the parameter level such as LSP parameters analyzed in units of the frames. May be generated.

また、歪評価の際に、ある音韻において、共通化音声素片を用いて合成音声素片を生成した場合と、共通化音声素片を用いない、すなわち通常の融合音声素片にて合成音声素片を生成した場合とを比較し、共通音声素片を用いない場合の方が歪が小さくなる場合には、当該音韻に関しては共通化音声素片を用いずに通常の融合音声素片を選択することも可能である。 In addition, when evaluating a distortion, a synthesized speech unit is generated using a common speech unit in a certain phoneme, and a synthesized speech is generated using a common fusion speech unit without using a shared speech unit. Compared to the case of generating a segment, if the distortion is smaller when the common speech unit is not used, the normal fusion speech unit is not used for the phoneme without using the common speech unit. It is also possible to select.

本実施の形態４の構成をとることにより、例えば、トレーニング素片の個数が十分用意できない場合でも、任意に波形を組み合わせて融合音声素片を生成して音声素片とすることで、音声素片のバリエーションを増やすことができ、品質の高い合成音声を生成することができる。 By adopting the configuration of the fourth embodiment, for example, even when the number of training segments cannot be sufficiently prepared, a speech unit can be generated by generating a combined speech unit by arbitrarily combining waveforms to form a speech unit. The variation of the piece can be increased, and a high-quality synthesized speech can be generated.

また、本実施の形態４の構成をとることにより、ＣＶ素片のＣ素片とＶ素片を、他の音韻のそれらと共通化することで共通部分を縮退化きるので、合成音声の品質を維持したまま音響辞書のメモリ量を大幅に削減したり、さらに、音素を共通化できるため聴感上の合成音声の安定化を図ることが可能となる。 Further, by adopting the configuration of the fourth embodiment, the C segment and the V segment of the CV segment are shared with those of other phonemes, so that the common part is degenerated. It is possible to significantly reduce the amount of memory of the acoustic dictionary while maintaining the above, and to stabilize the synthesized speech for audibility because the phonemes can be shared.

実施の形態５．
実施の形態４の別の実施の形態５として、先の実施の形態２と同様に、共通化音声素片、融合音声素片の組み合わせに用いる音声素片を予備選択してもよい。予備選択することで融合音声素片評価に対する処理量を削減できるとともに、音質が悪いトレーニング音声素片を排除することができ、合成音声の品質を向上することができる。 Embodiment 5 FIG.
As another fifth embodiment of the fourth embodiment, as in the second embodiment, speech units used for the combination of the common speech unit and the fused speech unit may be preselected. Preliminary selection can reduce the amount of processing for the fusion speech unit evaluation, can eliminate training speech units having poor sound quality, and can improve the quality of synthesized speech.

実施の形態６．
前記実施の形態１では、第１のトレーニング音声素片１０４中の音声素片が保持するピッチ周期および音韻継続時間長に従って融合音声素片１０６を生成したが、所定の規則により生成されたピッチおよび継続時間長、例えば、韻律設定部４が出力する入力テキストのピッチ周期および音韻継続時間長に従って融合音声素片を変形して合成音声素片を生成し、韻律設定部４の出力するピッチ周期および音韻継続時間長との差が最小となる音声素片を、第１のトレーニング音声素片１０４から抽出して、抽出された第１のトレーニング音声素片のピッチおよび継続時間長と合成音声素片との歪評価を行うことも可能である。 Embodiment 6 FIG.
In the first embodiment, the fusion speech unit 106 is generated according to the pitch period and phoneme duration length held by the speech unit in the first training speech unit 104, but the pitch and the pitch generated by a predetermined rule are generated. The synthesized speech segment is generated by transforming the fusion speech unit according to the duration, for example, the pitch period of the input text output by the prosody setting unit 4 and the phoneme duration, and the pitch period output by the prosody setting unit 4 The speech unit having the smallest difference from the phoneme duration is extracted from the first training speech unit 104, and the pitch and duration of the extracted first training speech unit and the synthesized speech unit are extracted. It is also possible to perform distortion evaluation.

図１０は、この発明の実施の形態６に係る音声合成方法を実現する音声合成装置の構成を示すブロック図である。図１と同一部分については同一の参照符号を付して説明を省き相違点を説明する。本実施の形態では、韻律設定部４が出力する韻律情報１０３が、融合音声素片生成部５へ入力されていることが、これまでの実施の形態と異なる点である。 FIG. 10 is a block diagram showing the configuration of a speech synthesizer that implements the speech synthesis method according to Embodiment 6 of the present invention. The same parts as those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted. In the present embodiment, the prosody information 103 output from the prosody setting unit 4 is input to the fused speech unit generation unit 5, which is different from the previous embodiments.

まず、入力端子１より、入力テキスト１０１として例えば「山の景色を見る」を入力する。言語処理部２では、言語辞書３を相互参照して入力テキスト１０１の解析を行い解析結果１０２を出力する。韻律設定部４では音韻系列、アクセントならびにイントネーションの制御処理が行われ、音響的特徴のパラメータ、例えば、音韻記号列、音声素片のピッチパターン、ピッチ周期、ピッチマーク、継続時間長または韻律のパラメータである韻律情報１０３が設定される。なお、入力テキストとして入力された「山の景色を見る」は、例えばＣＶを合成単位とした場合、式５のような音韻記号列に分解される。 First, for example, “view mountain scenery” is input from the input terminal 1 as the input text 101. The language processing unit 2 analyzes the input text 101 by cross-referencing the language dictionary 3 and outputs an analysis result 102. The prosody setting unit 4 performs control processing of phoneme series, accent and intonation, and parameters of acoustic features such as phoneme symbol string, pitch pattern of speech segment, pitch period, pitch mark, duration length or prosody parameter The prosodic information 103 is set. It should be noted that the “view mountain scenery” input as the input text is decomposed into a phoneme symbol string as shown in Equation 5 when CV is used as a synthesis unit.

融合音声素片生成部５では、前記の音韻記号列の各音韻/ya/、/ma/、…に対応する韻律情報１０３に従って、順次第２のトレーニング音声素片１０５から融合音声素片を生成し、第１のトレーニング音声素片１０４から韻律情報１０３に最も適した各々の音声素片を選択して、前記融合音声素片との歪評価を行い、歪を最小とする融合音声素片１０６を音声素片辞書６に記憶する。 The fused speech segment generation unit 5 sequentially generates fused speech segments from the second training speech segment 105 according to the prosodic information 103 corresponding to each phoneme / ya /, / ma /,. Then, each speech unit most suitable for the prosodic information 103 is selected from the first training speech unit 104, distortion evaluation with the fusion speech unit is performed, and the fusion speech unit 106 that minimizes the distortion is performed. Is stored in the speech segment dictionary 6.

図１１は、本実施の形態における融合音声素片生成部５の処理手順を示すフローチャートである。図１１のフローチャートは、図２で説明したステップＳ１１、ステップＳ１２、ステップＳ１３、ステップＳ１４、ステップＳ１５と、新規要素である評価素片選択ステップＳ５１により構成される。 FIG. 11 is a flowchart showing a processing procedure of the fused speech unit generation unit 5 in the present embodiment. The flowchart of FIG. 11 includes step S11, step S12, step S13, step S14, and step S15 described in FIG. 2, and an evaluation element selection step S51 that is a new element.

図１１より、まず、波形分離ステップＳ１１により、第２のトレーニング音声素片１０５の波形分離を行い、波形融合ステップＳ１２で融合音声素片を生成する。音声素片合成ステップＳ１３では、各音韻に対応した韻律情報１０３に含まれるピッチ周期および継続時間長に従って、前記生成された融合音声素片のピッチ周期および継続時間長等を変更することにより、合成音声素片を生成する。 From FIG. 11, first, waveform separation of the second training speech unit 105 is performed in waveform separation step S11, and a fusion speech unit is generated in waveform fusion step S12. In the speech unit synthesis step S13, synthesis is performed by changing the pitch period and duration of the generated fused speech unit in accordance with the pitch period and duration included in the prosodic information 103 corresponding to each phoneme. Generate speech segments.

続いて、評価素片選択ステップＳ５１では、前記生成された合成音声素片の歪評価を行うために、第１のトレーニング音声素片１０４から、各音韻に対応した韻律情報１０３に含まれるピッチ周期および継続時間長に近似した音声素片を選択・抽出する。言い換えれば、合成音声素片が持つピッチパターンと継続時間長に近似した音声素片を選択する。 Subsequently, in the evaluation unit selection step S51, the pitch period included in the prosody information 103 corresponding to each phoneme is extracted from the first training speech unit 104 in order to evaluate the distortion of the generated synthesized speech unit. A speech unit that approximates the duration is selected and extracted. In other words, a speech unit that approximates the pitch pattern and duration of the synthesized speech unit is selected.

第１のトレーニング音声素片１０４から、歪評価に用いる音声素片を選択する方法として、例えば、下記式６に示すピッチ周期と継続時間長の重み付き２乗誤差Ed[i]を用い、Ed[i]が所定の閾値以下の音声素片を選択することで実施できる。 As a method of selecting a speech unit to be used for distortion evaluation from the first training speech unit 104, for example, using a weighted square error Ed [i] of a pitch period and a duration length shown in the following equation 6, Ed This can be implemented by selecting a speech segment whose [i] is less than or equal to a predetermined threshold.

ここで、F0ruleは、韻律情報１０３に含まれるピッチ周期系列を示すM個の配列であり、F0rule[j]はそのj番目の要素を示す。また、F0[i]（i＝１、２、３、…、Ns1）は、F0ruleの配列長にあわせて正規化した（M次元化）した第１のトレーニング音声素片１０４のピッチ周期系列の配列であり、F0[i][j]はF0[i]のj番目の要素を示す。同様にDURruleは韻律情報１０３に含まれる継続時間長を示し、DUR[i]（i＝１、２、３、…、Ns1）は第１のトレーニング音声素片１０４の継続時間長である。wfおよびwdは所定の重み係数であり、例えば、wf=0.8、wd=0.2である。 Here, F0rule is an M array indicating the pitch period sequence included in the prosodic information 103, and F0rule [j] indicates the j-th element. F0 [i] (i = 1, 2, 3,..., Ns1) is a pitch period sequence of the first training speech segment 104 normalized (M-dimensionalized) according to the array length of F0rule. It is an array, and F0 [i] [j] indicates the jth element of F0 [i]. Similarly, DURrule indicates the duration of the prosody information 103, and DUR [i] (i = 1, 2, 3,..., Ns1) is the duration of the first training speech segment 104. wf and wd are predetermined weighting factors, for example, wf = 0.8 and wd = 0.2.

歪評価ステップＳ１４では、評価素片選択ステップＳ５１にて選択された音声素片と、音声素片合成ステップＳ１３にて生成された合成音声素片との歪評価を各音韻毎に実行する。 In the distortion evaluation step S14, distortion evaluation between the speech unit selected in the evaluation unit selection step S51 and the synthesized speech unit generated in the speech unit synthesis step S13 is performed for each phoneme.

以上、前記の「山の景色を見る」に続いて、大量の任意の入力テキストを順次入力して、言語処理部２、韻律設定部４および融合音声素片生成部５のステップＳ１１〜Ｓ１３，Ｓ５１，Ｓ１４〜Ｓ１５の処理を順次実行し、各音韻毎にステップＳ１４で得られた歪評価を集計する。総合評価ステップＳ１５ではこうして得られた歪評価を元に、最終的に歪が最小となる融合音声素片１０６を各音韻毎に音声素片辞書６に記憶する。 As described above, following the “viewing the mountain landscape”, a large amount of arbitrary input text is sequentially input, and steps S11 to S13 of the language processing unit 2, the prosody setting unit 4 and the fusion speech unit generation unit 5 are performed. The processes of S51 and S14 to S15 are sequentially executed, and the distortion evaluations obtained in step S14 are tabulated for each phoneme. In the comprehensive evaluation step S15, based on the distortion evaluation obtained in this way, the fusion speech element 106 that finally minimizes the distortion is stored in the speech element dictionary 6 for each phoneme.

なお、評価素片選択ステップＳ５１において用いられた、ピッチ周期系列および継続時間長については、所定の規則によって生成された韻律情報の代わりに、自然音声から抽出されたピッチ周期系列および継続時間長、すなわち自然韻律を用いることもできる。 Note that the pitch period sequence and the duration length used in the evaluation segment selection step S51 are, instead of the prosodic information generated by a predetermined rule, the pitch period sequence and the duration length extracted from natural speech, That is, natural prosody can be used.

実施の形態６の構成をとることにより、韻律設定部４により生成された韻律情報１０３に則した音声素片のみを評価することができるので、さらに合成音声の品質を向上させることができるとともに、韻律情報１０３が対応しないトレーニング音声素片との歪評価を行わずに済むので、処理量を削減する効果がある。 By adopting the configuration of the sixth embodiment, it is possible to evaluate only speech segments in accordance with the prosody information 103 generated by the prosody setting unit 4, so that the quality of the synthesized speech can be further improved, Since it is not necessary to perform distortion evaluation with training speech segments that do not correspond to the prosodic information 103, there is an effect of reducing the processing amount.

実施の形態７．
実施の形態６の別の実施の形態７として、先の実施の形態２と同様に、融合音声素片の組み合わせに用いる音声素片を予備選択してもよい。予備選択することで融合音声素片評価に対する処理量を削減できるとともに、音質が悪いトレーニング音声素片を排除することができ、合成音声の品質を向上することができる。 Embodiment 7 FIG.
As another seventh embodiment of the sixth embodiment, as in the second embodiment, speech units used for the combination of fused speech units may be preselected. Preliminary selection can reduce the amount of processing for the fusion speech unit evaluation, can eliminate training speech units having poor sound quality, and can improve the quality of synthesized speech.

実施の形態８．
前記の実施の形態１において、音声素片辞書６に格納されている融合音声素片は、メモリ量や通信情報量を削減するために圧縮処理を行ってもよい。 Embodiment 8 FIG.
In the first embodiment, the fusion speech unit stored in the speech unit dictionary 6 may be subjected to compression processing in order to reduce the amount of memory and the amount of communication information.

図１２は、この発明の実施の形態８に係る音声合成方法を実現する音声合成装置の構成を示すブロック図である。図１と同一部分については同一の参照符号を付して説明を省き相違点を説明する。本実施の形態では、融合音声素片生成部５が出力する融合音声素片１０６を符号化した符号化音声素片３０１を音声素片辞書６に保持させる符号化部３１と、音声素片辞書６からの符号化音声素片３０１を復号する復号化部３２が備えられている点がこれまでの実施の形態と異なる。 FIG. 12 is a block diagram showing the configuration of a speech synthesizer that implements the speech synthesis method according to Embodiment 8 of the present invention. The same parts as those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted. In the present embodiment, an encoding unit 31 that holds in the speech unit dictionary 6 an encoded speech unit 301 obtained by encoding the fused speech unit 106 output by the fused speech unit generation unit 5, and a speech unit dictionary 6 is different from the previous embodiments in that a decoding unit 32 for decoding encoded speech units 301 from 6 is provided.

融合音声素片生成部５の出力である融合音声素片１０６が符号化部３１へ入力され、所定の圧縮方法にてデータ圧縮あるいは符号化処理が実施されて符号化音声素片３０１とされ、この符号化音声素片３０１が音声素片辞書６に出力される。素片選択部７は韻律情報１０３に従って音声素片辞書６に保持されている符号化音声素片３０１を復号化部３２へ入力し、復号化部３２でデータ伸長あるいは復号化処理が行われ、復号化音声素片３０２を得て素片選択・接続処理をし、音声合成部８で音声合成して合成音声１０８を得て出力端子９より出力する。 The fusion speech unit 106 that is the output of the fusion speech unit generation unit 5 is input to the encoding unit 31 and is subjected to data compression or encoding processing by a predetermined compression method to obtain an encoded speech unit 301. The encoded speech unit 301 is output to the speech unit dictionary 6. The unit selection unit 7 inputs the encoded speech unit 301 held in the speech unit dictionary 6 according to the prosodic information 103 to the decoding unit 32, and the decoding unit 32 performs data expansion or decoding processing. A decoded speech unit 302 is obtained, a unit selection / connection process is performed, and speech synthesis is performed by the speech synthesizer 8 to obtain a synthesized speech 108 which is output from the output terminal 9.

ここで、融合音声素片１０６が音声素片辞書６に格納されるパラメータまたは波形信号を圧縮する方法として、例えばハフマン圧縮やLZ(Lempel-Ziv)法あるいはその他公知のデータ可逆圧縮方法を用いて可逆圧縮しても良いし、前記のＬＳＰパラメータやスペクトルパラメータ等の音響パラメータを量子化あるいは符号化して非可逆圧縮したり、波形をADPCM法、ITU-T G.729やその他公知の音声音響符号化方法を用いて非可逆圧縮しても良い。
また、量子化あるいは符号化して非可逆圧縮した後、非可逆圧縮されたデータを可逆圧縮して更にメモリ量を削減する等、両者を組み合わせて用いることも可能であるし、音声素片毎にその特性を考慮して可逆圧縮のみ、非可逆圧縮のみ、可逆圧縮＋非可逆圧縮等の圧縮パタンを使い分けても良い。さらに、量子化・符号化精度（量子化・符号化に割り当てるビット数）や符号化方法は音声素片毎に異なるものであっても良い。 Here, as a method of compressing parameters or waveform signals stored in the speech unit dictionary 6 by the fusion speech unit 106, for example, the Huffman compression, the LZ (Lempel-Ziv) method, or other known data lossless compression methods are used. Lossless compression may be used, and acoustic parameters such as the LSP parameters and spectral parameters may be quantized or encoded to be irreversibly compressed, or the waveform may be compressed using the ADPCM method, ITU-T G.729, or other known speech acoustic codes. It is also possible to perform irreversible compression using a conversion method.
In addition, after irreversible compression by quantization or encoding, it is possible to use a combination of both, such as lossless compression of the lossy compressed data to further reduce the amount of memory. In consideration of the characteristics, compression patterns such as reversible compression only, irreversible compression only, and reversible compression + irreversible compression may be used. Furthermore, the quantization / encoding accuracy (the number of bits allocated to quantization / encoding) and the encoding method may be different for each speech unit.

融合音声素片１０６が圧縮されて音声素片辞書６に保管・記憶されるとき、音声素片辞書６の内部に、可逆圧縮の場合には圧縮された音声素片データとデータ伸長時に用いる情報が格納され、非可逆圧縮の場合には、音声素片データを構成する量子化または符号化処理による量子化テーブルのインデックス情報や符号化コードと、量子化テーブルや符号帳など復号化処理に用いる情報が格納されることとなる。 When the fused speech unit 106 is compressed and stored / stored in the speech unit dictionary 6, the speech unit dictionary 6 stores the compressed speech unit data in the case of lossless compression and information used for data decompression. Is stored and is used for decoding processing such as quantization table index information and encoding code, and quantization table and codebook constituting quantization of speech unit data. Information will be stored.

なお、本実施の形態８で述べた融合音声素片１０６の圧縮は、実施の形態２等にて述べた予備選択を実施した後に行っても良い。 Note that the compression of the fusion speech unit 106 described in the eighth embodiment may be performed after the preliminary selection described in the second embodiment or the like is performed.

実施の形態８の構成をとることにより、音声素片辞書６に格納される融合音声素片１０６を圧縮することが可能となり、音声素片辞書６に要するメモリ量や、音声素片辞書６をダウンロード等するための通信情報量を削減することができる。 By adopting the configuration of the eighth embodiment, it is possible to compress the fusion speech unit 106 stored in the speech unit dictionary 6, and the memory amount required for the speech unit dictionary 6 and the speech unit dictionary 6 can be reduced. The amount of communication information for downloading and the like can be reduced.

実施の形態９．
実施の形態８の変形例として、融合音声素片１０６に対する圧縮・伸張処理を、実施の形態４にて述べた、共通化音声素片２０３および非共通化音声素片２０４に対して実施しても良い。 Embodiment 9 FIG.
As a modification of the eighth embodiment, compression / decompression processing for the fusion speech unit 106 is performed on the common speech unit 203 and the non-common speech unit 204 described in the fourth embodiment. Also good.

図１３は、この発明の実施の形態９に係る音声合成方法を実現する音声合成装置の構成を示すブロック図である。図７と同一部分については同一の参照符号を付し説明を省略する。相違点を説明すると、本実施の形態では、融合音声素片生成具２２の出力の共通化音声素片２０３と非共通化音声素片２０４を符号化し、符号化した符号化音声素片３０１を音声素片辞書６に保持させる符号化部３１と、音声素片辞書６からの符号化音声素片３０１を復号し、復号化音声素片３０２を得る復号化部３２が備えられている点が図７に示す実施の形態４と異なる。 FIG. 13 is a block diagram showing the configuration of a speech synthesis apparatus that implements the speech synthesis method according to Embodiment 9 of the present invention. The same parts as those in FIG. 7 are denoted by the same reference numerals, and the description thereof is omitted. To explain the difference, in the present embodiment, the common speech unit 203 and the non-common speech unit 204 output from the fusion speech unit generator 22 are encoded, and the encoded speech unit 301 is encoded. The encoding unit 31 stored in the speech unit dictionary 6 and the decoding unit 32 that decodes the encoded speech unit 301 from the speech unit dictionary 6 to obtain the decoded speech unit 302 are provided. This is different from the fourth embodiment shown in FIG.

融合音声素片生成具２２で生成された共通化音声素片２０３と非共通化音声素片２０４は、符号化部３１へ入力されて、例えば、前記実施の形態８にて述べられている公知の手法により符号化または圧縮処理が行われて、符号化音声素片３０１として音声素片辞書６へ出力される。
素片選択部７は韻律情報１０３に従って選択される音声素片辞書６に保持されている共通化音声素片２０３と非共通化音声素片２０４に該当する符号化音声素片３０１を復号化部３２へ入力し、復号化部３２でデータ伸長あるいは復号化処理が行われ、復号化音声素片３０２を得て素片選択・接続処理をし、音声合成部８で音声合成して合成音声１０８を得て出力端子９より出力する。 The common speech unit 203 and the non-common speech unit 204 generated by the fusion speech unit generator 22 are input to the encoding unit 31 and are described in, for example, the above-described eighth embodiment. Encoding or compression processing is performed by the above method, and the encoded speech unit 301 is output to the speech unit dictionary 6.
The unit selection unit 7 decodes the encoded speech unit 301 corresponding to the common speech unit 203 and the non-common speech unit 204 held in the speech unit dictionary 6 selected according to the prosody information 103. 32, the decoding unit 32 performs data decompression or decoding processing, obtains a decoded speech segment 302, performs segment selection / connection processing, and synthesizes speech by the speech synthesis unit 8 to synthesize the synthesized speech 108. And output from the output terminal 9.

実施の形態９の構成をとることにより、融合音声素片生成具２２で生成され、音声素片辞書６に格納される共通化音声素片２０３と非共通化音声素片２０４を圧縮することが可能となり、音声素片辞書６に要するメモリ量や、音声素片辞書６をダウンロード等するための通信情報量を削減することができる。 By adopting the configuration of the ninth embodiment, it is possible to compress the common speech unit 203 and the non-common speech unit 204 which are generated by the fusion speech unit generator 22 and stored in the speech unit dictionary 6. Thus, the amount of memory required for the speech unit dictionary 6 and the amount of communication information for downloading the speech unit dictionary 6 can be reduced.

なお、この実施の形態９で述べた共通音声素片および非共通音声素片の圧縮は、実施の形態５にて述べた予備選択を実施した後に行っても良い。 Note that the compression of the common speech element and the non-common speech element described in the ninth embodiment may be performed after the preliminary selection described in the fifth embodiment is performed.

また、共通化音声素片、非共通化音声素片を別々に異なる圧縮方法により情報量圧縮を行っても良いし、例えば、共通化音声素片は圧縮せず、非共通化音声素片のみ圧縮を行うことも可能であるし、その逆も可能である。 In addition, the information amount compression may be performed separately for the common speech unit and the non-common speech unit by different compression methods. For example, the common speech unit is not compressed and only the non-common speech unit is compressed. It is possible to perform compression and vice versa.

実施の形態９の構成をとることにより、音声素片辞書６に格納されている共通化音声素片および非共通化音声素片を圧縮することが可能となり、音声素片辞書６に要するメモリ量や、音声素片辞書６をダウンロード等するための通信情報量を削減することができる。 By adopting the configuration of the ninth embodiment, it becomes possible to compress the common speech unit and the non-common speech unit stored in the speech unit dictionary 6, and the amount of memory required for the speech unit dictionary 6 In addition, it is possible to reduce the amount of communication information for downloading the speech element dictionary 6.

前記実施の形態では、第１のトレーニング音声素片１０４ S1[i]と、第２のトレーニング音声素片１０５ S2[j]は別データとしたが、第１のトレーニング音声素片と第２のトレーニング音声素片は同一のものであっても良い。 In the embodiment described above, the first training speech unit 104 S1 [i] and the second training speech unit 105 S2 [j] are separate data. The training speech element may be the same.

なお、前記実施の形態における、形態素解析、構文解析、ならびに韻律設定の全てまたは一部については、予め処理を行っておいてその解析結果を例えばＲＯＭ(Read Only Memory)、ＲＡＭ(Random Access Memory)、不揮発メモリ、磁気ディスク等の記憶手段に蓄えておき、音声合成時に解析結果を記憶手段から読み出すことで省略することも可能である。
また、例えばLAN(Local Area Network)、インターネット、赤外線通信、携帯電話パケット通信等の通信手段経由で、サーバコンピュータ等の処理手段により解析された解析結果や韻律情報、あるいはサーバコンピュータ上のハードディスク等の記憶手段に記憶されている解析結果や韻律情報を読み出すことでも省略可能である。 In the embodiment, all or part of the morphological analysis, syntax analysis, and prosody setting are processed in advance, and the analysis result is, for example, ROM (Read Only Memory), RAM (Random Access Memory). It is also possible to omit it by storing it in a storage means such as a nonvolatile memory or a magnetic disk and reading the analysis result from the storage means during speech synthesis.
Also, analysis results and prosodic information analyzed by processing means such as a server computer via a communication means such as LAN (Local Area Network), the Internet, infrared communication, mobile phone packet communication, or a hard disk on the server computer, etc. It can also be omitted by reading the analysis results and prosodic information stored in the storage means.

さらに、解析結果や韻律情報を例えば、コンピュータのＧＵＩ(Graphical User Interface)、キーボード、押しボタン、１次元／２次元バーコードリーダ、ＯＣＲ(Optical Character Reader)等の入力手段から直接入力してもかまわない。これはカーナビゲーションシステム、携帯電話、ＰＤＡ（Personal Digital Assistance）、ビデオレコーダ、監視システム、ゲーム機器、電子書籍、玩具等において決まった文章、例えばナビの市町村名や操作案内（ガイダンス）文、防犯警告合成音声、ゲームのキャラクタ合成音、新聞の文章等を読み上げる場合に有効である。 Furthermore, analysis results and prosodic information may be directly input from input means such as a computer GUI (Graphical User Interface), keyboard, push buttons, 1D / 2D barcode reader, OCR (Optical Character Reader), etc. Absent. This is a fixed text in car navigation systems, mobile phones, PDAs (Personal Digital Assistance), video recorders, surveillance systems, game machines, e-books, toys, etc., such as navigation city names and operation guidance (guidance) sentences, security warnings This is effective when reading out synthesized speech, synthesized character of a game, newspaper sentences, and the like.

前記述べた実施の形態において、前記の全ての機能あるいは一部の機能は、パーソナルコンピュータ等のソフトウエアとしてプログラム実行したり、ＣＰＵ等の組み込みソフトウエアやファームウエアとしてプログラム実行することで達成できるものである。また、同様の動作をする回路、例えばＬＳＩ（Large Scale IC）、ＦＰＧＡ（Field Programmable Gate Array）、論理IC等の集積回路で実現しても良いし、あるいはディスクリート素子を組み合わせて実現しても良い。 In the above-described embodiment, all or some of the above functions can be achieved by executing a program as software such as a personal computer, or executing a program as embedded software such as a CPU or firmware. It is. Further, it may be realized by an integrated circuit such as an LSI (Large Scale IC), an FPGA (Field Programmable Gate Array), or a logic IC, or may be realized by combining discrete elements. .

また、前記のソフトウエア等は、例えばＲＯＭ、磁気ディスク（ハードディスクやリムーバブルディスク等）、不揮発性半導体メモリ等の記憶手段に予め保持しておいたものであってもよいし、例えば、インターネット、ＬＡＮ、赤外線通信、Bluetooth、携帯電話のパケット通信等の有線・無線通信手段を用いてサーバ上の記憶手段からダウンロードしたり、例えば、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ(Digital Versatile Disk)、ＭＯディスク、磁気ディスク(ハードディスクやリムーバブルディスク等)、不揮発性の半導体メモリ、磁気テープ等の記憶媒体や、バーコード等が印刷されたカード等の印刷媒体より配布・提供されるものであってもよい。この場合、記憶媒体等から読み出された前記ソフトウエアのプログラムコードが、前記実施の形態の機能を実現することとなり、これら記憶媒体等はこの発明を構成するものとなる。 The software may be stored in advance in storage means such as ROM, magnetic disk (hard disk, removable disk, etc.), non-volatile semiconductor memory, etc. , Download from storage means on the server using wired / wireless communication means such as infrared communication, Bluetooth, packet communication of mobile phone, etc., for example, CD-ROM, CD-R, DVD (Digital Versatile Disk), MO disk Further, it may be distributed / provided from a storage medium such as a magnetic disk (hard disk, removable disk, etc.), nonvolatile semiconductor memory, magnetic tape, or a print medium such as a card printed with a barcode. In this case, the program code of the software read from the storage medium or the like realizes the function of the embodiment, and these storage medium and the like constitute the present invention.

前記実施の形態においては、各部を同一の計算機上で構成する場合について説明したが、この発明はこれに限定されるものではなく、例えば、ネットワーク上に分散した計算機や処理装置などに分かれて各部を構成してもよい。 In the above embodiment, the case where each unit is configured on the same computer has been described. However, the present invention is not limited to this, and for example, each unit is divided into computers and processing devices distributed on a network. May be configured.

また、この発明は、１つ以上の複数の機器から構成されるシステムに適用しても良い。サーバコンピュータがこの発明の実施の形態を実現するプログラム等をネットワーク等の通信手段を用いて配信し、複数のクライアントコンピュータや、携帯電話、ＰＤＡ等の携帯端末機器が配信されたプログラムを実行することができる。 In addition, the present invention may be applied to a system composed of one or more devices. The server computer distributes a program or the like for realizing the embodiment of the present invention using a communication means such as a network, and executes a program distributed by a plurality of client computers, mobile terminal devices such as mobile phones and PDAs. Can do.

前記の実施の形態で用いたトレーニング音声素片は、人間が発声した自然音声信号を用いたが、トレーニング音声素片は自然音声だけでなく、自然音声から解析的に生成した音声波形、例えば、所定の基準（例えば、スペクトル上の相互距離が所定の閾値以下）の下に選択された波形の平均的な波形、準最適波形、パワー補正された音声波形などでも良いし、さらに、人工的に生成された波形と自然音声の両者を混合した信号波形でも適用可能である。また、動物の鳴き声、楽器、電子音等の人以外から抽出した擬似的な音声信号波形でも良い。さらに、前記人工的に生成された音声波形等に雑音波形を混入してもよい。 The training speech unit used in the above embodiment uses a natural speech signal uttered by a human, but the training speech unit is not only a natural speech but also a speech waveform generated analytically from natural speech, for example, An average waveform, a sub-optimal waveform, a power-corrected speech waveform, etc. selected under a predetermined standard (for example, a mutual distance on the spectrum is equal to or smaller than a predetermined threshold) may be used, or artificially A signal waveform obtained by mixing both the generated waveform and natural speech can also be applied. Further, a pseudo sound signal waveform extracted from a person other than a person, such as an animal call, a musical instrument, or an electronic sound, may be used. Further, a noise waveform may be mixed into the artificially generated speech waveform or the like.

この発明によれば、高品質の合成音声を生成できるので、カーナビ向け音声合成機能、携帯電話のメールや情報家電の音声読み上げ機能、市町村防災無線、ハイウェイラジオにおける音声合成システム、エレベータ、エスカレータなどの自動音声案内等に適用可能である。 According to the present invention, since high-quality synthesized speech can be generated, a speech synthesis function for car navigation systems, a mobile phone mail and a voice reading function for information appliances, a municipal disaster prevention radio, a speech synthesis system in a highway radio, an elevator, an escalator, etc. It can be applied to automatic voice guidance.

実施の形態１の音声合成装置のブロック構成図である。1 is a block configuration diagram of a speech synthesizer according to Embodiment 1. FIG. 実施の形態１における融合音声素片生成部のフローチャートである。3 is a flowchart of a fused speech unit generation unit in the first embodiment. 実施の形態１における融合音声素片生成部の変形例のフローチャートである。10 is a flowchart of a modified example of the fused speech unit generation unit in the first embodiment. 実施の形態２における融合音声素片生成部のフローチャートである。10 is a flowchart of a fusion speech unit generation unit in the second embodiment. 実施の形態２における融合音声素片生成部の変形例のフローチャートである。10 is a flowchart of a modified example of the fused speech segment generation unit in the second embodiment. 実施の形態３における融合音声素片生成部のフローチャートである。10 is a flowchart of a fused speech segment generation unit in the third embodiment. 実施の形態４の音声合成装置のブロック構成図である。FIG. 10 is a block configuration diagram of a speech synthesizer according to a fourth embodiment. 実施の形態４における融合音声素片生成部のフローチャートである。10 is a flowchart of a fused speech unit generation unit in the fourth embodiment. 実施の形態４における融合音声素片生成部の変形例のフローチャートである。10 is a flowchart of a modified example of the fused speech element generation unit in the fourth embodiment. 実施の形態６の音声合成装置のブロック構成図である。FIG. 10 is a block configuration diagram of a speech synthesizer according to a sixth embodiment. 実施の形態６における融合音声素片生成部５のフローチャートである。18 is a flowchart of the fusion speech unit generation unit 5 in the sixth embodiment. 実施の形態８の音声合成装置のブロック構成図である。FIG. 10 is a block configuration diagram of a speech synthesizer according to an eighth embodiment. 実施の形態９の音声合成装置のブロック構成図である。FIG. 10 is a block configuration diagram of a speech synthesizer according to a ninth embodiment.

Explanation of symbols

１入力端子、２言語解析部、３言語辞書、４韻律設定部、５融合音声素片生成部、６音声素片辞書、７素片選択部、８音声合成部、９出力端子、２１共通音声素片生成具、２２融合音声素片生成具、３１符号化部、３２復号化部、１０１入力テキスト、１０２解析結果、１０３韻律情報、１０４第１のトレーニング音声素片、１０５第２のトレーニング音声素片、１０６融合音声素片、１０７代表音声素片、１０８合成音声、２０１共通化音声素片候補、２０２非共通化音声素片候補、２０３共通化音声素片、２０４非共通化音声素片、３０１符号化音声素片、３０２復号化音声素片。 1 input terminal, 2 language analysis unit, 3 language dictionary, 4 prosody setting unit, 5 fused speech unit generation unit, 6 speech unit dictionary, 7 unit selection unit, 8 speech synthesis unit, 9 output terminal, 21 common speech Segment generator, 22 Fusion speech segment generator, 31 Coding unit, 32 Decoding unit, 101 Input text, 102 Analysis result, 103 Prosodic information, 104 First training speech unit, 105 Second training speech Unit, 106 fusion speech unit, 107 representative speech unit, 108 synthesized speech, 201 common speech unit candidate, 202 non-common speech unit candidate, 203 common speech unit, 204 non-common speech unit , 301 encoded speech unit, 302 decoded speech unit.

Claims

Two training speech segments are prepared from a plurality of training speech segments, and the waveform separation position of one speech segment is cut out from one training speech segment at a predetermined phoneme boundary to generate a plurality of signal waveforms A separation step;
A waveform fusion step for generating a plurality of fused speech segments by combining and fusing any one or a plurality of signal waveforms from a plurality of signal waveforms cut out at the waveform separation position;
Speech that generates a plurality of synthesized speech segments in which at least one of the pitch and duration length of the generated fused speech segment is changed according to at least one of the pitch and duration length of the other training speech segment prepared Fragment synthesis step;
In the waveform separation step, the distance between each of the prepared other training speech units and each of the generated synthesized speech units is evaluated, and the distance is minimized based on the evaluation . A distortion evaluation step for holding or storing in a speech segment dictionary a fusion speech unit in which a plurality of signal waveforms obtained by finely adjusting the cut-out position before and after the phoneme boundary are combined ,
Synthesis by outputting synthesized speech by selecting and connecting fused speech units corresponding to input phonemes obtained by analyzing input text from a plurality of fused speech units held or stored in the speech unit dictionary A text-to-speech synthesis method comprising: a speech generation step.

The waveform fusion step generates a plurality of signal waveforms that become common parts with other speech units and a plurality of signal waveforms that become non-common parts with other speech units from the plurality of extracted signal waveforms. Then, by combining and fusing any one or a plurality of signal waveforms from the plurality of signal waveforms serving as the common part and the plurality of signal waveforms serving as the non-common part, common with other speech units Generate multiple fusion speech segments containing parts,
The distortion evaluation step evaluates a distance between each of the prepared plurality of synthesized speech segments with respect to each of the other prepared training speech segments, and based on the evaluation, a common speech that minimizes the distance Hold or remember the segment and other speech segments,
The synthesized speech generating step generates a synthesized speech unit corresponding to an input phoneme from the plurality of shared speech units and the other plurality of speech units held or stored, and connects them to generate a synthesized speech The text-to-speech synthesis method according to claim 1, wherein:

3. The pitch and duration length used when generating a synthesized speech segment in the speech segment synthesis step is a pitch and duration length generated according to a predetermined rule. Text-to-speech synthesis method.

The waveform separation step is a process of selecting a speech unit for two prepared training speech units based on a predetermined criterion and cutting out a plurality of signal waveforms from the selected speech unit. The text-to-speech synthesis method according to any one of claims 1 to 3.

The amount of information of the fusion speech unit stored or held in the speech unit dictionary is compressed by a predetermined compression method,
5. The synthesized speech is generated by expanding the information amount of the fused speech unit compressed from the acoustic unit dictionary when selecting the fused speech unit corresponding to the input phoneme. The text-to-speech synthesis method according to any one of the above.

A prosody setting unit that analyzes input text and obtains input phonemes;
Preparing two training speech segments from a plurality of training speech segments , cutting out waveform separation positions of one speech segment from one training speech segment at a predetermined phoneme boundary, and generating a plurality of signal waveforms; A plurality of fused speech unit candidates are generated by arbitrarily combining the plurality of signal waveforms, and at least one of the pitch and duration of the fused speech unit candidate is one of the other training speech units prepared above. Generating a plurality of synthesized speech segments modified according to at least one of one pitch and duration, and evaluating a distance between the plurality of synthesized speech segments and the other prepared training speech segment; melting the distance based on the evaluation is fused by combining a plurality of signal waveforms were finely adjusted before and after the phoneme boundary extraction position of the waveform separation step so as to minimize And fused speech unit generating means for generating a speech segment,
Fusion speech unit storage means for holding or storing the fusion speech unit;
Unit selection means for selecting a fusion speech unit corresponding to the input phoneme from the fusion speech units held or stored by the fusion speech unit storage unit;
A text-to-speech synthesizer comprising speech synthesis means for connecting the selected fused speech units and generating synthesized speech.

The fused speech segment generation means includes:
Intersection of the waveform separation position of one speech units from one training speech units by cutting out a plurality of signal waveforms in a predetermined phoneme boundary, from the cut-out a plurality of signal waveforms, and other speech unit A plurality of signal waveforms, and a plurality of signal waveforms that are non-common parts with other speech units, and a common speech unit generator that outputs to the fused speech unit generation means,
A plurality of fused voices including a common part with another speech unit by combining and combining arbitrary signal waveforms from the plurality of signal waveforms serving as the common part and the plurality of signal waveforms serving as the non-common part. A plurality of synthesized speech elements generated by generating a segment and changing at least one of the pitch and duration of the generated fused speech segment according to at least one of the pitch and duration of the other training speech segment prepared Generate a segment, evaluate the distance between the other prepared training speech segment and the corresponding plurality of generated synthesized speech segments, and separate the waveforms so that the distance is minimized based on the evaluation the extraction position in step is composed of a fused speech unit generation device for outputting fused speech unit fused by combining a plurality of signal waveforms were finely adjusted before and after the phoneme boundary Text to speech synthesis apparatus according to claim 6, wherein a.

The fused speech segment generation means includes:
The pitch and duration used when generating a plurality of synthesized speech segments by changing at least one of the pitch and duration length of the fused speech segment is generated in advance according to a predetermined rule. A text-to-speech synthesizer according to claim 6 or 7.

The fused speech segment generation means includes:
For the two training speech units prepared, a speech unit is selected based on a predetermined criterion, and a plurality of fused speech units are generated and evaluated using the selected speech unit. The text-to-speech synthesizer according to any one of claims 6 to 8, wherein

An encoding unit that compresses the information amount of the fusion speech unit stored or held in the speech unit dictionary by a predetermined compression method;
The text-to-speech synthesizer according to any one of claims 6 to 9, further comprising a decoding unit that expands the amount of information of the compressed fused speech unit in the speech unit dictionary.

Prosody setting means to obtain input phonemes by analyzing input text on a computer,
Preparing two training speech segments from a plurality of training speech segments , cutting out waveform separation positions of one speech segment from one training speech segment at a predetermined phoneme boundary, and generating a plurality of signal waveforms; Generating a plurality of synthesized speech segments in which at least one of the pitch and duration length of the fusion speech segment candidate is changed according to at least one pitch and duration length of the other prepared training speech segment Then, the distance between the plurality of synthesized speech segments and the other prepared training speech segment is evaluated, and the cut-out position in the waveform separation step is determined so as to minimize the distance based on the evaluation. A fused speech unit generating means for generating a fused speech unit by combining a plurality of signal waveforms finely adjusted before and after
Fused speech unit storage means for holding or storing the fused speech unit;
Unit selection means for selecting a fusion speech unit corresponding to the input phoneme from the fusion speech units held or stored by the fusion speech unit storage unit;
A text-to-speech synthesis program for connecting selected fusion speech units to function as speech synthesis means for generating synthesized speech.

Prosody setting means to obtain input phonemes by analyzing input text by computer,
Preparing two training speech segments from a plurality of training speech segments , cutting out waveform separation positions of one speech segment from one training speech segment at a predetermined phoneme boundary, and generating a plurality of signal waveforms; A plurality of fused speech unit candidates are generated by arbitrarily combining the plurality of signal waveforms, and at least one of the pitch and duration of the fused speech unit candidate is one of the other training speech units prepared above. Generating a plurality of synthesized speech segments modified according to at least one of one pitch and duration, and evaluating a distance between the plurality of synthesized speech segments and the other prepared training speech segment; melting the distance based on the evaluation is fused by combining a plurality of signal waveforms were finely adjusted before and after the phoneme boundary waveform separation position of the waveform separation step so as to minimize Fused speech unit generating means for generating a speech segment,
Fused speech unit storage means for holding or storing the fused speech unit;
Unit selection means for selecting a fusion speech unit corresponding to the input phoneme from the fusion speech units held or stored by the fusion speech unit storage unit;
A computer-readable recording medium recording a text-to-speech synthesis program for connecting selected fusion speech units to function as speech synthesis means for generating synthesized speech.