JP2008191477A

JP2008191477A - Hybrid type speech synthesis method, its device, its program and its recording medium

Info

Publication number: JP2008191477A
Application number: JP2007026801A
Authority: JP
Inventors: Akihiro Yoshida; 明弘吉田; Takashi Nakamura; 孝中村; Noboru Miyazaki; 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-02-06
Filing date: 2007-02-06
Publication date: 2008-08-21
Anticipated expiration: 2027-02-06
Also published as: JP4773988B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a hybrid type speech synthesis device by which high quality synthesized speech is stably obtained, in a hybrid type speech synthesis system in which synthesized speech is created by switching over a connecting waveform type speech synthesis method and a Hidden Markov Model (HMM) speech synthesis method. <P>SOLUTION: A connecting waveform type synthesized speech creating section creates synthesized speech data in which a content coincides with an input text by searching for and connecting an elementary speech unit of arbitrary length included in a speech corpus. An HMM synthesized speech creating section creates a statistical model beforehand, by learning speech information extracted from the speech corpus by a statistical method, and calculates a speech information parameter corresponding to the input text from the statistical model, and creates the synthesized voice data from the parameter. The above two speech synthesis systems are included, and a speech quality comparison determination means performs comparison and determines which is higher in speech quality, the synthesized speech data created by the connecting waveform type synthesized speech creating section, or that created by the HMM synthesized speech creating section, for each syllable. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、波形接続型音声合成方式とＨＭＭ音声合成方式とを備え、任意のテキスト入力と一致した高品質な合成音声を出力するハイブリッド型音声合成方法及び、その装置、そのプログラムとそのプログラムを記憶する記憶媒体に関する。 The present invention comprises a waveform-connected speech synthesis method and an HMM speech synthesis method, a hybrid speech synthesis method that outputs high-quality synthesized speech that matches an arbitrary text input, its apparatus, its program, and its program. The present invention relates to a storage medium for storage.

近年、音声合成技術の進歩により、自動音声応答装置（ＩＶＲ：Interactive Voice Response）における情報ガイダンスやＷｅｂ記事や電子メールなどの読み上げに音声合成が用いられている。主な音声合成方式として、特許文献１に開示されたような波形接続型音声合成方法と非特許文献１に開示された隠れマルコフモデル音声合成方式（Hidden Markov Model、以降ＨＭＭ音声合成方式と称する）が挙げられる。
波形接続型音声合成方式は大規模な音声コーパス（Corpus）を収録することが可能な場合、高品質な合成音声を生成することが出来る。 2. Description of the Related Art In recent years, with the advance of speech synthesis technology, speech synthesis is used for reading out information guidance, Web articles, e-mails and the like in an automatic voice response device (IVR). As main speech synthesis methods, a waveform-connected speech synthesis method as disclosed in Patent Document 1 and a hidden Markov model speech synthesis method (Hidden Markov Model, hereinafter referred to as HMM speech synthesis method) disclosed in Non-Patent Document 1. Is mentioned.
The waveform-connected speech synthesis method can generate high-quality synthesized speech when a large-scale speech corpus (Corpus) can be recorded.

一方、ＨＭＭ音声合成方式は、小規模な音声コーパスを用いても、ある程度の品質を保った合成音声を生成することが出来る。
様々な口調の合成音声を生成したい場合、その口調ごとに大規模音声コーパスを作成することは難しいため、波形接続型音声合成方式による合成音声は高品質に生成することが出来ない。なぜなら、該当する素片が音声コーパス内に存在している箇所は高品質な合成音声を生成することが出来るが、音声コーパス内に求める該当素片が存在していない場合には品質が大幅に劣化してしまうからである。かといって、ＨＭＭ音声合成方式のみでは現状、十分な音声品質を実現することはできない。 On the other hand, the HMM speech synthesis method can generate synthesized speech having a certain level of quality even when a small speech corpus is used.
When it is desired to generate synthesized speech of various tone, it is difficult to create a large-scale speech corpus for each tone, so synthesized speech by the waveform connection type speech synthesis method cannot be generated with high quality. This is because high-quality synthesized speech can be generated where the corresponding segment exists in the speech corpus, but the quality is greatly improved if the desired segment does not exist in the speech corpus. This is because it will deteriorate. However, at present, sufficient speech quality cannot be realized only by the HMM speech synthesis method.

そこで、波形接続型音声合成方式による合成音声とＨＭＭ音声合成方式による合成音声のどちらを使用するかをある基本単位で切替えて合成音声を生成するハイブリッドな音声合成方式を考えることで、より高品質な合成音声を小規模音声コーパスから生成する方法が、非特許文献２に開示されている。この文献では、どちらの方式の合成音声を使用するかは、合成対象となっている音声素片が小規模コーパスに一定数以上含まれている場合には波形接続型音声合成方式による合成音声を使用し、音声素片が一定数未満しか含まれていない場合には合成音声が劣化する可能性があるため、ＨＭＭ音声合成方式による合成音声を使用するという選択基準で決定されている。
また、非特許文献２では、どちらの方式の合成音声を使用するかの決定を音韻の中心を境界とするダイフォン（音韻の中心から次の音韻の中心）単位で行っている。
特許第２７６１５５２号徳田恵一、“ＨＭＭによる音声合成の基礎”、信学技法、SP2000-74,43-50頁,Oct.2000 大久保雅史、望月亮、小林哲則、“ＨＭＭ素片選択を用いた話者変換方式の検討”、信学技法、SP2004-139,13-18頁,Jan.2005 Therefore, by considering a hybrid speech synthesis method that generates synthesized speech by switching whether to use synthesized speech by waveform connection speech synthesis method or synthesized speech by HMM speech synthesis method in a certain basic unit, higher quality is achieved. Non-Patent Document 2 discloses a method for generating a simple synthesized speech from a small-scale speech corpus. In this document, which type of synthesized speech is used is determined based on whether the speech unit to be synthesized is included in a small-scale corpus or not. If the speech unit is used and contains less than a certain number of speech units, the synthesized speech may be deteriorated. Therefore, it is determined based on the selection criterion that the synthesized speech by the HMM speech synthesis method is used.
Further, in Non-Patent Document 2, the determination of which type of synthesized speech is used is performed in units of diphones (phoneme center to next phoneme center) with the phoneme center as a boundary.
Japanese Patent No. 2761552 Keiichi Tokuda, “Basics of Speech Synthesis with HMM”, Shingaku Techniques, SP2000-74, 43-50, Oct. 2000 Masafumi Okubo, Ryo Mochizuki, Tetsunori Kobayashi, “Study of speaker conversion method using HMM segment selection”, Shingaku technique, SP2004-139, pp. 13-18, Jan. 2005

しかしながら、小規模音声コーパス内の合成対象となっている音声素片の個数が、波形接続型音声合成の音声品質に直接関係しているとは言えない。音声素片が少なくても高品質、音声素片が多くても低品質である場合も十分に考えられるため、波形接続型音声合成方式による合成音声とＨＭＭ音声合成方式による合成音声のどちらの音声素片を使用するかを、それぞれの音声の品質を判定した上で決定することが望ましい。
この発明は、このような問題点に鑑みてなされたものであり、波形接続型音声合成方式による合成音声とＨＭＭ音声合成方式による合成音声のどちらの音声素片を使用するかを、それぞれの音声の品質を判定した上で決定するハイブリッド型音声合成方法、及びその装置とそのプログラムと、その記憶媒体を提供することを目的とする。 However, it cannot be said that the number of speech units to be synthesized in the small-scale speech corpus is directly related to the speech quality of waveform connected speech synthesis. Since it is possible to consider high quality even if there are few speech units and low quality even if there are many speech units, either speech synthesized by the waveform connection speech synthesis method or synthesized speech by the HMM speech synthesis method It is desirable to determine whether to use a segment after judging the quality of each voice.
The present invention has been made in view of such problems, and it is determined whether each speech unit to be used is synthesized speech by the waveform-connected speech synthesis method or synthesized speech by the HMM speech synthesis method. It is an object of the present invention to provide a hybrid speech synthesis method, a device thereof, a program thereof, and a storage medium that are determined after determining the quality of the speech.

この発明によるハイブリッド型音声合成装置は、波形接続型合成音声生成部と、ＨＭＭ合成音声生成部と、音声品質比較判定手段と、ハイブリッド合成音声生成処理手段とを備える。波形接続型合成音声生成部は、音声コーパスに含まれる任意長の音声素片を探索、接続することで入力テキストと内容が一致した合成音声データを生成する。ＨＭＭ合成音声生成部は、音声コーパスから抽出された音声情報を統計的手法で学習することによって統計モデルを予め生成し、上記統計モデルから入力テキストに対応する音声情報パラメータを求め、上記パラメータから合成音声データを生成する。
音声品質比較判定手段は、波形接続型合成音声生成部の生成する合成音声データと、ＨＭＭ合成音声生成部の生成する合成音声データの、どちらの音声品質が高いかを合成音声データの音節単位毎に比較判定を行なう。 The hybrid speech synthesizer according to the present invention includes a waveform connection type synthesized speech generation unit, an HMM synthesized speech generation unit, speech quality comparison / determination means, and hybrid synthesized speech generation processing means. The waveform connection type synthesized speech generation unit generates synthesized speech data whose contents match the input text by searching for and connecting speech units of arbitrary length included in the speech corpus. The HMM synthesized speech generation unit generates a statistical model in advance by learning speech information extracted from the speech corpus by a statistical method, obtains speech information parameters corresponding to the input text from the statistical model, and synthesizes from the parameters Generate audio data.
The voice quality comparison / determination means determines which voice quality of the synthesized voice data generated by the waveform connection type synthesized voice generator and the synthesized voice data generated by the HMM synthesized voice generator is higher for each syllable unit of the synthesized voice data. The comparison judgment is performed.

ハイブリッド合成音声生成処理手段は、音声品質比較判定手段の判定結果に従って合成音声データ単位を接続してハイブリッド合成音声データを生成する。
また、この発明によるハイブリッド型音声合成装置の音声コーパスは、波形接続型合成音声生成部の生成する合成音声データの品質が高くなるように設計された音声データと、ＨＭＭ合成音声生成部の生成する合成音声データの品質が高くなるように設計された音声データと、から構成される。 The hybrid synthesized speech generation processing means connects the synthesized speech data units according to the determination result of the speech quality comparison / determination means to generate hybrid synthesized speech data.
In addition, the speech corpus of the hybrid speech synthesizer according to the present invention generates speech data designed to improve the quality of synthesized speech data generated by the waveform-connected synthesized speech generation unit and the HMM synthesized speech generation unit. Voice data designed so that the quality of the synthesized voice data is high.

この発明によるハイブリッド型音声合成装置によれば、波形接続型音声合成方式による合成音声とＨＭＭ音声合成方式による合成音声のどちらの品質が高いかを判定し、合成方式の違いによる音質の差による違和感や接続箇所の不連続が少なくなるように決定された音節単位毎に品質の高い方式の合成音声データを使用することが出来るので、小規模な音声コーパスを使用した場合でも高品質な合成音声を生成することが可能となる。 According to the hybrid speech synthesizer according to the present invention, it is determined whether the quality of the synthesized speech by the waveform connection speech synthesis method or the synthesized speech by the HMM speech synthesis method is higher, and the sense of incongruity due to the difference in sound quality due to the difference in the synthesis method. High-quality synthesized speech data can be used for each syllable unit determined so that there are few discontinuities in the connection points and high-quality synthesized speech even when using a small speech corpus. Can be generated.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

この発明のハイブリッド型音声合成装置１００のハードウェア構成の実施例１を図１に示す。実施例１は、一般的なパーソナルコンピュータを用いてハイブリッド型音声合成装置１００を構成した例である。 FIG. 1 shows a first embodiment of the hardware configuration of the hybrid speech synthesizer 100 of the present invention. The first embodiment is an example in which a hybrid speech synthesizer 100 is configured using a general personal computer.

図１において、１０２は制御メモリ（ＲＯＭ）で、中央処理装置（ＣＰＵ）１０４で使用される各種制御データを記憶している。ＣＰＵ１０４は、ＲＡＭ１０６に記憶された制御プログラムを実行してハイブリッド型音声合成装置１００全体の動作を制御している。ＲＡＭ１０６は、ＣＰＵ１０４による各種制御処理の実行時、ワークエリアとして使用された各種データを一時的に保存するとともに、ＣＰＵ１０４による各種処理の実行時に、外部記憶装置１０８から制御プログラムをロードして記憶している。この外部記憶装置１０８は、例えばハードディスク装置や光ディスクドライブである。制御プログラムは、例えば光ディスクのＤＶＤ（Digital Versatile Disc）やＣＤ-ＲＯＭ(Compact Disc Read Only Memory)に記憶されて、他のパーソナルコンピュータに転送が可能なものである。 In FIG. 1, reference numeral 102 denotes a control memory (ROM), which stores various control data used by a central processing unit (CPU) 104. The CPU 104 executes a control program stored in the RAM 106 to control the overall operation of the hybrid speech synthesizer 100. The RAM 106 temporarily stores various data used as a work area when the CPU 104 executes various control processes, and loads and stores a control program from the external storage device 108 when the CPU 104 executes various processes. Yes. The external storage device 108 is, for example, a hard disk device or an optical disk drive. The control program is stored in, for example, a DVD (Digital Versatile Disc) or CD-ROM (Compact Disc Read Only Memory) of an optical disc and can be transferred to another personal computer.

Ｄ/Ａ変換器１１０は、音声信号を示すディジタルデータが入力されると、そのデータをアナログ信号に変換し、アナログ信号をスピーカ１１８に出力して音声を再生する。１１２は入力部で、例えばキーボード１２０や図示しないマウス等のポインティングデバイスを備えている。１１４は表示部で、例えばＣＲＴや液晶等の表示器１２２を有している。１１６はハイブリッド型音声合成ユニットで、入力されたテキストデータと一致する音声合成ディジタルデータを生成する。１５０はバスで、上記した各部分を接続している。 When digital data indicating an audio signal is input, the D / A converter 110 converts the data into an analog signal and outputs the analog signal to the speaker 118 to reproduce the audio. An input unit 112 includes a keyboard 120 and a pointing device such as a mouse (not shown). A display unit 114 includes a display 122 such as a CRT or a liquid crystal. Reference numeral 116 denotes a hybrid speech synthesis unit that generates speech synthesis digital data that matches the input text data. A bus 150 connects the above-described parts.

以上の構成において、この実施例１の実施形態のハイブリッド型音声合成ユニット１１６を制御するための制御プログラムは、外部記憶装置１０８からロードされてＲＡＭ１０６に記憶され、その制御プログラムで用いる各種データは制御メモリ１０２に記憶されている。これらのデータは、ＣＰＵ１０４の制御の下にバス１５０を通じて適宜ＲＡＭ１０６に取り込まれ、ＣＰＵ１０４による制御処理で使用される。Ｄ/Ａ変換器１１０は、制御プログラムを実行することによってハイブリッド型音声合成ユニット１１６で作成される音声合成ディジタルデータを、アナログ信号に変換してスピーカ１１８に出力する。 In the above configuration, the control program for controlling the hybrid type speech synthesis unit 116 of the embodiment of the first embodiment is loaded from the external storage device 108 and stored in the RAM 106, and various data used in the control program is controlled by the control program. Stored in the memory 102. These data are appropriately taken into the RAM 106 through the bus 150 under the control of the CPU 104 and used in the control process by the CPU 104. The D / A converter 110 converts the speech synthesis digital data created by the hybrid speech synthesis unit 116 by executing a control program into an analog signal and outputs the analog signal to the speaker 118.

この実施例１のハイブリッド型音声合成ユニット１１６のモジュールの機能構成例を図２に示し、その動作フローを図３に示す。テキストデータが、入力部１１２から、若しくは外部記憶装置１０８からデータファイルとして入力される（ステップＳ１１２）。入力されたテキストデータは、テキスト解析処理部３２において形態素解析され品詞に分解され読みとアクセント型が付与される（ステップＳ３２）。テキスト解析処理部３２は、音節を一つの合成音声データ単位として、読みとアクセント型を出力する。その読みとアクセント型に基づいた目標韻律パタンが、韻律情報生成部３４で生成される（ステップＳ３４）。目標韻律パタンとできるだけ一致する音声合成データが合成処理部４０で生成される（ステップＳ４０）。 FIG. 2 shows a functional configuration example of the module of the hybrid type speech synthesis unit 116 of the first embodiment, and FIG. 3 shows an operation flow thereof. Text data is input as a data file from the input unit 112 or from the external storage device 108 (step S112). The input text data is morphologically analyzed by the text analysis processing unit 32, decomposed into parts of speech, and given a reading and an accent type (step S32). The text analysis processing unit 32 outputs a reading and an accent type by using a syllable as one synthesized speech data unit. A target prosody pattern based on the reading and accent type is generated by the prosody information generation unit 34 (step S34). Speech synthesis data that matches the target prosodic pattern as much as possible is generated by the synthesis processing unit 40 (step S40).

テキスト解析処理部３２における読みとアクセント型を付与する動作（ステップＳ３２）と、韻律情報生成部３４における目標韻律パタンの生成（ステップＳ３４）は、従来からの音声合成装置と同じである。
この発明の要部は、合成処理部４０である。合成処理部４０は、波形接続型音声合成方式による合成音声データ単位（音節）とＨＭＭ音声合成方式による合成音声データ単位をそれぞれ生成（ステップＳ４２，Ｓ４４）し、どちらの合成音声データ単位の音声品質が高いかを音節単位毎に比較判定（ステップＳ５０）することで、品質劣化の少ない合成音声を生成するものである。 The operation of assigning reading and accent type in the text analysis processing unit 32 (step S32) and the generation of the target prosody pattern in the prosody information generation unit 34 (step S34) are the same as those in the conventional speech synthesizer.
A main part of the present invention is a synthesis processing unit 40. The synthesis processing unit 40 generates a synthesized voice data unit (syllable) by the waveform connection type voice synthesis method and a synthesized voice data unit by the HMM voice synthesis method (steps S42 and S44), and the voice quality of either synthesized voice data unit. Is compared for each syllable unit (step S50) to generate synthesized speech with little quality degradation.

両方式による合成音声の品質を比較し選択する単位が長すぎる場合、品質がそれほど高くないＨＭＭ合成音声が明確に知覚されてしまうため、両方式の合成音声が混合された出力合成音声の品質が低くなってしまう。一方、選択する単位が短すぎる場合、両方式による合成音声の接続点が増加することで、接続箇所における不連続性による音声劣化が多発してしまう。そこで、ＨＭＭ合成音声そのものの品質を知覚しにくく、両方式による合成音声の接続点が多くなりすぎない単位として音節単位を使用する。
その結果、品質のよい方の合成音声データを音節単位で接続することで音韻やダイフォン単位よりも接続箇所が少なくなるので、接続箇所の不連続性による劣化が生じ難い。また、音声品質が十分でないＨＭＭ合成音声が連続して出力されることを抑制することで品質がそれ程高くないＨＭＭ合成音声が明確に知覚されることによる品質の低下を防ぐことが出来る。 If the unit for comparing and selecting the quality of the synthesized speech by both methods is too long, the HMM synthesized speech that is not so high in quality is clearly perceived, so the quality of the output synthesized speech in which the synthesized speech of both types is mixed is It will be lower. On the other hand, when the unit to be selected is too short, the number of connection points of the synthesized speech by both types increases, resulting in frequent voice deterioration due to discontinuities at the connection points. Therefore, the syllable unit is used as a unit in which it is difficult to perceive the quality of the HMM synthesized speech itself and the number of connection points of synthesized speech by both systems does not increase too much.
As a result, connecting the higher-quality synthesized speech data in syllable units results in fewer connection points than phonemes or diphone units, so that deterioration due to discontinuities in connection points hardly occurs. Further, by suppressing the continuous output of HMM synthesized speech with insufficient speech quality, it is possible to prevent quality degradation caused by clearly perceiving HMM synthesized speech that is not so high in quality.

合成処理部４０は、波形接続型音声合成方式による合成音声とＨＭＭ音声合成方式による合成音声の合成音声の品質が高くなるような音声データを混ぜ合わせて作成されたデータベースである音声コーパス４２と、波形接続型合成音声生成部４４と、信号処理部４６と、ＨＭＭ合成音声生成部４８と、音声品質比較判定手段５０と、ハイブリッド合成音声生成処理手段５２とで構成される。
波形接続型合成音声生成部４４は、韻律情報生成部３４から入力される目標韻律パタンにできるだけ近い音声素片系列である合成音声データを、音声コーパス４２から探索して生成する。 The synthesis processing unit 40 includes a speech corpus 42 that is a database created by mixing speech data that improves the synthesized speech quality of the synthesized speech by the waveform connection speech synthesis method and the synthesized speech by the HMM speech synthesis method, The waveform connection type synthesized speech generation unit 44, the signal processing unit 46, the HMM synthesized speech generation unit 48, the speech quality comparison / determination unit 50, and the hybrid synthesis speech generation processing unit 52 are configured.
The waveform connection type synthesized speech generation unit 44 searches the speech corpus 42 to generate synthesized speech data that is a speech unit sequence as close as possible to the target prosody pattern input from the prosody information generation unit 34.

波形接続型音声合成方式による合成音声データは音声素片を接続したものなので、目標韻律パタンを再現することが保証されていない。そこで、信号処理部４６において、波形接続型音声合成方式の合成音声データの韻律を信号処理して目標韻律パタンに正確に合わせる処理を行なう。信号処理部４６で行なう信号処理技術としては、時間領域で音の高さを操作することで韻律を変更させるピッチ同期波形重畳合成方式（ＰＳＯＬＡ法）や、周波数領域で音の高さを操作して韻律を変化させる方法などがある。 Since the synthesized speech data by the waveform connected speech synthesis method is obtained by connecting speech units, it is not guaranteed to reproduce the target prosodic pattern. Therefore, the signal processing unit 46 performs signal processing on the prosody of the synthesized speech data of the waveform connection type speech synthesis method to accurately match the target prosody pattern. Signal processing techniques performed by the signal processing unit 46 include a pitch-synchronized waveform superposition method (PSOLA method) in which the prosody is changed by manipulating the pitch in the time domain, and the pitch in the frequency domain. To change the prosody.

ＨＭＭ合成音声生成部４８は、統計モデル学習部４８ｂと、音声パラメータ変換部４８ａと、統計モデル４８ｃとで構成される。統計モデル学習部４８ｂは、音声コーパス４２の音声データを基に予め統計モデル４８ｃを作成する。音声パラメータ変換部４８ａは、統計モデル４８ｃに韻律情報生成部３４から入力される目標韻律パタンを入力することで得られる音声パラメータを合成音声データに変換する。波形接続型合成音声生成部４４とＨＭＭ合成音声生成部４８とが、合成音声データを生成する動作は、従来からそれぞれの方式を単独で備えた音声合成装置と変るところはない。 The HMM synthesized speech generation unit 48 includes a statistical model learning unit 48b, a speech parameter conversion unit 48a, and a statistical model 48c. The statistical model learning unit 48b creates a statistical model 48c in advance based on the speech data of the speech corpus 42. The speech parameter conversion unit 48a converts speech parameters obtained by inputting the target prosody pattern input from the prosody information generation unit 34 into the statistical model 48c into synthesized speech data. The operation of generating the synthesized speech data by the waveform connection type synthesized speech generating unit 44 and the HMM synthesized speech generating unit 48 is not different from that of a speech synthesizer conventionally provided with each method alone.

従来の音声合成装置に対してこの実施例１は、波形接続型合成音声生成部４４とＨＭＭ合成音声生成部４８とが、１個の音声コーパス４２を共有している点と、それぞれの音声生成部が生成する音節単位の合成音声データの品質を比較し、品質の高い方の合成音声データを出力する点が新しい。 In contrast to the conventional speech synthesizer, the first embodiment is characterized in that the waveform connection type synthesized speech generation unit 44 and the HMM synthesized speech generation unit 48 share one speech corpus 42, and each speech generation. The new feature is that the quality of the synthesized speech data in syllable units generated by the sections is compared, and the synthesized speech data having the higher quality is output.

〔音声コーパス４２〕
音声コーパス４２は、波形接続型音声合成方式による合成音声の品質が高くなるように設計され集められた音声データと、ＨＭＭ音声合成方式による合成音声の品質が高くなるように設計され集められた音声データとが、混ぜ合わされたデータベースである。
波形接続型音声合成方式による合成音声の品質を高くするには、合成に用いられる頻度が高い音声データが優先的に含まれた音声データベースを用いる必要がある。合成に用いられる頻度が高い音声データとは、日本語では、よくある言い回しや単語、よく用いられる音節や音韻や音韻環境、基本周波数、パワーなどが高確率で含まれる音声データを意味する。 [Voice Corpus 42]
The speech corpus 42 is speech data designed and collected so that the quality of synthesized speech by the waveform connection type speech synthesis method is high, and speech that is designed and collected so that the quality of synthesized speech by the HMM speech synthesis method is high. Data is a mixed database.
In order to increase the quality of synthesized speech by the waveform-connected speech synthesis method, it is necessary to use a speech database that preferentially includes speech data that is frequently used for synthesis. In Japanese, speech data frequently used for synthesis means speech data including a common wording or word, frequently used syllables, phonology, phonological environment, fundamental frequency, power, and the like with high probability.

一方、ＨＭＭ音声合成方式による合成音声の品質を高くするには、ＨＭＭが統計モデルである以上、ＨＭＭモデルを学習するためのデータ数が学習単位毎にある程度そろっていなければならない。ＨＭＭモデルの学習単位は音声情報の１つであるスペクトル、つまり、音韻や音韻環境によって分けられる。そこで、基本周波数などは考慮せずに、音韻や音韻環境のバランスのみを考慮に入れて集められた音声データを用いることで、高品質なＨＭＭ合成音声を作成することが出来る。 On the other hand, in order to improve the quality of synthesized speech by the HMM speech synthesis method, since the HMM is a statistical model, the number of data for learning the HMM model must be uniform to some extent for each learning unit. The learning unit of the HMM model is divided according to a spectrum, which is one of speech information, that is, a phoneme and a phoneme environment. Therefore, high-quality HMM synthesized speech can be created by using speech data collected taking into consideration only the balance of phonemes and phonemic environments without considering the fundamental frequency.

このような音声データを格納した音声コーパスの構築方法は、例えば、公開特許公報2004−246140に開示されている。音声コーパス４２の具体的な構築方法は、例えば所定の音声データの半分程度は、音響的及び言語的な重みを同一にして、ある程度日本語の頻度を考慮してテキストを選択する。残りの半分については、例えば日本語において頻度が高い３つ組音韻（トライフォン）の数が同数となるような単純なテキストを選択する。このように音声コーパス４２を構築することで、比較的小容量のデータベースでもＨＭＭ音声合成方式にとっては適当な音韻や音韻環境のバランスがとれたテキストを収集することが可能になる。また、日本語としての頻度を重視して集めたテキストで構成されているので、波形接続型音声合成方式にとっても重要な韻律や音韻のバリエーションが保証された音声コーパス４２とすることが出来る。
音声コーパス４２の音声データを基に波形接続型合成音声生成部４４とＨＭＭ合成音声生成部４８が生成したそれぞれの合成音声データは、音声品質比較判定手段５０に入力される。 A method for constructing a speech corpus storing such speech data is disclosed in, for example, Japanese Patent Application Publication No. 2004-246140. As a specific construction method of the speech corpus 42, for example, about half of predetermined speech data, the acoustic and linguistic weights are made the same, and the text is selected in consideration of the Japanese frequency to some extent. For the other half, for example, a simple text is selected so that the number of triple phonemes (triphones) having a high frequency in Japanese is the same. By constructing the speech corpus 42 in this way, it is possible to collect texts that are balanced with a suitable phoneme and phoneme environment for an HMM speech synthesis method even with a relatively small database. In addition, since it is composed of text collected with emphasis on the frequency of Japanese, it is possible to provide a speech corpus 42 in which prosodic and phoneme variations important for waveform-connected speech synthesis methods are guaranteed.
The respective synthesized speech data generated by the waveform connection type synthesized speech generation unit 44 and the HMM synthesized speech generation unit 48 based on the speech data of the speech corpus 42 are input to the speech quality comparison / determination means 50.

〔音声品質比較判定手段５０〕
音声品質比較判定手段５０は、波形接続型合成音声生成部４４が生成する合成音声データ単位と、上記ＨＭＭ合成音声生成部４８が生成する合成音声データ単位の、どちらの音声品質が高いかを音節単位毎に比較判定する。
音声品質の比較方法は、機械学習のような統計的手法により判定する方法や、閾値を用いて判定する方法、コスト関数を設計して判定する方法などが考えられる。判定方法の一例として、音声の物理量パラメータを入力とするコスト関数を用いた例を実施例１で説明する。 [Audio quality comparison / determination means 50]
The voice quality comparison / determination means 50 determines which voice quality is higher, the synthesized voice data unit generated by the waveform connection type synthesized voice generating unit 44 or the synthesized voice data unit generated by the HMM synthesized voice generating unit 48. Comparison is made for each unit.
As a speech quality comparison method, a method of determining by a statistical method such as machine learning, a method of determining using a threshold value, a method of determining by designing a cost function, and the like can be considered. As an example of the determination method, an example in which a cost function using a speech physical quantity parameter as an input will be described in the first embodiment.

音声品質比較判定手段５０は、物理量パラメータをコスト値に換算するコスト値換算部５０ａと、コスト値を比較する比較判定部５０ｂとで構成される。判定に用いる物理量パラメータは、音声合成方式によって異なる。
波形接続型合成音声方式では、信号処理による大幅な音声波形の変形は品質劣化につながるので品質劣化度合いを判定するパラメータとして、目標韻律パタンと音声素片系列の韻律パタンとの差分の平均値や最大値が使える。又、音声素片の不明瞭さを判定する物理量パラメータとしては、選択された音声素片の音声コーパス中における韻律環境と入力テキストの韻律環境を比較した値が使える。 The voice quality comparison / determination unit 50 includes a cost value conversion unit 50a that converts a physical quantity parameter into a cost value, and a comparison / determination unit 50b that compares the cost value. The physical quantity parameter used for determination differs depending on the speech synthesis method.
In the waveform-connected synthetic speech method, since significant deformation of the speech waveform due to signal processing leads to quality degradation, the average value of the difference between the target prosody pattern and the prosody pattern of the speech segment sequence is used as a parameter for determining the degree of quality degradation. The maximum value can be used. Further, as a physical quantity parameter for determining the ambiguity of a speech segment, a value obtained by comparing the prosodic environment in the speech corpus of the selected speech segment with the prosody environment of the input text can be used.

一方、ＨＭＭ合成音声方式では、例えば、統計モデルを作成する際の学習単位に含まれる音韻データ数が使える。これは、学習単位に含まれる音韻データ数が少ない場合、信頼性の高い統計モデルを作るのが難しくＨＭＭ音声合成方式による合成音声の品質が低くなると考えられるからである。 On the other hand, in the HMM synthesized speech method, for example, the number of phoneme data included in a learning unit when creating a statistical model can be used. This is because when the number of phoneme data included in the learning unit is small, it is difficult to create a statistical model with high reliability, and it is considered that the quality of synthesized speech by the HMM speech synthesis method is lowered.

また、入力テキストの音韻環境とそれに対応する音節単位のＨＭＭの学習に使用されたデータの音韻環境が一致している割合をパラメータとして使うことで、信頼性の高い統計モデルであるかどうかを判定できる。音韻環境が一致していない場合、同じ音節であっても音の特徴が変化する。音韻環境が完全に一致しているデータのみで学習することが理想だが、そのようなデータがない場合は音韻環境を無視したデータでＨＭＭを学習する場合があるため、このようなパラメータを導入してＨＭＭ合成音声の品質を判定することが出来る。 In addition, by using as a parameter the proportion of the phonetic environment of the input text and the corresponding phonological environment of the data used for HMM learning in syllable units, it is determined whether the statistical model is highly reliable. it can. If the phonological environments do not match, the sound characteristics change even for the same syllable. It is ideal to learn only with data that has the same phonological environment, but if there is no such data, HMM may be learned using data that ignores the phonological environment. Thus, the quality of the HMM synthesized speech can be determined.

また、該当音節でＨＭＭ合成音声が選択された場合のＨＭＭ合成音声の連続選択回数もパラメータとして使える。これは、音節単位で波形接続型合成音声かＨＭＭ合成音声かを選択する本方式において、ＨＭＭ合成音声が連続して選択されると、品質がそれほど高くないＨＭＭ合成音声が明確に知覚されてしまい、両方式の合成音声が混合された出力合成音声の品質が低くなってしまうからである。つまり、ＨＭＭ合成音声が連続して選択されないように制御することを目的としたパラメータである。 The number of continuous selections of HMM synthesized speech when HMM synthesized speech is selected for the corresponding syllable can also be used as a parameter. This is because, in this method of selecting between waveform-connected synthesized speech and HMM synthesized speech in syllable units, HMM synthesized speech that is not so high in quality is clearly perceived when HMM synthesized speech is selected continuously. This is because the quality of the output synthesized speech in which both types of synthesized speech are mixed is lowered. That is, it is a parameter for the purpose of controlling so that HMM synthesized speech is not continuously selected.

以上のような物理量パラメータの次元（dimension）はそれぞれ異なるために、単純に比較することが出来ない。そこで実施例１では、それぞれの物理量をコスト関数（式（１））を用いてコスト値に換算した後に比較する。 Since the dimensions of the physical quantity parameters as described above are different, they cannot be simply compared. Therefore, in Example 1, each physical quantity is converted into a cost value using a cost function (formula (1)) and then compared.

ここで、
Ｓ：サブコスト関数
ｆ：シグモイド関数
ｘ：判定のために入力される物理量
θ：物理量に対する閾値
ｉ：物理量パラメータの種類
ｗ：各物理量パラメータに対する重み
α：シグモイド関数の傾斜調節パラメータ
シグモイド関数は、図４に示すようにｙ＝０とｙ＝１を漸近線に持ち、ｅの指数部が０の時にｙ＝１/２を示す関数である。横軸ｘを物理量パラメータとすることで、次元の異なる物理量パラメータを一つの指標であるコスト値に変換出来る。コスト値に変換する際に、各物理量の指数部の各値、閾値θ、重みｗ、傾斜調節パラメータαを操作することで、各物理量パラメータの比較水準を調節することが出来る。この調節は、実験等の結果から事前に設計しておく。

here,
S: Sub-cost function f: Sigmoid function x: Physical quantity input for determination θ: Threshold for physical quantity i: Kind of physical quantity parameter w: Weight for each physical quantity parameter α: Slope adjustment parameter of sigmoid function The sigmoid function is shown in FIG. As shown in the figure, y = 0 and y = 1 are asymptotic lines, and when the exponent part of e is 0, the function indicates y = 1/2. By using the horizontal axis x as a physical quantity parameter, physical quantity parameters having different dimensions can be converted into a cost value as one index. When converting to a cost value, the comparison level of each physical quantity parameter can be adjusted by manipulating each value of the exponent part of each physical quantity, threshold value θ, weight w, and slope adjustment parameter α. This adjustment is designed in advance based on the results of experiments and the like.

音声品質比較判定手段５０で行なわれる音声品質比較判定処理（ステップＳ５０）の動作フローを図５に示す。音声品質比較判定手段５０内のコスト値換算部５０ａは、波形接続型合成音声生成部４４から入力される合成音声データ単位毎の目標韻律パタンと、音声素片系列である合成音声データ単位を入力として、単位毎の目標韻律パタンと音声素片で形成される韻律パタンとの差分Ｒｄを、式（１）でサブコスト値Ｓ_１に換算する（ステップＳ５２０）。また、サブコスト値Ｓ_３として合成音声データ単位前後の音韻環境の一致度合いＡ_Ｌに換算する（ステップＳ５２１）。サブコスト値Ｓ_１とＳ_３とを加算してCost1を算出する（ステップＳ５２２）。 FIG. 5 shows an operation flow of the voice quality comparison / determination process (step S50) performed by the voice quality comparison / determination means 50. The cost value conversion unit 50a in the speech quality comparison / determination unit 50 inputs a target prosodic pattern for each synthesized speech data unit input from the waveform-connected synthesized speech generation unit 44 and a synthesized speech data unit that is a speech unit sequence. As described above, the difference Rd between the target prosodic pattern for each unit and the prosodic pattern formed by the speech segment is converted into the sub-cost value S ₁ using the equation (1) (step S520). Also, in terms of the agreement degree _{A L} of the synthesized speech data units before and after the phoneme environment as a sub-cost value _{S 3} (step S521). Cost 1 is calculated by adding the sub-cost values S ₁ and S ₃ (step S522).

ＨＭＭ合成音声生成部４８が出力する合成音声データ単位のコスト値は、コスト値換算部５０ａにおいて、合成音声データ単位内に含まれる学習データの数Ｎに依存したサブコスト値Ｓ_２が計算される（ステップＳ５２３）。また、ＨＭＭ合成音声データについてもサブコスト値Ｓ_４として合成音声データ単位前後の音韻環境の一致度合いＡ_ＨＬに換算する（ステップＳ５２４）。サブコスト値Ｓ_２とＳ_４とを加算してCost２を算出する（ステップＳ５２５）。 Cost value of the synthesized speech data units output by the HMM speech synthesizer unit 48, the cost value conversion unit 50a, sub-cost value S ₂ that depends on the number N of learning data included in the synthesized speech data units is calculated ( Step S523). Further, it converted as sub-cost value _{S 4} also HMM synthesized speech data to the matching degree _{A HL} synthesized speech data units before and after the phoneme environment (step S524). Adding the sub-cost value _{S 2} and _{S 4} to calculate the Cost2 (step S525).

Cost１とCost２は、比較判定部５０ｂにおいて比較される（ステップＳ５２６）。Cost１≦Cost２の場合（ステップＳ５２６、Ｙｅｓ）は、波形接続型合成音声方式で生成された合成音声データ単位の音声品質の方が、ＨＭＭ合成音声方式で生成された合成音声データ単位よりもよいと判断できる。
Cost１＞Cost２の場合（ステップＳ５２６、Ｎｏ）は、ＨＭＭ合成音声方式で生成された合成音声データ単位の音声品質の方がよいと判断できる。 Cost1 and Cost2 are compared by the comparison determination unit 50b (step S526). In the case of Cost1 ≦ Cost2 (step S526, Yes), the voice quality of the synthesized voice data unit generated by the waveform connection type synthesized voice method is better than the synthesized voice data unit generated by the HMM synthesized voice method. I can judge.
When Cost1> Cost2 (step S526, No), it can be determined that the voice quality in units of synthesized voice data generated by the HMM synthesized voice method is better.

比較判定部５０ｂにおいて、音声品質がよいと判断された合成音声方式の合成音声データ単位が、ハイブリッド合成音声生成処理手段５２に出力される。
ハイブリッド合成音声生成処理手段５２は、順次入力される合成音声データ単位を接続してハイブリッド合成音声データをバス１５０へ出力する（ステップＳ５２９）。 The synthesized speech data unit of the synthesized speech method determined to have good speech quality in the comparison determination unit 50 b is output to the hybrid synthesized speech generation processing means 52.
The hybrid synthesized speech generation processing means 52 connects the synthesized speech data units that are sequentially input, and outputs the hybrid synthesized speech data to the bus 150 (step S529).

［変形例］
また、波形接続型合成音声データとＨＭＭ型合成音声データのそれぞれに依存する１個のコスト値を算出して、その１個のコスト値と閾値を比較して合成音声データの選択を行ってもよい。音声品質比較判定処理の変形例（ステップＳ５０’）のその動作フローを図６に示す。
ここで用いるコスト関数はＨＭＭ合成音声を用いることに対する障壁度合（コスト）を表わすこととし、波形接続型合成音声の方が品質が高い場合は高いコスト値を、ＨＭＭ音声合成の方が品質が高い場合は低いコスト値を出力するように設計されたコスト関数である。 [Modification]
Alternatively, one cost value depending on each of the waveform connection type synthesized speech data and the HMM type synthesized speech data is calculated, and the synthesized speech data is selected by comparing the one cost value with a threshold value. Good. FIG. 6 shows an operation flow of a modified example (step S50 ′) of the voice quality comparison determination process.
The cost function used here represents the degree of barrier (cost) to using the HMM synthesized speech, and the waveform connected synthesized speech has a higher cost value when the quality is higher, and the HMM speech synthesized has a higher quality. The case is a cost function designed to output a low cost value.

ここで波形接続型合成音声生成部４４から入力される合成音声データ単位毎の目標韻律パタンと、音声素片系列である合成音声データ単位を入力として、単位毎の目標韻律パタンと音声素片で形成される韻律パタンとの差分Ｒｄを、式（１）でサブコスト値Ｓ_１に換算する（ステップＳ５２０）処理と、サブコスト値Ｓ_３として合成音声データ単位前後の音韻環境の一致度合いＡ_Ｌをサブコスト値Ｓ_３に換算する（ステップＳ５２１）処理は、図５と異なる。ステップＳ５２０とＳ５２１では、波形接続型の合成音声データの音声品質が高い場合、大きなサブコスト値Ｓ_１とＳ_３を算出し、逆に音声品質が低い場合には小さなサブコスト値を算出する。 Here, a target prosody pattern for each unit of synthesized speech data input from the waveform connection type synthesized speech generation unit 44 and a synthesized speech data unit that is a speech unit sequence are input, and the target prosody pattern and speech unit for each unit are input. The difference Rd from the formed prosodic pattern is converted into the sub cost value S ₁ by the expression (1) (step S520), and the degree of agreement A _L of the phoneme environment before and after the synthesized speech data unit is used as the sub cost value S ₃ as the sub cost. converted into the value _{S 3} (step S521) the processing is different from FIG. In step S520 and S521, if the voice quality of the synthesized speech data of concatenative high, calculates a larger sub-cost values S ₁ and S _3, if the voice quality is low reversed to calculate the small sub-costs value.

一方、ＨＭＭ合成音声生成部４８が出力する合成音声データ単位のコスト値は、合成音声データ単位内に含まれる学習データの数Ｎが多い程、小さなサブコスト値Ｓ_２を算出する（ステップＳ５２３’）。また、ＨＭＭ合成音声方式の合成音声データ単位前後の音韻環境の一致度合いＡ_ＨＬも、一致度が高い程、小さなサブコスト値Ｓ_４に換算する（ステップＳ５２４’）ように、式（１）に示したシグモイド関数を設計しておく。 On the other hand, the cost value of the synthesized speech data units output by the HMM speech synthesizer unit 48, as the number N is large training data contained in the synthesized speech data units, and calculates a smaller sub-cost value S ₂ (step S523 ') . Further, the matching degree A _HL of the phoneme environment before and after the synthesized speech data unit of the HMM synthesized speech method is also expressed by the equation (1) so that the higher the matching degree, the smaller the sub cost value S ₄ is converted (step S524 ′). Design a sigmoid function.

このようなサブコスト値Ｓ_１〜Ｓ_４を１個のコスト値Cost１として算出する（ステップＳ５２２）と、ＨＭＭ合成音声方式の合成音声データの音声品質が高いとコスト値Cost１は、小さな値を示し、波形接続型音声合成方式の合成音声データの音声品質が高いとコスト値Cost１（スコア値）が大きな値を示す。
このように設計されたコスト値Cost１を閾値と比較することでも、どちらの方式の合成音声データの音声品質が高いかを判断することが出来る。 When such sub cost values S _{1 to} S ₄ are calculated as one cost value Cost 1 (step S 522), if the speech quality of the synthesized speech data of the HMM synthesized speech method is high, the cost value Cost 1 shows a small value, The cost value Cost1 (score value) indicates a large value when the speech quality of the synthesized speech data of the waveform connection speech synthesis method is high.
By comparing the cost value Cost1 designed in this way with a threshold value, it is possible to determine which type of synthesized voice data has higher voice quality.

以上述べたように、この発明では、波形接続型音声合成方式による合成音声とＨＭＭ音声合成方式による合成音声のどちらの品質が高いかを判定し、音声合成の単位毎に品質の高い方式の合成音声データを使用することが出来るので、比較的小規模な音声データベースを使用した場合でも高品質な合成音声を生成することが可能となる。 As described above, according to the present invention, it is determined whether the quality of the synthesized speech by the waveform connection type speech synthesis method or the synthesized speech by the HMM speech synthesis method is high, and synthesis of a high quality method is performed for each unit of speech synthesis. Since voice data can be used, high-quality synthesized voice can be generated even when a relatively small voice database is used.

なお、音声品質の比較方法としてコスト関数を用いた例を示して説明を行ったが、一例であって他の方法も考えられる。例えば、機械学習のような統計的手法を用いて比較判断を行ってもよい。波形接続型音声合成方式の合成音声データ単位と、ＨＭＭ音声合成方式の合成音声データ単位のどちらの音声品質がよいかを判定し、その判定結果と上記した物理量パラメータとから機械学習器の１つである２値分類器（ＳＶＭ:Support Vector Machine）を作成しておき、未知の物理量パラメータを２値分類器に入力することで、どちらの方式による合成音声を用いた方が良いかを判断させてもよい。 In addition, although the example using a cost function was shown and demonstrated as a speech quality comparison method, it is an example and another method is also considered. For example, the comparison determination may be performed using a statistical method such as machine learning. One of the machine learners is determined based on the result of the determination and the above-described physical quantity parameter by determining which of the synthesized speech data unit of the waveform connection type speech synthesis method and the synthesized speech data unit of the HMM speech synthesis method is better. By creating a binary classifier (SVM: Support Vector Machine) and inputting an unknown physical quantity parameter to the binary classifier, it is possible to determine which method to use synthesized speech. May be.

また、この発明のハイブリッド型音声合成装置を一般的なパーソナルコンピュータを用いてソフトウェアで構成する例で説明を行ったが、図２に示したハイブリッド型音声合成ユニット１１６の各モジュールを、それぞれ専用のハードウェアで構成することも可能である。 In addition, the hybrid speech synthesizer of the present invention has been described with an example in which a general personal computer is used to configure software, but each module of the hybrid speech synthesizer 116 shown in FIG. It is also possible to configure with hardware.

この発明のハイブリッド型音声合成装置１００のハードウェア構成の一例である実施例１を示す図。The figure which shows Example 1 which is an example of the hardware constitutions of the hybrid type speech synthesizer 100 of this invention. 実施例１のハイブリッド型音声合成ユニット１１６のモジュールの機能構成例を示す図。FIG. 3 is a diagram illustrating an example of a functional configuration of a module of the hybrid type speech synthesis unit 116 according to the first embodiment. ハイブリッド型音声合成ユニット１１６の動作フローを示す図。The figure which shows the operation | movement flow of the hybrid type speech synthesis unit 116. シグモイド関数を示す図。The figure which shows a sigmoid function. 音声品質比較判定手段５０（ステップＳ５０）の動作フローを示す図。The figure which shows the operation | movement flow of the audio | voice quality comparison determination means 50 (step S50). 音声品質比較判定手段５０の変形例（ステップＳ５０’）の動作フローを示す図。The figure which shows the operation | movement flow of the modification (step S50 ') of the audio | voice quality comparison determination means 50. FIG.

Claims

A waveform-connected synthesized speech generation unit that generates synthesized speech data whose content matches the input text by searching for and connecting speech units of arbitrary length included in the speech corpus;
A statistical model is generated in advance by learning speech information extracted from the speech corpus using a statistical method, speech information parameters corresponding to the input text are obtained from the statistical model, and synthesized speech data is generated from the speech information parameters. A hybrid speech synthesizer comprising an HMM synthesized speech generation unit
Compare and determine for each syllable unit of the synthesized speech data which speech quality is higher, the synthesized speech data generated by the waveform-connected synthesized speech generation unit and the synthesized speech data generated by the HMM synthesized speech generation unit Voice quality comparison / determination means,
A signal processing unit that performs processing for matching a syllable unit prosody of the waveform-connected speech synthesis method to a target prosody pattern between the waveform-connected synthesized speech generation unit and the speech quality comparison and determination unit;
Hybrid synthesized speech generation processing means for generating hybrid synthesized speech data by connecting the synthesized speech data units according to the determination result of the speech quality comparison determining means;
A hybrid type speech synthesizer characterized by comprising:

The hybrid speech synthesizer according to claim 1,
The speech corpus is designed so that the quality of the synthesized speech data generated by the waveform connection type synthesized speech generation unit is high and the quality of the synthesized speech data generated by the HMM synthesized speech generation unit is high. A hybrid type speech synthesizer characterized by comprising: voice data designed as described above.

A waveform connected synthetic speech generation process for generating synthesized speech data whose content matches the input text by searching for and connecting speech segments of arbitrary length included in the speech corpus,
Obtaining an audio information parameter corresponding to the input text and generating synthesized speech data from the speech information parameter;
A signal processing process for processing the syllable unit prosody of the waveform connected speech synthesis method to match the target prosody pattern,
A speech quality comparison / determination process for comparing each syllable unit to determine which voice quality is higher, the syllable unit generated in the waveform connected synthetic speech process or the syllable unit generated in the HMM synthesized speech generation process; ,
A hybrid synthesized speech generation process for generating hybrid synthesized speech data by connecting the synthesized speech data units according to the determination result of the speech quality comparison determination process;
A hybrid speech synthesis method characterized by comprising:

An apparatus program for causing a computer to function as each apparatus according to claim 1.

A computer-readable storage medium storing any one of the programs according to claim 4.