JP4034751B2

JP4034751B2 - Speech synthesis apparatus, speech synthesis method, and speech synthesis program

Info

Publication number: JP4034751B2
Application number: JP2004106711A
Authority: JP
Inventors: 正統田村; 竜也水谷; 岳彦籠嶋; 勝美土谷
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2004-03-31
Filing date: 2004-03-31
Publication date: 2008-01-16
Anticipated expiration: 2024-03-31
Also published as: JP2005292433A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer efficiently synthesizing a natural speech of high quality. <P>SOLUTION: The speech synthesizer is provided with: an acquiring means 110 of acquiring a meter series for a target speech to be synthesized for a plurality of segments respectively; merged speech element holding means 160 and 170 of holding merged speech elements obtained by merging a plurality of speech elements and merged speech element meter information showing meters of the merged speech elements while making them correspond to each other; a held speech distortion estimating means 130 of estimating the degree of distortion between segment meter information showing meters of segments obtained by the acquiring means 110 and the merged speech element meter information held in the merged speech element holding means 160 and 170; a merged speech element selecting means 140 of selecting a merged speech element on the basis of the degree of distortion estimated by the held speech distortion estimating means 130; and a speech synthesizing means 150 of generating a synthesized speech by connecting respective merged speech elements that the merged speech element selecting means 140 select for the respective segments. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声の韻律系列等に基づいて、音声合成を行う音声合成装置、音声合成方法および音声合成プログラムに関するものである。 The present invention relates to a speech synthesizer, a speech synthesis method, and a speech synthesis program that synthesize speech based on a prosodic sequence of speech.

任意の文章から人工的に音声信号を作り出す、テキスト音声合成技術が知られている。テキスト音声合成は、一般的に言語処理段階、韻律処理段階および音声合成段階の３つの段階によって行われる。 A text-to-speech synthesis technique that artificially creates a speech signal from an arbitrary sentence is known. Text-to-speech synthesis is generally performed in three stages: a language processing stage, a prosody processing stage, and a speech synthesis stage.

テキスト音声合成では、まず言語処理段階において入力されたテキストに対して形態素解析や構文解析などが行われる。次に韻律処理段階では、アクセントやイントネーションの処理が行われる。そして、音韻系列・韻律情報（基本周波数、音韻継続時間長、パワーなど）が出力される。最後に、音声信号合成段階では、音韻系列・韻律情報から音声信号を合成する。 In text-to-speech synthesis, morphological analysis, syntax analysis, and the like are first performed on text input at the language processing stage. Next, in the prosody processing stage, accent and intonation processing is performed. Then, phoneme series / prosodic information (basic frequency, phoneme duration, power, etc.) is output. Finally, in the speech signal synthesis stage, a speech signal is synthesized from the phoneme sequence / prosodic information.

このようにテキスト音声合成においては、任意の韻律記号列から音声信号を合成する。したがって、当該テキスト音声合成に用いる音声合成方法は、任意の韻律記号列を任意の韻律で音声合成することができる方法である必要がある。 Thus, in text-to-speech synthesis, a speech signal is synthesized from an arbitrary prosodic symbol string. Therefore, the speech synthesis method used for the text-to-speech synthesis needs to be a method that can synthesize an arbitrary prosodic symbol string with an arbitrary prosody.

従来、このような音声合成方法として、音声合成単位がＣＶ、ＣＶＣ、ＶＣＶ（Ｖは母音、Ｃは子音を表す）といった小さな単位の特徴パラメータ（これを代表音声素片という）を記憶し、これらを選択的に読み出した後、基本周波数や継続時間長を制御して接続することにより、音声を合成するというものが知られている（例えば、特許文献１参照）。 Conventionally, as such a speech synthesis method, speech synthesis units are stored as small unit feature parameters (this is referred to as a representative speech segment) such as CV, CVC, and VCV (V represents a vowel and C represents a consonant). Is known to synthesize a voice by controlling the fundamental frequency and duration and connecting them after selectively reading (see, for example, Patent Document 1).

また、統計学習に基づく手法として、ＨＭＭに基づく音声合成手法が開示されている（例えば非特許文献１参照）。ＨＭＭに基づく音声合成手法では、スペクトル包絡パラメータおよび基本周波数パラメータを隠れマルコフモデルに基づいて同時にモデル化し、合成時にはパラメータの静的特徴量および動的特徴量の統計量を考慮してスペクトル包絡パラメータおよび基本周波数パラメータを生成する。未知コンテキストに対応する分布は、ＨＭＭの各状態において保持している決定木を辿ることにより選択される。この決定木は各ノードにおいて質問をもち、入力属性情報が各ノードの質問に該当するかどうかで決定木を辿り、リーフノードにおける分布を選択するものである。 Further, as a technique based on statistical learning, a voice synthesis technique based on HMM is disclosed (for example, see Non-Patent Document 1). In the speech synthesis method based on the HMM, the spectral envelope parameter and the fundamental frequency parameter are modeled simultaneously based on the hidden Markov model, and the spectral envelope parameter and the dynamic feature statistic are taken into consideration at the time of synthesis. Generate fundamental frequency parameters. The distribution corresponding to the unknown context is selected by following the decision tree held in each state of the HMM. This decision tree has a question in each node, follows the decision tree depending on whether the input attribute information corresponds to the question of each node, and selects the distribution in the leaf nodes.

特許第２５８３０７４号公報Japanese Patent No. 2583074 吉村貴克、徳田恵一、益子貴史、小林隆夫、北村正、“ＨＭＭに基づく音声合成におけるスペクトル・ピッチ・継続長の同時モデル化”，電子情報通信学会論文誌，2000年11月，Vol． J83-D-II，No． 11，pp．2099-21-8，Takakatsu Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura, “Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis”, IEICE Transactions, November 2000, Vol. J83-D-II, No. 11, pp. 2099-21-8,

代表音声素片を使用する音声合成方法においては、予め作成しておいた代表音声素片が利用される。しかし、この方法では、利用できる音声素片は、予め作成しておいた代表音声素片に限定される。したがって、入力される韻律や音韻環境の多様なバリエーションに対応することが難しいという問題があった。 In a speech synthesis method using a representative speech element, a representative speech element created in advance is used. However, in this method, usable speech segments are limited to representative speech segments created in advance. Therefore, there is a problem that it is difficult to cope with various variations of input prosodic and phonological environments.

予め作成しておく代表音声素片の数を増加させることにより、入力される韻律環境の多様なバリエーションに対応できるが、その一方で、予め作成しておく代表音声素片の数を増加させた場合には、処理効率が低下してしまう。また、音声合成に割り当てられる計算資源には限界があり、予め作成しておく代表音声素片の数も制限されてしまう。 By increasing the number of representative speech segments to be created in advance, it is possible to cope with various variations of the input prosodic environment, but on the other hand, the number of representative speech segments to be created has been increased. In this case, the processing efficiency is lowered. Moreover, there is a limit to the computational resources allocated to speech synthesis, and the number of representative speech segments created in advance is also limited.

また、素片選択型の音声合成方法においては、人が自然に聞こえるような音声素片系列を選択する規則をコスト関数として定式化することが困難であるという問題があった。さらに、不良素片の排除が困難であるという問題があった。 In addition, in the unit selection type speech synthesis method, there is a problem that it is difficult to formulate a rule for selecting a speech unit sequence that can be heard naturally as a cost function. Furthermore, there is a problem that it is difficult to eliminate defective pieces.

本発明は、上記に鑑みてなされたものであって、自然で高品質な音声を効率的に合成することのできる音声合成装置を提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a speech synthesizer that can efficiently synthesize natural and high-quality speech.

上述した課題を解決し、目的を達成するために、本発明は、同一の音声単位に対する複数の音声素片であって、かつ当該音声単位の韻律が互いに異なる複数の音声素片と、当該音声素片の韻律を示す音声素片韻律情報とを対応付けて保持する音声素片保持手段と、予め設定されている教師音声の韻律を示す教師音声韻律情報と前記音声素片保持手段に保持されている前記音声素片韻律情報とに基づいて、前記音声素片保持手段から複数の音声素片を選択する音声素片選択手段と、前記音声素片選択手段によって選択された複数の前記音声素片から、予め定められた条件を満たす複数の前記音声素片の組み合わせを決定する組合せ決定手段と、決定された前記組み合わせに含まれる複数の前記音声素片に基づいて、複数の前記音声素片を融合した融合音声素片を作成する融合音声素片作成手段と、決定された前記組み合わせに含まれる複数の前記音声素片それぞれに対応する前記韻律情報に基づいて、前記融合音声素片の韻律を示す融合音声素片韻律情報を作成する融合音声素片韻律情報作成手段と、前記融合音声素片作成手段によって作成された前記融合音声素片と、前記融合音声素片韻律情報作成手段によって作成された前記融合音声素片韻律情報とを対応付けて保持する融合音声素片保持手段と、音声合成すべき目標音声に対する韻律系列を、音声合成の合成単位である複数のセグメントそれぞれに対して取得する取得手段と、前記取得手段によって得られた前記セグメントの韻律を示すセグメント韻律情報と前記融合音声素片保持手段に保持されている前記融合音声素片韻律情報との間の歪みの度合いを推定する保持音声歪み推定手段と、前記保持音声歪み推定手段によって推定された前記歪みの度合いに基づいて、前記融合音声素片を選択する融合音声素片選択手段と、前記融合音声素片選択手段が各セグメントに対して選択した各融合音声素片を接続して合成音声を生成する音声合成手段とを備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention provides a plurality of speech units for the same speech unit, and a plurality of speech units having different prosody of the speech unit, and the speech Speech unit holding means for holding speech unit prosody information indicating the prosody of the unit in association with each other, teacher speech prosody information indicating the prosody of teacher speech set in advance, and the speech unit holding unit Speech unit selection means for selecting a plurality of speech units from the speech unit holding means based on the speech unit prosody information, and the plurality of speech units selected by the speech unit selection means. Based on the plurality of speech units included in the determined combination, a plurality of speech units based on the combination determination means for determining a combination of the plurality of speech units satisfying a predetermined condition from the segment Fusing Fusion speech unit creating means for creating a fused speech unit, and fusion indicating the prosody of the fused speech unit based on the prosodic information corresponding to each of the plurality of speech units included in the determined combination Fusion speech unit prosody information creating means for creating speech unit prosodic information, the fused speech unit created by the fused speech unit creating means, and the fused speech unit prosody information creating means created by the fused speech unit prosody information creating means Fusion speech unit holding means for holding the fusion speech unit prosodic information in association with each other, and acquisition means for acquiring a prosodic sequence for the target speech to be synthesized for each of a plurality of segments that are synthesis units of speech synthesis Segment prosody information indicating the prosody of the segment obtained by the acquisition unit and the fused speech unit prosody held in the fused speech unit holding unit Holding speech distortion estimation means for estimating the degree of distortion between the information and the fusion speech unit selection means for selecting the fusion speech unit based on the degree of distortion estimated by the holding speech distortion estimation means And speech synthesis means for generating synthesized speech by connecting the fused speech units selected by the fused speech unit selection means for each segment.

また、本発明は、同一の音声単位に対する複数の音声素片であって、かつ当該音声単位の韻律が互いに異なる複数の音声素片と当該音声素片の韻律を示す音声素片韻律情報とを対応付けて保持する音声素片保持手段に保持されている前記音声素片韻律情報と、予め設定されている教師音声の韻律を示す教師音声韻律情報とに基づいて、前記音声素片保持手段から複数の音声素片を選択する音声素片選択ステップと、前記音声素片選択ステップによって選択された複数の前記音声素片から、予め定められた条件を満たす複数の前記音声素片の組み合わせを決定する組み合わせ決定ステップと、決定された前記組み合わせに含まれる複数の前記音声素片に基づいて、複数の前記音声素片を融合した融合音声素片を作成する融合音声素片作成ステップと、決定された前記組み合わせに含まれる複数の前記音声素片それぞれに対応する前記韻律情報に基づいて、前記融合音声素片の韻律を示す融合音声素片韻律情報を作成する融合音声素片韻律情報作成ステップと、前記融合音声素片作成ステップによって作成された前記融合音声素片と、前記融合音声素片韻律情報作成ステップによって作成された前記融合音声素片韻律情報とを対応付けて融合音声素片保持手段に保存する保存ステップと、音声合成すべき目標音声に対する韻律系列を、音声合成の合成単位である複数のセグメントそれぞれに対して取得する取得ステップと、前記融合音声素片保持手段に保持されている前記融合音声素片韻律情報と、前記取得ステップにおいて得られた前記セグメントの韻律を示すセグメント韻律情報との間の歪みの度合いを推定する保持音声歪み推定ステップと、前記保持音声歪み推定ステップにおいて推定された前記歪みの度合いに基づいて、前記融合音声素片を選択する融合音声素片選択ステップと、
前記融合音声素片選択ステップにおいて各セグメントに対して選択した各融合音声素片を接続して合成音声を生成する音声合成ステップとを有することを特徴とする。 Also, the present invention provides a plurality of speech units for the same speech unit, and a plurality of speech units having different prosody of the speech unit and speech unit prosody information indicating the prosody of the speech unit. Based on the speech segment prosody information held in the speech unit holding means held in association with the speech unit prosody information indicating the prosody of the teacher speech set in advance, from the speech unit holding means A speech unit selection step for selecting a plurality of speech units, and a combination of the plurality of speech units satisfying a predetermined condition is determined from the plurality of speech units selected by the speech unit selection step. A combined speech unit creating step for creating a fused speech unit by fusing a plurality of speech units based on the plurality of speech units included in the determined combination. And fused speech unit prosody for creating fused speech unit prosody information indicating the prosody of the fused speech unit based on the prosodic information corresponding to each of the plurality of speech units included in the determined combination A fused speech in which an information creating step, the fused speech segment created by the fused speech segment creating step, and the fused speech segment prosodic information created by the fused speech segment prosodic information creating step are associated with each other; a storage step of storing the segment holding means, a prosody series with respect to the target speech to be speech synthesis, an acquisition step of acquiring the respective plurality of segments is a composite unit of speech synthesis, the fused speech unit holding means The fused speech segment prosody information held; segment prosody information indicating the prosody of the segment obtained in the acquisition step; Holding audio distortion estimating step of estimating the degree of distortion between, based on the degree of the distortion estimated in the holding audio distortion estimation step, a fused speech unit selection step of selecting the fused speech unit,
A speech synthesis step of generating synthesized speech by connecting the fused speech units selected for each segment in the fused speech unit selection step.

また、本発明は、音声合成処理をコンピュータに実行させる音声合成プログラムであって、同一の音声単位に対する複数の音声素片であって、かつ当該音声単位の韻律が互いに異なる複数の音声素片と当該音声素片の韻律を示す音声素片韻律情報とを対応付けて保持する音声素片保持手段に保持されている前記音声素片韻律情報と、予め設定されている教師音声の韻律を示す教師音声韻律情報とに基づいて、前記音声素片保持手段から複数の音声素片を選択する音声素片選択ステップと、前記音声素片選択ステップによって選択された複数の前記音声素片から、予め定められた条件を満たす複数の前記音声素片の組み合わせを決定する組み合わせ決定ステップと、決定された前記組み合わせに含まれる複数の前記音声素片に基づいて、複数の前記音声素片を融合した融合音声素片を作成する融合音声素片作成ステップと、決定された前記組み合わせに含まれる複数の前記音声素片それぞれに対応する前記韻律情報に基づいて、前記融合音声素片の韻律を示す融合音声素片韻律情報を作成する融合音声素片韻律情報作成ステップと、前記融合音声素片作成ステップによって作成された前記融合音声素片と、前記融合音声素片韻律情報作成ステップによって作成された前記融合音声素片韻律情報とを対応付けて融合音声素片保持手段に保存する保存ステップと、音声合成すべき目標音声に対する韻律系列を、音声合成の合成単位である複数のセグメントそれぞれに対して取得する取得ステップと、前記融合音声素片保持手段に保持されている前記融合音声素片韻律情報と、前記取得ステップにおいて得られた前記セグメントの韻律を示すセグメント韻律情報との間の歪みの度合いを推定する保持音声歪み推定ステップと、前記保持音声歪み推定ステップにおいて推定された前記歪みの度合いに基づいて、前記融合音声素片を選択する融合音声素片選択ステップと、前記融合音声素片選択ステップにおいて各セグメントに対して選択した各融合音声素片を接続して合成音声を生成する音声合成ステップとを有することを特徴とする。 The present invention also provides a speech synthesis program for causing a computer to perform speech synthesis processing, a plurality of speech units for the same speech unit, and a plurality of speech units having different prosody of the speech unit. The speech unit prosody information held in the speech unit holding means that holds the speech unit prosody information indicating the prosody of the speech unit in association with the teacher and the teacher showing the prosody of the preset teacher speech Based on speech prosody information, a speech unit selection step for selecting a plurality of speech units from the speech unit holding means, and a plurality of speech units selected by the speech unit selection step are determined in advance. A combination determining step for determining a combination of a plurality of speech units that satisfy a given condition, and a plurality of speech units included in the determined combination, Based on the prosody information corresponding to each of the plurality of speech units included in the determined combination, and a fusion speech unit creation step of creating a fused speech unit by fusing the recorded speech units Fusion speech segment prosodic information creation step for creating fused speech segment prosodic information indicating the prosody of the segment, the fused speech segment created by the fused speech segment creation step, and the fused speech segment prosodic information A step of associating the fusion speech unit prosody information created in the creation step with the fusion speech unit holding means in association with each other, and a plurality of prosody sequences for the target speech to be synthesized as speech synthesis units an acquisition step of acquiring a segment for each, and the fused speech unit prosody information held in the fused speech unit holding means, said acquisition stearate Based on the degree of distortion estimated in the retained speech distortion estimation step, the retained speech distortion estimation step for estimating the degree of distortion between the segment prosody information indicating the prosody of the segment obtained in the step, A fusion speech unit selection step for selecting a fusion speech unit; and a speech synthesis step for generating a synthesized speech by connecting the fusion speech units selected for each segment in the fusion speech unit selection step. It is characterized by that.

本発明にかかる音声合成装置は、融合音声素片保持手段が、融合音声素片と、当該融合音声素片の融合音声素片韻律情報とを対応付けて保持し、保持音声歪み推定手段によって推定された歪みの度合いに基づいて、融合音声素片選択手段が選択した融合音声素片を利用して音声合成を行うので、音声合成時に融合音声素片を作成する場合に比べて、処理の効率化を図ることができ、かつ自然で高品質な音声を合成することができるという効果を奏する。 In the speech synthesizer according to the present invention, the fused speech unit holding unit holds the fused speech unit and the fused speech unit prosodic information of the fused speech unit in association with each other, and estimates by the held speech distortion estimation unit Since speech synthesis is performed using the fusion speech unit selected by the fusion speech unit selection means based on the degree of distortion, the processing efficiency is higher than when creating a fusion speech unit during speech synthesis. It is possible to achieve the effect of synthesizing a natural and high-quality voice.

以下に、本発明にかかる音声合成装置、音声合成方法および音声合成プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施の形態によりこの発明が限定されるものではない。 Hereinafter, embodiments of a speech synthesizer, a speech synthesis method, and a speech synthesis program according to the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited to the embodiments.

（実施の形態１）
図１は、本発明の第１の実施形態に係るテキスト音声合成装置の全体構成を示すブロック図である。テキスト音声合成装置１０は、テキスト取得部１１と、言語処理部１２と、韻律処理部１３と、音声合成部１４と、音声波形出力部１５とを備えている。 (Embodiment 1)
FIG. 1 is a block diagram showing the overall configuration of the text-to-speech synthesizer according to the first embodiment of the present invention. The text-to-speech synthesizer 10 includes a text acquisition unit 11, a language processing unit 12, a prosody processing unit 13, a speech synthesis unit 14, and a speech waveform output unit 15.

テキスト取得部１１は、外部から音声合成の対象となるテキストデータを取得する。言語処理部１２は、テキスト取得部１１が取得したテキストデータの形態素解析・構文解析を行う。そして、その結果を韻律処理部１３へ送る。 The text acquisition unit 11 acquires text data that is a target of speech synthesis from the outside. The language processing unit 12 performs morphological analysis / syntax analysis of the text data acquired by the text acquisition unit 11. Then, the result is sent to the prosody processing unit 13.

韻律処理部１３は、言語解析結果に基づいて、テキストデータのアクセントやイントネーションなどを特定する。すなわち、韻律に関する特性を特定する。韻律処理部１３は、特定した韻律に関する特性に基づいて、音声合成の目標となる目標音声の音韻系列（音韻記号列）及び韻律情報を生成する。そして、韻律系列および韻律情報を音声合成部１４へ送る。ここで、韻律情報とは、基本周波数、音韻継続時間長、およびパワーなどを示す情報である。 The prosodic processing unit 13 specifies the accent or intonation of the text data based on the language analysis result. That is, the characteristic about the prosody is specified. The prosody processing unit 13 generates a phoneme sequence (phoneme symbol string) and prosody information of a target speech that is a target of speech synthesis, based on the characteristics related to the specified prosody. Then, the prosodic sequence and prosodic information are sent to the speech synthesizer 14. Here, the prosodic information is information indicating a fundamental frequency, a phoneme duration, power, and the like.

音声合成部１４は、音韻系列及び韻律情報から音声波形を生成する。こうして生成された音声波形は音声波形出力部１５で出力される。 The speech synthesizer 14 generates a speech waveform from the phoneme sequence and prosodic information. The speech waveform generated in this way is output from the speech waveform output unit 15.

図２は、図１の音声合成部１４の詳細な構成を示すブロック図である。音声合成部１４は、音韻系列・韻律情報取得部１１０と、歪み推定部１３０と、融合音声素片選択部１４０と、融合音声素片編集・接続部１５０と、融合音声素片作成部１８０と、融合音声素片記憶部１６０と、融合音声素片音素環境記憶部１７０とを有している。 FIG. 2 is a block diagram showing a detailed configuration of the speech synthesizer 14 of FIG. The speech synthesis unit 14 includes a phoneme sequence / prosodic information acquisition unit 110, a distortion estimation unit 130, a fusion speech unit selection unit 140, a fusion speech unit editing / connection unit 150, and a fusion speech unit creation unit 180. , A fusion speech unit storage unit 160 and a fusion speech unit phoneme environment storage unit 170 are provided.

音韻系列・韻律情報取得部１１０は、韻律処理部１３から目標音声の音韻系列及び韻律情報を取得する。以下、音韻系列・韻律情報取得部１１０が取得する音韻系列および韻律情報を、それぞれ入力音韻系列および入力韻律情報と称する。入力音韻系列は、例えば音韻記号の系列である。 The phoneme sequence / prosodic information acquisition unit 110 acquires the phoneme sequence and prosody information of the target speech from the prosody processing unit 13. Hereinafter, the phoneme sequence and the prosody information acquired by the phoneme sequence / prosodic information acquisition unit 110 are referred to as an input phoneme sequence and input prosody information, respectively. The input phoneme sequence is a sequence of phoneme symbols, for example.

一方、融合音声素片記憶部１６０は、既に作成されている複数の融合音素片を格納している。ここで、融合音声素片とは、同一の音声単位に対する複数の音声素片を融合することにより得られた音声素片である。なお、本実施の形態における音声単位は音素である。なお、音声単位は音素に限定されるものではない。また、融合音声素片記憶部１６０は、同一の音素に対する複数の音声素片であって、かつ当該音素の韻律が互いに異なる複数の音声素片を格納している。 On the other hand, the fusion speech unit storage unit 160 stores a plurality of fusion speech units that have already been created. Here, the fusion speech unit is a speech unit obtained by fusing a plurality of speech units for the same speech unit. Note that the speech unit in the present embodiment is a phoneme. Note that the speech unit is not limited to phonemes. The fusion speech unit storage unit 160 stores a plurality of speech units for the same phoneme, and a plurality of speech units having different phonemes.

融合音声素片記憶部１６０は、合成音声を生成する際に用いる音声の単位（合成単位）で融合音声素片を格納している。 The fused speech unit storage unit 160 stores the fused speech units in units of speech (synthetic units) used when generating synthesized speech.

ここで、合成単位とは、音素または音素を分割したものの組み合わせである。例えば、半音素、音素（Ｃ、Ｖ）、ダイフォン（ＣＶ、ＶＣ、ＶＶ）、トライフォン（ＣＶＣ、ＶＣＶ）、音節（ＣＶ、Ｖ）、などである（Ｖは母音、Ｃは子音を表す）。または、これらが混在していてもよい。この場合は、可変長であってもよい。 Here, the synthesis unit is a phoneme or a combination of phonemes divided. For example, semi-phonemes, phonemes (C, V), diphones (CV, VC, VV), triphones (CVC, VCV), syllables (CV, V), etc. (V represents vowels and C represents consonants) . Or these may be mixed. In this case, the length may be variable.

融合音声素片音素環境記憶部１７０は、融合音声素片記憶部１６０に格納されている融合音声素片に対する融合音声素片音素環境を格納している。 The fused speech unit phoneme environment storage unit 170 stores a fused speech unit phoneme environment for the fused speech unit stored in the fused speech unit storage unit 160.

ここで、融合音声素片音素環境とは、当該融合音声素片にとっての環境となる要因の組み合わせに対応する情報である。要因としては、例えば、当該融合音声素片の音素名、先行音素、後続音素、後々続音素、基本周波数、音韻継続時間長、パワー、ストレスの有無、アクセント核からの位置、息継ぎからの時間、発声速度、および感情などがある。このように、融合音声素片音素環境は、融合音声素片の韻律を示す融合音声素片韻律情報を含む情報である。 Here, the fused speech unit phoneme environment is information corresponding to a combination of factors that are environments for the fused speech unit. Factors include, for example, the phoneme name of the fusion speech unit, the preceding phoneme, the subsequent phoneme, the subsequent phoneme, the fundamental frequency, the phoneme duration, power, the presence or absence of stress, the position from the accent core, the time from breathing, There are speaking speed and emotion. Thus, the fusion speech unit phoneme environment is information including fusion speech unit prosody information indicating the prosody of the fusion speech unit.

なお、融合音声素片記憶部１６０に格納される融合音声素片と、融合音声素片音素環境記憶部１７０に格納される当該融合音声素片に対する融合音声素片音素環境とは対応付けられている。具体的には、例えば、融合音声素片音素環境記憶部１７０に格納されている前記融合音声素片音素環境は、対応する融合音声素片を識別する融合音声素片番号を対応付けて格納されていてもよい。 Note that the fusion speech unit stored in the fusion speech unit storage unit 160 and the fusion speech unit phoneme environment for the fusion speech unit stored in the fusion speech unit phoneme environment storage unit 170 are associated with each other. Yes. Specifically, for example, the fusion speech unit phoneme environment stored in the fusion speech unit phoneme environment storage unit 170 is stored in association with a fusion speech unit number for identifying the corresponding fusion speech unit. It may be.

ここで、本実施の形態における融合音声素片記憶部１６０および融合音声素片音素環境記憶部１７０は、特許請求の範囲に記載の融合音声素片保持手段を構成する。 Here, the fusion speech unit storage unit 160 and the fusion speech unit phoneme environment storage unit 170 in the present embodiment constitute the fusion speech unit holding means described in the claims.

融合音声素片作成部１８０は、融合音声素片記憶部１６０に格納すべき融合音声素片および融合音声素片音素環境記憶部１７０に格納すべき融合音声素片音素環境を作成する。本実施の形態においては、融合音声素片作成部１８０は予め融合音声素片および融合音声素片音素環境を作成し、融合音声素片記憶部１６０および融合音声素片音素環境記憶部１７０に格納している。 The fusion speech unit creation unit 180 creates a fusion speech unit phoneme environment to be stored in the fusion speech unit storage unit 160 and a fusion speech unit phoneme environment storage unit 170 to store in the fusion speech unit phoneme environment storage unit 170. In the present embodiment, the fusion speech unit creation unit 180 creates a fusion speech unit and a fusion speech unit phoneme environment in advance and stores them in the fusion speech unit storage unit 160 and the fusion speech unit phoneme environment storage unit 170. is doing.

歪み推定部１３０は、融合音声素片音素環境記憶部１７０に格納されている融合音声素片音素環境と、歪み推定部１３０から取得した、所定のセグメントに対する入力韻律情報とに基づいて、当該セグメントと融合音声素片音素環境記憶部１７０に格納されている融合音声素片音素環境との歪みの度合いを推定する。 Based on the fusion speech unit phoneme environment stored in the fusion speech unit phoneme environment storage unit 170 and the input prosodic information for the predetermined segment acquired from the distortion estimation unit 130, the distortion estimation unit 130 And the degree of distortion of the fused speech unit phoneme environment stored in the fused speech unit phoneme environment storage unit 170 is estimated.

ここで、本実施の形態にかかる歪み推定部１３０は、本発明の保持音声歪み推定手段と、作成音声歪み推定手段を構成する。 Here, the distortion estimation unit 130 according to the present embodiment constitutes the retained speech distortion estimation means and the created speech distortion estimation means of the present invention.

融合音声素片選択部１４０は、歪み推定部１３０によって推定された歪みの度合いに基づいて、融合音声素片記憶部１６０から融合音声素片を選択する。 The fusion speech unit selection unit 140 selects a fusion speech unit from the fusion speech unit storage unit 160 based on the degree of distortion estimated by the distortion estimation unit 130.

具体的には、まず、歪み推定部１３０は、所定のセグメントに対する入力韻律情報と融合音声素片音素環境記憶部１７０に格納されている複数の融合音声素片音素環境それぞれとの歪みの度合いを推定する。そして、融合音声素片選択部１４０は、各融合音声素片環境に対して得られた歪みの度合いの最小値を特定する。そして、最小値を示す融合音声素片環境に対応する融合音声素片を融合音声素片記憶部１６０から選択する。これにより、入力音韻系列の音韻記号の系列に対応する融合音声素片の系列を得ることができる。なお、歪みの度合いを推定する方法については後述する。 Specifically, first, the distortion estimation unit 130 calculates the degree of distortion between the input prosodic information for a predetermined segment and each of the plurality of fusion speech unit phoneme environments stored in the fusion speech unit phoneme environment storage unit 170. presume. Then, the fused speech unit selection unit 140 specifies the minimum value of the degree of distortion obtained for each fused speech unit environment. Then, the fused speech unit corresponding to the fused speech unit environment showing the minimum value is selected from the fused speech unit storage unit 160. Thereby, it is possible to obtain a sequence of fused speech segments corresponding to a sequence of phoneme symbols of the input phoneme sequence. A method for estimating the degree of distortion will be described later.

融合音声素片編集・接続部１５０は、各セグメントに対して得られた融合音声素片の系列を適宜編集し、接続する。これにより合成音声の音声波形が生成される。こうして生成された音声波形は音声波形出力部１５を介して外部に出力される。 The fusion speech unit editing / connection unit 150 appropriately edits and connects the series of fusion speech units obtained for each segment. As a result, a speech waveform of synthesized speech is generated. The voice waveform thus generated is output to the outside via the voice waveform output unit 15.

図３は、図２において説明した融合音声素片作成部１８０の詳細な機能構成を示すブロック図である。融合音声素片作成部１８０は、音声素片記憶部１８１と、融合音声素片音素環境記憶部１８２と、音声素片組み合わせ作成部１８３と、融合音声素片作成部１８４と、融合音声素片音素環境作成部１８５とを有している。 FIG. 3 is a block diagram showing a detailed functional configuration of the fused speech unit creation unit 180 described in FIG. The fusion speech unit creation unit 180 includes a speech unit storage unit 181, a fusion speech unit phoneme environment storage unit 182, a speech unit combination creation unit 183, a fusion speech unit creation unit 184, and a fusion speech unit A phoneme environment creation unit 185.

音声素片記憶部１８１は、大量の音声素片を格納している。また、融合音声素片音素環境記憶部１８２は、音声素片記憶部１８１に格納されている音声素片にそれぞれに対する音声素片音素環境を格納している。音声素片記憶部１８１に格納されている音声素片の合成単位は、融合音声素片記憶部１６０に格納がされている融合音声素片の合成単位と同一である。 The speech element storage unit 181 stores a large amount of speech elements. Further, the fusion speech unit phoneme environment storage unit 182 stores the speech unit phoneme environment for each speech unit stored in the speech unit storage unit 181. The synthesis unit of the speech unit stored in the speech unit storage unit 181 is the same as the synthesis unit of the fusion speech unit stored in the fusion speech unit storage unit 160.

音声素片記憶部１８１に格納される音声素片と、融合音声素片音素環境記憶部１８２に格納される音声素片音素環境とは対応付けられている。具体的には、例えば、融合音声素片音素環境記憶部１８２に格納されている音声素片音素環境は、対応する音声素片を識別する音声素片番号を対応付けて格納されていてもよい。 The speech unit stored in the speech unit storage unit 181 and the speech unit phoneme environment stored in the fused speech unit phoneme environment storage unit 182 are associated with each other. Specifically, for example, the speech unit phoneme environment stored in the fusion speech unit phoneme environment storage unit 182 may be stored in association with a speech unit number for identifying the corresponding speech unit. .

本実施の形態における音声素片記憶部１８１および融合音声素片音素環境記憶部１８２は、特許請求の範囲に記載の音声素片保持手段を構成する。 The speech unit storage unit 181 and the fusion speech unit phoneme environment storage unit 182 in the present embodiment constitute speech unit holding means described in the claims.

音声素片組み合わせ作成部１８３は、融合音声素片音素環境記憶部１８２に格納されている音声素片音素環境に基づいて、音声素片記憶部１８１に格納されている複数の音声素片の中から、融合すべき複数の音声素片の組み合わせを決定する。 The speech unit combination creating unit 183 is based on the speech unit phoneme environment stored in the fusion speech unit phoneme environment storage unit 182, and the speech unit combination creation unit 183 includes a plurality of speech units stored in the speech unit storage unit 181. Then, a combination of a plurality of speech segments to be fused is determined.

融合音声素片作成部１８４は、音声素片組み合わせ作成部１８３によって決定された組み合わせに含まれる音声素片を音声素片記憶部１８１から抽出する。さらに、抽出した音声素片を融合することにより、融合音声素片を作成する。融合音声素片作成部１８４は、作成した融合音声素片を融合音声素片記憶部１６０に格納する。 The fused speech unit creation unit 184 extracts speech units included in the combination determined by the speech unit combination creation unit 183 from the speech unit storage unit 181. Furthermore, a fused speech segment is created by fusing the extracted speech segments. The fused speech unit creation unit 184 stores the created fused speech unit in the fused speech unit storage unit 160.

融合音声素片音素環境作成部１８５は、音声素片組み合わせ作成部１８３によって決定された組み合わせに含まれる音声素片の音声素片音素環境を融合音声素片音素環境記憶部１８２から抽出する。さらに、抽出した音声素片音素環境に基づいて、融合音声素片音素環境を作成する。融合音声素片音素環境作成部１８５は、作成した融合音声素片音素環境を融合音声素片音素環境記憶部１７０に格納する。 The fused speech unit phoneme environment creation unit 185 extracts the speech unit phoneme environment of the speech units included in the combination determined by the speech unit combination creation unit 183 from the fused speech unit phoneme environment storage unit 182. Further, based on the extracted speech unit phoneme environment, a fused speech unit phoneme environment is created. The fused speech unit phoneme environment creation unit 185 stores the created fused speech unit phoneme environment in the fused speech unit phoneme environment storage unit 170.

具体的には、融合音声素片音素環境作成部１８５は、各音声素片の音声素片音素環境のセントロイドを用いて融合音声素片音素環境を作成する。 Specifically, the fused speech unit phoneme environment creation unit 185 creates a fused speech unit phoneme environment using the centroid of the speech unit phoneme environment of each speech unit.

他の例としては、音声素片組み合わせ作成部１８３によって決定された組み合わせに含まれる複数の音声素片それぞれの音声素片音素環境を、融合音声素片音素環境として作成してもよい。 As another example, a speech unit phoneme environment of each of a plurality of speech units included in the combination determined by the speech unit combination creation unit 183 may be created as a fusion speech unit phoneme environment.

ここで、本実施の形態にかかる融合音声素片作成部１８４は、特許請求の範囲に記載の音声素片選択手段と、融合音声素片作成手段とを構成する。また、本実施の形態にかかる融合音声素片音素環境作成部１８５は、特許請求の範囲に記載の音声素片選択手段と融合音声素片韻律情報作成手段とを構成する。 Here, the fused speech unit creating unit 184 according to the present embodiment constitutes a speech unit selecting unit and a fused speech unit creating unit described in the claims. Also, the fused speech unit phoneme environment creation unit 185 according to the present embodiment constitutes a speech unit selection unit and a fused speech unit prosody information creation unit described in the claims.

図４は、図３に示した音声素片組み合わせ作成部１８３の詳細な機能構成を示すブロック図である。音声素片組み合わせ作成部１８３は、音声素片組み合わせ頻度情報記憶部１８３５と、音韻系列・韻律情報取得部１８３１と、音声素片選択部１８３２と、音声素片組み合わせ頻度情報作成部１８３３と、音声素片組合せ決定部１８３４とを有している。 FIG. 4 is a block diagram showing a detailed functional configuration of the speech element combination creating unit 183 shown in FIG. The speech unit combination creation unit 183 includes a speech unit combination frequency information storage unit 1835, a phoneme sequence / prosodic information acquisition unit 1831, a speech unit selection unit 1832, a speech unit combination frequency information creation unit 1833, a speech A segment combination determining unit 1834.

音韻系列・韻律情報取得部１８３１は、文章データを解析して得られる音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対する音韻および入力韻律情報を取得する。なお、入力韻律情報等は、図１において説明した韻律処理部１３から取得する。 The phoneme sequence / prosodic information acquisition unit 1831 acquires phoneme and input prosody information for each of a plurality of segments obtained by dividing a phoneme sequence obtained by analyzing sentence data into synthesis units. The input prosodic information and the like are acquired from the prosodic processing unit 13 described with reference to FIG.

複数音声素片選択部１８３２は、入力韻律情報と、融合音声素片音素環境記憶部１８２に格納されている融合音声素片音素環境との間の歪みの度合いを推定する。そして、歪みの度合いに基づいて、音声素片記憶部１８１に格納されている音声素片の中から複数の音声素片を選択する。選択方法は、融合音声素片選択部１４０における選択方法と同じ方法であってもよい。 The multiple speech unit selection unit 1832 estimates the degree of distortion between the input prosody information and the fused speech unit phoneme environment stored in the fused speech unit phoneme environment storage unit 182. Based on the degree of distortion, a plurality of speech units are selected from speech units stored in the speech unit storage unit 181. The selection method may be the same as the selection method in the fusion speech unit selection unit 140.

音声素片組み合わせ頻度情報作成部１８３３は、複数音声素片選択部１８３２において選択された複数の音声素片の組み合わせの使用頻度をカウントする。そして、カウントした使用頻度を音声素片組み合わせ頻度情報記憶部１８３５に格納する。 The speech unit combination frequency information creation unit 1833 counts the usage frequency of the combination of the plurality of speech units selected by the multiple speech unit selection unit 1832. The counted usage frequency is stored in the speech unit combination frequency information storage unit 1835.

音声素片組合せ決定部１８３４は、前記音声素片組み合わせ頻度情報記憶部１８３５に格納された頻度情報に基づいて、複数の音声素片の組み合わせを決定する。音声素片組合せ決定部１８３４は、例えば、選択した複数の音声素片の使用頻度が、予め定めた閾値以上となるように、複数の音声素片を選択してもよい。 The speech unit combination determination unit 1834 determines a combination of a plurality of speech units based on the frequency information stored in the speech unit combination frequency information storage unit 1835. For example, the speech element combination determination unit 1834 may select a plurality of speech elements so that the frequency of use of the selected plurality of speech elements is equal to or higher than a predetermined threshold.

また、他の例としては、複数の組み合わせのうち、使用頻度の高い融合音声素片に対応する組み合わせを選択してもよい。例えば、融合音声素片記憶部１６０に格納すべき融合音声素片の数を制限している場合等に有効である。 As another example, a combination corresponding to a fusion speech unit that is frequently used may be selected from a plurality of combinations. For example, this is effective when the number of fused speech units to be stored in the fused speech unit storage unit 160 is limited.

このように、融合音声素片を作成するために選択する音声素片の組み合わせの選択方法は、本実施の形態に限定されるものではなく、予め定められた条件に基づいて選択すればよい。 As described above, the method for selecting a combination of speech units to be selected for creating a fused speech unit is not limited to the present embodiment, and may be selected based on a predetermined condition.

以下、音声合成部１４の各処理について詳しく説明する。ここでは、合成単位の音声素片は音素であるとする。 Hereinafter, each process of the speech synthesis unit 14 will be described in detail. Here, it is assumed that the speech unit of the synthesis unit is a phoneme.

図５は、融合音声素片記憶部１６０のデータ構成を模式的に示している。また、図６は、融合音声素片音素環境記憶部１７０のデータ構成を模式的に示している。 FIG. 5 schematically shows the data configuration of the fused speech unit storage unit 160. FIG. 6 schematically shows the data structure of the fused speech unit phoneme environment storage unit 170.

融合音声素片記憶部１６０は、図５に示すように、各音素の音声信号をピッチ波形として格納している。さらに各音声信号を当該音素を識別するための融合音声素片番号に対応付けて格納している。 As shown in FIG. 5, the fused speech unit storage unit 160 stores the speech signal of each phoneme as a pitch waveform. Furthermore, each audio signal is stored in association with a fusion speech unit number for identifying the phoneme.

ここで、ピッチ波形とは、その長さが音声の基本周期の数倍程度までで、それ自身は基本周期を持たない比較的短い波形であって、そのスペクトルが音声信号のスペクトル包絡を表すものを意味する。 Here, the pitch waveform is a relatively short waveform that has a length up to several times the basic period of the voice and does not have a basic period, and its spectrum represents the spectrum envelope of the audio signal. Means.

また、融合音声素片音素環境記憶部１７０は、図６に示すように、融合音声素片記憶部１６０に記憶されている各融合音声素片の音素環境情報を、当該音素の素片番号に対応付けて格納している。本実施の形態にかかる融合音声素片音素環境記憶部１７０は、音素環境として、音素記号（音素名）、基本周波数、音韻継続長、接続境界ケプストラムを格納している。 Further, as shown in FIG. 6, the fusion speech unit phoneme environment storage unit 170 sets the phoneme environment information of each fusion speech unit stored in the fusion speech unit storage unit 160 to the unit number of the phoneme. Stored in association. The fused speech segment phoneme environment storage unit 170 according to the present embodiment stores a phoneme symbol (phoneme name), a fundamental frequency, a phoneme duration, and a connection boundary cepstrum as a phoneme environment.

なお、本実施の形態においては、融合音声素片は音素単位であるが、他の例としては、半音素、ダイフォン、トライフォン、音節であってもよい。また、これらの組み合わせであってもよい。 In the present embodiment, the fusion speech unit is a phoneme unit, but other examples may be a semiphone, a diphone, a triphone, and a syllable. Moreover, these combinations may be sufficient.

次に、図２において説明した歪み推定部１３０の処理について詳述する。歪み推定部１３０は、コスト関数により算出されたコストに基づいて歪みの度合いを推定する。そして、融合音声素片選択部１４０は歪み推定部１３０によって推定されたコストに基づいて融合音声素片を選択する。 Next, the processing of the distortion estimation unit 130 described in FIG. 2 will be described in detail. The distortion estimation unit 130 estimates the degree of distortion based on the cost calculated by the cost function. Then, the fusion speech unit selection unit 140 selects a fusion speech unit based on the cost estimated by the distortion estimation unit 130.

ここで、コスト関数とは、テキストデータに含まれる全セグメントに対する歪みの度合いによって定まる関数である。 Here, the cost function is a function determined by the degree of distortion for all segments included in the text data.

以下、コスト関数について詳述する。融合音声素片を変形・接続して合成音声を生成する際に生ずる歪の要因ごとにサブコスト関数を定める。ここで、サブコスト関数とは、融合音声素片記憶部１６０に記憶されている融合音声素片を用いて合成音声を生成したときに生ずる当該合成音声の目標音声に対する歪みの度合いを推定するためのコストを算出するための関数である。 Hereinafter, the cost function will be described in detail. A sub-cost function is determined for each factor of distortion generated when a synthesized speech is generated by deforming and connecting fused speech segments. Here, the sub-cost function is used to estimate the degree of distortion of the synthesized speech with respect to the target speech that occurs when the synthesized speech is generated using the fused speech unit stored in the fused speech unit storage unit 160. This is a function for calculating the cost.

サブコスト関数をＣｎ（ｕｉ、ｕｉ−１、ｔｉ）（ｎ：１、…、Ｎ、Ｎはサブコスト関数の数）と定める。ここで、ｔｉは、入力音韻系列および入力韻律情報に対応する目標とする音声（目標音声）をｔ＝（ｔ１、…、ｔＩ）としたときのｉ番目のセグメントに対応する部分の音声素片の目標とする音素環境情報を表し、ｕｉは融合音声素片記憶部１６０に記憶されている融合音声素片のうち、ｔｉと同じ音韻の融合音声素片を表す。 The sub cost function is defined as Cn (ui, ui-1, ti) (n: 1,..., N, N is the number of sub cost functions). Here, ti is the speech unit of the portion corresponding to the i-th segment when the target speech (target speech) corresponding to the input phoneme sequence and the input prosodic information is t = (t1,..., TI). Ui represents a fusion speech unit having the same phoneme as ti among the fusion speech units stored in the fusion speech unit storage unit 160.

具体的には、当該コストを算出する際に、目標コストと接続コストの２種類のサブコストを用いる。ここで、目標コストとは、融合音声素片を使用することによって生じる合成音声の目標音声に対する歪みの度合いを推定するためのコストである。また、接続コストとは、融合音声素片を他の音声素片と接続したときに生じる当該合成音声の目標音声に対する歪みの度合いを推定するためのコストである。 Specifically, when calculating the cost, two types of sub-costs, a target cost and a connection cost, are used. Here, the target cost is a cost for estimating the degree of distortion of the synthesized speech with respect to the target speech generated by using the fusion speech unit. The connection cost is a cost for estimating the degree of distortion of the synthesized speech with respect to the target speech that occurs when the fusion speech unit is connected to another speech unit.

さらに、目標コストとして、基本周波数コストおよび音韻継続時間長コストを用いる。ここで、基本周波数コストとは、融合音声素片記憶部１６０に記憶されている融合音声素片の基本周波数と目標の基本周波数との違い（差）を表すコストである。また、音韻継続時間長コストとは、融合音声素片の音韻継続時間長と目標の音韻継続時間長との違い（差）を表すコストである。接続コストとしては、接続境界でのスペクトルの違い（差）を表すスペクトル接続コストを用いる。 Further, the basic frequency cost and the phoneme duration time cost are used as the target costs. Here, the fundamental frequency cost is a cost representing the difference (difference) between the fundamental frequency of the fused speech unit stored in the fused speech unit storage unit 160 and the target fundamental frequency. The phoneme duration time cost is a cost representing the difference (difference) between the phoneme duration time of the fusion speech unit and the target phoneme duration time. As the connection cost, a spectrum connection cost representing a spectrum difference (difference) at the connection boundary is used.

具体的には、基本周波数コストは、次式によって定義される。

ここで、ｖｉは融合音声素片記憶部１６０に記憶されている音声素片ｕｉの音素環境を、ｆは音素環境ｖｉから基本周波数を取り出す関数を表す。また、音韻継続時間長コストは、次式によって定義される。

ここで、ｇは音素環境ｖｉから音韻継続時間長を取り出す関数を表す。スペクトル接続コストは、２つの音声素片間のケプストラム距離によって算出される。 Specifically, the fundamental frequency cost is defined by the following equation.

Here, vi represents the phoneme environment of the speech unit ui stored in the fusion speech unit storage unit 160, and f represents a function for extracting the fundamental frequency from the phoneme environment vi. Further, the phoneme duration time cost is defined by the following equation.

Here, g represents a function for extracting the phoneme duration from the phoneme environment vi. The spectrum connection cost is calculated from the cepstrum distance between two speech segments.

なお、２つの音声素片間のケプストラム距離は次式によって定義される。

ここで、ｈは融合音声素片ｕｉの接続境界のケプストラム係数をベクトルとして取り出す関数を表す。 Note that the cepstrum distance between two speech segments is defined by the following equation.

Here, h represents a function that extracts a cepstrum coefficient of the connection boundary of the fusion speech unit ui as a vector.

これらのサブコスト関数の重み付き和を合成単位コスト関数と定義する。合成単位コスト関数は次式によって定義される。

ここで、ｗｎはサブコスト関数の重みを表す。本実施形態では、簡単のため、ｗｎはすべて「１」とする。上記式（４）は、ある合成単位に、ある融合音声素片を当てはめた場合の当該融合音声素片の合成単位コストである。 The weighted sum of these sub-cost functions is defined as a composite unit cost function. The composite unit cost function is defined by the following equation.

Here, wn represents the weight of the sub cost function. In this embodiment, for simplicity, wn is all “1”. The above equation (4) is the synthesis unit cost of the fusion speech unit when a fusion speech unit is applied to a synthesis unit.

入力音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対し、上記式（４）から合成単位コストを算出した結果を、全セグメントについて足し合わせたものをコストと呼び、当該コストを算出するためのコスト関数を次式（５）に示すように定義する。

融合音声素片選択部１４０は、（５）に示したコスト関数を使って１セグメントあたり（すなわち、１合成単位あたり）の融合音声素片を選択する。選択の際は、融合音声素片記憶部１６０に記憶されている融合音声素片群の中から、上記式（５）で算出されるコストの値が最小の融合音声素片の系列を求める。このコストが最小となる融合音声素片の組合せを最適素片系列と呼ぶこととする。すなわち、最適素片系列中の各融合音声素片は、入力音韻系列を合成単位で区切ることにより得られる複数のセグメントのそれぞれに対応し、最適素片系列中の各融合音声素片から算出された上記合成単位コストと式（５）より算出されたコストの値は、他のどの融合音声素片系列よりも小さい値である。 For each of a plurality of segments obtained by dividing the input phoneme sequence by synthesis unit, the result of calculating the synthesis unit cost from the above equation (4) is the sum of all segments is called the cost. A cost function for calculation is defined as shown in the following equation (5).

The fused speech element selection unit 140 selects the fused speech elements per segment (that is, per synthesized unit) using the cost function shown in (5). At the time of selection, from the fused speech unit group stored in the fused speech unit storage unit 160, a sequence of fused speech units having a minimum cost value calculated by the above equation (5) is obtained. A combination of fused speech units that minimizes the cost is called an optimal unit sequence. That is, each fusion speech unit in the optimum unit sequence corresponds to each of a plurality of segments obtained by dividing the input phoneme sequence by synthesis unit, and is calculated from each fusion speech unit in the optimum unit sequence. The value of the cost calculated from the synthesis unit cost and the equation (5) is smaller than any other fused speech element sequence.

なお、最適素片系列の探索には、動的計画法（ＤＰ：ｄｙｎａｍｉｃｐｒｏｇｒａｍｍｉｎｇ）を用いてもよい。これにより、探索処理の更なる効率化を図ることができる。 Note that dynamic programming (DP) may be used for searching for the optimum segment sequence. Thereby, the efficiency of the search process can be further increased.

次に、図７を参照しつつ、図２において説明した融合音声素片編集・接続部１５０の処理について詳述する。融合音声素片編集・接続部１５０は、融合音声素片選択部１４０で選択された最適素片系列の融合音声素片を、入力韻律情報に従って変形する。そして、変形後の融合音声素片を接続して合成音声の音声波形を生成する。 Next, the processing of the fusion speech unit editing / connecting unit 150 described in FIG. 2 will be described in detail with reference to FIG. The fusion speech unit editing / connection unit 150 transforms the fusion speech unit of the optimum unit sequence selected by the fusion speech unit selection unit 140 according to the input prosodic information. Then, the synthesized fused speech unit is connected to generate a speech waveform of synthesized speech.

融合音声素片記憶部１６０は、融合音声素片はピッチ波形の形で格納されている。そこで、当該融合音声素片の基本周波数、音韻継続時間長のそれぞれが、入力韻律情報に示されている目標音声の基本周波数、目標音声の音韻継続時間長になるようにピッチ波形を重畳して音声波形を生成する。 The fusion speech unit storage unit 160 stores the fusion speech units in the form of pitch waveforms. Therefore, the pitch waveform is superimposed so that the fundamental frequency and the phoneme duration length of the fusion speech unit become the fundamental frequency of the target speech and the phoneme duration length of the target speech indicated in the input prosodic information, respectively. Generate a speech waveform.

図７を参照しつつ、音素「ｍ」、「ａ」、「ｄ」、「ｏ」の各合成単位について選択された融合音声素片を変形・接続して、「まど」という音声波形を生成する場合の処理について具体的に説明する。 Referring to FIG. 7, the fusion speech unit selected for each synthesis unit of phonemes “m”, “a”, “d”, and “o” is transformed and connected to generate a speech waveform “Mado”. The process in the case of producing | generating is demonstrated concretely.

図７に示すように、入力韻律情報に示されている目標の基本周波数、目標の音韻継続時間長に応じて、セグメント（合成単位）毎に、融合された音声素片中の各ピッチ波形の基本周波数の変更（音の高さの変更）、ピッチ波形の数の増減（時間長の伸縮）を行う。その後に、セグメント内、セグメント間で、隣り合うピッチ波形を接続して合成音声を生成する。 As shown in FIG. 7, according to the target fundamental frequency and the target phoneme duration duration indicated in the input prosodic information, for each segment (synthesis unit), each pitch waveform in the fused speech unit Change the basic frequency (change the pitch) and increase / decrease the number of pitch waveforms (expand / expand time). After that, synthesized speech is generated by connecting adjacent pitch waveforms within and between segments.

なお、本実施の形態にかかる歪み推定部１３０は、コスト関数による演算結果を歪みの度合いとして利用したが、歪みの度合いを評価する値はこれに限定されるものではない。 In addition, although the distortion estimation part 130 concerning this Embodiment utilized the calculation result by a cost function as a degree of distortion, the value which evaluates the degree of distortion is not limited to this.

次に、融合音声素片作成部１８０の処理について説明する。音声素片記憶部１８１および、融合音声素片音素環境記憶部１８２には、音声データベースを分析して得られる音声素片および、その音素環境情報が保存されている。音声素片記憶部１８１には、大量の音声素片が蓄積されており、それらの音声素片の音素環境の情報（音素環境情報）が融合音声素片音素環境記憶部１８２に蓄積されている。音声素片記憶部１８１には、合成音声を生成する際に用いる音声の単位（合成単位）の音声素片が記憶されている。音声素片の合成単位は融合音声素片と同じ単位であり、音素環境情報の種類も融合音声素片と同じものであるとする。 Next, the process of the fusion speech unit creation unit 180 will be described. The speech unit storage unit 181 and the fused speech unit phoneme environment storage unit 182 store speech units obtained by analyzing a speech database and phoneme environment information thereof. A large amount of speech units are stored in the speech unit storage unit 181, and information on phoneme environments (phoneme environment information) of these speech units is stored in the fusion speech unit phoneme environment storage unit 182. . The speech unit storage unit 181 stores speech units of speech units (synthesis units) used when generating synthesized speech. The synthesis unit of the speech unit is the same as that of the fused speech unit, and the type of phoneme environment information is also the same as that of the fused speech unit.

図８は音声素片記憶部１８１のデータ構成を模式的に示している。音声素片記憶部１８１は、各音素の音声信号の波形と当該音素を識別するための素片番号とを対応付けて格納している。また、図９は融合音声素片音素環境記憶部１８２のデータ構成を模式的に示している。融合音声素片音素環境記憶部１８２には、融合音声素片音素環境記憶部１７０と同様に、音声素片記憶部１８１に記憶されている各音声素片の音素環境情報と当該音素の素片番号とを対応付けて格納している。 FIG. 8 schematically shows the data structure of the speech unit storage unit 181. The speech unit storage unit 181 stores a speech signal waveform of each phoneme and a unit number for identifying the phoneme in association with each other. FIG. 9 schematically shows the data structure of the fused speech unit phoneme environment storage unit 182. Similar to the fused speech unit phoneme environment storage unit 170, the fused speech unit phoneme environment storage unit 182 stores the phoneme environment information of each speech unit stored in the speech unit storage unit 181 and the unit of the phoneme. Numbers are stored in association with each other.

音声素片記憶部１８１に記憶されている各音声素片は、別途収集された多数の音声データに対して音素毎にラベリングを行い、音素毎に音声波形を切り出したものを、音声素片として蓄積したものである。 Each speech unit stored in the speech unit storage unit 181 is labeled for each phoneme with respect to a large number of separately collected speech data, and a speech waveform cut out for each phoneme is used as a speech unit. Accumulated.

例えば、図１０には、音声データ１０１に対し、音素毎にラベリングを行った結果を示している。図１０には、ラベリングの境界１０２により区切られた各音素の音声データ（音声波形）について、音素記号を示している。なお、この音声データから、各音素についての音素環境の情報（例えば、音韻（この場合、音素名（音素記号））、基本周波数、音韻継続時間長など）を併せて抽出する。 For example, FIG. 10 shows the result of labeling the voice data 101 for each phoneme. FIG. 10 shows phoneme symbols for the speech data (speech waveform) of each phoneme divided by the labeling boundary 102. Note that phoneme environment information (eg, phoneme (in this case, phoneme name (phoneme symbol)), fundamental frequency, phoneme duration length, etc.) for each phoneme is also extracted from the speech data.

以上の処理により音声データ１０１から求めた各音声波形と、当該音声波形に対応する音素環境の情報に、同じ素片番号が付与される。そして、図８および図９に示すように、音声素片記憶部１８１と融合音声素片音素環境記憶部１８２にそれぞれ格納される。ここでは、音素環境情報には、音声素片の音韻とその基本周波数及び音韻継続時間長を含むものとする。 The same unit number is assigned to each voice waveform obtained from the voice data 101 by the above processing and the information of the phoneme environment corresponding to the voice waveform. Then, as shown in FIG. 8 and FIG. 9, the speech unit storage unit 181 and the fusion speech unit phoneme environment storage unit 182 respectively store them. Here, the phoneme environment information includes the phoneme of the speech unit, its fundamental frequency, and the phoneme duration.

なお、ここでは、音声素片が音素単位に抽出する場合をしめしているが、音声素片が半音素、ダイフォン、トライフォン、音節、あるいはこれらの組み合わせや可変長であっても上記同様である。 In this example, the speech unit is extracted in units of phonemes. However, the same applies to the case where the speech unit is a semi-phoneme, a diphone, a triphone, a syllable, or a combination or variable length thereof. .

融合音声素片作成部１８４は、後述する音声素片組み合わせ作成部１８３によって作成された組み合わせに含まれる複数の音声素片を音声素片記憶部１８１から取得する。そして、取得した複数の音声素片を融合して融合音声素片を作成する。なお、融合音声素片作成部１８４は、対象となる音声素片が有声音である場合と無声音である場合とで別の処理を行う。 The fused speech unit creation unit 184 acquires a plurality of speech units included in the combination created by the speech unit combination creation unit 183, which will be described later, from the speech unit storage unit 181. Then, a plurality of acquired speech units are merged to create a fused speech unit. Note that the fused speech segment creation unit 184 performs different processing depending on whether the target speech segment is a voiced sound or an unvoiced sound.

まず、有声音の場合について説明する。有声音の場合には、音声素片からピッチ波形を取り出し、ピッチ波形のレベルで融合し、新たなピッチ波形を作りだす。ピッチ波形の抽出方法としては、単に基本周期同期窓で切り出す方法、ケプストラム分析やＰＳＥ分析によって得られたパワースペクトル包絡を逆離散フーリエ変換する方法、線形予測分析によって得られたフィルタのインパルス応答によってピッチ波形を求める方法、閉ループ学習法によって合成音声のレベルで自然音声に対する歪が小さくなるようなピッチ波形を求める方法など様々なものがある。 First, the case of voiced sound will be described. In the case of voiced sound, a pitch waveform is extracted from the speech segment and fused at the level of the pitch waveform to create a new pitch waveform. The pitch waveform can be extracted by simply cutting out with the fundamental period synchronization window, by inverse discrete Fourier transform of the power spectrum envelope obtained by cepstrum analysis or PSE analysis, and by the impulse response of the filter obtained by linear prediction analysis. There are various methods such as a method for obtaining a waveform and a method for obtaining a pitch waveform that reduces distortion with respect to natural speech at the level of synthesized speech by a closed loop learning method.

本実施の形態においては、基本周期同期窓で切り出す方法を用いてピッチ波形を抽出する。図１１を参照しつつ、音声素片組み合わせ作成部１８３で決められたＭ個の音声素片を融合して１つの新たな音声素片を生成する場合の処理手順を説明する。 In the present embodiment, the pitch waveform is extracted using a method of cutting out with a basic period synchronization window. With reference to FIG. 11, a processing procedure in the case where one new speech unit is generated by fusing the M speech units determined by the speech unit combination creating unit 183 will be described.

ステップＳ１１１において、Ｍ個の音声素片のそれぞれの音声波形に、その周期間隔毎にマーク（ピッチマーク）を付与する。図１２−１には、Ｍ個の音声素片のうちの１つの音声素片の音声波形１２１に対し、その周期間隔毎にピッチマーク１２２が付けられている場合を示している。 In step S111, marks (pitch marks) are given to the respective speech waveforms of the M speech units for each periodic interval. FIG. 12A shows a case where pitch marks 122 are attached to the speech waveform 121 of one speech unit among the M speech units at every periodic interval.

ステップＳ１１２では、図１２−２に示すように、ピッチマークを基準として窓掛けを行ってピッチ波形を切り出す。窓にはハニング窓１２３を用い、その窓長は基本周期の２倍とする。そして、図１２−３に示すように、窓掛けされた波形１２４をピッチ波形として切り出す。 In step S112, as shown in FIG. 12B, the pitch waveform is cut out with reference to the pitch mark to cut out the pitch waveform. A Hanning window 123 is used as the window, and the window length is twice the basic period. Then, as shown in FIG. 12C, the windowed waveform 124 is cut out as a pitch waveform.

Ｍ個の音声素片のそれぞれについて、図１２−１から図１２−３に示す処理（ステップＳ１１２の処理）を施す。その結果、Ｍ個の音声素片のそれぞれについて、複数個のピッチ波形からなるピッチ波形の系列が求まる。 The process shown in FIGS. 12-1 to 12-3 (the process of step S112) is performed on each of the M speech units. As a result, a series of pitch waveforms consisting of a plurality of pitch waveforms is obtained for each of the M speech segments.

次にステップＳ１１３に進み、当該セグメントのＭ個の音声素片のそれぞれのピッチ波形の系列のなかで、最もピッチ波形の数が多いものに合わせて、Ｍ個全てのピッチ波形の系列中のピッチ波形の数が同じになるように、（ピッチ波形の数が少ないピッチ波形の系列については）ピッチ波形を複製して、ピッチ波形の数を合わせる。 Next, the process proceeds to step S113, and the pitches in the series of all M pitch waveforms are matched with the one having the largest number of pitch waveforms among the series of pitch waveforms of the M speech units of the segment. The pitch waveforms are duplicated so that the number of pitch waveforms is the same (for a series of pitch waveforms with a small number of pitch waveforms).

図１３は、当該セグメントのＭ個（例えば、ここでは、３個）の音声素片ｄ１〜ｄ３のそれぞれから、ステップＳ１１２で切り出されたピッチ波形の系列ｅ１〜ｅ３を示している。ピッチ波形の系列ｅ１中のピッチ波形の数は７個、ピッチ波形の系列ｅ２中のピッチ波形の数は５個、ピッチ波形の系列ｅ３中のピッチ波形の数は６個である。すなわち、ピッチ波形の系列ｅ１〜ｅ３のうち最もピッチ波形の数が多いものは、系列ｅ１である。 FIG. 13 shows a series of pitch waveforms e1 to e3 cut out in step S112 from each of M speech segments d1 to d3 of the segment (for example, three here). The number of pitch waveforms in the pitch waveform series e1 is 7, the number of pitch waveforms in the pitch waveform series e2 is 5, and the number of pitch waveforms in the pitch waveform series e3 is 6. That is, among the pitch waveform series e1 to e3, the series having the largest number of pitch waveforms is the series e1.

従って、この系列ｅ１中のピッチ波形の数（例えば、ここでは、ピッチ波形の数は、７個）に合わせる。他の系列ｅ２、ｅ３については、それぞれ、当該系列中のピッチ波形のいずれかをコピーして、ピッチ波形の数を７個にする。その結果得られた、系列ｅ２、ｅ３のそれぞれに対応する新たなピッチ波形の系列がｅ２´、ｅ３´である。 Accordingly, the number of pitch waveforms in the series e1 (for example, the number of pitch waveforms here is 7). For the other series e2, e3, each of the pitch waveforms in the series is copied to make the number of pitch waveforms seven. As a result, new pitch waveform series corresponding to the series e2 and e3 are e2 ′ and e3 ′, respectively.

次に、ステップＳ１１４に進む。このステップでは、ピッチ波形ごとに処理を行う。ステップＳ１１４では、当該セグメントのＭ個のそれぞれの音声素片に対応するピッチ波形をその位置ごとに平均化し、新たなピッチ波形の系列を生成する。この生成された新たなピッチ波形の系列を融合された音声素片とする。 Next, the process proceeds to step S114. In this step, processing is performed for each pitch waveform. In step S114, the pitch waveforms corresponding to the M speech units of the segment are averaged for each position to generate a new pitch waveform sequence. The generated new pitch waveform sequence is used as a fused speech unit.

図１４は、当該セグメントのＭ個（例えば、ここでは、３個）の音声素片ｄ１〜ｄ３のそれぞれからステップＳ１１３で求めたピッチ波形の系列ｅ１、ｅ２´、ｅ３´を示している。各系列中には、７個のピッチ波形があるので、ステップＳ１１４では、１番目から７番目のピッチ波形をそれぞれ３つの音声素片で平均化し、７個の新たなピッチ波形からなる新たなピッチ波形の系列ｆ１を生成している。すなわち、例えば、系列ｅ１の１番目とピッチ波形と、系列ｅ２´の１番目のピッチ波形と、系列ｅ３´の１番目のピッチ波形のセントロイドを求めて、それを新たなピッチ波形の系列ｆ１の１番目のピッチ波形とする。新たなピッチ波形の系列ｆ１の２番目〜７番目のピッチ波形についても同様である。ピッチ波形の系列ｆ１が、上記「融合音声素片」である。 FIG. 14 shows pitch waveform series e1, e2 ′, e3 ′ obtained in step S113 from each of M (for example, three in this case) speech elements d1 to d3 of the segment. Since there are seven pitch waveforms in each series, in step S114, the first to seventh pitch waveforms are averaged with three speech segments, and a new pitch consisting of seven new pitch waveforms is obtained. A waveform series f1 is generated. That is, for example, the centroid of the first pitch waveform of the series e1, the first pitch waveform of the series e2 ′, and the first pitch waveform of the series e3 ′ is obtained, and is obtained as a new pitch waveform series f1. The first pitch waveform. The same applies to the second to seventh pitch waveforms of the new pitch waveform series f1. A series f1 of pitch waveforms is the above “fusion speech unit”.

一方、融合音声素片作成部１８４の処理において、無声音のセグメントの場合には、素片選択ステップＳ１１１で当該セグメントのＭ個の音声素片のうち、当該Ｍ個の音声素片から音声素片を一つ選択し、選択した音声素片の音声波形をそのまま使用する。すなわち、選択した音声素片の音声波形を融合音声素片記憶部１６０に蓄積する。なお、便宜的にこれも融合音声素片と呼ぶ。組み合わせに順位がつけられている場合は、１位の素片を選択することにより音声素片を決める。 On the other hand, in the process of the fusion speech unit creating unit 184, in the case of an unvoiced segment, in the segment selection step S111, among the M speech units of the segment, the M speech units are converted into speech units. Is selected, and the speech waveform of the selected speech segment is used as it is. That is, the speech waveform of the selected speech unit is stored in the fusion speech unit storage unit 160. For convenience, this is also called a fusion speech unit. When the combination is ranked, the speech segment is determined by selecting the first segment.

融合音声素片音素環境作成部１８５は、上記組み合わせの音素環境に基づいて、融合音声素片の音素環境を作成する。融合音声素片の音素環境は、各音素環境のセントロイドとして求める。この場合、融合音声素片の基本周波数ｆは、各音声素片の基本周波数をｆｍ（１≦m≦Ｍ）とすると、次式によって定義される。

融合音声素片の継続時間長Ｔは、各音声素片の継続時間長をＴｍ（１≦m≦Ｍ）とすると、次式によって定義される。

融合音声素片の接続境界のケプストラムｃは、各音声素片の接続境界のケプストラムをｃｍ（１≦m≦Ｍ）とすると、次式によって定義される。

The fused speech element phoneme environment creation unit 185 creates a phoneme environment of the fused speech element based on the phoneme environment of the combination. The phoneme environment of the fusion speech unit is obtained as the centroid of each phoneme environment. In this case, the fundamental frequency f of the fusion speech unit is defined by the following equation, where fm (1 ≦ m ≦ M) is the fundamental frequency of each speech unit.

The duration time T of the fusion speech unit is defined by the following equation, where Tm (1 ≦ m ≦ M) is the duration time of each speech unit.

The cepstrum c at the connection boundary of the fusion speech unit is defined by the following equation, where the cepstrum at the connection boundary of each speech unit is cm (1 ≦ m ≦ M).

これらの処理により、融合音声素片およびその音素環境を作成する。そして、融合音声素片記憶部１６０および融合音声素片音素環境記憶部１７０に格納する。 Through these processes, a fusion speech unit and its phoneme environment are created. Then, it is stored in the fused speech unit storage unit 160 and the fused speech unit phoneme environment storage unit 170.

次に、音声素片組み合わせ作成部１８３の処理について詳述する。音声素片組み合わせ作成部１８３は、融合音声素片作成部１８４において融合すべき音声素片の組み合わせを作成する。本実施形態では、融合音声素片選択部１４０の処理において前述したコスト関数に基づいて複数の音声素片を選択する。さらには、使用頻度に基づいて、融合する複数の音声素片の組み合わせを決定する。 Next, the processing of the speech element combination creation unit 183 will be described in detail. The speech unit combination creating unit 183 creates a combination of speech units to be merged in the fused speech unit creating unit 184. In the present embodiment, a plurality of speech units are selected based on the cost function described above in the process of the fused speech unit selection unit 140. Furthermore, a combination of a plurality of speech units to be merged is determined based on the usage frequency.

前述したコスト関数は融合音声素片の音素環境情報に基づいてコストを計算しているが、ここでは音声素片に対応する音素環境情報に基づいて計算する。まず、各音声素片の組み合わせの使用頻度を求めるための文章データを用意する。それぞれの文章データを図１のテキスト取得部１１、言語処理部１２、韻律処理部１３により処理し、音韻系列と、韻律情報とを求める。音韻系列を合成単位で区切ることにより得られる各セグメントのそれぞれに対し、前記韻律情報と、融合音声素片音素環境記憶部１８２に含まれる音素環境情報との間のコストに基づいて１セグメントあたり（すなわち、１合成単位あたり）複数の音声素片を選択する。 The cost function described above calculates the cost based on the phoneme environment information of the fusion speech unit. Here, the cost function is calculated based on the phoneme environment information corresponding to the speech unit. First, text data for obtaining the use frequency of each speech element combination is prepared. Each text data is processed by the text acquisition unit 11, the language processing unit 12, and the prosody processing unit 13 in FIG. 1 to obtain a phoneme sequence and prosodic information. For each segment obtained by dividing the phoneme sequence by synthesis unit, per segment based on the cost between the prosodic information and the phoneme environment information included in the fused phoneme phoneme environment storage unit 182 ( That is, a plurality of speech segments are selected (per synthesis unit).

図１５は、このときの処理を説明するためのフローチャートである。まず、ステップＳ１５１において最適な音声素片のパスを、融合音声素片の最適パス計算と同様に、コスト関数および、動的計画法を利用して求める。 FIG. 15 is a flowchart for explaining the processing at this time. First, in step S151, the optimal speech unit path is obtained using a cost function and dynamic programming, as in the optimal path calculation of the fused speech unit.

次に、ステップＳ１５２に進み、最適素片系列を用いて、１セグメントあたり複数の音声素片を選ぶ。ここでは、セグメントの数をJ個とし、セグメントあたりＭ個の音声素片を選ぶこととして説明する。ステップＳ１５２の詳細を説明する。 Next, proceeding to step S152, a plurality of speech segments are selected per segment using the optimal segment sequence. Here, it is assumed that the number of segments is J and M speech units are selected per segment. Details of step S152 will be described.

ステップＳ１５３およびＳ１５４では、Ｊ個のセグメントのうちの１つを注目セグメントとする。ステップＳ１５３およびＳ１５４はＪ回繰り返され、Ｊ個のセグメントが1回ずつ注目セグメントとなるように処理を行う。まず、ステップＳ１５３では、注目セグメント以外のセグメントには、それぞれ最適素片系列の音声素片を固定する。この状態で、注目セグメントに対して音声素片記憶部１８１に記憶されている音声素片を式（５）のコストの値に応じて順位付けし、上位Ｍ個を選択する。 In steps S153 and S154, one of the J segments is set as a target segment. Steps S153 and S154 are repeated J times, and processing is performed so that J segments become the target segment once. First, in step S153, speech segments of the optimal segment series are fixed to segments other than the target segment. In this state, the speech units stored in the speech unit storage unit 181 are ranked with respect to the segment of interest according to the cost value of Expression (5), and the top M pieces are selected.

例えば、図１６に示すように、入力音韻系列が「ｔｓ・ｉ・ｉ・ｓ・ａ・…」であるとする。この場合、合成単位は、音素「ｔｓ」、「ｉ」、「ｉ」、「ｓ」、「ａ」、…のそれぞれに対応し、これら音素のそれぞれが１つのセグメントに対応する。図１６では、入力された音韻系列中の３番目の音素「ｉ」に対応するセグメントを注目セグメントとし、この注目セグメントについて、複数の音声素片を求める場合を示している。この３番目の音素「ｉ」に対応するセグメント以外のセグメントに対しては、最適素片系列中の音声素片１６１ａ、１６１ｂ、１６１ｄ、１６１ｅ…を固定する。 For example, as shown in FIG. 16, it is assumed that the input phoneme sequence is “ts · i · i · s · a ·. In this case, the synthesis unit corresponds to each of phonemes “ts”, “i”, “i”, “s”, “a”,..., And each of these phonemes corresponds to one segment. FIG. 16 shows a case where a segment corresponding to the third phoneme “i” in the input phoneme sequence is set as a target segment, and a plurality of speech segments are obtained for this target segment. For segments other than the segment corresponding to the third phoneme “i”, the speech units 161a, 161b, 161d, 161e,... In the optimal unit sequence are fixed.

この状態で、音声素片記憶部１８１に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ音声素片のそれぞれについて、式（５）を用いてコストを算出する。ただし、それぞれの音声素片に対してコストを求める際に、値が変わるのは、注目セグメントの目標コスト、注目セグメントとその一つ前のセグメントとの接続コスト、注目セグメントとその一つ後のセグメントとの接続コストであるので、これらのコストのみを考慮すればよい。 In this state, among the speech units stored in the speech unit storage unit 181, for each speech unit having the same phoneme name (phoneme symbol) as the phoneme “i” of the segment of interest, Equation (5) is obtained. To calculate the cost. However, when the cost is calculated for each speech unit, the value changes for the target cost of the target segment, the connection cost between the target segment and the previous segment, the target segment and the next segment. Since these are the connection costs with the segments, only these costs need be considered.

（手順１）音声素片記憶部１８１に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ音声素片のうちの１つを音声素片ｕ３とする。音声素片ｕ３の基本周波数ｆ（ｖ３）と、目標の基本周波数ｆ（ｔ３）とから、式（１）を用いて、基本周波数コストを算出する。 (Procedure 1) Among the speech elements stored in the speech element storage unit 181, one of the speech elements having the same phoneme name (phoneme symbol) as the phoneme “i” of the segment of interest is selected as the speech element. Let u3. From the fundamental frequency f (v3) of the speech element u3 and the target fundamental frequency f (t3), the fundamental frequency cost is calculated using Equation (1).

（手順２）音声素片ｕ３の音韻継続時間長ｇ（ｖ３）と、目標の音韻継続時間長ｇ（ｔ３）とから、式（２）を用いて、音韻継続時間長コストを算出する。 (Procedure 2) The phoneme duration length cost is calculated from the phoneme duration length g (v3) of the speech unit u3 and the target phoneme duration length g (t3) using Equation (2).

（手順３）音声素片ｕ３のケプストラム係数ｈ（ｕ３）と、音声素片１６１ｂ（ｕ２）のケプストラム係数ｈ（ｕ２）とから、式（３）を用いて、第１のスペクトル接続コストを算出する。また、音声素片ｕ３のケプストラム係数ｈ（ｕ３）と、音声素片１６１ｄ（ｕ4）のケプストラム係数ｈ（ｕ4）とから、式（３）を用いて、第２のスペクトル接続コストを算出する。 (Procedure 3) The first spectrum connection cost is calculated from the cepstrum coefficient h (u3) of the speech unit u3 and the cepstrum coefficient h (u2) of the speech unit 161b (u2) using Equation (3). To do. Further, the second spectrum connection cost is calculated from the cepstrum coefficient h (u3) of the speech element u3 and the cepstrum coefficient h (u4) of the speech element 161d (u4) using Equation (3).

（手順４）上記（手順１）〜（手順３）で各サブコスト関数を用いて算出された基本周波数コストと音韻継続時間長コストと第１及び第２のスペクトル接続コストの重み付け和を算出して、音声素片ｕ３のコストを算出する。 (Procedure 4) Calculate the weighted sum of the fundamental frequency cost, the phoneme duration time cost, and the first and second spectrum connection costs calculated by using each sub-cost function in (Procedure 1) to (Procedure 3). The cost of the speech unit u3 is calculated.

（手順５）音声素片記憶部１８１に記憶されている音声素片のうち、注目セグメントの音素「ｉ」と同じ音素名（音素記号）をもつ各音声素片について、上記（手順１）〜（手順４）に従って、コストを算出したら、その値の最も小さい音声素片ほど高い順位となるように順位付けを行う（図１５のステップＳ１５３）。そして、上位Ｍ個の音声素片を選択する（図１５のステップＳ１５４）。例えば、図１６では、音声素片１６２ａが最も順位が高く、音声素片１６２ｄが最も順位が低い。 (Procedure 5) Among the speech elements stored in the speech element storage unit 181, for each speech element having the same phoneme name (phoneme symbol) as the phoneme “i” of the segment of interest, the above (Procedure 1) to When the cost is calculated according to (Procedure 4), ranking is performed so that the speech unit having the smallest value has a higher rank (Step S153 in FIG. 15). Then, the top M speech segments are selected (step S154 in FIG. 15). For example, in FIG. 16, the speech unit 162a has the highest ranking, and the speech unit 162d has the lowest ranking.

以上の（手順１）〜（手順５）をそれぞれのセグメントに対して行う。その結果、それぞれのセグメントについて、Ｍ個ずつの音声素片が得られる。 The above (Procedure 1) to (Procedure 5) are performed for each segment. As a result, M speech segments are obtained for each segment.

すべての入力文章の各セグメントに対して、上記手順により、Ｍ個の音声素片を選択し、選択されたＭ個の音声素片の素片番号を、音声素片組み合わせ頻度情報作成部１８３３に渡す。音声素片組み合わせ頻度情報作成部１８３３では、素片番号の組み合わせの頻度情報を、複数音声素片組み合わせ頻度情報記憶部１８３５に蓄積する。 For each segment of all input sentences, M speech units are selected by the above procedure, and the unit numbers of the selected M speech units are input to the speech unit combination frequency information creation unit 1833. hand over. The speech unit combination frequency information creation unit 1833 accumulates the frequency information of the unit number combinations in the multiple speech unit combination frequency information storage unit 1835.

図１７は、複数音声素片組み合わせ頻度情報記憶部１８３５に格納されている複数音声素片組み合わせ頻度情報の例を示す。複数音声素片組み合わせ情報は、組み合わせの番号、音韻（音素名）、１位からＭ位までの音声素片の素片番号とともに、その出現度数が保持されている。入力されたＭ個の音声素片の素片番号が、複数音声素片組み合わせ頻度情報に存在する場合はその組み合わせに対応する出現度数に１を加え、存在しない場合はその組み合わせを追加して、出現度数を１とする。これをすべてのセグメントの組み合わせに対して行うことにより、入力文章に対する出現頻度情報が作成される。 FIG. 17 shows an example of multiple speech unit combination frequency information stored in the multiple speech unit combination frequency information storage unit 1835. The multiple speech unit combination information includes the combination number, phoneme (phoneme name), and the number of appearances of speech units from the 1st to Mth speech units. When the unit number of the input M speech units is present in the multiple speech unit combination frequency information, 1 is added to the appearance frequency corresponding to the combination, and when there is not, the combination is added, The appearance frequency is 1. By performing this for all combinations of segments, appearance frequency information for the input sentence is created.

次に、音声素片組合せ決定部１８３４において、実際に融合する音声素片の組み合わせを決定する。組み合わせの決め方は、いくつか考えられるが、予め出現度数の閾値を決めて、複数の音声素片の組み合わせの出現度数がその閾値より大きい素片を利用する方法、音素毎の素片数の上限を定め、出現頻度順に素片を選択する方法、全体の融合音声素片群のサイズを決め、そのサイズを超えない範囲で出現頻度順に選ぶ方法などが挙げられる。 Next, the speech unit combination determination unit 1834 determines the combination of speech units to be actually merged. There are several ways to determine the combination, but the threshold for the frequency of occurrence is determined in advance, and the method of using the segment whose frequency of occurrence of the combination of multiple speech segments is greater than the threshold, the upper limit for the number of segments per phoneme And selecting the segments in order of appearance frequency, determining the size of the entire fused speech segment group, and selecting in order of appearance frequency within a range not exceeding the size.

図１７の頻度情報において、出現頻度の閾値を３０とした場合は、０番の/a/と、２番の/i/は融合音声素片を作成するが、１番の/a/は作成しないことになる。 In the frequency information of FIG. 17, when the appearance frequency threshold is set to 30, the number 0 / a / and the number 2 / i / create a fusion speech unit, while the number 1 / a / creates Will not.

ここで、第１の実施形態に係る音声合成手法と、従来の音声合成手法との違いについて説明する。ＣＯＣに基づく方法、ＨＭＭに基づく方法では、融合された音韻パラメータを保持し、合成時は融合された音韻パラメータに基づいて合成しているものの、選択する際に決定木を用いている。このため韻律情報の歪みの度合いに基づいて選択を行う本実施形態とは異なる選択手法となっている。 Here, the difference between the speech synthesis method according to the first embodiment and the conventional speech synthesis method will be described. In the method based on COC and the method based on HMM, a fused phoneme parameter is held, and at the time of synthesis, synthesis is performed based on the fused phoneme parameter, but a decision tree is used for selection. For this reason, the selection method is different from the present embodiment in which selection is performed based on the degree of distortion of prosodic information.

本実施形態の手法は決定木の形でクラスタリングする手法と比べ、合成時の自由度が高く、大量の融合音声素片から容易に融合音声素片を選択することができるため、スケーラブルな合成器にしやすい点、すなわち融合音声素片記憶部１６０のサイズを大きくするにしたがって、高音質な合成音声が得られる点などの利点を持つ。 Compared to the clustering method in the form of a decision tree, the method of this embodiment has a high degree of freedom during synthesis and can easily select a fused speech unit from a large number of fused speech units. Therefore, a scalable synthesizer There is an advantage that a synthesized speech with high sound quality can be obtained as the size of the fusion speech unit storage unit 160 is increased.

従来の素片選択型音声合成では、合成単位あたり、一つの音声素片を選択して、接続することで合成を行うが、本実施形態では選択される音声素片が音声波形そのものではなく、融合された音声素片となっている。融合された音声素片を用いることで、安定で高品質な音声素片となり、より自然で高品質な合成音声を生成することができる。また、合成単位あたりの融合音声素片は、事前に作成されているため、合成時の処理量は、素片選択型の音声合成方式に近く、高速に音声合成ができる。 In the conventional unit selection type speech synthesis, synthesis is performed by selecting and connecting one speech unit per synthesis unit, but in this embodiment, the selected speech unit is not the speech waveform itself, It is a fused speech unit. By using the fused speech unit, a stable and high-quality speech unit is obtained, and a more natural and high-quality synthesized speech can be generated. Further, since the fusion speech unit per synthesis unit is created in advance, the processing amount at the time of synthesis is close to that of the unit selection type speech synthesis method, and speech synthesis can be performed at high speed.

図１８は、実施例１に係るテキスト音声合成装置１０のハードウェア構成を示す図である。テキスト音声合成装置１０は、ハードウェア構成として、テキスト音声合成装置１０における音声合成処理を実行する音声合成プログラムなどが格納されているＲＯＭ５２と、ＲＯＭ５２内のプログラムに従ってテキスト音声合成装置１０の各部を制御し、バッファリング時間変更処理等を実行するＣＰＵ５１と、ワークエリアが形成され、テキスト音声合成装置１０の制御に必要な種々のデータを記憶するＲＡＭ５３と、ネットワークに接続して通信を行う通信I／Ｆ５７と、各部を接続するバス６２とを備えている。 FIG. 18 is a diagram illustrating a hardware configuration of the text-to-speech synthesizer 10 according to the first embodiment. The text-to-speech synthesizer 10 has, as a hardware configuration, a ROM 52 that stores a speech synthesis program for executing speech synthesis processing in the text-to-speech synthesizer 10, and controls each unit of the text-to-speech synthesizer 10 according to the program in the ROM 52. Then, a CPU 51 that executes buffering time change processing, a RAM 53 that stores various data necessary for control of the text-to-speech synthesizer 10 and a communication I / O that communicates by connecting to a network. F57 and a bus 62 for connecting each part are provided.

先に述べたテキスト音声合成装置１０における音声合成プログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フロッピー（Ｒ）ディスク（ＦＤ）、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録されて提供されてもよい。 The speech synthesis program in the text-to-speech synthesizer 10 described above is a file in an installable or executable format and is a computer-readable recording medium such as a CD-ROM, floppy (R) disk (FD), or DVD. May be recorded and provided.

この場合には、音声合成プログラムは、テキスト音声合成装置１０において上記記録媒体から読み出して実行することにより主記憶装置上にロードされ、上記ソフトウェア構成で説明した各部が主記憶装置上に生成されるようになっている。 In this case, the speech synthesis program is loaded onto the main storage device by being read from the recording medium and executed by the text speech synthesizer 10, and each unit described in the software configuration is generated on the main storage device. It is like that.

また、本実施例の音声合成プログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。 Further, the speech synthesis program of the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network.

(実施の形態２)
次に、実施の形態２にかかるテキスト音声合成装置１０について説明する。図１９は、実施の形態２にかかるテキスト音声合成装置１０の音声合成部１４の詳細な機能構成を示すブロック図である。 (Embodiment 2)
Next, the text-to-speech synthesizer 10 according to the second embodiment will be described. FIG. 19 is a block diagram of a detailed functional configuration of the speech synthesizer 14 of the text-to-speech synthesizer 10 according to the second embodiment.

実施の形態２にかかる音声合成部１４は、融合音声素片音素環境記憶部１７０を有さない。また、実施の形態２にかかる音声合成部１４は、融合音声素片組み合わせ記憶部２００を有する。 The speech synthesis unit 14 according to the second exemplary embodiment does not include the fused speech unit phoneme environment storage unit 170. Further, the speech synthesis unit 14 according to the second exemplary embodiment includes a fused speech unit combination storage unit 200.

融合音声素片組み合わせ記憶部２００は、融合音声素片記憶部１６０に格納されている融合音声素片に含まれる音声素片の組み合わせを各融合音声素片に対応付けて格納している。 The fusion speech unit combination storage unit 200 stores a combination of speech units included in the fusion speech unit stored in the fusion speech unit storage unit 160 in association with each fusion speech unit.

図２０は、融合音声素片組み合わせ記憶部２００のデータ構成を模式的に示している。融合音声素片組み合わせ記憶部２００は、音韻名、音声素片の組み合わせそれぞれの順位および番号、および融合音声素片番号を対応付けて格納している。 FIG. 20 schematically shows the data configuration of the fused speech unit combination storage unit 200. The fusion speech unit combination storage unit 200 stores phoneme names, ranks and numbers of combinations of speech units, and fusion speech unit numbers in association with each other.

実施の形態２においては、音韻系列・韻律情報取得部１１０は、韻律処理部１３から取得した入力韻律系列および入力韻律情報を融合音声素片作成部１８０に送る。融合音声素片作成部１８０は、取得した入力韻律系列および入力韻律情報に基づいて複数の音声素片の組み合わせを選択する。そして、歪み推定部１３０は、融合音声素片作成部１８０によって選択された音声素片の組み合わせと、音韻系列・韻律情報取得部１１０から取得した入力韻律情報との間の歪みの度合いを推定する。 In Embodiment 2, the phoneme sequence / prosodic information acquisition unit 110 sends the input prosody sequence and input prosodic information acquired from the prosody processing unit 13 to the fused speech unit creation unit 180. The fused speech segment creation unit 180 selects a combination of a plurality of speech segments based on the acquired input prosodic sequence and input prosodic information. Then, the distortion estimation unit 130 estimates the degree of distortion between the combination of the speech units selected by the fusion speech unit creation unit 180 and the input prosodic information acquired from the phoneme sequence / prosodic information acquisition unit 110. .

融合音声素片選択部１４０は、歪み推定部１３０が推定した歪みの度合いが最少となる組み合わせを選択する。そして、選択した組み合わせが融合音声素片組み合わせ記憶部２００に格納されているか否かを判断する。融合音声素片組み合わせ記憶部２００に格納されている場合には、融合音声素片記憶部１６０から当該組み合わせに対応する融合音声素片を抽出する。一方、選択した組み合わせが融合音声素片組み合わせ記憶部２００に格納されていない場合には、融合音声素片作成部１８０に対して、当該組み合わせに対する融合音声素片を作成するよう指示する。 The fused speech element selection unit 140 selects a combination that minimizes the degree of distortion estimated by the distortion estimation unit 130. Then, it is determined whether or not the selected combination is stored in the fused speech unit combination storage unit 200. When stored in the fusion speech unit combination storage unit 200, the fusion speech unit corresponding to the combination is extracted from the fusion speech unit storage unit 160. On the other hand, if the selected combination is not stored in the fused speech unit combination storage unit 200, the fused speech unit creating unit 180 is instructed to create a fused speech unit for the combination.

図２１は、実施の形態２にかかる融合音声素片作成部１８０の詳細な機能構成を示すブロック図である。実施の形態２に係る融合音声素片作成部１８０は、融合音声素片音素環境作成部１８５を有さない。そして、融合音声素片作成部１８０は、融合音声素片音素環境を作成しない。 FIG. 21 is a block diagram of a detailed functional configuration of the fused speech unit creation unit 180 according to the second embodiment. The fusion speech unit creation unit 180 according to Embodiment 2 does not have the fusion speech unit phoneme environment creation unit 185. Then, the fused speech unit creation unit 180 does not create a fused speech unit phoneme environment.

また、音声素片組み合わせ作成部１８３は、音韻系列・韻律情報取得部１１０から取得した入力韻律情報等に基づいて、音声素片の組み合わせを作成する。音声素片組み合わせ作成部１８３は、実施の形態１において図１５を参照しつつ説明した処理によって複数の音声素片を選択する。音声素片組み合わせ作成部１８３は、作成した音声素片の組み合わせを示す組み合わせ情報を融合音声素片組み合わせ記憶部２００に格納する。融合音声素片作成部１８４は、融合音声素片選択部１４０からの指示により、指示された複数の音声素片から融合音声素片を作成する。 The speech element combination creation unit 183 creates a speech element combination based on the input prosody information acquired from the phoneme sequence / prosodic information acquisition unit 110. The speech element combination creating unit 183 selects a plurality of speech elements by the process described with reference to FIG. 15 in the first embodiment. The speech unit combination creation unit 183 stores combination information indicating the created speech unit combination in the fusion speech unit combination storage unit 200. The fusion speech unit creation unit 184 creates a fusion speech unit from a plurality of instructed speech units according to an instruction from the fusion speech unit selection unit 140.

図２２は、実施の形態２にかかる融合音声素片選択部１４０が融合音声素片を選択する処理を示すフローチャートである。 FIG. 22 is a flowchart of a process in which the fused speech unit selection unit 140 according to the second embodiment selects a fused speech unit.

まず、ステップＳ２１２において、歪み推定部１３０が推定した歪みの度合いに基づいて、融合音声素片とすべき音声素片の組み合わせを決定する。次に、ステップＳ２１２で決定した組み合わせが融合音声素片組み合わせ記憶部２００に格納されているか否かを判定する。 First, in step S212, based on the degree of distortion estimated by the distortion estimation unit 130, a combination of speech units to be a fusion speech unit is determined. Next, it is determined whether or not the combination determined in step S212 is stored in the fused speech unit combination storage unit 200.

本実施の形態においては、ステップＳ２１２で決定した音声素片の組み合わせの１位からＭ位までの音声素片番号が、融合音声素片と一致する場合に、融合音声素片が融合音声素片記憶部１６０に保持されていると判定する。一方、一致しない場合には融合音声素片記憶部１６０に保持されていないと判定する。ステップＳ２１２で決定した組み合わせの融合音声素片が融合音声素片記憶部１６０に保持されていると判定した場合、ステップＳ２１３に進む。 In the present embodiment, when the speech unit numbers from the first place to the Mth place of the speech unit combination determined in step S212 match the fusion speech unit, the fusion speech unit is the fusion speech unit. It is determined that the data is held in the storage unit 160. On the other hand, if they do not match, it is determined that they are not held in the fused speech unit storage unit 160. When it is determined that the combined speech unit of the combination determined in step S212 is held in the fused speech unit storage unit 160, the process proceeds to step S213.

ステップＳ２１３では、融合音声素片組み合わせ記憶部２００を参照し、組み合わせに対応する融合音声素片番号を取得する。そして、取得した融合音声素片番号に基づいて、融合音声素片記憶部１６０から対応する融合音声素片を取得する。 In step S213, the fused speech unit combination storage unit 200 is referred to, and a fused speech unit number corresponding to the combination is acquired. And based on the acquired fusion speech unit number, the corresponding fusion speech unit is acquired from the fusion speech unit storage unit 160.

ステップＳ２１２において、融合音声素片が融合音声素片記憶部１６０に存在しないと判定された場合は、ステップＳ２１４において、融合音声素片作成部１８０に対して、ステップＳ２１１において決定した複数の音声素片の組み合わせから融合音声素片を作成する旨の指示を送る。ステップ２１５では、ステップＳ２１４において融合音声素片作成部１８０に送った指示に対する応答として、融合音声素片作成部１８０から対応する融合音声素片を取得する。 If it is determined in step S212 that the fusion speech unit does not exist in the fusion speech unit storage unit 160, the plurality of speech units determined in step S211 are determined in step S214 with respect to the fusion speech unit creation unit 180. Send an instruction to create a fusion speech unit from the combination of pieces. In step 215, the corresponding fused speech segment is acquired from the fused speech segment creating unit 180 as a response to the instruction sent to the fused speech segment creating unit 180 in step S 214.

このように、本実施の形態にかかるテキスト音声合成装置１０は、融合音声記憶部１６０に適切な融合音声素片が保持されていない場合には、新たに融合音声素片を作成し、当該融合音声素片を利用して音声合成を行うので、より高音質な合成音声を効率的に生成することができる。 As described above, the text-to-speech synthesizer 10 according to the present embodiment newly creates a fused speech unit when the fused speech unit is not held in the fused speech storage unit 160, and the fused speech unit is created. Since speech synthesis is performed using speech segments, synthesized speech with higher sound quality can be efficiently generated.

図２３は、融合音声素片系列の例を示している。図２３は「ｔｓ、ｉ、ｉ、ｓ、ａ」の各音素に対して、融合音声素片記憶部１６０から抽出された融合音声素片と融合音声素片作成部１８０によって新たに作成された融合音声素片とのいずれを用いるかを示している。 FIG. 23 shows an example of a fused speech unit sequence. In FIG. 23, for each phoneme of “ts, i, i, s, a”, a fusion speech unit extracted from the fusion speech unit storage unit 160 and a fusion speech unit creation unit 180 are newly created. It shows which one of the fused speech segments is used.

ｔｓ、ｉ、ｉ、ｓ、ａ、それぞれに対応する融合音声素片を２２１ａ、２２１ｂ、２２１ｃ、２２１ｄ、２２１ｅとしている。ここでは、２２１ｂおよび２２１ｄは融合音声素片記憶部１６０に存在せず、２２１ａ、２２１ｃ、２２１ｅは融合音声素片記憶部１６０に存在するとする。 The fusion speech segments corresponding to ts, i, i, s, a are 221a, 221b, 221c, 221d, and 221e. Here, it is assumed that 221b and 221d do not exist in the fused speech unit storage unit 160, and 221a, 221c, and 221e exist in the fused speech unit storage unit 160.

この場合、３つの素片は、予め作成されている。一方、残りの２つの素片は素片の融合処理を合成時に必要とする。したがって、すべての素片を合成時に融合する場合に比べ融合処理の回数を２／５に削減することができる。 In this case, the three segments are created in advance. On the other hand, the remaining two pieces need to be fused at the time of synthesis. Therefore, the number of fusion processes can be reduced to 2/5 compared with the case where all the pieces are fused at the time of synthesis.

素片の融合処理は演算量の多い処理であるため、合成時の処理が高速化される。また、音声素片がハードディスクドライブに蓄積されている場合は、それぞれの音声素片のシーク時間を削減することができる。 Since the unit fusion process is a process with a large amount of calculation, the process at the time of synthesis is accelerated. In addition, when speech segments are stored in the hard disk drive, the seek time of each speech segment can be reduced.

すなわち、合成時に融合する場合は、それぞれ融合音声素片数であるＭ回のシーク時間がかかるのに対し、あらかじめ融合してある場合は１回のシーク時間でＭ個の素片を融合した融合音声素片を取得することができる。 That is, in the case of merging at the time of synthesis, it takes M seek times that are the number of fused speech units, whereas in the case of merging in advance, the fusion is performed by fusing M segments in one seek time. A speech segment can be acquired.

このように、第２の実施形態では、合成に用いる融合音声素片の一部をあらかじめ融合しておくことにより、すべて合成時に融合する場合と同等の合成音声が得られ、かつ高速に音声を合成することができる。 As described above, in the second embodiment, by synthesizing a part of the fusion speech unit used for synthesis in advance, a synthesized speech equivalent to the case where all of the speech units are fused at the time of synthesis can be obtained, and the speech can be transmitted at high speed. Can be synthesized.

なお、実施の形態２にかかるテキスト音声合成装置１０のこれ以外の構成および処理は、実施の形態１にかかるテキスト音声合成装置１０の構成および処理と同様である。 Other configurations and processes of the text-to-speech synthesizer 10 according to the second embodiment are the same as those of the text-to-speech synthesizer 10 according to the first embodiment.

第２の実施形態の判定ステップＳ２１２においては、Ｓ２１１において入力された各セグメントに対応するＭ個の音声素片の組み合わせすべてが融合音声素片組み合わせ記憶部１８１に蓄積されている組み合わせと一致した場合に、前記組み合わせに対応する融合音声素片を融合音声素片記憶部１６０から取得することにより融合音声素片とし、そうでない場合に選択されたＭ個の音声素片を音声素片記憶部１８１から取得して融合することにより融合音声素片を作成するとしたが、これに限定するものではない。 In the determination step S212 of the second embodiment, when all the combinations of M speech units corresponding to the segments input in S211 match the combinations stored in the fusion speech unit combination storage unit 181. In addition, the fusion speech unit corresponding to the combination is acquired from the fusion speech unit storage unit 160 to obtain a fusion speech unit. Otherwise, the M speech units selected are the speech unit storage unit 181. However, the present invention is not limited to this.

例えば、予め一致させる組み合わせ個数の下限値Ｎを定めてもよい。各セグメントに対応するＭ個の音声素片のうち、Ｎ個以上の音声素片が融合音声素片組み合わせ記憶部１８１中の組み合わせと一致した場合に、前記組み合わせに対応する融合音声素片を融合音声素片記憶部１６０から取得する。一方、一致した組み合わせがＮ個よりも少ない場合にはＭ個の音声素片を融合することにより新たに融合音声素片を作成する。 For example, the lower limit value N of the number of combinations to be matched may be determined in advance. When M or more speech units corresponding to each segment match N or more speech units with the combination in the fused speech unit combination storage unit 181, the fused speech units corresponding to the combination are fused. Obtained from the speech segment storage unit 160. On the other hand, if there are fewer than N matching combinations, a new fused speech unit is created by fusing M speech units.

これにより、合成時に融合音声素片系列中の融合音声素片記憶部１６０から取得された融合音声素片を用いる割合が増加し、音声合成処理がさらに高速化される。 As a result, the proportion of using the fusion speech unit acquired from the fusion speech unit storage unit 160 in the fusion speech unit sequence at the time of synthesis increases, and the speech synthesis process is further speeded up.

また、他の例としては、図１５に示す処理により決定されたＭ個の音声素片の組み合わせの上位Ｎ個の音声素片が、融合音声素片組み合わせ記憶部１８１に記憶されている組み合わせの上位Ｎ個と一致するか否かを基準としてもよい。 As another example, the top N speech units of the combination of M speech units determined by the processing shown in FIG. 15 are the combinations stored in the fusion speech unit combination storage unit 181. It may be based on whether or not it matches the top N.

Ｍ個の組み合わせの上位Ｎ個の音声素片が、融合音声素片組み合わせ記憶部１８１に記憶されている組み合わせの上位Ｎ個と一致した場合に、前記組み合わせに対応する融合音声素片を融合音声素片記憶部１６０から取得する。一方、一致しない場合に、前記Ｍ個の音声素片を融合することにより新たに融合音声素片を作成する。 When the top N speech units of the M combinations match the top N speech units stored in the fused speech unit combination storage unit 181, the fused speech unit corresponding to the combination is merged speech Obtained from the segment storage unit 160. On the other hand, if they do not match, a new fused speech unit is created by fusing the M speech units.

上位の音声素片が一致しているので、融合音声素片記憶部１６０から取得された融合音声素片のコスト関数の値は、選択された複数の音声素片の組み合わせのコスト関数の値に近づき、高音質な合成音声を得ることができる。 Since the higher speech units match, the value of the cost function of the fused speech unit acquired from the fused speech unit storage unit 160 becomes the value of the cost function of the combination of the selected speech units. Approaching, high-quality synthesized speech can be obtained.

（第３の実施形態）
次に、第３の実施形態にかかるテキスト音声合成装置１０について説明する。図２４は、実施の形態３にかかるテキスト音声合成装置１０の音声合成部１４の詳細な機能構成を示すブロック図である。実施の形態３にかかる音声合成部１４においては、融合音声素片選択部１４０は、歪み推定部１３０が推定した歪みの度合いに基づいて、融合音声素片記憶部１６０に格納されている融合音声素片を選択するか否かを判断する。 (Third embodiment)
Next, the text-to-speech synthesizer 10 according to the third embodiment will be described. FIG. 24 is a block diagram of a detailed functional configuration of the speech synthesizer 14 of the text-to-speech synthesizer 10 according to the third embodiment. In the speech synthesizer 14 according to the third embodiment, the fused speech unit selection unit 140 is based on the degree of distortion estimated by the distortion estimation unit 130 and is stored in the fused speech unit storage unit 160. It is determined whether or not a segment is selected.

より具体的には、歪み推定部１３０から取得した歪みの度合いが予め定められている歪み基準値よりも小さい場合に、対応する融合音声素片を融合音声素片記憶部１６０から抽出する。一方、歪み推定部１３０から取得した歪みの度合いが歪み基準値以上である場合には、融合音声素片記憶部１６０から抽出する代わりに、融合音声素片作成部１８０に対して融合音声素片の作成を指示する。実施の形態３にかかるテキスト音声合成装置１０は、この点で他の実施の形態にかかるテキスト音声合成装置１０と異なる。 More specifically, when the degree of distortion acquired from the distortion estimation unit 130 is smaller than a predetermined distortion reference value, the corresponding fusion speech unit is extracted from the fusion speech unit storage unit 160. On the other hand, if the degree of distortion acquired from the distortion estimation unit 130 is equal to or greater than the distortion reference value, instead of extracting from the fusion speech unit storage unit 160, the fusion speech unit creation unit 180 is informed of the fusion speech unit. Instruct the creation of. The text-to-speech synthesizer 10 according to the third embodiment is different from the text-to-speech synthesizer 10 according to the other embodiments in this respect.

図２５は、融合音声素片選択部１４０による処理を示すフローチャートである。まず、ステップＳ２４２において、歪み推定部１３０から各セグメントの韻律情報に対する歪みの度合いを取得する。なお、歪み推定部１３０からは複数の融合音声素片に対する歪みの度合いのうち最少の値を取得する。 FIG. 25 is a flowchart showing processing by the fused speech unit selection unit 140. First, in step S242, the degree of distortion for the prosodic information of each segment is acquired from the distortion estimation unit 130. It should be noted that the minimum value is obtained from the degree of distortion for the plurality of fused speech units from the distortion estimation unit 130.

次に、各セグメントに対して、ステップＳ２４３以下の処理を行う。ステップＳ２４３では、歪み推定部１３０から取得した歪みの度合いが予め定められている歪み基準値よりも小さいか否かを判断する。歪み基準値以上である場合、すなわち、歪みが大きく許容できない場合には、ステップＳ２４５において、融合音声素片作成部１８０に対して新たに融合音声素片を作成する旨を指示する。 Next, the processing from step S243 is performed on each segment. In step S243, it is determined whether or not the degree of distortion acquired from the distortion estimation unit 130 is smaller than a predetermined distortion reference value. If it is greater than or equal to the distortion reference value, that is, if the distortion is large and unacceptable, in step S245, the fused speech unit creating unit 180 is instructed to create a new fused speech unit.

そして、ステップ２４６において、当該指示に対する応答として、新たな融合音声素片を取得する。なお、この場合融合音声素片作成部１８０は、分割部１２０を介して対応する入力韻律情報等を取得する。取得した入力韻律情報等に基づいて、複数の音声素片を選択する。選択した音声素片を融合して融合音声素片を得る。 In step 246, a new fused speech segment is acquired as a response to the instruction. In this case, the fused speech unit creation unit 180 acquires the corresponding input prosodic information and the like via the dividing unit 120. Based on the acquired input prosody information and the like, a plurality of speech segments are selected. The selected speech unit is fused to obtain a fused speech unit.

一方、歪み基準値よりも小さい場合、すなわち、歪みが小さい場合には、ステップＳ２４４において対応する融合音声素片を融合音声素片記憶部１６０から選択する。以上で、融合音声素片選択部１４０による処理が完了する。 On the other hand, if it is smaller than the distortion reference value, that is, if the distortion is small, the corresponding fused speech element is selected from the fused speech element storage unit 160 in step S244. Thus, the processing by the fused speech element selection unit 140 is completed.

なお、実施の形態３にかかるテキスト音声合成装置１０のこれ以外の構成および処理は、実施の形態１にかかるテキスト音声合成装置１０の構成および処理と同様である。 The other configuration and processing of the text-to-speech synthesizer 10 according to the third embodiment are the same as those of the text-to-speech synthesizer 10 according to the first embodiment.

本実施の形態においては、予め歪み基準値、すなわち閾値を設定している。そして、閾値に基づいて、新たに融合音声素片を作成するか否かを判断している。新たな融合音声素片を作成するか否かの判断にあたっては、韻律情報の歪みの度合いを考慮すればよく、閾値に基づく判断に限定されない。 In this embodiment, a distortion reference value, that is, a threshold value is set in advance. Then, based on the threshold value, it is determined whether or not to create a new fused speech segment. In determining whether or not to create a new fused speech unit, the degree of distortion of prosodic information may be taken into consideration, and the determination is not limited to a determination based on a threshold value.

図２６は、新たな融合音声素片を作成するか否かの判断方法の他の例を説明するための図である。まず、ステップＳ２５１において各セグメントに対する入力融合音声素片の歪みの度合いＥ１を取得する。ステップＳ２５２では、音声素片記憶部１８１から図１５に示した処理により複数の音声素片を選択する。ステップＳ２５３では、ステップ２５２において選択された韻律情報の歪みの度合いの平均値Ｅ２を計算する。ステップＳ２５４では、Ｅ１とＥ２との差が予め定めた閾値より大きいか否かを判定する。そして、この判定結果に基づいて新たに融合音声素片を作成するか否かを決定する。 FIG. 26 is a diagram for explaining another example of a method for determining whether or not to create a new fused speech unit. First, in step S251, the degree of distortion E1 of the input fusion speech unit for each segment is acquired. In step S252, a plurality of speech units are selected from the speech unit storage unit 181 by the process shown in FIG. In step S253, an average value E2 of the degree of distortion of the prosodic information selected in step 252 is calculated. In step S254, it is determined whether or not the difference between E1 and E2 is greater than a predetermined threshold. Then, based on the determination result, it is determined whether or not to create a new fused speech segment.

具体的には、Ｅ１とＥ２との差が予め定めて閾値よりも小さい場合には、融合音声素片記憶部１６０から融合音声素片を選択する。一方、Ｅ１とＥ２との差が閾値以上である場合には、新たに融合音声素片を作成する指示を出す。 Specifically, if the difference between E1 and E2 is predetermined and smaller than the threshold value, the fused speech unit is selected from the fused speech unit storage unit 160. On the other hand, if the difference between E1 and E2 is greater than or equal to the threshold, an instruction to create a new fused speech segment is issued.

第３の実施形態においては、第２の実施形態と同様に、歪みの度合いが小さい場合には、予め作成されて融合音声素片記憶部１６０に保持されている融合音声素片を利用することができる。したがって、高速に音声合成を行うことができる。 In the third embodiment, as in the second embodiment, when the degree of distortion is small, a fusion speech unit that is created in advance and held in the fusion speech unit storage unit 160 is used. Can do. Therefore, speech synthesis can be performed at high speed.

また、新たに作成した融合音声素片を適宜融合音声素片記憶部１６０に追加してもよい。これにより、複数素片の組み合わせをあらかじめ限定している第１の実施形態にかかるテキスト音声合成装置１０に比べて、より融合する組み合わせのバリエーションが多くなる。したがって、高音質な合成音声を得ることができる。 Further, the newly created fused speech unit may be added to the fused speech unit storage unit 160 as appropriate. Thereby, compared with the text-to-speech synthesizer 10 according to the first embodiment in which combinations of a plurality of segments are limited in advance, there are more variations of combinations to be merged. Therefore, high-quality synthesized speech can be obtained.

（実施の形態４）
次に、実施の形態４にかかるテキスト音声合成装置１０について説明する。実施の形態４にかかるテキスト音声合成装置１０は、融合音声素片記憶部１６０および融合音声素片音素環境記憶部１７０の内容を更新する。 (Embodiment 4)
Next, the text-to-speech synthesizer 10 according to the fourth embodiment will be described. The text-to-speech synthesizer 10 according to the fourth embodiment updates the contents of the fused speech unit storage unit 160 and the fused speech unit phoneme environment storage unit 170.

図２７は、実施の形態４にかかる音声合成部１４の機能構成を示すブロック図である。実施の形態４にかかる音声合成部１４は、実施の形態２にかかる音声合成部１４の構成に加え更新部２１０をさらに備えている。更新部２１０は、融合音声素片編集・接続部１５０から各セグメントの組み合わせを取得する。そして、当該組み合わせを融合音声素片記憶部１６０に追加するか否かを判断する。 FIG. 27 is a block diagram of a functional configuration of the speech synthesizer 14 according to the fourth embodiment. The speech synthesis unit 14 according to the fourth embodiment further includes an update unit 210 in addition to the configuration of the speech synthesis unit 14 according to the second embodiment. The updating unit 210 acquires a combination of segments from the fusion speech unit editing / connecting unit 150. Then, it is determined whether or not to add the combination to the fused speech unit storage unit 160.

図２８は、更新部２１０における更新処理を示すフローチャートである。まず、ステップＳ２７１において、合成時に用いられた複数音声素片組み合わせ系列を融合音声素片編集・接続部１５０から取得する。ステップＳ２７２においては、入力された各セグメントの組み合わせを融合音声素片記憶部１６０に加えるかどうかを判定する。例えば、融合音声素片編集・接続部１５０から取得した組み合わせが融合音声素片記憶部１６０に既に格納されているか否かに基づいて判断する。 FIG. 28 is a flowchart showing an update process in the update unit 210. First, in step S <b> 271, a plurality of speech unit combination sequences used at the time of synthesis are acquired from the fusion speech unit editing / connecting unit 150. In step S272, it is determined whether or not the input combination of segments is to be added to the fusion speech unit storage unit 160. For example, the determination is made based on whether or not the combination acquired from the fused speech unit editing / connecting unit 150 is already stored in the fused speech unit storage unit 160.

そして、追加すると判断した場合には、ステップＳ２７３において融合音声素片とその組み合わせ情報を融合音声素片記憶部１６０に追加する。一方、追加しないと判断した場合には、融合音声素片編集・接続部１５０から取得した組み合わせを破棄する。以上で、更新処理が完了する。 If it is determined to be added, the fused speech unit and its combination information are added to the fused speech unit storage unit 160 in step S273. On the other hand, if it is determined not to be added, the combination acquired from the fusion speech unit editing / connecting unit 150 is discarded. Thus, the update process is completed.

なお、実施の形態４にかかるテキスト音声合成装置１０のこれ以外の構成および処理は、実施の形態２にかかるテキスト音声合成装置１０の構成および処理と同様である。 The other configurations and processes of the text-to-speech synthesizer 10 according to the fourth embodiment are the same as the configurations and processes of the text-to-speech synthesizer 10 according to the second embodiment.

実施の形態４にかかる第１の変更例としては、更新部２１０は、融合音声素片記憶部１６０に組み合わせを追加する処理に加えて、さらに融合音声素片記憶部１６０に格納されている融合音声素片を削除する処理を行ってもよい。例えば、更新部２１０は、融合音声素片記憶部１６０に格納されている各融合音声素片の使用頻度を監視する。そして、使用頻度が予め定められた値以下である場合に削除してもよい。 As a first modification example according to the fourth embodiment, the updating unit 210 further includes a fusion stored in the fusion speech unit storage unit 160 in addition to the process of adding a combination to the fusion speech unit storage unit 160. You may perform the process which deletes an audio | voice element. For example, the updating unit 210 monitors the frequency of use of each fused speech unit stored in the fused speech unit storage unit 160. Then, it may be deleted when the use frequency is equal to or less than a predetermined value.

実施の形態４にかかる第２の変更例としては、更新部２１０は、組み合わせの使用頻度により融合音声素片記憶部１６０に追加するか否かを決定してもよい。このように、ステップＳ２７２における判断基準は、本実施の形態に限定されるものではない。 As a second modification example according to the fourth embodiment, the updating unit 210 may determine whether or not to add to the fused speech unit storage unit 160 according to the use frequency of the combination. Thus, the determination criterion in step S272 is not limited to the present embodiment.

更新部２１０は、融合音声素片編集・接続部１５０から取得した組み合わせごとに、取得した回数を保持しておく。そして、同一の組み合わせを予め定められた回数以上取得した場合に、当該組み合わせを融合音声素片記憶部１６０に格納してもよい。一方、予め定められた回数以上取得しない場合には破棄する。 The updating unit 210 holds the acquired number of times for each combination acquired from the fusion speech unit editing / connecting unit 150. Then, when the same combination is acquired more than a predetermined number of times, the combination may be stored in the fused speech unit storage unit 160. On the other hand, when it is not acquired more than a predetermined number of times, it is discarded.

より具体的には、更新部２１０は、例えばキャッシュメモリ等で構成された組み合わせ一時保持部（図示せず）を有する。一時保持部は、予め定められた期間だけ、組み合わせを保持する。そして、一時保持部に保持されている組み合わせについての回数をカウントし、保持する。なお、本例にかかる更新部２１０は、本発明にかかる更新手段および使用頻度カウント手段を構成する。 More specifically, the updating unit 210 includes a combination temporary holding unit (not shown) configured by, for example, a cache memory. The temporary holding unit holds the combination only for a predetermined period. Then, the number of times of the combination held in the temporary holding unit is counted and held. The updating unit 210 according to the present example constitutes an updating unit and a usage frequency counting unit according to the present invention.

これにより、使用頻度の高い組み合わせに対する融合音声素片のみを融合音声素片記憶部１６０に追加することができる。従って、メモリを有効に利用することができ、かつ音声合成処理の効率化を図ることができる。 As a result, only the fusion speech unit for the frequently used combination can be added to the fusion speech unit storage unit 160. Therefore, the memory can be used effectively and the efficiency of the speech synthesis process can be improved.

第３の変更例としては、第２の変更例においては、音声素片間の類似度を定義したが、同様に融合音声素片間の類似度を定義してもよい。すなわち、本実施の形態においては、融合音声素片作成部１８０は、組み合わせ頻度および類似度に基づいて融合音声素片を作成することができる。 As a third modification, the similarity between speech units is defined in the second modification, but the similarity between fused speech units may be defined in the same manner. In other words, in the present embodiment, fused speech segment creating section 180 can create a fused speech segment based on the combination frequency and similarity.

本例においては、二つの融合音声素片間の類似度を、二つの融合音声素片のコストの逆数とする。二つの融合音声素片のコストは、式（１６）から式（１９）と同様に定義することができる。 In this example, the similarity between two fused speech units is the reciprocal of the cost of the two fused speech units. The costs of the two fused speech segments can be defined in the same manner as in equations (16) to (19).

図２９は、第３の変更例にかかる融合音声素片作成処理を示すフローチャートである。まず、ステップＳ２９１で、利用頻度順に複数音声素片の組み合わせを入力する。これは音声素片組み合わせ作成部１８３で作成されたものである。 FIG. 29 is a flowchart showing the fused speech segment creation processing according to the third modification. First, in step S291, a combination of a plurality of speech units is input in order of use frequency. This is created by the speech element combination creation unit 183.

次に、組み合わせごとに以下の処理を行う。すなわち、ステップＳ２９２では、融合音声素片記憶部１６０中の各融合音声素片と、取得した組み合わせから作成した融合音声素片との類似度を求める。ここで、融合音声素片記憶部１６０に該当する音素の融合音声素片が一つもない場合、類似度を０とする。この類似度が予め設定された閾値より大きい場合は、ステップＳ２９３に進み、小さい場合はステップＳ２９４に進む。 Next, the following processing is performed for each combination. That is, in step S292, the similarity between each fusion speech unit in the fusion speech unit storage unit 160 and the fusion speech unit created from the acquired combination is obtained. Here, if there is no fused phoneme unit corresponding to the phoneme in the fused phoneme storage unit 160, the similarity is set to zero. If this similarity is larger than a preset threshold, the process proceeds to step S293, and if it is smaller, the process proceeds to step S294.

ステップＳ２９３は、似ている融合音声素片が存在すると判断された場合に対応する。この場合は、取得したされた組み合わせとともに、類似度が最大となる融合音声素片の素片番号を、融合音声素片組み合わせ記憶部２００に追加する。 Step S293 corresponds to the case where it is determined that there is a similar fusion speech unit. In this case, together with the acquired combination, the unit number of the fusion speech unit having the maximum similarity is added to the fusion speech unit combination storage unit 200.

ステップＳ２９４は、似ている融合音声素片が融合音声素片記憶部１６０に存在しないと判断された場合に対応する。この場合は、入力された組み合わせに対応する融合音声素片を追加する。そして、ステップＳ２９５において、融合音声素片組み合わせ記憶部２００に、該当する組み合わせを追加する。これにより、融合音声素片記憶部１６０中の融合音声素片はあらかじめ定めた閾値より類似度の小さい融合音声素片を蓄積することになり、メモリの利用量を減少させることができる。 Step S294 corresponds to the case where it is determined that there is no similar fused speech unit in the fused speech unit storage unit 160. In this case, a fusion speech unit corresponding to the input combination is added. In step S295, the corresponding combination is added to the fused speech unit combination storage unit 200. As a result, the fusion speech unit in the fusion speech unit storage unit 160 accumulates fusion speech units having a similarity lower than a predetermined threshold value, thereby reducing the memory usage.

第４の変更例としては、本実施の形態においては、予め定められた条件に基づいて、融合音声素片記憶部１６０に予め保持されている融合音声素片を利用するか、複数の音声素片から新たな融合音声素片を作成するかを判断したが、さらに、利用可能な演算量や音声合成に対する要求スペック等を考慮して条件を定めてもよい。 As a fourth modification, in the present embodiment, based on a predetermined condition, a fusion speech unit stored in the fusion speech unit storage unit 160 is used in advance, or a plurality of speech units are used. Although it has been determined whether to create a new fused speech unit from a piece, the condition may be determined in consideration of the amount of computation that can be used, the required specifications for speech synthesis, and the like.

すなわち、融合音声素片記憶部１６０に予め格納されている融合音声素片を利用することにより処理の効率化を図ることができる一方、音質が低下する可能性がある。 That is, by using the fusion speech unit stored in advance in the fusion speech unit storage unit 160, the processing efficiency can be improved, but the sound quality may be lowered.

具体的には、例えば、融合音声素片に対する音声素片の組み合わせのうちの一部が一致する融合音声素片を融合音声素片記憶部１６０から選択した場合には、予め作成された融合音声素片を利用するため、高速処理が可能である。一方、一致しない音声素片を含んでいるため、作成される融合音声素片は最適なものとは異なってしまう。 Specifically, for example, when a fusion speech unit that matches a part of a combination of speech units to the fusion speech unit is selected from the fusion speech unit storage unit 160, a fusion speech created in advance is selected. Since the segment is used, high-speed processing is possible. On the other hand, since the speech unit that does not match is included, the created fusion speech unit is different from the optimum one.

そこで、本例においては、融合音声素片記憶部１６０に格納されている融合音声素片を利用する頻度を、演算量の観点から制御することとする。これにより、演算量の観点と合成音声の品質の観点の両面から制御することができる。 Therefore, in this example, the frequency of using the fusion speech unit stored in the fusion speech unit storage unit 160 is controlled from the viewpoint of the amount of computation. Thereby, it is possible to control from both the viewpoint of the amount of calculation and the viewpoint of the quality of the synthesized speech.

なお、音声合成部１４における初期設定値として、演算量等の観点から定めた条件を設定してもよく、また他の例としては、初期設定後も適宜演算量等の観点から条件を変更してもよい。 It should be noted that a condition determined from the viewpoint of calculation amount or the like may be set as the initial set value in the speech synthesizer 14, and as another example, the condition is appropriately changed from the viewpoint of calculation amount or the like after the initial setting. May be.

また、融合音声素片作成部１８０においては、音声素片群のクラスタリングにより融合音声素片記憶部１６０に格納すべき融合音声素片を制限してもよい。 Further, the fusion speech unit creation unit 180 may limit the fusion speech units to be stored in the fusion speech unit storage unit 160 by clustering speech unit groups.

具体的には、まず音声素片記憶部１８１に保持されている各音声素片間の類似度を算出する。そして、類似度に基づいて音声素片のクラスタリングを行う。より具体的には、類似度が大きい音声素片同士を同一の音声素片群とする。そして、クラスタリングにより得られた各音声素片群に対する融合音声素片を作成する。さらに融合音声素片に対する融合音声素片音声環境を作成する。そして、更新部２１０は、新たに作成された融合音声素片および融合音声素片音声環境を対応付けて融合音声素片記憶部１６０に格納する。 Specifically, first, the similarity between each speech unit held in the speech unit storage unit 181 is calculated. Then, speech segments are clustered based on the similarity. More specifically, speech units having a high degree of similarity are defined as the same speech unit group. Then, a fusion speech unit for each speech unit group obtained by clustering is created. Furthermore, a fusion speech unit speech environment for the fusion speech unit is created. Then, the updating unit 210 stores the newly created fused speech unit and the fused speech unit speech environment in the fused speech unit storage unit 160 in association with each other.

例えば、二つの音声素片間の類似度に基づいて、音声素片群のクラスタリングを行う。そして、クラスタリングにより、類似度の最も高い融合音声素片のみを融合音声素片記憶部１６０に保持してもよい。 For example, clustering of speech unit groups is performed based on the similarity between two speech units. Then, only the fused speech unit having the highest similarity may be held in the fused speech unit storage unit 160 by clustering.

具体的には、まず、コスト関数に基づいて、二つの素片間の類似度を定義する。ここでは類似度は、二つの素片間のコストの逆数とし、コストを最小にするようにクラスタリングを行う。 Specifically, first, the similarity between two segments is defined based on a cost function. Here, the similarity is the reciprocal of the cost between two segments, and clustering is performed so as to minimize the cost.

二つの素片間のコストは、上述したコスト関数に基づいて、式（１６）で示される基本周波数コスト、式（１７）で示される継続時間長コスト、および式（１８）で示される平均スペクトルコストの線形結合とする（式（１９））。

ここで、ｖｉは音声素片記憶部１８１に記憶されている音声素片ｕｉの音素環境を、ｆは音素環境ｖｉから平均基本周波数を取り出す関数、ｇは音素環境ｖｉから音韻継続時間長を取り出す関数ｈは音声素片ｕｉの平均的なケプストラム係数をベクトルとして取り出す関数を表す。 Based on the above-described cost function, the cost between the two segments is the fundamental frequency cost represented by Equation (16), the duration cost represented by Equation (17), and the average spectrum represented by Equation (18). A linear combination of costs is obtained (formula (19)).

Here, vi is the phoneme environment of the speech unit ui stored in the speech unit storage unit 181, f is a function that extracts the average fundamental frequency from the phoneme environment vi, and g is the phoneme duration length from the phoneme environment vi. The function h represents a function for extracting an average cepstrum coefficient of the speech unit ui as a vector.

このようにして、二つの素辺間のコストを求めた後、全体でのコストが最小になるようなＭ個の素片を選択する。選択されたＭ個の素片と、全体のコスト（トータルコスト）は、式（２０）のように表される。

これは、すべての素片ｕｉ（１＜ｉ＜Ｉ）に対してＭ個の素片の中でコスト最小の候補ｕｍ（１＜ｍ＜Ｉ）を求め、そのコストを加算したものである。このトータルコストを最小化するようにＭ個の素片を求め、すべての素片をM個の素片中コスト最小の素片に対応付けることによりクラスタリングを行う。 In this way, after obtaining the cost between the two element sides, M pieces are selected such that the total cost is minimized. The selected M pieces and the total cost (total cost) are expressed as in Expression (20).

This is obtained by obtaining a candidate um (1 <m <I) having the smallest cost among M pieces for all the pieces ui (1 <i <I) and adding the costs. Clustering is performed by obtaining M pieces so as to minimize the total cost and associating all the pieces with the piece having the smallest cost among the M pieces.

以上の演算により求めた各クラスタの素片を融合することにより融合音声素片を作成する。また、各素片の韻律情報のセントロイドを求めることにより融合音声素片の韻律情報を求める。そして、融合音声素片音素環境情報とする。 A fused speech segment is created by fusing the segments of each cluster obtained by the above calculation. Further, the prosodic information of the fusion speech segment is obtained by obtaining the centroid of the prosodic information of each segment. And it is set as fusion speech unit phoneme environment information.

他の例としては、式（１８）に替えて、ケプストラムパラメータのＤＴＷ（dynamic time warping）距離を用いてもよい。この場合は、各ピッチ波形に対応するケプストラムを求め、ケプストラム距離が最小になるように動的計画法に基づいて時間軸伸縮を行い、最小ケプストラム距離を求める。 As another example, a DTW (dynamic time warping) distance of a cepstrum parameter may be used instead of Expression (18). In this case, a cepstrum corresponding to each pitch waveform is obtained, and time axis expansion / contraction is performed based on the dynamic programming so that the cepstrum distance is minimized to obtain a minimum cepstrum distance.

また、本例においては、類似度をコストに基づいて定義しているが、これに限定するものではなく例えば単純に時間伸縮したケプストラム距離、韻律変形した際の波形の自乗誤差などに基づいて定義してもよい。各クラスタにおいてＨＭＭを学習し、その尤度を類似度として定義してもよい。 In this example, the similarity is defined based on the cost. However, the present invention is not limited to this. For example, the similarity is defined based on the cepstrum distance that is simply time-expanded or the square error of the waveform when the prosody is deformed. May be. The HMM may be learned in each cluster, and the likelihood may be defined as the similarity.

これによりあらかじめ融合音声素片記憶部に記憶する融合音声素片をコスト最小という基準で作成することができ、効率よく融合音声素片群を作成することができ、メモリの使用量を減少させることができる。 This makes it possible to create a fusion speech unit to be stored in advance in the fusion speech unit storage unit on the basis of the minimum cost, to efficiently create a fusion speech unit group, and to reduce the amount of memory used. Can do.

また他の例としては、融合音声素片の類似度について閾値を設定し、当該閾値を基準として融合音声素片記憶部１６０に格納するか否かを決定してもよい。具体的には、融合音声素片同士の類似度を判定する。そして、類似度が予め定められた閾値以上である場合に融合音声素片記憶部１６０に格納する。一方、類似度が閾値よりも小さい場合には融合音声素片記憶部１６０に格納せずに破棄する。 As another example, a threshold may be set for the similarity of fused speech units, and whether to store in the fused speech unit storage unit 160 based on the threshold may be determined. Specifically, the similarity between the fusion speech units is determined. When the similarity is equal to or higher than a predetermined threshold value, the fusion speech unit storage unit 160 stores the similarity. On the other hand, if the similarity is smaller than the threshold value, it is discarded without being stored in the fused speech unit storage unit 160.

以上、本発明を実施の形態を用いて説明したが、上記実施の形態に多様な変更または改良を加えることができる。 As described above, the present invention has been described using the embodiment, but various changes or improvements can be added to the above embodiment.

そうした変更例としては、本実施の形態においては、図１１等を参照しつつ説明したように、融合音声素片作成部１８４は、有声音の融合音声素片をピッチ波形の平均化により作成したが、融合音声素片の作成方法は、これに限定されるものではない。例えば、閉ループ学習を用いてもよい。閉ループ学習を使うことで、それぞれの音声素片のピッチ波形を取り出すことなく、合成音のレベルで最適なピッチ波形系列を作り出すことができる。 As an example of such change, in the present embodiment, as described with reference to FIG. 11 and the like, the fused speech segment creating unit 184 creates a voiced fused speech segment by averaging pitch waveforms. However, the method of creating the fused speech segment is not limited to this. For example, closed loop learning may be used. By using closed loop learning, an optimum pitch waveform sequence can be created at the level of the synthesized sound without extracting the pitch waveform of each speech unit.

ここで、閉ループ学習とは、実際に基本周波数や韻律継続時間長を変更して合成された合成音声のレベルで、自然音声に対する歪みが小さくなるような代表音声素片を生成する方法である。すなわち、閉ループ学習においては、合成音声のレベルで歪みが小さくなるような素片を生成する（特許第３２８１２８１号参照）。 Here, closed-loop learning is a method of generating a representative speech segment that reduces the distortion of natural speech at the level of synthesized speech that is actually synthesized by changing the fundamental frequency and prosodic duration. That is, in closed-loop learning, a segment whose distortion is reduced at the level of the synthesized speech is generated (see Japanese Patent No. 3281281).

閉ループ学習を用いて、有声音の音声素片を融合する場合について説明する。融合によって求められる音声素片は、第１の実施形態と同様にピッチ波形の系列として求められる。これらのピッチ波形を連結して構成されるベクトルｕで音声素片を表すこととする。 A case where voiced speech segments are fused using closed loop learning will be described. The speech segment obtained by the fusion is obtained as a series of pitch waveforms as in the first embodiment. A speech unit is represented by a vector u configured by connecting these pitch waveforms.

まず、音声素片の初期値を用意する。初期値としては、第１の実施形態で述べた手法によって求められるピッチ波形の系列を用いてもよい。また、ランダムなデータを用いても良い。また、音声素片組み合わせ作成部１８３で作成された組み合わせの音声素片の波形を表すベクトルをｒｊ（ｊ＝１、２、…、Ｍ）とする。次に、ｕを用いて、ｒｊを目標としてそれぞれ音声を合成する。生成された合成音声セグメントをｓｊと表す。ｓｊは、次式（９）のように、ピッチ波形の重畳を表す行列Ａｊとｕの積で表される。

ｒｊのピッチマークとｕのピッチ波形とのマッピング、およびｒｊのピッチマーク位置より行列Ａｊは決定される。行列Ａｊの例を図３０に示す。次に、合成音声セグメントｓｊとｒｊの誤差を評価する。ｓｊとｒｊの誤差ｅｊを次式（１０）で定義する。

ただし、次式（１１）、（１２）に示すように、ｇｉは、２つの波形の平均的なパワーの差を補正して、波形の歪みのみを評価するためのゲインであり、ｅｊが最小となるような最適なゲインを用いている。

ベクトルｒｉ全てに対する誤差の総和を表す評価関数Ｅを次式（１３）で定義する。

Ｅを最小にする最適なベクトルｕは、Ｅをｕで偏微分して「０」とおくことで得られる次式（１４）、（１５）を解くことによって求められる。

式（１５）はｕについての連立方程式であり、これを解くことによって新たな音声素片ｕを一意に求めることができる。ベクトルｕが更新されることによって、最適ゲインｇｊが変化するため、上述したプロセスをＥの値が収束するまで繰り返し、収束した時点のベクトルｕを、融合によって生成された音声素片として用いる。 First, an initial value of a speech unit is prepared. As the initial value, a series of pitch waveforms obtained by the method described in the first embodiment may be used. Further, random data may be used. Also, let rj (j = 1, 2,..., M) be a vector representing the waveform of the combination speech unit created by the speech unit combination creation unit 183. Next, using u, the speech is synthesized with rj as the target. The generated synthesized speech segment is represented as sj. sj is represented by the product of a matrix Aj representing the superposition of pitch waveforms and u as shown in the following equation (9).

The matrix Aj is determined from the mapping between the pitch mark of rj and the pitch waveform of u and the pitch mark position of rj. An example of the matrix Aj is shown in FIG. Next, the error between the synthesized speech segments sj and rj is evaluated. An error ej between sj and rj is defined by the following equation (10).

However, as shown in the following equations (11) and (12), gi is a gain for correcting only the waveform distortion by correcting the average power difference between the two waveforms, and ej is the minimum. The optimal gain is used.

An evaluation function E representing the sum of errors for all vectors ri is defined by the following equation (13).

The optimal vector u that minimizes E is obtained by solving the following equations (14) and (15) obtained by partial differentiation of E by u and setting it to “0”.

Equation (15) is a simultaneous equation for u, and a new speech unit u can be uniquely obtained by solving this. Since the optimum gain gj changes as the vector u is updated, the above-described process is repeated until the value of E converges, and the vector u at the time of convergence is used as the speech segment generated by the fusion.

また、行列Ａｊを求める際のｒｊのピッチマーク位置を、ｒｊとｕの波形の相関に基づいて修正するようにしても良い。 Further, the pitch mark position of rj when obtaining the matrix Aj may be corrected based on the correlation between the waveform of rj and u.

また、ベクトルｒｊを帯域分割し、各帯域毎に上述した閉ループ学習を行ってｕを求め、全帯域のｕを加算することによって融合された音声素片を生成するようにしても良い。
このように、閉ループ学習を素片の融合に用いることによって、ピッチ周期変更による合成音声の劣化が小さい音声素片を生成することが可能である。 Alternatively, the vector rj may be divided into bands, the closed loop learning described above is performed for each band to obtain u, and a united speech unit may be generated by adding u of all bands.
In this way, by using closed loop learning for unit fusion, it is possible to generate a speech unit in which the synthesized speech is less degraded by changing the pitch period.

また、融合音声素片記憶部１６０に新たに作成された融合音声素片を格納するに際して、既に格納されている融合音声素片との類似度を算出してもよい。具体的には、音声素片作成部１８０が融合音声素片を作成した場合に、作成された融合音声素片と、既に融合音声素片記憶部１８０に記憶されている融合音声素片との類似度を算出する。そして、類似度が予め定められた値よりも小さい場合には、音声素片作成部１８０によって融合音声素片を融合音声素片記憶部１８０に新たに格納する。これにより、比較的類似した融合音声素片が格納されるのを避けることができるので、メモリを有効に利用することができる。 In addition, when a newly created fused speech unit is stored in the fused speech unit storage unit 160, the similarity to the already stored fused speech unit may be calculated. Specifically, when the speech unit creation unit 180 creates the fusion speech unit, the created fusion speech unit and the fusion speech unit already stored in the fusion speech unit storage unit 180 Calculate similarity. If the similarity is smaller than a predetermined value, the speech unit creation unit 180 newly stores the fusion speech unit in the fusion speech unit storage unit 180. As a result, it is possible to avoid storing relatively similar fused speech segments, so that the memory can be used effectively.

本発明の第１の実施形態に係るテキスト音声合成装置の全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of a text-to-speech synthesizer according to a first embodiment of the present invention. 図１の音声合成部１４の詳細な構成を示すブロック図である。It is a block diagram which shows the detailed structure of the speech synthesis part 14 of FIG. 図２において説明した融合音声素片作成部１８０の詳細な機能構成を示すブロック図である。FIG. 3 is a block diagram illustrating a detailed functional configuration of a fusion speech unit creation unit 180 described in FIG. 2. 図３に示した音声素片組み合わせ作成部１８３の詳細な機能構成を示すブロック図である。It is a block diagram which shows the detailed functional structure of the speech unit combination preparation part 183 shown in FIG. 融合音声素片記憶部１６０のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of the fusion speech unit memory | storage part 160. FIG. 融合音声素片音素環境記憶部１７０のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of the fusion speech element phoneme environment storage part 170. FIG. 図２において説明した融合音声素片編集・接続部１５０の処理を説明するための図である。It is a figure for demonstrating the process of the fusion speech unit edit and the connection part 150 demonstrated in FIG. 音声素片記憶部１８１のデータ構成を模式的に示す図である。3 is a diagram schematically showing a data configuration of a speech unit storage unit 181. FIG. 融合音声素片音素環境記憶部１８２のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of the fusion speech unit phoneme environment storage part 182. 音声データ１０１に対し、音素毎にラベリングを行った結果を示す図である。It is a figure which shows the result of having performed the labeling for every phoneme with respect to the audio | voice data 101. FIG. 音声素片組み合わせ作成部１８３で決められたＭ個の音声素片を融合して１つの新たな音声素片を生成する場合の処理手順を説明するための図である。It is a figure for demonstrating the process sequence in the case of uniting the M speech unit determined by the speech unit combination production | generation part 183, and producing | generating one new speech unit. ステップＳ１１１において、音声素片の音声波形に対してピッチマークを付与する処理を説明するための図である。It is a figure for demonstrating the process which provides a pitch mark with respect to the audio | voice waveform of a speech unit in step S111. ステップＳ１１１において、音声素片の音声波形に対してピッチマークを付与する処理を説明するための図である。It is a figure for demonstrating the process which provides a pitch mark with respect to the audio | voice waveform of a speech unit in step S111. ステップＳ１１１において、音声素片の音声波形に対してピッチマークを付与する処理を説明するための図である。It is a figure for demonstrating the process which provides a pitch mark with respect to the audio | voice waveform of a speech unit in step S111. 音声素片ｄ１〜ｄ３のそれぞれから、ステップＳ１１２で切り出されたピッチ波形の系列ｅ１〜ｅ３を示す図である。It is a figure which shows the series e1-e3 of the pitch waveform cut out by step S112 from each of the speech segment d1-d3. 音声素片ｄ１〜ｄ３のそれぞれからステップＳ１１３で求めたピッチ波形の系列ｅ１、ｅ２´、ｅ３´を示す図である。It is a figure which shows series e1, e2 ', e3' of the pitch waveform calculated | required by step S113 from each of the speech segments d1-d3. 音声素片を選択する処理を示すフローチャートである。It is a flowchart which shows the process which selects a speech segment. 入力韻律系列を示す図である。It is a figure which shows an input prosodic sequence. 複数音声素片組み合わせ頻度情報記憶部１８３５に格納されている複数音声素片組み合わせ頻度情報の例を示す図である。It is a figure which shows the example of the several speech unit combination frequency information stored in the several speech unit combination frequency information storage part 1835. FIG. 実施例１に係るテキスト音声合成装置１０のハードウェア構成を示す図である。1 is a diagram illustrating a hardware configuration of a text-to-speech synthesizer 10 according to Embodiment 1. FIG. 実施の形態２にかかるテキスト音声合成装置１０の音声合成部１４の詳細な機能構成を示すブロック図である。It is a block diagram which shows the detailed functional structure of the speech synthesizer 14 of the text speech synthesizer 10 concerning Embodiment 2. FIG. 融合音声素片組み合わせ記憶部２００のデータ構成を模式的に示す図である。It is a figure which shows typically the data structure of the fusion speech unit combination memory | storage part 200. FIG. 実施の形態２にかかる融合音声素片作成部１８０の詳細な機能構成を示すブロック図である。FIG. 6 is a block diagram showing a detailed functional configuration of a fusion speech unit creation unit 180 according to the second exemplary embodiment. 実施の形態２にかかる融合音声素片選択部１４０が融合音声素片を選択する処理を示すフローチャートである。It is a flowchart which shows the process which the fusion speech unit selection part 140 concerning Embodiment 2 selects a fusion speech unit. 融合音声素片系列の例を示す図である。It is a figure which shows the example of a fusion speech unit series. 実施の形態３にかかるテキスト音声合成装置１０の音声合成部１４の詳細な機能構成を示すブロック図である。It is a block diagram which shows the detailed function structure of the speech synthesizing part 14 of the text-to-speech synthesizer 10 concerning Embodiment 3. 融合音声素片選択部１４０による処理を示すフローチャートである。It is a flowchart which shows the process by the fusion speech unit selection part 140. FIG. 新たな融合音声素片を作成するか否かの判断方法の他の例を説明するための図である。It is a figure for demonstrating the other example of the determination method of whether to produce a new fusion speech unit. 実施の形態４にかかる音声合成部１４の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the speech synthesizer 14 concerning Embodiment 4. 更新部２１０における更新処理を示すフローチャートである。5 is a flowchart showing an update process in an update unit 210. 変更例３にかかる融合音声素片作成処理を示すフローチャートである。It is a flowchart which shows the fusion speech unit creation process concerning the example 3 of a change. 行列Ａｊの例を示す図である。It is a figure which shows the example of matrix Aj.

Explanation of symbols

１０テキスト音声合成装置
１１テキスト取得部
１２言語処理部
１３言語処理部
１４音声合成部
１５音声波形出力部
１１０音韻系列・韻律情報取得部
１２０分割部
１３０歪み推定部
１４０融合音声素片選択部
１５０融合音声素片編集・接続部
１６０融合音声素片記憶部
１７０融合音声素片音素環境記憶部
１８０融合音声素片作成部
１８１音声素片記憶部
１８２融合音声素片音素環境記憶部
１８３音声素片組み合わせ作成部
１８４融合音声素片作成部
１８５融合音声素片音素環境作成部
２００融合音声素片組み合せ記憶部
２１０更新部
１８３１音韻系列・韻律情報取得部
１８３２複数音声素片選択部
１８３３音声素片組み合わせ頻度情報作成部
１８３４複数音声素片組み合わせ決定部
１８３５音声素片組み合わせ頻度情報頻度情報記憶部
５１ＣＰＵ
５２ＲＯＭ
５３ＲＡＭ
５７通信I／Ｆ
６２バス DESCRIPTION OF SYMBOLS 10 Text speech synthesizer 11 Text acquisition part 12 Language processing part 13 Language processing part 14 Speech synthesis part 15 Speech waveform output part 110 Phonological sequence and prosodic information acquisition part 120 Dividing part 130 Distortion estimation part 140 Fusion speech unit selection part 150 Fusion Speech unit editing / connecting unit 160 Fusion speech unit storage unit 170 Fusion speech unit phoneme environment storage unit 180 Fusion speech unit creation unit 181 Speech unit storage unit 182 Fusion speech unit phoneme environment storage unit 183 Speech unit combination Creation unit 184 Fusion speech unit creation unit 185 Fusion speech unit phoneme environment creation unit 200 Fusion speech unit combination storage unit 210 Update unit 1831 Phoneme sequence / prosodic information acquisition unit 1832 Multiple speech unit selection unit 1833 Speech unit combination frequency Information creation unit 1834 Multiple speech unit combination determination unit 1835 Speech unit combination Allowed frequency information frequency information storage unit 51 CPU
52 ROM
53 RAM
57 Communication I / F
62 Bus

Claims

A plurality of speech units corresponding to the same speech unit and having different prosody of the speech unit and speech unit prosody information indicating the prosody of the speech unit are stored in association with each other. Speech segment holding means;
Based on teacher speech prosody information indicating a preset prosody of a teacher speech and the speech unit prosody information held in the speech unit holding unit, a plurality of speech units are generated from the speech unit holding unit. Speech segment selection means for selecting
A combination determining unit that determines a combination of a plurality of speech units that satisfy a predetermined condition from the plurality of speech units selected by the speech unit selection unit;
Based on the plurality of speech units included in the determined combination, fused speech unit creating means for creating a fused speech unit by fusing a plurality of speech units;
Fusion speech segment prosodic information creation for creating fused speech segment prosodic information indicating the prosody of the fused speech segment based on the prosodic information corresponding to each of the plurality of speech segments included in the determined combination Means,
Fused speech unit holding for holding the fused speech unit created by the fused speech unit creating unit and the fused speech unit prosody information created by the fused speech unit prosody information associated with each other Means,
Acquisition means for acquiring a prosodic sequence for a target speech to be synthesized for each of a plurality of segments that are synthesis units of speech synthesis;
Retained speech distortion estimation for estimating the degree of distortion between segment prosodic information indicating the prosody of the segment obtained by the acquiring means and the fused speech segment prosodic information held in the fused speech segment holding means Means,
Based on the degree of distortion estimated by the retained speech distortion estimation means, fused speech segment selection means for selecting the fused speech segment;
A speech synthesizer comprising: speech synthesis means for generating synthesized speech by connecting the fused speech units selected by the fused speech unit selection means for each segment.

The speech synthesis means selects the fused speech selected by the fused speech segment selection means when the degree of distortion estimated by the retained speech distortion estimation means is smaller than a predetermined retained speech distortion reference value. The speech synthesis apparatus according to claim 1 , wherein the speech synthesis is performed using a segment.

It said speech synthesis means, when the degree of the distortion the holding audio distortion estimating means has estimated for each fused speech unit stored in the fused speech unit holding means is the holding audio distortion reference value or more 3. The speech synthesizer according to claim 2 , wherein speech synthesis is performed using the fused speech unit created by the fused speech unit creating means.

The fused speech unit selection means any of claims 1 to 3, characterized by selecting the fused speech unit corresponding to the minimum value of the degree of the distortion estimated by the holding audio distortion estimator The speech synthesizer according to claim 1.

Further comprising combination information holding means for holding combination information indicating a combination of the plurality of speech units included in the fused speech unit held in the fused speech unit holding means;
The holding audio distortion estimating means of claims 1 to 4, characterized in that for estimating the degree of coincidence between the combination of the combination information holding means and combinations in speech of the segment is held as the degree of the strain The speech synthesizer as described in any one of Claims.

The fusion speech unit selection means is configured such that the retained speech distortion estimation means matches a combination of the speech units of the speech of the segment and a combination of the fusion speech units held in the fusion speech unit holding means. 6. The speech synthesizer according to claim 5 , wherein when it is determined, a fusion speech unit corresponding to the combination is selected.

The fusion speech unit selection means is a combination of the fusion speech unit held by the fusion speech unit holding means and a part of the combination of the speech units of the speech of the segment by the held speech distortion estimation means. The speech synthesizer according to claim 6 , wherein the combination is determined to match when a part of the matches.

The combination information holding unit holds a priority order for the combination in association with each combination,
In the fused speech unit selection means, the held speech distortion estimation means matches the combination of the speech units of the speech of the segment and the combination of the fused speech units held in the fused unit holding means. The speech synthesis apparatus according to claim 6 or 7 , wherein the fusion speech unit is selected when the priority of the fusion speech unit is equal to or higher than a predetermined priority reference value.

The speech synthesizer performs the speech synthesis using a retained speech distortion reference value determined based on at least one of a computation amount of the speech synthesis process and a sound quality of the synthesized speech to be synthesized. The speech synthesizer according to claim 2 .

A frequency information creating means for counting the frequency of use of the combination ;
The combination determining means determines the combination in which the use frequency is equal to or higher than a predetermined threshold;
The speech synthesizer according to any one of claims 1 to 9 .

According to any one of claims 1 to 10, characterized in that the fused speech unit created by the fused speech unit producing means further comprising updating means for storing the fused speech unit holding means Voice synthesizer.

The updating means further comprises a usage frequency counting means for counting the usage frequency of the fused speech unit created by the fused speech segment creating means,
The updating means stores the corresponding fused speech element in the fused speech element holding means when the usage frequency counting means counts a value equal to or greater than a predetermined usage frequency reference value. The speech synthesizer according to claim 11 .

Similarity calculating means for calculating the similarity between the fused speech unit created by the fused speech unit creating means and the fused speech unit held in the fused speech unit holding means;
When the similarity calculated by the similarity calculation unit is smaller than a predetermined value, the fusion speech unit created by the fusion speech unit creation unit is stored in the fusion speech unit holding unit. speech synthesis apparatus according to any one of claims 1 to 10, characterized in, further comprising a updating means for.

The similarity calculation means includes: a time-stretched spectral distance between two speech segments, a square error of a waveform when prosody is deformed, a pitch pattern distance corresponding to the speech segment, and a prosodic duration distance The speech synthesis apparatus according to claim 13 , wherein the similarity is calculated using at least one.

2. The speech synthesis according to claim 1 , wherein the fused speech segment prosodic information creation unit creates a centroid of the prosodic information for each of the plurality of speech segments as the fused speech segment prosodic information. apparatus.

A plurality of speech units corresponding to the same speech unit and having a plurality of speech units having different prosody of the speech unit and speech unit prosody information indicating the prosody of the speech unit in association with each other Based on the speech unit prosody information held in the unit holding unit and the teacher speech prosody information indicating the preset prosody of the teacher speech, a plurality of speech units are obtained from the speech unit holding unit. A speech segment selection step to select;
A combination determining step of determining a combination of a plurality of the speech elements that satisfy a predetermined condition from the plurality of speech elements selected by the speech element selection step;
Based on the plurality of speech units included in the determined combination, a fused speech unit creating step for creating a fused speech unit obtained by fusing a plurality of the speech units;
Fusion speech segment prosodic information creation for creating fused speech segment prosodic information indicating the prosody of the fused speech segment based on the prosodic information corresponding to each of the plurality of speech segments included in the determined combination Steps,
In the fused speech unit holding means, the fused speech unit created by the fused speech unit creating step and the fused speech unit prosody information created by the fused speech unit prosody information creating step are associated with each other. A save step to save;
An acquisition step of acquiring a prosodic sequence for a target speech to be synthesized for each of a plurality of segments that are synthesis units of speech synthesis;
Said fused speech unit prosody information held in the fused speech unit holding means, holding audio distortion to estimate the degree of distortion between the resulting segment prosody information indicating the prosody of said segments in said obtaining step An estimation step;
A fusion speech unit selection step of selecting the fusion speech unit based on the degree of distortion estimated in the retained speech distortion estimation step;
A speech synthesis method comprising: a speech synthesis step of generating synthesized speech by connecting each fused speech unit selected for each segment in the fused speech unit selection step.

A speech synthesis program for causing a computer to execute speech synthesis processing,
A plurality of speech units corresponding to the same speech unit and having a plurality of speech units having different prosody of the speech unit and speech unit prosody information indicating the prosody of the speech unit in association with each other Based on the speech unit prosody information held in the unit holding unit and the teacher speech prosody information indicating the preset prosody of the teacher speech, a plurality of speech units are obtained from the speech unit holding unit. A speech segment selection step to select;
A combination determining step of determining a combination of a plurality of the speech elements that satisfy a predetermined condition from the plurality of speech elements selected by the speech element selection step;
Based on the plurality of speech units included in the determined combination, a fused speech unit creating step for creating a fused speech unit obtained by fusing a plurality of the speech units;
Fusion speech segment prosodic information creation for creating fused speech segment prosodic information indicating the prosody of the fused speech segment based on the prosodic information corresponding to each of the plurality of speech segments included in the determined combination Steps,
In the fused speech unit holding means, the fused speech unit created by the fused speech unit creating step and the fused speech unit prosody information created by the fused speech unit prosody information creating step are associated with each other. A save step to save;
An acquisition step of acquiring a prosodic sequence for a target speech to be synthesized for each of a plurality of segments that are synthesis units of speech synthesis;
Said fused speech unit prosody information held in the fused speech unit holding means, holding audio distortion to estimate the degree of distortion between the resulting segment prosody information indicating the prosody of said segments in said obtaining step An estimation step;
A fusion speech unit selection step of selecting the fusion speech unit based on the degree of distortion estimated in the retained speech distortion estimation step;
A speech synthesis program comprising: a speech synthesis step of generating synthesized speech by connecting each fused speech unit selected for each segment in the fused speech unit selection step.