JP2013250486A

JP2013250486A - Speech waveform database generation device, method, and program

Info

Publication number: JP2013250486A
Application number: JP2012126349A
Authority: JP
Inventors: Yusuke Ijima; 勇祐井島; Hideyuki Mizuno; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-06-01
Filing date: 2012-06-01
Publication date: 2013-12-12
Anticipated expiration: 2032-06-01
Also published as: JP5840075B2

Abstract

PROBLEM TO BE SOLVED: To provide a construction technique for speech waveform database that provides synthesis of a speech with stable quality even when the amount of speech data is small.SOLUTION: A model 103 for speech synthesis is obtained by learning a model capable of representing respective speech synthesis units in a plurality of states using speech data 101 holding speech parameters of respective utterance speeches and a set of utterance information (utterance information data 102) corresponding to the respective utterance speeches in the speech data. A speech waveform is generated using weighted speech parameters represented by the weighted sum of a speech parameter generated from the model 103 for speech synthesis and the speech parameter included in the speech data 101.

Description

本発明は、音声合成に用いられる音声波形データベースの構築技術に関する。 The present invention relates to a technique for constructing a speech waveform database used for speech synthesis.

近年、主流となっている音声合成方式として、肉声に近い高品質な音声を合成できる素片接続型音声合成方式［例えば特許文献１参照］や、素片接続型音声合成方式よりも品質は劣るが少量の音声データからでも安定した品質の音声を合成することができるHMM（隠れマルコフモデル）音声合成方式［例えば非特許文献１参照］が提案されている。 In recent years, as a speech synthesis method that has become mainstream, the quality is inferior to the unit connection type speech synthesis method that can synthesize high-quality speech close to the real voice [see, for example, Patent Document 1] or the unit connection type speech synthesis method. Has proposed an HMM (Hidden Markov Model) speech synthesis method [see, for example, Non-Patent Document 1] that can synthesize stable quality speech even from a small amount of speech data.

素片接続型音声合成方式では、あらかじめ収録された数時間から数十時間程度の大量の音声データから、音声合成単位（音節、音素等）ごとに多数の音声素片を持つ音声波形データベースが構築される。音声合成時には、構築された音声波形データベースから、合成したい音声の条件（文章、声の高さ、前後の音素環境、発話速度等）に最も適合する音声素片を音声合成単位ごとに選択し、選択された音声素片を接続することによって高品質な音声の合成を可能としている。 In the unit-connected speech synthesis method, a speech waveform database with a large number of speech units for each speech synthesis unit (syllable, phoneme, etc.) is constructed from a large amount of speech data recorded for several hours to several tens of hours. Is done. At the time of speech synthesis, from the constructed speech waveform database, for each speech synthesis unit, select the speech unit that best matches the speech conditions to be synthesized (sentence, voice pitch, phoneme environment before and after, speech rate, etc.) By connecting the selected speech segments, high-quality speech synthesis is possible.

他方、HMM音声合成方式では、音声データから抽出された音声合成単位（音節、音素等）をHMMでモデル化したときのモデルパラメータ（スペクトル、F0など）が音声合成単位ごとに平均化（平滑化）されており、音声合成単位ごとに一つの音声モデル（HMM）が用意されている。これによって、少量の音声データを用いる場合でも、肉声感は低いが安定した品質の音声の合成を可能としている。 On the other hand, in the HMM speech synthesis method, model parameters (spectrum, F0, etc.) when speech synthesis units (syllables, phonemes, etc.) extracted from speech data are modeled by HMM are averaged (smoothed) for each speech synthesis unit. One speech model (HMM) is prepared for each speech synthesis unit. As a result, even when a small amount of voice data is used, it is possible to synthesize a voice with low quality but stable quality.

特許第２７６１５５２号公報Japanese Patent No. 2761552

益子他、“動的特徴を用いたHMMに基づく音声合成”、信学論、vol.J79-D-II、no.12、pp.2184-2190、Dec. 1996.Masuko et al., “HMM-based speech synthesis using dynamic features”, IEICE, vol.J79-D-II, no.12, pp.2184-2190, Dec. 1996.

素片接続型音声合成方式の音声波形データベースは、音声波形データベースの構築時に使用できる音声データ量が少量の場合、音声素片の接続箇所において異音が生じる等の合成音声の品質が安定しないという課題がある。 The speech waveform database of the unit connection type speech synthesis method is that the quality of the synthesized speech is not stable, such as abnormal sound is generated at the connection point of the speech unit when the amount of speech data that can be used when constructing the speech waveform database is small There are challenges.

他方、HMM音声合成方式の音声モデルでは、使用できる音声データ量が数時間〜数十時間の如く大量にある場合でも、平均化の影響によって、合成音声の品質が向上しないという課題がある。 On the other hand, the speech model of the HMM speech synthesis method has a problem that the quality of synthesized speech is not improved due to the influence of averaging even when the amount of usable speech data is large such as several hours to several tens of hours.

本発明は、このような課題に鑑み、音声データ量が少量の場合であっても安定した品質の音声の合成を可能とする音声波形データベースの構築技術を提供することを目的とする。 In view of such problems, an object of the present invention is to provide a technology for constructing a speech waveform database that enables synthesis of speech of stable quality even when the amount of speech data is small.

各発話音声の音声パラメータを保持している音声データと、当該音声データ中の各発話音声に対応する発話情報の集合（発話情報データ）とを用いて、各音声合成単位を複数の状態で表現できるモデルを学習することによって音声合成用モデルを得る。そして、音声合成用モデルから生成した音声パラメータと音声データに含まれる音声パラメータとの重み付き和で表された重み付き音声パラメータを用いて音声波形を生成する。 Each speech synthesis unit is expressed in multiple states using speech data holding speech parameters for each speech and a set of speech information (utterance information data) corresponding to each speech in the speech data A model for speech synthesis is obtained by learning a model that can be used. Then, a speech waveform is generated using a weighted speech parameter represented by a weighted sum of the speech parameter generated from the speech synthesis model and the speech parameter included in the speech data.

音声データのデータ量が多いほど、音声合成用モデルから生成した音声パラメータに対する重みＡが、音声データに含まれる音声パラメータに対する重みＢよりも小さくなるように、各重みを設定するようにしてもよい。 Each weight may be set so that the weight A for the speech parameter generated from the speech synthesis model is smaller than the weight B for the speech parameter included in the speech data as the amount of speech data increases. .

例えば、発話情報データに含まれる発話情報と音声合成用モデルに含まれる音声パラメータとを用いて、各発話音声に対応する音声パラメータを生成し、発話音声ごとに、生成された音声パラメータに重みＡを乗じたものと音声データに含まれる音声パラメータに重みＢを乗じたものとの和で表される上記重み付き音声パラメータを求め、重み付き音声パラメータと音声合成フィルタを用いて音声波形を生成する。 For example, using the utterance information included in the utterance information data and the speech parameters included in the speech synthesis model, speech parameters corresponding to each utterance speech are generated, and a weight A is assigned to the generated speech parameters for each utterance speech. And the weighted speech parameter represented by the sum of the speech parameter included in the speech data and the weight B is obtained, and a speech waveform is generated using the weighted speech parameter and the speech synthesis filter. .

なお、各重みの設定に、音声データに含まれる発話音声の総時間長、または、音声データに含まれる音声合成単位数を用いるようにしてもよい。 In addition, you may make it use the total time length of the speech voice contained in audio | voice data, or the number of speech synthesis units contained in audio | voice data for the setting of each weight.

本発明によると、使用できる音声データ量が少量の場合、HMM音声合成方式と同様に安定した音声波形を生成することができるため、合成音声の品質が通常の素片接続型音声合成方式よりも安定する。また、大量の音声データを使用できる場合、素片接続型音声合成方式と同様に肉声に近い高品質な音声波形を生成することができるため、HMM音声合成方式より合成音声の品質が向上する。 According to the present invention, when the amount of voice data that can be used is small, a stable voice waveform can be generated in the same way as the HMM voice synthesis method, so that the quality of the synthesized voice is higher than that of a normal unit connection type voice synthesis method. Stabilize. In addition, when a large amount of speech data can be used, a high-quality speech waveform close to the real voice can be generated as in the unit connection speech synthesis method, so that the quality of synthesized speech is improved compared to the HMM speech synthesis method.

実施形態の機能構成図。The functional block diagram of embodiment. 音素セグメンテーション情報の一例。An example of phoneme segmentation information. 音声波形データベース構築部の具体的な機能構成の一例。An example of a specific functional configuration of the speech waveform database construction unit. 音声波形データベース構築処理の具体的な処理フローの一例。An example of the specific process flow of an audio | voice waveform database construction process. 実施形態で得られた音声波形データベースを用いた音声合成装置の機能構成の一例。An example of a functional structure of the speech synthesizer using the speech waveform database obtained in the embodiment.

図面を参照しながら本発明の実施形態を説明する。各形態に共通の構成要素には同じ符号を割り当てて重複説明を省略することとする。 Embodiments of the present invention will be described with reference to the drawings. Constituent elements common to the respective forms are assigned the same reference numerals and redundant description is omitted.

本発明の実施形態において、「音声合成単位」として音素、音節、半音節などを例示できる。例えば音声合成単位を音素として実施する場合には、以下の説明において、「音声合成単位」を「音素」に読み替えればよい。 In the embodiment of the present invention, phonemes, syllables, semi-syllables and the like can be exemplified as “speech synthesis unit”. For example, when the speech synthesis unit is implemented as a phoneme, “speech synthesis unit” may be read as “phoneme” in the following description.

この実施形態の音声波形データベース生成装置１は、音声データ１０１と発話情報データ１０２を用いた学習によって音声合成用HMM１０３を得るモデル学習部２０１と、学習によって得られた音声合成用HMM１０３のパラメータ（スペクトル、F0等）と学習に使用した音声データ１０１のパラメータ（スペクトル、F0等）とを用いて新たに音声波形データベース１０４を生成する音声波形データベース構築部（以下、音声DB構築部と略記する）２０２と、音声データ１０１、発話情報データ１０２、音声合成用HMM１０３、音声波形データベース１０４を記憶する記憶部（図示せず）を含む（図１参照）。 The speech waveform database generation device 1 of this embodiment includes a model learning unit 201 that obtains a speech synthesis HMM 103 by learning using speech data 101 and speech information data 102, and parameters (spectrums) of the speech synthesis HMM 103 obtained by learning. , F0, etc.) and a speech waveform database construction unit (hereinafter abbreviated as a speech DB construction unit) 202 that newly generates a speech waveform database 104 using speech data 101 parameters (spectrum, F0, etc.) used for learning 202 And a storage unit (not shown) for storing speech data 101, speech information data 102, speech synthesis HMM 103, and speech waveform database 104 (see FIG. 1).

＜音声データ＞
音声データ１０１は、音声波形データベース１０４の構築に使用される音声データであり、あらかじめ用意されている。
音声データ１０１には、例えば一名の話者によるN個の発話の音声信号と当該音声信号に対する信号処理によって得られた音声パラメータ（例えば、音高パラメータ（基本周波数F0等）、スペクトルパラメータ（ケプストラム、メルケプストラム等））が保持されている。なお、音声データ１０１には、後の音声合成に必要な各音声合成単位に対応した音声パラメータ（スペクトル、F0等）が含まれていることが望ましい。 <Audio data>
The voice data 101 is voice data used for construction of the voice waveform database 104 and is prepared in advance.
The voice data 101 includes, for example, voice signals of N utterances by one speaker, voice parameters obtained by signal processing on the voice signals (for example, pitch parameters (basic frequency F0, etc.), spectrum parameters (cepstrum). , Mel cepstrum, etc.)). Note that the speech data 101 preferably includes speech parameters (spectrum, F0, etc.) corresponding to each speech synthesis unit necessary for later speech synthesis.

＜発話情報＞
発話情報データ１０２は、音声データ１０１中の各発話音声に対して付与された音声合成単位ごとの発音等の情報（以下、発話情報という）の集合体である。音声データ１０１中の各発話音声には、一つの発話情報が付与されている。この発話情報には、少なくとも各音声合成単位の開始時間、終了時間の情報（セグメンテーション情報；音声合成単位が音素の場合、「音素セグメンテーション情報」に相当する）が含まれている。この開始・終了時間は、各発話音声の始点を0[秒]とした時の経過時間である。音素セグメンテーション情報の例を図２に示す。なお、発話情報は、音素セグメンテーション情報以外にもアクセント情報（アクセント型、アクセント句長）、品詞情報等を含んでいてもよい。 <Speech information>
The utterance information data 102 is a collection of information such as pronunciation (hereinafter referred to as utterance information) for each speech synthesis unit assigned to each utterance voice in the voice data 101. One utterance information is given to each utterance voice in the voice data 101. This utterance information includes at least information on the start time and end time of each speech synthesis unit (segmentation information; if the speech synthesis unit is a phoneme, this corresponds to “phoneme segmentation information”). This start / end time is the elapsed time when the start point of each uttered voice is set to 0 [seconds]. An example of phoneme segmentation information is shown in FIG. Note that the utterance information may include accent information (accent type, accent phrase length), part-of-speech information, and the like in addition to the phoneme segmentation information.

なお、音声波形データベース生成装置１は、図１に図示されるように別個のデータとして用意された音声データ１０１および発話情報データ１０２を用いることに限定されず、例えば、音声データ１０１中において各発話音声に対して発話情報が付与されたデータ構造を有する、つまり音声データと発話情報との対応関係が記述されたデータ構造を有する一つの音声-発話情報データを用いることもできる。 Note that the speech waveform database generation device 1 is not limited to using speech data 101 and speech information data 102 prepared as separate data as illustrated in FIG. It is also possible to use one voice-speech information data having a data structure in which utterance information is given to voice, that is, having a data structure in which a correspondence relationship between voice data and utterance information is described.

＜モデル学習＞
モデル学習部２０１は、音声データ１０１と発話情報データ１０２を用いてHMMを学習することによって音声合成用HMM１０３を得る。ここでのHMMの学習方法として従来技術を用いることができる［例えば非特許文献１参照］。音声合成用HMM１０３は、各音声合成単位を複数の状態を持つモデルとして表現しており、各モデルパラメータをμ^→ _ijとする。このμ^→ _ijはi番目の音声合成単位のHMMにおけるj番目の状態の音声パラメータの平均ベクトルであり、通常、多次元のベクトルで表現される（j=1,…,S_i：S_iはi番目の音声合成単位を表現するHMMに含まれる状態数）。また、このモデルパラメータには平均ベクトルだけでなく、分散や動的パラメータの平均ベクトル、分散を保存しておいてもよい。
なお、モデル学習部２０１によって学習されるモデルはHMMである必要はなく、各音声合成単位を複数の状態で表現できるモデル（例えばマルコフモデル等）であればよい。 <Model learning>
The model learning unit 201 obtains a speech synthesis HMM 103 by learning the HMM using the speech data 101 and the speech information data 102. A conventional technique can be used as a learning method of the HMM here [see, for example, Non-Patent Document 1]. The speech synthesis HMM 103 represents each speech synthesis unit as a model having a plurality of states, and each model parameter is set to μ ^→ _ij . Μ ^→ _ij is an average vector of speech parameters in the j-th state in the HMM of the i-th speech synthesis unit, and is usually expressed by a multidimensional vector (j = 1,..., S _i : S _i is Number of states included in the HMM representing the i-th speech synthesis unit). In addition to the average vector, the model parameter may store variance, the average vector of the dynamic parameter, and variance.
Note that the model learned by the model learning unit 201 is not necessarily an HMM, and may be a model (for example, a Markov model) that can express each speech synthesis unit in a plurality of states.

＜音声波形データベースの構築＞
音声DB構築部２０２は、モデル学習部２０１によって得られた音声合成用HMM１０３から生成した音声パラメータ（スペクトル、F0等）と学習に使用した音声データ１０１の音声パラメータ（スペクトル、F0等）とを用いて新たに音声波形を生成し、これらを音声波形データベース１０４として保存する。
音声DB構築部２０２による処理の内容の一例を以下に説明する（図３、図４参照）。 <Construction of speech waveform database>
The speech DB construction unit 202 uses speech parameters (spectrum, F0, etc.) generated from the speech synthesis HMM 103 obtained by the model learning unit 201 and speech parameters (spectrum, F0, etc.) of the speech data 101 used for learning. Then, new speech waveforms are generated and stored as speech waveform database 104.
An example of the contents of processing by the voice DB construction unit 202 will be described below (see FIGS. 3 and 4).

（１）音声パラメータの生成
まず、音声DB構築部２０２の音声パラメータ生成部２０２ａは、i番目の発話情報（i=1,…,N：Nは音声データ１０１に含まれる発話数）を用いて、モデル学習部２０１によって得られた音声合成用HMM１０３から、i番目の発話情報と同一のセグメンテーション情報を持つ音声パラメータ（スペクトル、F0等）を生成する。 (1) Generation of Voice Parameters First, the voice parameter generation unit 202a of the voice DB construction unit 202 uses the i-th utterance information (i = 1,..., N: N is the number of utterances included in the voice data 101). From the speech synthesis HMM 103 obtained by the model learning unit 201, speech parameters (spectrum, F0, etc.) having the same segmentation information as the i-th utterance information are generated.

音声パラメータの生成には、まずi番目の発話情報中のセグメンテーション情報を用いて、i番目の発話音声に含まれるp番目の音声合成単位を表現するモデル（HMM）に含まるs番目の状態（p=1,…,P_i：P_iはi番目の発話音声に含まれる音素数）のフレーム数を求める。各状態のフレーム数の算出は、p番目の音声合成単位の継続時間長（終了時間−開始時間）を状態数S_pで等分することにより行う。 For the generation of speech parameters, first, segmentation information in the i-th utterance information is used, and the s-th state included in the model (HMM) representing the p-th speech synthesis unit included in the i-th utterance speech ( _{p = 1, ..., P i} : P obtains the number of frames number of phonemes) contained in the i-th speech. Calculation of the number of frames each state, duration of p-th speech synthesis unit - performed by equally dividing the (end time start time) in the state number S _p.

次に、i番目の発話音声に含まれるp番目の音声合成単位を表現するモデル（HMM）に含まるs番目の状態のパラメータの平均ベクトルμ^→ _psを、求めたフレーム数だけ並べる。この処理を、s=1,…,S_p、p=1,…,P_iについて行い、全てのフレームを連結することによって、i番目の発話音声の音声パラメータ系列を得る。 Next, the average vector μ ^→ _ps of the parameters in the s-th state included in the model (HMM) representing the p-th speech synthesis unit included in the i-th uttered speech is arranged by the calculated number of frames. This process is performed for s = 1,..., S _p , p = 1,..., _Pi , and by connecting all the frames, a speech parameter sequence of the i-th uttered speech is obtained.

最後に、i番目の発話音声の音声パラメータ系列に対して補間を行い、i番目の発話音声に対応する音声パラメータ１０４ａを得る。音声パラメータ系列の補間には、スプライン補間のような一般的な補間手法を用いることができるが、非特許文献１に開示された技術事項のようにモデルに保存されている動的特徴量と分散を用いることが一般的である。 Finally, interpolation is performed on the speech parameter series of the i-th uttered speech to obtain speech parameters 104a corresponding to the i-th uttered speech. A general interpolation method such as spline interpolation can be used for the interpolation of the speech parameter series. However, the dynamic feature amount and the variance stored in the model as in the technical matter disclosed in Non-Patent Document 1 are used. Is generally used.

（２）音声パラメータの算出
次に、音声DB構築部２０２の音声パラメータ算出部２０２ｂは、音声パラメータ１０４ａとモデル学習部２０１が使用した音声データ１０１とを用いて、i番目の発話音声について新たな重み付き音声パラメータを算出する。音声パラメータ１０４ａにおいてi番目の発話音声のjフレーム目（j=1,…,F_i：F_iはi番目の発話音声に含まれるフレーム数）に対応する音声パラメータをs^→ _ij、音声データ１０１においてi番目の発話音声のjフレーム目に対応する音声パラメータをv^→ _ijとすると、新しく算出するi番目の発話音声のjフレーム目に対応する重み付き音声パラメータv^→ _ij'はs^→ _ijとv^→ _ijの重み付け和（以下の式）として求められる。重みαについては後述する。
v^→ _ij'=α・v^→ _ij＋（１−α）・s^→ _ij （j=1,…,F_i） (2) Calculation of Voice Parameter Next, the voice parameter calculation unit 202b of the voice DB construction unit 202 uses the voice parameter 104a and the voice data 101 used by the model learning unit 201 to perform a new operation for the i-th utterance voice. Calculate weighted speech parameters. I-th speech of j-th frame in the speech parameters 104a (j = 1, ..., F i: F it is the number of frames included in the i-th speech) audio parameters corresponding to s ^→ _ij, audio data 101 If the speech parameter corresponding to the j frame of the i-th uttered speech is v ^→ _ij in FIG. 4, the weighted speech parameter v ^→ _ij 'corresponding to the j frame of the i-th uttered speech is newly calculated as s ^→ _ij _{Calculated as} the weighted sum of v ^→ _ij (the following formula). The weight α will be described later.
v ^→ _ij '= α ・ v ^→ _ij + (1-α) ・ s ^→ _ij (j = 1, ..., F _i )

（３）音声波形生成
次に、音声DB構築部２０２の音声波形生成部２０２ｃは、音声パラメータ算出部２０２ｂによって算出されたi番目の発話音声の重み付き音声パラメータ（スペクトル、F0等）v^→ _ij'（j=1,…,F_i）と、音声合成フィルタを用いて音声波形を生成する。このような音声合成フィルタとして従来技術を用いることができる（例えば、参考文献Ａ参照）。
（参考文献Ａ）今井他，“音声合成のためのメル対数スペクトル近似（MLSA）フィルタ”，電子情報通信学会論文誌 A Vol.J66-A No.2 pp.122-129, Feb. 1983. (3) Speech waveform generation Next, the speech waveform generation unit 202c of the speech DB construction unit 202 has a weighted speech parameter (spectrum, F0, etc.) v ^→ _{ij of} the i-th speech speech calculated by the speech parameter calculation unit 202b. A speech waveform is generated using '(j = 1,..., F _i ) and a speech synthesis filter. Conventional techniques can be used as such a speech synthesis filter (see, for example, Reference A).
(Reference A) Imai et al., “Mel Logarithmic Spectral Approximation (MLSA) Filter for Speech Synthesis”, IEICE Transactions A Vol.J66-A No.2 pp.122-129, Feb. 1983.

各i（i=1,…,N）について上記（１）−（３）の処理が行われることによって、音声データ１０１に含まれる全ての発話音声に対応して生成された音声波形が音声波形データベース１０４として記憶部に保存される。 By performing the processes (1) to (3) for each i (i = 1,..., N), the speech waveform generated corresponding to all uttered speech included in the speech data 101 is a speech waveform. It is stored in the storage unit as the database 104.

なお、上記の処理において発話情報は一切変更されないため、音声合成の処理には一切影響がない。このため、素片接続型音声合成方式に限定されることなく、音声波形データベース１０４を利用することが可能である。 Note that since the utterance information is not changed in the above processing, there is no influence on the speech synthesis processing. Therefore, the speech waveform database 104 can be used without being limited to the unit connection type speech synthesis method.

＜重みα＞
重みαは、新たな音声パラメータを算出するための重み係数であり、0以上1以下の数値で表現される。
重みαの値が小さい場合は、新たに生成される音声パラメータは、音声合成用HMM１０３から生成した音声パラメータとほぼ同様の特徴量を持つため、音声データ量が数分程度の如く少量の場合でも、HMM音声合成方式と同様に安定した品質の合成音声を生成することが可能である。
他方、重みαの値が1に近い場合は、元の音声データ１０１における音声パラメータとほぼ同一の音声パラメータが生成される。従って、素片接続型音声合成方式と同様に、音声データ量が数時間程度の如く多量であれば、高品質な合成音声を生成することが可能となる。
つまり、重みαを音声データ量に応じて動的に設定することによって、音声データ量が十分に多くはない場合にはHMM音声合成方式と同様に安定した品質の合成音声を生成することができ、音声データ量が十分に多い場合にはHMM音声合成方式よりも高品質な合成音声を生成できるようになる。 <Weight α>
The weight α is a weight coefficient for calculating a new voice parameter, and is expressed by a numerical value of 0 or more and 1 or less.
When the value of the weight α is small, the newly generated speech parameter has almost the same feature amount as the speech parameter generated from the speech synthesis HMM 103, so even if the speech data amount is as small as several minutes. As with the HMM speech synthesis method, it is possible to generate synthesized speech with stable quality.
On the other hand, when the value of the weight α is close to 1, an audio parameter substantially the same as the audio parameter in the original audio data 101 is generated. Therefore, as in the unit connection type speech synthesis method, if the amount of speech data is large such as several hours, high-quality synthesized speech can be generated.
In other words, by dynamically setting the weight α according to the amount of audio data, it is possible to generate synthesized speech with stable quality in the same way as the HMM speech synthesis method when the amount of audio data is not sufficiently large. When the amount of speech data is sufficiently large, it becomes possible to generate synthesized speech with higher quality than the HMM speech synthesis method.

重みαの算出方法として、以下に２つの例を説明する。
（ａ）音声データの時間長を使用
音声波形データベース１０４の構築に使用された音声データ１０１に含まれる発話音声の総時間長をlen[sec]、素片接続型音声合成方式において十分に良好な品質が得られる音声時間長をL[sec]とした場合、重みαを以下の式に従って算出する。ここで、Lはあらかじめ与えられる定数であり、一般的な素片接続型音声合成方式では、7200〜18000[sec]（2時間〜5時間程度）にすることが望ましい。

Two examples of the calculation method of the weight α will be described below.
(A) Use time length of speech data The total time length of speech speech included in the speech data 101 used for constructing the speech waveform database 104 is len [sec], which is sufficiently good in the unit connection speech synthesis method. When the voice time length for obtaining quality is L [sec], the weight α is calculated according to the following equation. Here, L is a constant given in advance, and is preferably set to 7200 to 18000 [sec] (about 2 hours to 5 hours) in a general unit connection type speech synthesis method.

（ｂ）音声データに含まれる音声合成単位数を使用
（ａ）による設定では、全ての音声合成単位に対して、同一の重みαが設定されるため、音声合成単位ごとの音声データ量に偏りがある場合、合成音声の品質が低下する可能性がある。このため、音声合成単位ごとに重みαを算出するために、音声データに含まれる音声合成単位数を重みαの計算に使用する。音声合成単位jの音声データに含まれる音声合成単位数をn_j、素片接続型音声合成方式において十分に良好な品質が得られる音声合成単位数をN_jとした場合、重みαを以下の式に従って算出する。ここで、N_jはあらかじめ与えられる定数であり、一般的な素片接続型音声合成方式では、母音や有声子音では500〜1000、無声子音では100程度にすることが望ましい。

(B) Using the number of speech synthesis units included in the speech data In the setting based on (a), the same weight α is set for all speech synthesis units, so that the amount of speech data for each speech synthesis unit is biased. If there is, there is a possibility that the quality of the synthesized speech is deteriorated. For this reason, in order to calculate the weight α for each speech synthesis unit, the number of speech synthesis units included in the speech data is used to calculate the weight α. When the number of speech synthesis units included in speech data of speech synthesis unit j is n _j and the number of speech synthesis units that can provide sufficiently good quality in the unit connection speech synthesis method is N _j , weight α is Calculate according to the formula. Here, N _j is a constant given in advance, and is preferably about 500 to 1000 for vowels and voiced consonants and about 100 for unvoiced consonants in a general segment-connected speech synthesis method.

＜音声波形データベース１０４を用いた音声合成の例＞
図５を参照して、音声波形データベース１０４を使用して音声合成を行う音声合成装置２の一例を説明する。この素片接続型音声合成方式の処理の概要を以下に説明する。 <Example of speech synthesis using speech waveform database 104>
An example of the speech synthesizer 2 that performs speech synthesis using the speech waveform database 104 will be described with reference to FIG. The outline of the processing of this unit connection type speech synthesis method will be described below.

テキスト解析部５０１は、入力された音声合成対象のテキスト９０１に対してテキスト解析を行い、テキスト９０１の読み、アクセント等の情報９０２を得る。
韻律生成部５０２は、テキスト解析よって得られた情報９０２と予め与えられている韻律モデル９０３とを用いて韻律生成を行い、韻律パラメータ（F0、音素継続長など）９０４を得る。
素片選択接続部５０３は、テキスト解析よって得られた情報９０２と韻律パラメータ９０４と音声波形データベース１０４を用いて、最も適切な音声素片を選択し、それらを接続することによってテキスト９０１に対応する合成音声９０５を生成する。 The text analysis unit 501 performs text analysis on the input text 901 to be synthesized, and obtains information 902 such as reading of the text 901 and accents.
The prosody generation unit 502 generates a prosody using information 902 obtained by text analysis and a prosody model 903 given in advance, and obtains prosody parameters (F0, phoneme duration, etc.) 904.
The segment selection / connection unit 503 selects the most appropriate speech segment using the information 902 obtained by text analysis, the prosodic parameter 904, and the speech waveform database 104, and connects them to correspond to the text 901. A synthesized voice 905 is generated.

なお、素片選択処理では、一般的にコスト最小となる素片が選択される（例えば、特許文献１参照）。以下に、素片選択処理の概要を説明する。
或る音声素片候補の総合コストPは、一般的に、以下のようなサブコスト関数の重みづけ和として表される。

Note that, in the segment selection process, a segment having a minimum cost is generally selected (see, for example, Patent Document 1). The outline of the segment selection process will be described below.
The total cost P of a certain speech element candidate is generally expressed as a weighted sum of the following sub-cost functions.

ここで、C_iはサブコスト関数、w_iは各サブコスト関数に対する重みであり、Dはサブコスト関数の数である。このサブコスト関数として一般的に使用されるものとして、F0平均値、F0の傾き、音素継続時間長等が挙げられる。以下にその例を示す。 Here, C _i is a sub cost function, w _i is a weight for each sub cost function, and D is the number of sub cost functions. Commonly used sub-cost functions include F0 average value, F0 slope, phoneme duration, and the like. An example is shown below.

・F0平均値
韻律パラメータのF0平均値Vpと、音声素片候補のF0平均値Vsに対応するサブコスト関数は以下の式で表される。
C₁(Vp,Vs)=(Vp-Vs)² -F0 average value The F0 average value Vp of the prosodic parameter and the sub cost function corresponding to the F0 average value Vs of the speech segment candidate are expressed by the following equations.
C ₁ (Vp, Vs) = (Vp-Vs) ²

・F0の傾き
韻律パラメータのF0の傾きFpと、音声素片候補のF0の傾きFsに対応するサブコスト関数は以下の式で表される。
C₂(Fp,Fs)=(Fp-Fs)² F0 slope The subcost function corresponding to the slope Fp of the prosodic parameter F0 and the slope Fs of F0 of the speech segment candidate is expressed by the following equation.
C ₂ (Fp, Fs) = (Fp-Fs) ²

・音素継続時間長
韻律パラメータの音素継続時間長Tpと、音声素片候補の音素継続時間長Tsに対応するサブコスト関数は以下の式で表される。
C₃(Tp,Ts)=(Tp-Ts)² -Phoneme duration length The sub-cost function corresponding to the phoneme duration length Tp of the prosodic parameter and the phoneme duration length Ts of the speech segment candidate is expressed by the following equation.
C ₃ (Tp, Ts) = (Tp-Ts) ²

＜音声波形データベース生成装置のハードウェア構成例＞
上述の実施形態に関わる音声波形データベース生成装置は、ＣＰＵ（Central Processing Unit）やＤＳＰ（Digital Sygnal Processor）〔キャッシュメモリなどを備えていてもよい。〕、メモリであるＲＡＭ（Random Access Memory）やＲＯＭ（Read Only Memory）と、ハードディスクである外部記憶装置、並びにこれらのＣＰＵやＤＳＰ、ＲＡＭやＲＯＭ、外部記憶装置間のデータのやり取りが可能なように接続するバスなどを備えている。また必要に応じて、音声波形データベース生成装置に、ＣＤ−ＲＯＭなどの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。 <Hardware configuration example of speech waveform database generation device>
The speech waveform database generation apparatus according to the above-described embodiment may include a CPU (Central Processing Unit) and a DSP (Digital Synchronous Processor) [cache memory. ] RAM (Random Access Memory) and ROM (Read Only Memory) and external storage devices that are hard disks, and these CPUs and DSPs, RAM and ROM, and external storage devices so that data can be exchanged. A bus connected to the If necessary, the voice waveform database generation device may be provided with a device (drive) that can read and write a storage medium such as a CD-ROM.

音声波形データベース生成装置の外部記憶装置には、上述の音声波形データベース生成処理のためのプログラム並びにこのプログラムの処理において必要となるデータ（音声データ、発話情報データ等）などが記憶されている〔外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくなどでもよい。〕。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される場合がある。データやその格納領域のアドレスなどを記憶する記憶装置を単に「記憶部」と呼ぶことにする。 The external storage device of the speech waveform database generation apparatus stores a program for the above-described speech waveform database generation processing and data (speech data, speech information data, etc.) necessary for the processing of this program [external For example, the program may be stored in a ROM that is a read-only storage device. ]. In addition, data obtained by the processing of these programs may be appropriately stored in a RAM or an external storage device. A storage device that stores data, addresses of storage areas, and the like is simply referred to as a “storage unit”.

音声波形データベース生成装置の記憶部には、音声データと発話情報データとを用いて、各音声合成単位を複数の状態で表現できるモデルを学習することによって音声合成用モデルを得るためのプログラムと、音声合成用モデルから生成した音声パラメータと音声データに含まれる音声パラメータとの重み付き和で表された重み付き音声パラメータを用いて音声波形を生成するためのプログラムなどが記憶されている。 A program for obtaining a speech synthesis model by learning a model that can express each speech synthesis unit in a plurality of states using speech data and speech information data in the storage unit of the speech waveform database generation device, A program for generating a speech waveform using a weighted speech parameter represented by a weighted sum of speech parameters generated from a speech synthesis model and speech parameters included in speech data is stored.

音声波形データベース生成装置では、記憶部に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてＲＡＭに読み込まれて、ＣＰＵで解釈実行・処理される。この結果、ＣＰＵが所定の機能（モデル学習部、音声DB構成部など）を実現することで上述の音声波形データベースの構築が実現される。 In the speech waveform database generation device, each program stored in the storage unit and data necessary for processing each program are read into the RAM as necessary, and are interpreted and processed by the CPU. As a result, the CPU implements predetermined functions (model learning unit, speech DB configuration unit, etc.), thereby realizing the construction of the above-described speech waveform database.

＜補記＞
本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 <Supplementary note>
The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

また、上記実施形態において説明したハードウェアエンティティ（音声波形データベース生成装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 When the processing functions in the hardware entity (speech waveform database generation apparatus) described in the above embodiment are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

Each speech synthesis unit is expressed in multiple states using speech data holding speech parameters for each speech and a set of speech information (utterance information data) corresponding to each speech in the speech data A model learning unit that obtains a model for speech synthesis by learning a model that can
A speech waveform including a speech waveform database construction unit that generates a speech waveform using a weighted speech parameter represented by a weighted sum of speech parameters generated from the speech synthesis model and speech parameters included in the speech data Database generator.

The speech waveform database generation device according to claim 1,
The speech waveform database construction part
Each weight is set such that the greater the data amount of the voice data is, the smaller the weight A for the voice parameter generated from the voice synthesis model is than the weight B for the voice parameter included in the voice data. A voice waveform database generation device.

The speech waveform database generation device according to claim 2,
The speech waveform database construction part
Using the speech information included in the speech information data and the speech parameters included in the speech synthesis model, a speech parameter generation unit that generates speech parameters corresponding to each speech;
For each uttered voice, the weighted value represented by the sum of the voice parameter generated by the voice parameter generation unit multiplied by the weight A and the voice parameter included in the voice data multiplied by the weight B A voice parameter calculation unit for obtaining a voice parameter;
A speech waveform database generation apparatus comprising: a speech waveform generation unit that generates a speech waveform using the weighted speech parameter and a speech synthesis filter.

The speech waveform database generation device according to claim 2 or 3,
A speech waveform database generation apparatus characterized in that the total time length of speech speech included in the speech data or the number of speech synthesis units included in the speech data is used for setting each weight.

Each speech synthesis unit is expressed in multiple states using speech data holding speech parameters for each speech and a set of speech information (utterance information data) corresponding to each speech in the speech data A model learning step of obtaining a model for speech synthesis by learning a model capable of
A speech waveform having a speech waveform database construction step for generating a speech waveform using a weighted speech parameter represented by a weighted sum of speech parameters generated from the speech synthesis model and speech parameters included in the speech data Database generation method.

The program for functioning a computer as a speech waveform database production | generation apparatus in any one of Claims 1-4.