JP5722295B2

JP5722295B2 - Acoustic model generation method, speech synthesis method, apparatus and program thereof

Info

Publication number: JP5722295B2
Application number: JP2012248151A
Authority: JP
Inventors: 勇祐井島; 水野　秀之; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-11-12
Filing date: 2012-11-12
Publication date: 2015-05-20
Anticipated expiration: 2032-11-12
Also published as: JP2014095851A

Description

本発明は、ＨＭＭ（Hidden Markov Model）音声合成方式に用いる音響モデルを生成する音響モデル生成方法と音声合成方法と、それらの装置とプログラムに関する。 The present invention relates to an acoustic model generation method and a speech synthesis method for generating an acoustic model used for an HMM (Hidden Markov Model) speech synthesis method, and an apparatus and a program thereof.

近年、音声合成方式として、ＨＭＭ音声合成方式が提案されている（例えば非特許文献１）。ＨＭＭ音声合成方式における音響モデル（音声データベース）は、合成単位ごとにスペクトルやＦ０を平均化した音声データのパラメータとして、合成単位ごとに一つのモデルを保持している。これにより、少量の音声データでも肉声感は低いが安定した品質の音声合成を可能にしている。 In recent years, an HMM speech synthesis method has been proposed as a speech synthesis method (for example, Non-Patent Document 1). The acoustic model (speech database) in the HMM speech synthesis method holds one model for each synthesis unit as a parameter of speech data obtained by averaging the spectrum and F0 for each synthesis unit. As a result, even with a small amount of voice data, the voice feeling is low, but stable voice synthesis is possible.

一方、非特許文献２に開示されているように、アクセント句間の音調結合型を導入することで、合成音声の自然性が向上することが知られている。 On the other hand, as disclosed in Non-Patent Document 2, it is known that the naturalness of synthesized speech is improved by introducing a tone coupling type between accent phrases.

益子他、「動的特徴を用いたＨＭＭに基づく音声合成」信学論、vol.J79-D-II, no.12, pp.2184-2190, Dec.1996.Masuko et al., "HMM-based speech synthesis using dynamic features", Theory of Science, vol.J79-D-II, no.12, pp.2184-2190, Dec. 1996. 箱田他、「文章音声の音調結合型導出規則の検討」信学技法、SP89-5, pp.33-38, 1989.Hakoda et al., `` Examination of Tone Combined Derivation Rules for Sentence Voices '', IEICE Tech.

従来のＨＭＭ音声合成方式では、モデル学習時、音声合成時に音調結合型を考慮できていないため、合成音声の品質が低下する課題がある。しかし、モデル学習のための音声データに対して、人手で音調結合型を付与することは高コストであるため、音調結合型を考慮したＨＭＭ音声合成方式はほとんど普及していない。 In the conventional HMM speech synthesis method, the tone combination type cannot be considered at the time of model learning and speech synthesis. However, since it is expensive to manually add a tone coupling type to voice data for model learning, HMM speech synthesizing methods that take into account the tone coupling type are rarely used.

本発明は、この課題に鑑みてなされたものであり、音調結合型を自動的に付与した音響モデルを学習して生成することができる音響モデル生成方法と音声合成方法と、それらの装置とプログラムを提供することを目的とする。 The present invention has been made in view of this problem, and an acoustic model generation method, a speech synthesis method, an apparatus, and a program capable of learning and generating an acoustic model automatically provided with a tone coupling type The purpose is to provide.

本発明の音響モデル生成方法は、モデル学習過程と、音調結合型抽出過程と、音調結合型モデル学習過程と、を備える。モデル学習過程は、音高パラメータとスペクトルパラメータを含む学習用音声データと、当該学習用音声データの音素セグメンテーション情報とアクセント情報を含む発話情報とを入力として音声合成用ＨＭＭを学習する。音調結合型抽出過程は、音声合成用ＨＭＭから発話情報と同一の音素セグメンテーション情報を持つ音声パラメータを生成し、当該音声パラメータと上記学習用音声データのパラメータとを用いて各アクセント句間のピッチパタン形状に影響を与える音調結合型を抽出する。音調結合型モデル学習過程は、学習用音声データと発話情報と音調結合型とを入力として、音調結合型を考慮したモデル学習を行い音調結合型音響モデルを生成する。 The acoustic model generation method of the present invention includes a model learning process, a tone coupling type extraction process, and a tone coupling type model learning process. In the model learning process, the speech synthesis HMM is learned by inputting the learning speech data including the pitch parameter and the spectrum parameter, and the phoneme segmentation information and the speech information including the accent information of the learning speech data. The tone combination type extraction process generates a speech parameter having the same phoneme segmentation information as speech information from the speech synthesis HMM, and uses the speech parameter and the parameter of the learning speech data to generate a pitch pattern between the accent phrases. Extract the tone combination type that affects the shape . In the tone coupled model learning process, the learning speech data, the utterance information, and the tone coupled type are input, and model learning considering the tone coupled type is performed to generate a tone coupled acoustic model.

また、本発明の音声合成方法は、テキスト解析過程と、音声パラメータ生成過程と、音声合成フィルタ過程と、を備える。テキスト解析過程は、音声合成対象テキストを入力として、当該音声合成対象テキストをテキスト解析して読みとアクセントと音調結合型とから成るテキスト情報を出力する。音声パラメータ生成過程は、上記した音響モデル生成方法で生成した音調結合型音響モデルと、テキスト情報とを用いて、音声パラメータを生成する。音声合成フィルタ過程は、上記音声パラメータを用いて音声波形を生成する。 The speech synthesis method of the present invention includes a text analysis process, a speech parameter generation process, and a speech synthesis filter process. In the text analysis process, the speech synthesis target text is input, the text synthesis target text is analyzed, and text information including reading, accent, and tone combination types is output. In the speech parameter generation process, speech parameters are generated using the tone-coupled acoustic model generated by the above-described acoustic model generation method and text information. The speech synthesis filter process generates a speech waveform using the speech parameters.

本発明の音響モデル生成方法によれば、音調結合型を自動的に付与した音響モデルを生成することができるので、音調結合型を考慮したＨＭＭ音声合成方式を実現するためのコストを低減させることができる。 According to the acoustic model generation method of the present invention, it is possible to automatically generate an acoustic model to which a tone coupling type is assigned, so that the cost for realizing an HMM speech synthesis method considering the tone coupling type can be reduced. Can do.

また、本発明の音声合成方法によれば、音調結合型を考慮した音響モデルを用いて合成音を生成するので、通常のＨＭＭ音声合成より合成音の品質を向上させることができる。 In addition, according to the speech synthesis method of the present invention, the synthesized sound is generated using the acoustic model that takes into account the tone combination type, so that the quality of the synthesized sound can be improved over the normal HMM speech synthesis.

この発明の音響モデル生成装置１００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 100 of this invention. 音響モデル生成装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production | generation apparatus 100. FIG. 音素セグメンテーション情報の例を示す図。The figure which shows the example of phoneme segmentation information. ３状態の音声合成用ＨＭＭの例を示す図。The figure which shows the example of HMM for speech synthesis of 3 states. 音調結合型抽出部２０の機能構成例を示す図。The figure which shows the function structural example of the tone combination type | mold extraction part 20. FIG. 音声パラメータ生成手段２０１が生成する音声パラメータの概念を示す図。The figure which shows the concept of the audio | voice parameter which the audio | voice parameter production | generation means 201 produces | generates. アクセント句間の概念を示す図。The figure which shows the concept between accent phrases. 音調結合型抽出部２０の動作フローを示す図。The figure which shows the operation | movement flow of the tone combination type | mold extraction part 20. FIG. 音調結合型抽出部２０′の動作フローを示す図。The figure which shows the operation | movement flow of tone combination type | mold extraction part 20 '. この発明の音声合成装置２００の機能構成例を示す図。The figure which shows the function structural example of the speech synthesizer 200 of this invention. 音声合成装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the speech synthesizer.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音響モデル生成装置１００の機能構成例を示す。その動作フローを図２に示す。音響モデル生成装置１００は、モデル学習部１０と、音調結合型抽出部２０と、音調結合型モデル学習部３０と、制御部４０と、を具備する。音響モデル生成装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows an example of the functional configuration of an acoustic model generation apparatus 100 according to the present invention. The operation flow is shown in FIG. The acoustic model generation apparatus 100 includes a model learning unit 10, a tone coupling type extraction unit 20, a tone coupling type model learning unit 30, and a control unit 40. The acoustic model generation apparatus 100 is realized by a predetermined program being read into a computer including, for example, a ROM, a RAM, a CPU, and the like, and the CPU executing the program.

モデル学習部１０は、音高パラメータとスペクトルパラメータを含む学習用音声データと、当該学習用音声データの音素セグメンテーション情報とアクセント情報を含む発話情報と、を入力として音声合成用ＨＭＭを学習する（ステップＳ１０）。学習用音声データは、音声データベースを構築する対象の話者がＮ個の文章を発話した音声を収録したデータである。学習用音声データには、音声信号に対して信号処理を行った結果得られる音高パラメータ（基本周波数：Ｆ０）とスペクトルパラメータ（ケプストラム、メルケプストラム等）とが含まれる。 The model learning unit 10 learns the speech synthesis HMM by using the speech data for learning including the pitch parameter and the spectrum parameter, and the speech information including the phoneme segmentation information and the accent information of the speech data for learning (step). S10). The voice data for learning is data in which voices in which N speakers have spoken N sentences are recorded. The speech data for learning includes a pitch parameter (basic frequency: F0) and spectrum parameters (cepstrum, mel cepstrum, etc.) obtained as a result of performing signal processing on the speech signal.

これらのパラメータは、フレームと呼ばれる所定の時間間隔毎のデータである。１フレームは、音声信号を、例えばサンプリング周波数１６ｋＨｚで離散的なディジタル信号に変換した音声信号の所定数（例えば１６０個）から成る時間（１０ｍｓ）である。なお、モデル学習部１０に、音声データそのものを与え、ディジタル信号処理によって、パラメータを生成するようにしても良い。 These parameters are data at predetermined time intervals called frames. One frame is a time (10 ms) consisting of a predetermined number (for example, 160) of audio signals obtained by converting the audio signals into discrete digital signals at a sampling frequency of 16 kHz, for example. Note that the speech data itself may be given to the model learning unit 10 and the parameters may be generated by digital signal processing.

発話情報は、学習用音声データ中の各発話に付与された情報であり、少なくとも発話を構成する各音素の開始時間と終了時間の情報から成る音素セグメンテーション情報と、アクセント句境界、アクセント型、アクセント句長等のアクセント情報と、から成る。図３に、音素セグメンテーション情報の例を示す。図３の1列目は音素名、２列目は開始時間、３列目は終了時間である。開始・終了時間は、各発話の始点を０［秒］とした時の経過時間である。 The utterance information is information given to each utterance in the speech data for learning, and includes phoneme segmentation information including at least the start time and end time information of each phoneme constituting the utterance, accent phrase boundary, accent type, accent And accent information such as phrase length. FIG. 3 shows an example of phoneme segmentation information. The first column in FIG. 3 is the phoneme name, the second column is the start time, and the third column is the end time. The start / end time is an elapsed time when the start point of each utterance is set to 0 [seconds].

モデル学習部１０では、学習用音声データと発話情報から音声合成用のＨＭＭを学習する。このＨＭＭは、３状態または５状態のleft-to-right型ＨＭＭと呼ばれるものである。図４に、３状態の音声合成用ＨＭＭの例を示す。１は第１状態でＨＭＭの開始を表す。２は第２状態、３は第３状態を表す。そしてＨＭＭは、自己遷移ａ_１１，ａ_２２，ａ_３３と、次状態への状態遷移ａ_１２，ａ_２３とから成る確率連鎖で表される。 The model learning unit 10 learns an HMM for speech synthesis from learning speech data and speech information. This HMM is a three-state or five-state left-to-right type HMM. FIG. 4 shows an example of a three-state speech synthesis HMM. 1 represents the start of the HMM in the first state. 2 represents the second state, and 3 represents the third state. The HMM is represented by a probability chain including self transitions a ₁₁ , a ₂₂ , a ₃₃ and state transitions a ₁₂ , a ₂₃ to the next state.

ＨＭＭの各状態１，２，３には、それぞれ出力確率分布ｂ_１（ｏ_ｔ），ｂ_２（ｏ_ｔ），ｂ_３（ｏ_ｔ），が対応付けられている。出力確率分布は、Ｆ０、ケプストラム等の音を特徴付けるモデルパラメータである。モデル学習部１０は、音素ラベルに従って、例えばBaum-Welchアルゴリズム等を用いて音声合成用ＨＭＭを学習する。学習した音声合成用ＨＭＭは、音調結合型抽出部２０に出力される。学習した音声合成用ＨＭＭは、音声合成用ＨＭＭ５０として記録装置に蓄えるようにしても良い。 The HMM states 1, 2, and 3 are associated with output probability distributions b ₁ (o _t ), b ₂ (o _t ), and b ₃ (o _t ), respectively. The output probability distribution is a model parameter that characterizes sounds such as F0 and cepstrum. The model learning unit 10 learns the speech synthesis HMM using, for example, a Baum-Welch algorithm according to the phoneme label. The learned speech synthesis HMM is output to the tone combination type extraction unit 20. The learned speech synthesis HMM may be stored in the recording device as the speech synthesis HMM 50.

音調結合型抽出部２０は、音声合成用ＨＭＭから発話情報と同一の音素セグメンテーション情報を持つ音声パラメータを生成し、当該音声パラメータと、外部から入力される学習用音声データのパラメータとを用いてアクセント句間の音調結合型を抽出する（ステップＳ２０）。音調結合型抽出部２０の詳しい動作説明は後述する。 The tone combination type extraction unit 20 generates a speech parameter having the same phoneme segmentation information as the speech information from the speech synthesis HMM, and uses the speech parameter and a parameter of learning speech data input from the outside as an accent. The tone combination type between phrases is extracted (step S20). Detailed operation of the tone combination type extraction unit 20 will be described later.

音調結合型モデル学習部３０は、外部から入力される学習用音声データと発話情報と、音調結合型抽出部２０で抽出した音調結合型を入力として音調結合型を考慮したモデル学習を行い音調結合型音響モデルを生成する（ステップＳ３０）。音調結合型モデル学習部３０におけるモデルの学習は、音調結合型を加えて学習する点で、モデル学習部１０のモデル学習と異なる。制御部４０は、各部の時系列的な動作を制御する。 The tone combination type model learning unit 30 performs model learning considering the tone combination type by inputting the learning voice data and speech information input from the outside and the tone combination type extracted by the tone combination type extraction unit 20 to perform tone combination. A type acoustic model is generated (step S30). The model learning in the tone coupled model learning unit 30 is different from the model learning in the model learning unit 10 in that learning is performed by adding the tone coupled type. The control unit 40 controls the time series operation of each unit.

以上説明したように、音響モデル生成装置１００によれば、学習用音声データから自動的に音調結合型を抽出し、その音調結合型も加えた形で音声合成用ＨＭＭを学習することができる。従って、音調結合型を考慮した音響モデルを低コストで提供することが可能になる。 As described above, the acoustic model generation apparatus 100 can automatically extract the tone combination type from the learning speech data, and learn the speech synthesis HMM with the tone combination type added. Therefore, it is possible to provide an acoustic model considering the tone coupling type at a low cost.

図５に、本発明の要部である音調結合型抽出部２０のより具体的な機能構成例を示して更に詳しく動作を説明する。音調結合型抽出部２０は、音声パラメータ生成手段２０１と、音調結合型抽出手段２０２と、を備える。 FIG. 5 shows a more specific functional configuration example of the tone coupling type extraction unit 20 which is a main part of the present invention, and the operation will be described in more detail. The tone combination type extraction unit 20 includes a sound parameter generation unit 201 and a tone combination type extraction unit 202.

音声パラメータ生成手段２０１は、モデル学習部１０で生成された音声合成用ＨＭＭと、外部から入力される発話情報とを入力として、当該発話情報と同一の音素セグメンテーション情報を持つ音声パラメータを生成する（ステップＳ２０１、図２）。図６に、音声パラメータ生成手段２０１が生成する音声パラメータの概念を示す。 The speech parameter generation unit 201 receives the speech synthesis HMM generated by the model learning unit 10 and the utterance information input from the outside, and generates a speech parameter having the same phoneme segmentation information as the utterance information ( Step S201, FIG. 2). FIG. 6 shows the concept of audio parameters generated by the audio parameter generation unit 201.

まず、発話ｉのｐ番目の音素のｓ番目の状態のフレーム数を求める。各状態のフレーム数の算出は、ｐ番目の音素の継続時間長を状態数で等分することにより行う。例えば、図３に示した音素「ｏ」の継続時間長は１５０［ｍｓ］である。そして、音声合成用ＨＭＭの状態数を例えば３状態とすると、各状態には５０［ｍｓ］の時間が割り振られる。１フレームを例えば１０［ｍｓ］とすると各状態は、それぞれ５フレームで構成される（図６の３行目）。 First, the number of frames in the sth state of the pth phoneme of utterance i is obtained. The number of frames in each state is calculated by equally dividing the duration of the p-th phoneme by the number of states. For example, the duration of the phoneme “o” shown in FIG. 3 is 150 [ms]. If the number of states of the speech synthesis HMM is three, for example, a time of 50 [ms] is allocated to each state. If one frame is 10 [ms], for example, each state is composed of 5 frames (third line in FIG. 6).

次に、音声パラメータ生成手段２０１は、各フレームにモデルパラメータの平均ベクトルμ_ｐｓを割り振ることで発話ｉの音声パラメータ系列を生成する（図６の４行目）。そして最後に、当該音声パラメータ系列に対して補間を行う。音声パラメータの補間は、非特許文献１に開示されているように、モデルパラメータの動的特徴量と分散を用いて行う。なお、スプライン補間のような一般的な補間手法を用いても良い
音調結合型抽出手段２０２は、音声パラメータ生成手段２０１で生成した音声パラメータ系列と、外部から入力される学習用音声データとを用いて、アクセント句間の音調結合型を抽出する。図７に、アクセント句間の概念を示す。例えば「今日は打ち合わせです。」の一文は、「今日は」、「打ち合わせ」、「です」の３つのアクセント句で構成される。このアクセント句の位置（時間）は、発話情報を参照することで得られる。 Next, the voice parameter generation unit 201 generates a voice parameter series of the utterance i by assigning an average vector μ _ps of model parameters to each frame (fourth line in FIG. 6). Finally, interpolation is performed on the speech parameter series. As disclosed in Non-Patent Document 1, the speech parameter interpolation is performed using the dynamic feature amount and variance of the model parameter. Note that a general interpolation method such as spline interpolation may be used. The tone combination extraction unit 202 uses a speech parameter sequence generated by the speech parameter generation unit 201 and learning speech data input from the outside. To extract the tone combination type between accent phrases. FIG. 7 shows the concept between accent phrases. For example, a sentence “Today is a meeting” is composed of three accent phrases “Today”, “Meeting”, and “I”. The position (time) of this accent phrase can be obtained by referring to the utterance information.

音調結合型抽出手段２０２は、音声パラメータ生成手段２０１で生成した音声パラメータ系列のｉ番目の発話のｊ番目、ｊ＋１番目のアクセント句の境界周辺のＦ０の対数値の平均値をｍｓ_ｉｊを求めると共に、学習用音声データの同じアクセント句の境界周辺のＦ０の対数値の平均値をｍｏ_ｉｊを求める。そして、その差分ｄを計算し、差分ｄが閾値αより大きい場合の音調結合型を弱結合、小さい場合を強結合として抽出する。 The tone combination type extraction unit 202 calculates ms _ij as an average value of logarithmic values of F0 around the boundary of the j-th and j + 1-th accent phrases of the i-th utterance of the speech parameter series generated by the speech parameter generation unit 201. Then, mo _ij is obtained as an average value of logarithmic values of F0 around the boundary of the same accent phrase of the speech data for learning. Then, the difference d is calculated, and the tone coupling type when the difference d is larger than the threshold value α is extracted as weak coupling, and the case where the difference d is small is extracted as strong coupling.

一般的に、アクセント句間の結合の強さが小さい場合（強結合）、二つのアクセント句の境界付近の学習音声データのＦ０は低くなる傾向があり、結合の強さが大きい場合（弱結合）、二つのアクセント句間の境界付近のＦ０は高くなる傾向がある。一方、生成した音声パラメータ系列のＦ０は、音調結合型を考慮せずに学習した音声合成用ＨＭＭから生成しているため、強結合・弱結合が考慮されていない中間程度の高さのＦ０が生成される。そのため、音声パラメータ生成手段２０１で生成した音声パラメータ系列のＦ０と比較して学習音声データのＦ０が低い（差分が小さい）場合は強結合、高い（差分ｄが大きい）場合は弱結合として音調結合型を判別することができる。 In general, when the strength of coupling between accent phrases is small (strong coupling), F0 of learning speech data near the boundary between two accent phrases tends to be low, and when coupling strength is large (weak coupling) ), F0 near the boundary between two accent phrases tends to be high. On the other hand, F0 of the generated speech parameter series is generated from the speech synthesis HMM learned without considering the tone coupling type, so that F0 having an intermediate height without considering strong coupling or weak coupling is obtained. Generated. Therefore, tone coupling is performed as strong coupling when F0 of learning speech data is low (difference is small) compared to F0 of the speech parameter series generated by the speech parameter generation unit 201, and weak coupling when high (difference d is large). The type can be determined.

図７に音調結合型抽出部２０で行う処理を、図８にその動作フローを示して更に具体適に説明する。図７の横軸は経過時間ｔ［ｍｓ］、縦軸はＦ０［Ｈｚ］であり、ある一つのアクセント境界のＦ０を示している。 The processing performed by the tone combination type extraction unit 20 in FIG. 7 will be described more specifically with reference to FIG. The horizontal axis in FIG. 7 is the elapsed time t [ms], and the vertical axis is F0 [Hz], indicating F0 of a certain accent boundary.

音声パラメータ生成手段２０１は、モデル学習部１０で生成された音声合成用ＨＭＭと、外部から入力される発話情報とを用いて、当該発話情報と同一の音素セグメンテーション情報を持つ音声パラメータ系列を全ての発話ｉについて生成する（ループＳ２０１のステップＳ２０１ａ、図８）。 The speech parameter generation unit 201 uses the speech synthesis HMM generated by the model learning unit 10 and utterance information input from the outside to convert all speech parameter sequences having the same phoneme segmentation information as the utterance information. An utterance i is generated (step S201a of loop S201, FIG. 8).

音調結合型抽出手段２０２は、音声パラメータ系列のｉ番目の発話のｊ番目、ｊ＋１番目のアクセント句の境界周辺のＦ０の対数値の平均値をｍｓ_ｉｊを求めると共に、学習用音声データの同じアクセント句の境界周辺のＦ０の対数値の平均値をｍｏ_ｉｊを求め、その差分ｄを計算する（ループＳ２０２のステップＳ２０２ａ）。アクセント句間の境界付近のＦ０の平均値は、アクセント句の境界の前後ｔ［ｍｓ］のＦ０を用いて計算する（図７のｍｓ_ｉｊとｍｏ_ｉｊを参照）。 The tone combination type extraction unit 202 obtains ms _ij as an average value of logarithmic values of F0 around the boundary of the j-th and j + 1-th accent phrases of the i-th utterance of the speech parameter series, and the same accent of the speech data for learning The average value of logarithmic values of F0 around the phrase boundary is determined as mo _ij , and the difference d is calculated (step S202a of loop S202). The average value of F0 near the boundary between accent phrases is calculated using F0 before and after the boundary of the accent phrase (see ms _ij and mo _ij in FIG. 7).

そして音調結合型抽出手段２０２は、差分ｄが閾値αより大であればそのアクセント句は弱結合（ステップＳ２０２ｂ）、差分ｄが閾値α以下であればそのアクセント句は強結合（ステップＳ２０２ｂ′）と判定する。この音調結合型の抽出は、全ての発話の全てのアクセント句に対して行われる。 The tone combination type extraction unit 202 weakly connects the accent phrase if the difference d is greater than the threshold value α (step S202b), and strongly connects the accent phrase if the difference d is equal to or less than the threshold value α (step S202b ′). Is determined. This tone combination type extraction is performed for all accent phrases of all utterances.

〔変形例１〕
アクセント句の境界の前後ｔ［ｍｓ］のＦ０の平均値から音調結合型を求める例を説明したが、ｊ番目とｊ＋１番目のアクセント句を構成する全ての発話のＦ０の平均値の差分で音調結合型を判別しても良い。音声パラメータ系列のｉ番目の発話のｊ番目、ｊ＋１番目のアクセント句の平均値をｆｓ_ｉｊ，ｆｓ_ｉｊ＋１、学習用音声データの同じアクセント句の平均値をｆｏ_ｉｊ，ｆｏ_ｉｊ＋１として求め、音声パラメータ系列の差分をｄｓ＝ｆｓ_ｉｊ−ｆｓ_ｉｊ＋１、学習用音声データの差分ｄｏ＝ｆｏ_ｉｊ−ｆｏ_ｉｊ＋１とした時のｄｓとｄｏとの差分（ｄｏ−ｄｓ）が閾値αより大の場合を弱結合、小の場合を強結合として判別しても良い。 [Modification 1]
The example in which the tone combination type is obtained from the average value of F0 around t [ms] before and after the boundary of the accent phrase has been described, but the tone is determined by the difference between the average values of F0 of all utterances constituting the jth and j + 1th accent phrases. The coupling type may be determined. The average value of the jth and j + 1th accent phrases of the i-th utterance of the speech parameter series is obtained as fs _ij and fs _{ij + 1} , and the average value of the same accent phrase of the speech data for learning is obtained as fo _ij and fo _{ij + 1} , and the speech parameter series The difference between ds and do when the difference between ds and fs _ij _{+ 1} is ds = fs _ij −fs _{ij + 1} and learning speech data difference do = fo _ij −fo _{ij + 1} is weakly coupled. A small case may be determined as strong coupling.

〔変形例２〕
音調結合型として弱結合と強結合の２種類のみを抽出する例で説明をしたが、任意のＮ種類の音調結合型を抽出することも可能である。図９に、Ｎ種類の音調結合型を抽出するようにした音調結合型抽出部２０の動作フローを示す。 [Modification 2]
Although an example in which only two types of weak coupling and strong coupling are extracted as the tone coupling type has been described, any N types of tone coupling types can be extracted. FIG. 9 shows an operation flow of the tone combination type extraction unit 20 that extracts N types of tone combination types.

図９は、図８に対して複数の閾値α_ｉを備え、差分を判定するステップＳ２０２ｂが、複数の閾値α_１，α_２，…，α_Ｎ−１のそれぞれと、差分ｄを比較してＮ個の結合型に判別する点で異なる。このように２種類以上の音調結合型に分類することで、合成音声をより自然な音声にすることが可能になる。 9 includes a plurality of threshold values α _i compared to FIG. 8, and the step S202b for determining the difference compares the difference d with each of the plurality of threshold values α ₁ , α ₂ ,..., Α _N−1. It is different in that it is discriminated into N coupled types. Thus, by classifying into two or more types of tone coupling types, it becomes possible to make the synthesized speech more natural.

〔音声合成装置〕
図１０に、この発明の音声合成装置２００の機能構成例を示す。その動作フローを図１１に示す。音声合成装置２００は、テキスト解析部２１０と、音声パラメータ生成部２２０と、音調結合型音響モデル２３０と、音声合成フィルタ部２５０と、制御部２４０と、を具備する。音声合成装置２００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 [Speech synthesizer]
FIG. 10 shows a functional configuration example of the speech synthesizer 200 of the present invention. The operation flow is shown in FIG. The speech synthesizer 200 includes a text analysis unit 210, a speech parameter generation unit 220, a tone-coupled acoustic model 230, a speech synthesis filter unit 250, and a control unit 240. The speech synthesizer 200 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

テキスト解析部２１０は、音声合成対象テキストを入力として、当該音声合成対象テキストをテキスト解析し、読みとアクセントと音調結合型とから成るテキスト情報を出力する（ステップＳ２１０）。音調結合型音響モデル２３０は、上記した音響モデル生成装置１００で生成した音調結合型を考慮したモデル学習した音響モデルである。 The text analysis unit 210 receives the speech synthesis target text, performs text analysis on the speech synthesis target text, and outputs text information composed of reading, accent, and tone combination types (step S210). The tone coupled acoustic model 230 is an acoustic model that has been model-trained in consideration of the tone coupled type generated by the acoustic model generating apparatus 100 described above.

音声パラメータ生成部２２０は、音調結合型音響モデル２３０とテキスト情報を用いて、音声パラメータを生成する（ステップＳ２２０）。音声合成フィルタ部２５０は、音声パラメータ生成部２２０が出力する音声パラメータを用いて音声波形を生成する（ステップＳ２５０）。ステップＳ２１０〜Ｓ２５０は全てのテキストについての処理が終了するまで繰り返される（ステップＳ２４０）。この繰り返し動作に制御は制御部２４０が行う。 The voice parameter generation unit 220 generates a voice parameter using the tone coupled acoustic model 230 and text information (step S220). The speech synthesis filter unit 250 generates a speech waveform using the speech parameters output from the speech parameter generation unit 220 (step S250). Steps S210 to S250 are repeated until the processing for all texts is completed (step S240). The control unit 240 performs control for this repeated operation.

この発明の音声合成装置２００によれば、音調結合型を考慮した音響モデルに基づいて音声を合成するので、通常のＨＭＭ音声合成より合成音の品質を向上させることができる。 According to the speech synthesizer 200 of the present invention, since the speech is synthesized based on the acoustic model that takes into account the tone combination type, the quality of the synthesized speech can be improved over the normal HMM speech synthesis.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることが出来る。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A model learning process for learning and generating a speech synthesis HMM by using as input speech data for learning including pitch parameters and spectrum parameters, and speech information including phoneme segmentation information and accent information of the speech data for learning; ,
A speech parameter having the same phoneme segmentation information as the utterance information is generated from the speech synthesis HMM, and the pitch pattern shape between the accent phrases is influenced using the speech parameter and the parameter of the learning speech data. Tone combination type extraction process to extract the tone combination type to give ,
As inputs and the training speech data and the speech information and the tone-linked, and tone-linked model learning process of generating a tonal binding acoustic model performs model learning considering the tone-linked,
An acoustic model generation method comprising:

The acoustic model generation method according to claim 1,
The tone combination extraction process is
A speech parameter generation step for generating a speech parameter sequence having the same phoneme segmentation information as the speech information, using the speech synthesis HMM generated in the model learning process and the speech information as inputs;
Using the parameters of the speech parameter sequence and speech data for the learning, and the tone-linked extracting the tonal coupled between accent phrases,
A method for generating an acoustic model, comprising:

Tone-coupled acoustic model generated by the acoustic model generation method according to claim 1 or 2,
A text analysis process in which the speech synthesis target text is input, the text synthesis target text is analyzed, and text information consisting of reading, accent, and tone combination type is output;
Using the tone-coupled acoustic model and the text information, a speech parameter generation process for generating speech parameters;
A speech synthesis filter process for generating a speech waveform using the speech parameters;
A speech synthesis method comprising:

A model learning unit that learns and generates a speech synthesis HMM by using as input speech data for learning including pitch parameters and spectrum parameters, and speech information including phoneme segmentation information and accent information of the speech data for learning; ,
A speech parameter having the same phoneme segmentation information as the utterance information is generated from the speech synthesis HMM, and the pitch pattern shape between the accent phrases is influenced using the speech parameter and the parameter of the learning speech data. A tone combination type extraction unit for extracting a tone combination type to be given ;
As inputs and the training speech data and the speech information and the tone-linked, and tone-linked model learning unit for generating a tonal binding acoustic model performs model learning considering the tone-linked,
An acoustic model generation apparatus comprising:

In the acoustic model generation device according to claim 4,
The tone combination type extraction unit is
Speech parameter generation means for generating a speech parameter sequence having the same phoneme segmentation information as the speech information, using the speech synthesis HMM generated by the model learning unit and the speech information as inputs;
Using the parameters of the speech parameter sequence and speech data for the learning, and the tone-linked extracting means for extracting the tonal coupled between accent phrases,
An acoustic model generation device comprising:

Tone-coupled acoustic model generated by the acoustic model generation device according to claim 4 or 5,
A text analysis unit that receives the speech synthesis target text as input and outputs text information composed of reading, accent, and tone combination type by text analysis of the speech synthesis target text;
Using the tone-coupled acoustic model and the text information, a speech parameter generating unit that generates speech parameters;
A speech synthesis filter unit that generates a speech waveform using the speech parameters;
A speech synthesizer comprising:

A program for causing a computer to function as the acoustic model generation device according to claim 4 or the voice synthesis device according to claim 6.