JP6495781B2

JP6495781B2 - Voice parameter generation device, voice parameter generation method, program

Info

Publication number: JP6495781B2
Application number: JP2015161861A
Authority: JP
Inventors: 伸克北条; 勇祐井島; 宮崎　昇; 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-08-19
Filing date: 2015-08-19
Publication date: 2019-04-03
Anticipated expiration: 2035-08-19
Also published as: JP2017040747A

Description

本発明は、音声合成技術に関し、特に音声データから音声合成用隠れマルコフモデルを学習するものに関する。 The present invention relates to speech synthesis technology, and more particularly to learning hidden Markov models for speech synthesis from speech data.

音声データから合成音声を生成するための手法として、隠れマルコフモデル（ＨＭＭ）に基づく音声合成方法がある。ＨＭＭ音声合成では、統計処理による過剰な平滑化処理に起因する品質劣化が生じる。この品質劣化を抑制する手法として、変調スペクトルに基づくポストフィルタの技術がある（非特許文献１）。 As a method for generating synthetic speech from speech data, there is a speech synthesis method based on Hidden Markov Model (HMM). In HMM speech synthesis, quality deterioration occurs due to excessive smoothing processing by statistical processing. As a method of suppressing this quality deterioration, there is a post filter technology based on a modulation spectrum (Non-Patent Document 1).

以下、図１〜図５を参照して非特許文献１の音声合成装置の概略を説明する。図１は、非特許文献１の音声合成装置９の構成を示すブロック図である。図２、図４は、それぞれ非特許文献１の音声合成装置９を構成するモデル学習部９０１、音声合成部９０２の構成を示すブロック図である。図３、図５は、それぞれ非特許文献１の音声合成装置９を構成するモデル学習部９０１、音声合成部９０２の動作を示すフローチャートである。図１に示すように非特許文献１の音声合成装置９は、モデル学習部９０１と、音声合成部９０２と、音声データ記録部９０３と、コンテキストデータ記録部９０４を含む。また、図２に示すようにモデル学習部９０１は、音響モデル学習部９１０と、音響モデル記録部９１１と、音声パラメータ生成部９１２と、変調スペクトル補正係数算出部９１３と、変調スペクトル補正係数記録部９１４を含む。図４に示すように音声合成部９０２は、テキスト解析部９２０と、音声パラメータ生成部９２１と、ポストフィルタ部９２２と、音声波形生成部９２３を含む。 The outline of the speech synthesis apparatus of Non-Patent Document 1 will be described below with reference to FIGS. 1 to 5. FIG. 1 is a block diagram showing the configuration of the speech synthesizer 9 of Non-Patent Document 1. As shown in FIG. FIGS. 2 and 4 are block diagrams showing configurations of a model learning unit 901 and a speech synthesis unit 902 which constitute the speech synthesis device 9 of Non-Patent Document 1, respectively. FIGS. 3 and 5 are flowcharts showing the operations of the model learning unit 901 and the speech synthesis unit 902 that constitute the speech synthesis device 9 of Non-Patent Document 1, respectively. As shown in FIG. 1, the speech synthesizer 9 of Non-Patent Document 1 includes a model learning unit 901, a speech synthesis unit 902, a speech data recording unit 903, and a context data recording unit 904. Further, as shown in FIG. 2, the model learning unit 901 includes an acoustic model learning unit 910, an acoustic model recording unit 911, an audio parameter generation unit 912, a modulation spectrum correction coefficient calculation unit 913, and a modulation spectrum correction coefficient recording unit. Including 914. As shown in FIG. 4, the speech synthesis unit 902 includes a text analysis unit 920, a speech parameter generation unit 921, a post filter unit 922, and a speech waveform generation unit 923.

音声データ記録部９０３には、音声合成の対象とする１名の話者（以下、目標話者という）の音声データを記録しておく。また、コンテキストデータ記録部９０４には、当該音声データに含まれる発話に関する情報（以下、発話情報という）であるコンテキストデータを記録しておく。音声データは、目標話者が複数の文章を発話した音声から生成されるものであり、音声（音声信号）に対して信号処理を行った結果、得られる音響特徴量である。音響特徴量の例としては、基本周波数（Ｆ０）などの音高パラメータ、ケプストラムやメルケプストラムなどのスペクトルパラメータがある。コンテキストデータは、音声データ中の発話と１対１に対応する発話情報である。この発話情報には、少なくとも音素情報（発音情報）とアクセント情報（アクセント型、アクセント句長）が含まれる。また、発話情報は、品詞情報をさらに含んでいてもよい。 The voice data recording unit 903 records voice data of one speaker (hereinafter referred to as a target speaker) to be subjected to voice synthesis. In addition, the context data recording unit 904 records context data which is information related to an utterance included in the voice data (hereinafter referred to as utterance information). The voice data is generated from voice in which a target speaker utters a plurality of sentences, and is an acoustic feature value obtained as a result of performing signal processing on a voice (voice signal). Examples of acoustic feature quantities include pitch parameters such as fundamental frequency (F0), and spectral parameters such as cepstrum and mel cepstrum. The context data is utterance information that corresponds one-to-one with the utterance in the voice data. The speech information includes at least phoneme information (pronunciation information) and accent information (accent type, accent phrase length). In addition, the utterance information may further include part of speech information.

モデル学習部９０１では、音声データ記録部９０３に記録した音声データとコンテキストデータ記録部９０４に記録したコンテキストデータから音響モデルと変調スペクトル補正係数を生成し、それぞれ音響モデル記録部９１１と変調スペクトル補正係数記録部９１４に記録する。音響モデル学習部９１０は、音声データとコンテキストデータから音響モデルを学習する（Ｓ９１０）。より具体的には、目標話者の音声データと、当該音声データに含まれる発話に関する発話情報であるコンテキストデータとを学習データとして、隠れマルコフモデルとして音響モデルを学習する。音声パラメータ生成部９１２は、Ｓ９１０で学習した音響モデルとコンテキストデータから音声パラメータ系列を生成する（Ｓ９１２）。変調スペクトル補正係数算出部９１３は、Ｓ９１２で生成した音声パラメータ系列と、音声データ記録部９０３の音声データから生成される自然音声の音声パラメータ系列とから、変調スペクトル補正係数を算出する（Ｓ９１３）。 The model learning unit 901 generates an acoustic model and a modulation spectrum correction coefficient from the speech data recorded in the speech data recording unit 903 and the context data recorded in the context data recording unit 904, and the acoustic model recording unit 911 and the modulation spectrum correction coefficient respectively The image is recorded in the recording unit 914. The acoustic model learning unit 910 learns an acoustic model from speech data and context data (S910). More specifically, the acoustic model is learned as a hidden Markov model, using speech data of the target speaker and context data that is speech information related to an utterance included in the speech data as learning data. The speech parameter generation unit 912 generates a speech parameter sequence from the acoustic model and context data learned in S910 (S912). The modulation spectrum correction coefficient calculation unit 913 calculates a modulation spectrum correction coefficient from the speech parameter sequence generated in S912 and the speech parameter sequence of natural speech generated from speech data of the speech data recording unit 903 (S913).

音声合成部９０２は、音響モデル記録部９１１に記録した音響モデルと変調スペクトル補正係数記録部９１４に記録した変調スペクトル補正係数を用いて合成音声の対象となるテキストから音声波形を生成し、合成音声を出力する。テキスト解析部９２０は、音声合成部９０２に入力されたテキストを解析し、当該テキストの読みやアクセントなどのコンテキストデータを生成する（Ｓ９２０）。音声パラメータ生成部９２１は、Ｓ９２０で得られたコンテキストデータと音響モデル記録部９１１に記録している音響モデルから、音声パラメータ系列を生成する（Ｓ９２１）。ポストフィルタ部９２２は、変調スペクトル補正係数記録部９１４に記録している変調スペクトル補正係数に基づいて設計されるポストフィルタをＳ９２１で生成した音声パラメータ系列に適用して、ポストフィルタリング後音声パラメータ系列を生成する（Ｓ９２２）。音声波形生成部９２３は、ポストフィルタリング後音声パラメータ系列から音声波形を生成し、合成音声を出力する（Ｓ９２３）。なお、音声パラメータ生成部９２１は、モデル学習部９０１を構成する音声パラメータ生成部９１２と同様の機能を有するものでよい。 The speech synthesis unit 902 generates a speech waveform from the text to be synthesized speech using the acoustic model recorded in the acoustic model recording unit 911 and the modulation spectrum correction coefficient recorded in the modulation spectrum correction coefficient recording unit 914. Output The text analysis unit 920 analyzes the text input to the speech synthesis unit 902, and generates context data such as reading of the text and an accent (S920). The speech parameter generation unit 921 generates a speech parameter sequence from the context data obtained in S920 and the acoustic model recorded in the acoustic model recording unit 911 (S921). The post filter unit 922 applies a post filter designed based on the modulation spectrum correction coefficient recorded in the modulation spectrum correction coefficient recording unit 914 to the speech parameter sequence generated in S 921 to obtain a post filter speech parameter sequence Generate (S922). The speech waveform generation unit 923 generates a speech waveform from the post-filtered speech parameter sequence, and outputs synthetic speech (S923). Note that the speech parameter generation unit 921 may have the same function as the speech parameter generation unit 912 constituting the model learning unit 901.

高道慎之介、戸田智基、Graham Neubig、Sakriani Sakti、中村哲、“ＨＭＭ音声合成における変調スペクトルに基づくポストフィルタ”、信学技報、電子情報通信学会、２０１３年１１月、Vol.113 No.308 pp.19-24.Takamichi Shinnosuke, Toda Tomoki, Graham Neubig, Sakriani Sakti, Nakamura Tetsu, "Post filter based on modulation spectrum in HMM speech synthesis", IEICE Technical Report, The Institute of Electronics, Information and Communication Engineers, November 2013, Vol. 113 No. 308 pp. 19-24.

非特許文献１の変調スペクトルに基づくポストフィルタを用いたＨＭＭ音声合成技術では、品質劣化を抑制し高品質な音声を得るために、異なる２つの観点から音声パラメータ系列の尤もらしさを捉え、それぞれの観点に基づいて順に音声パラメータ系列の生成・変換を行い、合成音声を出力している。以下、具体的に説明する。 In HMM speech synthesis technology using a post filter based on a modulation spectrum of Non-Patent Document 1, in order to suppress quality degradation and obtain high quality speech, the likelihood of speech parameter series is captured from two different viewpoints, Based on the viewpoint, the speech parameter sequence is sequentially generated and converted to output synthetic speech. The details will be described below.

まず、コンテキストデータに対する音声パラメータの静的・動的特徴の観点から、音声パラメータ系列の尤もらしさを表現、つまり、音響モデルとしての尤度を生成し、その尤度を最大化する音声パラメータ系列を生成する（Ｓ９２１）。次に、音声パラメータの時系列方向の変動の観点から、音声パラメータ系列を自然音声の音声パラメータ系列に近づけることで、音声品質を向上させる。つまり、変調スペクトル補正係数に基づくポストフィルタにより、自然音声の音声パラメータ系列に近い音声パラメータ系列（ポストフィルタリング後音声パラメータ系列）に変換する（Ｓ９２２）。 First, from the viewpoint of static and dynamic features of speech parameters relative to context data, express the likelihood of speech parameter series, that is, generate the likelihood as an acoustic model, and maximize the likelihood of speech parameter series It generates (S921). Next, from the viewpoint of time-series direction fluctuation of speech parameters, the speech quality is improved by bringing the speech parameter series closer to the speech parameter series of natural speech. That is, by the post filter based on the modulation spectrum correction coefficient, it is converted into a speech parameter series (post-filtered speech parameter series) close to the speech parameter series of natural speech (S922).

ここで、Ｓ９２２の処理では、音響モデルの尤度を考慮せずに一律に音声パラメータ系列の変換を行っている。このため、音声パラメータ系列がコンテキストデータに対する音声パラメータの静的・動的特徴の観点から見た場合に妥当ではない値へ変換されることもあり、音声の自然性が十分に向上しないこともある。 Here, in the process of S922, the speech parameter series is uniformly converted without considering the likelihood of the acoustic model. For this reason, the voice parameter sequence may be converted to an invalid value when viewed from the viewpoint of the static and dynamic features of the voice parameter for context data, and the naturalness of the voice may not be sufficiently improved. .

また、一度音声パラメータ系列を生成した後に再度ポストフィルタを適用し音声パラメータ系列を変換するため、演算量が増大してしまい、結果音声合成対象のテキストを入力してから合成音声が得られるまでの時間に遅延が発生するという問題も生じる。 Also, once the speech parameter sequence is generated, the post filter is applied again to convert the speech parameter sequence, so the amount of operation increases, and the text from the result speech synthesis target is input until the synthesized speech is obtained. There is also the problem of a delay in time.

そこで本発明では、コンテキストデータに対する音声パラメータの静的・動的特徴の観点と音声パラメータの時系列方向の変動の観点の２つの観点から妥当な、自然音声に近い音声パラメータ系列を低演算量で生成する音声パラメータ生成装置を提供することを目的とする。 Therefore, according to the present invention, a speech parameter sequence close to natural speech is valid with a small amount of operation, from the viewpoints of static and dynamic characteristics of speech parameters with respect to context data and from the viewpoint of time-series direction fluctuation of speech parameters. An object of the present invention is to provide an audio parameter generation device for generating.

本発明の一態様は、音声パラメータ系列の生成に用いる隠れマルコフモデルである音響モデルと、前記音声パラメータ系列の分布モデルとしてガウス分布N(Ac;0,σ²I)（ただし、Aは次式の下三角行列、cは音声パラメータ系列、σ²Iは当該ガウス分布の共分散行列（Iは単位行列））を仮定した場合における線形予測係数(a₁,a₂,…,a_p)で表現される変調スペクトルモデルを用いて、音声合成の対象となるテキストに関するコンテキストデータから、当該テキストに関する音声パラメータ系列を生成する。 One aspect of the present invention is an acoustic model which is a hidden Markov model used for generating a speech parameter sequence, and a Gaussian distribution N (Ac; 0, σ ² I) as the distribution model of the speech parameter sequence (where A is the following equation) Lower triangular matrix, c is a speech parameter sequence, σ ² I is a linear prediction coefficient (a ₁ , a ₂ ,..., A _p ) when assuming that the Gaussian distribution covariance matrix (I is an identity matrix) From the context data on the text to be subjected to speech synthesis, a speech parameter series on the text is generated using the modulation spectrum model to be expressed.

本発明によれば、音響モデルと変調スペクトルモデルを同時に用いて音声パラメータ系列を生成することにより、自然音声に近い音声パラメータ系列を低演算量で生成することが可能となる。 According to the present invention, it is possible to generate a speech parameter series close to natural speech with a small amount of calculation by generating a speech parameter series using an acoustic model and a modulation spectrum model simultaneously.

非特許文献１の音声合成装置９の構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a speech synthesis device 9 of Non-Patent Document 1. 非特許文献１のモデル学習部９０１の構成を示すブロック図。FIG. 7 is a block diagram showing the configuration of a model learning unit 901 of Non-Patent Document 1. 非特許文献１のモデル学習部９０１の動作を示すフローチャート。10 is a flowchart showing an operation of a model learning unit 901 of Non-Patent Document 1. 非特許文献１の音声合成部９０２の構成を示すブロック図。FIG. 10 is a block diagram showing the configuration of a speech synthesis unit 902 of Non-Patent Document 1. 非特許文献１の音声合成部９０２の動作を示すフローチャート。11 is a flowchart showing the operation of the speech synthesis unit 902 of Non-Patent Document 1. 実施例１の音声合成装置１の構成を示すブロック図。FIG. 1 is a block diagram showing the configuration of a speech synthesizer 1 of a first embodiment. 実施例１のモデル学習部１０１の構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a model learning unit 101 according to the first embodiment. 実施例１のモデル学習部１０１の動作を示すフローチャート。6 is a flowchart showing the operation of the model learning unit 101 according to the first embodiment. 実施例１の音声合成部１０２の構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a speech synthesis unit 102 according to the first embodiment. 実施例１の音声合成部１０２の動作を示すフローチャート。6 is a flowchart showing the operation of the speech synthesis unit 102 according to the first embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. Note that components having the same function will be assigned the same reference numerals and redundant description will be omitted.

＜実施例１の発明の要点＞
コンテキストデータに対する音声パラメータの静的・動的特徴の観点を表現するために、非特許文献１でも利用されている音響モデルを用いる。また、音声パラメータの時系列方向の変動の観点を表現するために、線形予測モデルを用いて記述される変調スペクトルモデルを用いる。そして、音響モデルと変調スペクトルモデルの尤度を同時に最大にする音声パラメータ系列を求める。この音声パラメータ系列を求めるアルゴリズムを、従来の音声パラメータ生成手法（後述する参考非特許文献３に記載の技術）を拡張したアルゴリズムとして構成する。この結果、上記２つの観点から妥当な、自然音声に近い音声パラメータ系列を低演算量で生成することが可能となり、従来よりも自然性の高い合成音声を合成できるようになる。 <The point of the invention of the first embodiment>
In order to express viewpoints of static and dynamic features of speech parameters with respect to context data, an acoustic model that is also used in Non-Patent Document 1 is used. Also, in order to express the viewpoint of the time-series directional fluctuation of speech parameters, a modulation spectrum model described using a linear prediction model is used. Then, a speech parameter sequence that maximizes the likelihood of the acoustic model and the modulation spectrum model simultaneously is determined. The algorithm for obtaining this voice parameter sequence is configured as an algorithm that is an extension of the conventional voice parameter generation method (the technique described in Reference Non-Patent Document 3 described later). As a result, it becomes possible to generate a speech parameter sequence close to natural speech with a low amount of calculation, which is appropriate from the above two points of view, and it becomes possible to synthesize synthetic speech having higher naturality than ever before.

＜実施例１の具体的説明＞
以下、図６〜図１０を参照して実施例１の音声合成装置を説明する。図６は、実施例１の音声合成装置１の構成を示すブロック図である。図７、図９は、それぞれ実施例１の音声合成装置１を構成するモデル学習部１０１、音声合成部１０２の構成を示すブロック図である。図８、図１０は、それぞれ実施例１の音声合成装置１を構成するモデル学習部１０１、音声合成部１０２の動作を示すフローチャートである。図６に示すように実施例１の音声合成装置１は、モデル学習部１０１と、音声合成部１０２と、音響モデル学習用音声データ記録部１０３と、音響モデル学習用コンテキストデータ記録部１０４と、変調スペクトルモデル学習用音声データ記録部１０５と、変調スペクトルモデル学習用コンテキストデータ記録部１０６を含む。また、図７に示すようにモデル学習部１０１は、音響モデル学習部１１０と、音響モデル記録部１１１と、変調スペクトルモデル学習部１１２と、変調スペクトルモデル記録部１１３を含む。図９に示すように音声合成部１０２は、テキスト解析部１２０と、音声パラメータ生成部１２１と、音声波形生成部１２２を含む。 <Specific Description of Example 1>
The speech synthesizer according to the first embodiment will be described below with reference to FIGS. FIG. 6 is a block diagram showing the configuration of the speech synthesizer 1 of the first embodiment. FIGS. 7 and 9 are block diagrams showing configurations of the model learning unit 101 and the speech synthesis unit 102 that constitute the speech synthesis device 1 according to the first embodiment, respectively. FIGS. 8 and 10 are flowcharts showing the operations of the model learning unit 101 and the speech synthesis unit 102 that constitute the speech synthesis device 1 of the first embodiment, respectively. As shown in FIG. 6, the speech synthesizer 1 of the first embodiment includes a model learning unit 101, a speech synthesis unit 102, an acoustic model learning speech data recording unit 103, and an acoustic model learning context data recording unit 104. The modulation spectrum model training speech data storage unit 105 and the modulation spectrum model training context data storage unit 106 are included. Further, as shown in FIG. 7, the model learning unit 101 includes an acoustic model learning unit 110, an acoustic model recording unit 111, a modulation spectrum model learning unit 112, and a modulation spectrum model recording unit 113. As shown in FIG. 9, the speech synthesis unit 102 includes a text analysis unit 120, a speech parameter generation unit 121, and a speech waveform generation unit 122.

音響モデル学習用音声データ記録部１０３、音響モデル学習用コンテキストデータ記録部１０４には、非特許文献１の音声データ記録部９０３、コンテキストデータ記録部９０４と同様のデータを記録する。一方、変調スペクトルモデル学習用音声データ記録部１０５には、変調スペクトルモデル学習に使用される音声データである変調スペクトルモデル学習用音声データを記録する。変調スペクトルモデル学習用音声データは、目標話者による、音響モデル学習用音声データで使用するものと同一の発話文章の音声データのすべて・一部以外にも、目標話者による、音響モデル学習用音声データで使用するものと異なる発話文章の音声データや、目標話者とは異なる話者による、異なる発話文章の音声データを含むようにしてもよい。また、変調スペクトルモデル学習用コンテキストデータ記録部１０６には、変調スペクトルモデル学習用音声データ記録部１０５の音声データに対応するコンテキストデータを記録する。 The same data as the voice data recording unit 903 and the context data recording unit 904 of Non-Patent Document 1 is recorded in the sound model learning speech data recording unit 103 and the sound model learning context data recording unit 104. On the other hand, the modulation spectrum model training speech data storage unit 105 records modulation spectrum model speech training data, which is speech data used for modulation spectrum model training. The modulation spectrum model training speech data is used by the target speaker for acoustic model training by the target speaker in addition to all or part of the speech data of the same spoken sentence as that used for the acoustic model training speech data. Voice data of an utterance text different from that used as voice data, or voice data of an utterance text different from a speaker different from the target speaker may be included. Further, context data corresponding to the audio data of the modulation spectrum model training speech data storage unit 105 is recorded in the modulation spectrum model training context data storage section 106.

（モデル学習部１０１）
モデル学習部１０１では、音響モデル学習用音声データ記録部１０３、音響モデル学習用コンテキストデータ記録部１０４、変調スペクトルモデル学習用音声データ記録部１０５、変調スペクトルモデル学習用コンテキストデータ記録部１０６に記録した各データから、目標話者の音声合成用モデルとなる２つのモデル、音響モデルと変調スペクトルモデルを生成する。音響モデル学習部１１０は、音響モデル学習用音声データ記録部１０３に記録した目標話者の音声データと音響モデル学習用コンテキストデータ記録部１０４に記録したコンテキストデータから音響モデルを学習する（Ｓ１１０）。音響モデル学習部１１０は、非特許文献１の音響モデル学習部９１０と同一のものでよい。 (Model learning unit 101)
In the model learning unit 101, the sound data for sound model learning 103, the context data for sound model learning 104, the modulation spectrum model sound for speech data recording unit 105, and the modulation spectrum model for context data recording 106 are recorded. From each data, two models, a model for speech synthesis of the target speaker, an acoustic model and a modulation spectrum model are generated. The acoustic model learning unit 110 learns an acoustic model from the voice data of the target speaker recorded in the acoustic model learning voice data recording unit 103 and the context data recorded in the acoustic model learning context data recording unit 104 (S110). The acoustic model learning unit 110 may be the same as the acoustic model learning unit 910 of Non-Patent Document 1.

変調スペクトルモデル学習部１１２は、変調スペクトルモデル学習用音声データ記録部１０５に記録した音声データと変調スペクトルモデル学習用コンテキストデータ記録部１０６に記録したコンテキストデータから変調スペクトルモデルを生成する（Ｓ１１２）。以下、変調スペクトルモデルの表現と変調スペクトルモデル生成方法（学習アルゴリズム）について説明する。 The modulation spectrum model learning unit 112 generates a modulation spectrum model from the voice data recorded in the modulation spectrum model learning voice data recording unit 105 and the context data recorded in the modulation spectrum model learning context data recording unit 106 (S112). Hereinafter, the expression of the modulation spectrum model and the modulation spectrum model generation method (learning algorithm) will be described.

（１）変調スペクトルモデルの表現
非特許文献１では、変調スペクトルを音声パラメータ系列のパワースペクトルとして定義している。そして、非特許文献１では、ＨＭＭ音声合成における合成音声の品質劣化の主な原因は変調スペクトルの高周波成分が自然音声のものに比べ低下することにあるとして、高周波の変調スペクトルを補正するポストフィルタを設計している。 (1) Expression of Modulation Spectrum Model In Non-Patent Document 1, the modulation spectrum is defined as the power spectrum of the speech parameter sequence. In Non-Patent Document 1, it is assumed that the main cause of quality deterioration of synthetic speech in HMM speech synthesis is that the high frequency component of the modulation spectrum is lower than that of natural speech. Is designed.

実施例１では、非特許文献１と異なり、変調スペクトルを周波数領域で扱うのではなく、時間領域で扱う。つまり、時間的な側面における音声パラメータ系列の生成モデルが次のガウス分布で規定される分布モデルになると仮定する。

In the first embodiment, unlike the non-patent document 1, the modulation spectrum is not handled in the frequency domain but in the time domain. That is, it is assumed that the generation model of the speech parameter sequence in the temporal aspect is a distribution model defined by the following Gaussian distribution.

ここで、Tは１文の時間長、c=(c₁,c₂,…,c_T)は音声パラメータ系列（各時刻tにおける音声パラメータc_tの列）、a=(a₁,a₂,…,a_p)は変調スペクトルモデルを線形予測モデルとして表現するパラメータ（以下、変調スペクトルモデルのモデルパラメータという）、Pは線形予測モデルの次元、σ²はガウス分布の分散パラメータである。また、Aは次のT行T列の下三角行列、IはT次の単位行列である。

Here, T is the time length of one _{sentence, c = (c 1, c} 2, ..., c T) ( column speech parameters c _t at each time t) speech parameter sequence _{is, a = (a 1, a} 2 ,..., A _p ) are parameters representing the modulation spectrum model as a linear prediction model (hereinafter referred to as model parameters of the modulation spectrum model), P is a dimension of the linear prediction model, and σ ² is a dispersion parameter of Gaussian distribution. Also, A is a lower triangular matrix of T rows and T columns, and I is an identity matrix of T order.

Aを上述のようなT行T列の下三角行列としたことから、ここでの音声パラメータ系列の生成モデルについての仮定は、変調スペクトルの生成モデルを全極モデルと仮定することと等価である。 Since A is a lower triangular matrix of T rows and T columns as described above, the assumption about the generation model of the speech parameter sequence here is equivalent to assuming the generation model of the modulation spectrum as the all-pole model .

（２）変調スペクトルモデルの生成方法（学習アルゴリズム）
変調スペクトルモデルの学習は、変調スペクトルモデル学習用音声データ記録部１０５に記録した音声データと変調スペクトルモデル学習用コンテキストデータ記録部１０６に記録したコンテキストデータから与えられる音声パラメータ系列c_nに対する最尤推定問題を解くことにより行う。変調スペクトルモデルのモデルパラメータa^は以下の式で求まる。

(2) Generation method of modulation spectrum model (learning algorithm)
The modulation spectrum model is learned by maximum likelihood estimation for speech parameter series c _n given from speech data recorded in modulation spectrum model training speech data storage unit 105 and context data recorded in modulation spectrum model training context data storage section 106 It does by solving the problem. The model parameter a ^ of the modulation spectrum model is obtained by the following equation.

なお、上述の（式１）のように上述のT行T列の下三角行列Aを用いて変調スペクトルモデルのモデルパラメータA^を表現すると、

となる。 In addition, if model parameter A ^ of a modulation | alteration spectrum model is expressed using the lower triangular matrix A of the above-mentioned T line T column like the above-mentioned (Formula 1),

It becomes.

これは線形予測モデルの推定と同形の問題であり、例えばLevinson-Durbin-Itakuraアルゴリズム（参考非特許文献２）を用いることで、上記最尤推定問題の解である変調スペクトルモデルを効率的に求めることができる。
（参考非特許文献２：N. Levinson, “The Wiener RMS (Root Mean Square) Error Criterion in Filter Design and Prediction”, J. Mathematical Phys., 25 (1947), pp. 261-278.） This is an isomorphic problem of linear prediction model estimation. For example, by using the Levinson-Durbin-Itakura algorithm (Reference Non-Patent Document 2), a modulation spectrum model which is a solution of the maximum likelihood estimation problem is efficiently determined. be able to.
(Reference Non-Patent Document 2: N. Levinson, “The Wiener RMS (Root Mean Square) Error Criterion in Filter Design and Prediction”, J. Mathematical Phys., 25 (1947), pp. 261-278.)

なお、上述の線形予測係数a=(a₁,a₂,…,a_p)の最尤推定は、自然音声の音声パラメータの持つ変調スペクトルを全極モデルでモデル化し、Itakura-Saito divergenceを最小化する係数a=(a₁,a₂,…,a_p)を求めることと等価であることが知られている。 In the maximum likelihood estimation of the linear prediction coefficients a = (a ₁ , a ₂ , ..., a _p ) described above, the modulation spectrum of the speech parameters of natural speech is modeled with an all-pole model to minimize Itakura-Saito divergence. It is known that it is equivalent to obtaining the conversion coefficient a = (a ₁ , a ₂ ,..., A _p ).

（音声合成部１０２）
音声合成部１０２は、音響モデル記録部１１１に記録した音響モデルと変調スペクトルモデル記録部１１３に記録した変調スペクトルモデルを用いて合成音声の対象となるテキストから音声波形を生成し、合成音声を出力する。テキスト解析部１２０は、音声合成部１０２に入力されたテキストから当該テキストの読みやアクセントなどのコンテキストデータを生成する（Ｓ１２０）。コンテキストデータの生成に使用するアルゴリズムは（参考非特許文献３）、（参考非特許文献４）と同様のものでよい。
（参考非特許文献３）益子貴史、徳田恵一、小林隆夫、今井聖、“動的特徴を用いたＨＭＭに基づく音声合成”、電子情報通信学会論文誌、電子情報通信学会、１９９６年１２月、Vol.J79-D-II No.12 pp.2184-2190.
（参考非特許文献４）今井聖、住田一男、古市千枝子、“音声合成のためのメル対数スペクトル近似（ＭＬＳＡ）フィルタ”、電子情報通信学会論文誌、電子情報通信学会、１９８３年２月、Vol.J66-A No.2 pp.122-129. (Voice synthesis unit 102)
The speech synthesis unit 102 generates a speech waveform from the text to be synthesized speech using the acoustic model recorded in the acoustic model recording unit 111 and the modulation spectrum model recorded in the modulation spectrum model recording unit 113, and outputs a synthesized speech. Do. The text analysis unit 120 generates context data such as reading of the text and an accent from the text input to the speech synthesis unit 102 (S120). An algorithm used to generate context data may be the same as (Reference Non-Patent Document 3) or (Reference Non-Patent Document 4).
(Reference Non-Patent Document 3) Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Sei Imai, "HMM-based speech synthesis using dynamic features", Transactions of the Institute of Electronics, Information and Communication Engineers, The Institute of Electronics, Information and Communication Engineers, December 1996. Vol. J79-D-II No. 12 pp. 2184-2190.
(Reference Non-Patent Document 4) Sei Imai, Kazuo Sumita, Chieko Furuichi, "Mel log spectral approximation (MLSA) filter for speech synthesis", Transactions of the Institute of Electronics, Information and Communication Engineers, The Institute of Electronics, Information and Communication Engineers, February 1983, Vol. .J66-A No. 2 pp. 122-129.

音声パラメータ生成部１２１は、Ｓ１２０で得られたコンテキストデータと音響モデル記録部１１１に記録している音響モデルと変調スペクトルモデル記録部１１３に記録している変調スペクトルモデルから、音声パラメータ系列を生成する（Ｓ１２１）。 The speech parameter generation unit 121 generates a speech parameter sequence from the context data obtained in S120, the acoustic model recorded in the acoustic model recording unit 111, and the modulation spectrum model recorded in the modulation spectrum model recording unit 113. (S121).

音声パラメータ系列の生成に使用するアルゴリズムは、（参考非特許文献３）の音声パラメータ系列生成アルゴリズムの拡張として得られる。（参考非特許文献３）のアルゴリズムでは、生成すべき音声パラメータ系列c⁺は以下の式で求まる。

The algorithm used to generate the speech parameter sequence is obtained as an extension of the speech parameter sequence generation algorithm of (Reference Non-Patent Document 3). In the algorithm of (Reference Non-Patent Document 3), a speech parameter series c ⁺ to be generated is obtained by the following equation.

ただし、λは隠れマルコフモデルとしての音響モデルのパラメータ、wは入力テキスト、Wは音声パラメータ系列c=(c₁,c₂,…,c_T)の時間差分を表現する行列、μ_q、Σ_qはλとwから得られるガウス分布の平均パラメータ系列、共分散行列である。 Where λ is the parameter of the acoustic model as a hidden Markov model, w is the input text, W is a matrix representing the time difference of the speech parameter sequence c = (c ₁ , c ₂ , ..., c _T ), μ _q , Σ _q is an average parameter sequence of Gaussian distribution obtained from λ and w, and a covariance matrix.

同様に、Ｓ１２１で求める音声パラメータ系列c^は、音響モデル記録部１１１に記録している音響モデルをλ^(a)、変調スペクトルモデル記録部１１３に記録している変調スペクトルモデルのモデルパラメータをa^、音声合成部１０２に入力されたテキストをw^とし、以下の目的関数を同時に最適化することにより得られる。

すなわち、

である。ただし、μ_q、Σ_qはλ^(a)とw^から得られるガウス分布の平均パラメータ系列、共分散行列、Aは変調スペクトルモデルのモデルパラメータa^に対応する下三角行列であるとし、

とおく。 Similarly, in the speech parameter series c ^ determined in S121, the acoustic model recorded in the acoustic model recording unit 111 is λ ^(a) , and the model parameters of the modulation spectrum model recorded in the modulation spectrum model recording unit 113 are a. ^, The text input to the speech synthesis unit 102 is obtained as w ^ and the following objective functions are simultaneously optimized.

That is,

It is. Where μ _q and _{q q} are the mean parameter sequence of the Gaussian distribution obtained from λ ^(a) and w ^, the covariance matrix, and A is the lower triangular matrix corresponding to the model parameter a ^ of the modulation spectrum model,

far.

上記目的関数の第一因子が音響モデル尤度、第二因子が変調スペクトルモデル尤度に対応し、両者の積を最大化することにより、２つの観点から見た音声の自然性を同時に最大化することができる。 The first factor of the objective function corresponds to the acoustic model likelihood, and the second factor corresponds to the modulation spectrum model likelihood, and by maximizing the product of both, the naturalness of the speech viewed from two viewpoints is maximized simultaneously can do.

（参考非特許文献３）のアルゴリズムと同様に、音声パラメータ系列

が得られる。 Similar to the algorithm of (Reference non-patent document 3), speech parameter sequence

Is obtained.

音声波形生成部１２２は、Ｓ１２１で生成された音声パラメータ系列c^から、音声波形を生成し、合成音声を出力する（Ｓ１２２）。音声波形生成に使用するアルゴリズムは（参考非特許文献３）、（参考非特許文献４）と同様のものでよい。 The speech waveform generation unit 122 generates a speech waveform from the speech parameter sequence c ^ generated in S121, and outputs a synthesized speech (S122). An algorithm used for speech waveform generation may be the same as (Reference Non-Patent Document 3) or (Reference Non-Patent Document 4).

本実施例の音声合成装置１によれば、音響モデルの尤度と変調スペクトルモデルの尤度の積を最大化する音声パラメータ系列を求めるようにしたため、コンテキストデータに対する音声パラメータの静的・動的特徴の観点と音声パラメータの時系列方向の変動の観点の２つの観点から妥当な音声パラメータ系列を生成することが可能となる。その結果、自然音声に近い音声パラメータ系列を低演算量で生成することが可能となる。 According to the speech synthesizer 1 of the present embodiment, the speech parameter sequence that maximizes the product of the likelihood of the acoustic model and the likelihood of the modulation spectrum model is determined. It is possible to generate a valid speech parameter sequence from two points of view: the feature point of view and the viewpoint of time-series direction variation of speech parameters. As a result, it is possible to generate a speech parameter sequence close to natural speech with a low amount of calculation.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary Note>
The apparatus according to the present invention is, for example, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected as a single hardware entity, or a communication device (for example, communication cable) capable of communicating outside the hardware entity. Communication unit that can be connected, CPU (central processing unit, cache memory, registers, etc. may be provided), RAM or ROM that is memory, external storage device that is hard disk, input unit for these, output unit, communication unit , CPU, RAM, ROM, and a bus connected so as to enable exchange of data between external storage devices. If necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. Examples of physical entities provided with such hardware resources include general purpose computers.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above-mentioned function, data required for processing the program, and the like (not limited to the external storage device, for example, the program is read) It may be stored in the ROM which is a dedicated storage device). In addition, data and the like obtained by the processing of these programs are appropriately stored in a RAM, an external storage device, and the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in the external storage device (or ROM etc.) and data necessary for processing of each program are read into the memory as necessary, and interpreted and processed appropriately by the CPU . As a result, the CPU realizes predetermined functions (each component requirement expressed as the above-mentioned,...

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the spirit of the present invention. Further, the processing described in the above embodiment may be performed not only in chronological order according to the order of description but also may be performed in parallel or individually depending on the processing capability of the device that executes the processing or the necessity. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function in the hardware entity (the apparatus of the present invention) described in the above embodiment is implemented by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on a computer, the processing function of the hardware entity is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (Rewritable), etc. as magneto-optical recording medium, MO (Magneto-Optical disc) etc., as semiconductor memory EEP-ROM (Electronically Erasable and Programmable Only Read Memory) etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD, a CD-ROM or the like in which the program is recorded. Furthermore, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, at the time of execution of the process, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer Each time, processing according to the received program may be executed sequentially. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by executing instructions and acquiring results from the server computer without transferring the program to the computer It may be Note that the program in the present embodiment includes information provided for processing by a computer that conforms to the program (such as data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the hardware entity is configured by executing a predetermined program on a computer, but at least a part of the processing content may be realized as hardware.

Claims

An acoustic model which is a hidden Markov model used to generate a speech parameter series, and a Gaussian distribution N (Ac; 0, σ ² I) as the distribution model of the speech parameter series (where A is a lower triangular matrix of the following formula, c is A speech parameter sequence, σ ² I is a modulation spectrum model represented by linear prediction coefficients (a ₁ , a ₂ ,..., A _p ) when assuming a covariance matrix (I is a unit matrix) of the Gaussian distribution. A speech parameter generation unit that generates a speech parameter sequence related to the text from context data related to the text to be subjected to speech synthesis;

Voice parameter generator including:

The voice parameter generation device according to claim 1, wherein
The Gaussian distribution corresponding to the acoustic model is N (Wc; μ _q , _{q q} ) (where c is a speech parameter sequence, W is a matrix representing the time difference of the speech parameter sequence c, and μ _q and _{q q} are each) The mean parameter sequence of the Gaussian distribution, the covariance matrix),
The speech parameter sequence c ^ generated by the speech parameter generation unit is

A voice parameter generator that is

The voice parameter generation device according to claim 1 or 2
An acoustic model learning unit that learns the acoustic model using, as learning data, first voice data and first context data that is speech information related to an utterance included in the first voice data;
Model parameter A of the modulation spectrum model obtained by the following equation using the second speech data and the second context data which is the speech information related to the utterance included in the second speech data as learning data Modulation spectrum model learning unit to learn as

(Where c _n is a voice parameter sequence derived from the second voice data and the second context data)
An audio parameter generator further comprising:

An acoustic model which is a hidden Markov model used to generate a speech parameter series, and a Gaussian distribution N (Ac; 0, σ ² I) as the distribution model of the speech parameter series (where A is a lower triangular matrix of the following formula, c is A speech parameter sequence, σ ² I is a modulation spectrum model represented by linear prediction coefficients (a ₁ , a ₂ ,..., A _p ) when assuming a covariance matrix (I is a unit matrix) of the Gaussian distribution. A speech parameter generating step of generating a speech parameter series related to the text from context data related to the text to be subjected to speech synthesis;

Voice parameter generation method including:

The speech parameter generation method according to claim 4 , wherein
The Gaussian distribution corresponding to the acoustic model is N (Wc; μ _q , _{q q} ) (where c is a speech parameter sequence, W is a matrix representing the time difference of the speech parameter sequence c, and μ _q and _{q q} are each) The mean parameter sequence of the Gaussian distribution, the covariance matrix),
The speech parameter sequence c ^ generated in the speech parameter generation step is

Voice parameter generation method.

A program for causing a computer to function as the voice parameter generation device according to any one of claims 1 to 3 .