JP6626052B2

JP6626052B2 - Acoustic model generation method, speech synthesis method, acoustic model generation device, speech synthesis device, program

Info

Publication number: JP6626052B2
Application number: JP2017153135A
Authority: JP
Inventors: 伸克北条; 勇祐井島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-08-08
Filing date: 2017-08-08
Publication date: 2019-12-25
Anticipated expiration: 2037-08-08
Also published as: JP2019032427A

Description

本技術は、ＤＮＮに基づく音声合成技術に関し、音響モデル生成方法、音声合成方法、音響モデル生成装置、音声合成装置、プログラムに関する。 The present technology relates to a speech synthesis technology based on DNN, and relates to an acoustic model generation method, a speech synthesis method, an acoustic model generation device, a speech synthesis device, and a program.

音声データから音声合成用モデルを学習し、合成音声を生成する手法として、ＤＮＮに基づく技術がある（非特許文献１）。また、複数ドメインの学習データを効率よく活用し、各ドメインについて品質の高い音声を合成するための手法として、ＤＮＮの複数ドメインモデリング技術がある。ドメインには、例えば話者（非特許文献２、非特許文献３）、対話行為情報（非特許文献４）がある。複数ドメインモデリング技術として、話者コード（非特許文献２）、shared hidden layer (SHU)（非特許文献３）等のモデル構成を活用する手法がある。この手法の概要を図１、図２に示す。 As a technique for learning a model for speech synthesis from speech data and generating a synthesized speech, there is a technique based on DNN (Non-Patent Document 1). In addition, as a method for efficiently utilizing learning data of a plurality of domains and synthesizing high-quality speech for each domain, there is a multi-domain modeling technique of DNN. The domains include, for example, speakers (Non-Patent Documents 2 and 3) and dialogue action information (Non-Patent Document 4). As a multi-domain modeling technique, there is a method utilizing a model configuration such as a speaker code (Non-Patent Document 2) and a shared hidden layer (SHU) (Non-Patent Document 3). An outline of this method is shown in FIGS.

図１に示すように、音響モデル生成装置９１の音響モデル学習部９１３は、複数ドメイン音声ＤＢ記憶部９１１に予め記憶された複数ドメイン音声ＤＢ、複数ドメインコンテキストＤＢ記憶部９１２に予め記憶された複数ドメインコンテキストＤＢを利用し、音響モデル学習を行い、複数ドメイン音響モデルを得て、複数ドメイン音響モデル記憶部９１４に記憶する（Ｓ９１３）。 As shown in FIG. 1, the acoustic model learning unit 913 of the acoustic model generation device 91 includes a multi-domain voice DB stored in advance in the multi-domain voice DB storage unit 911 and a plurality of voice databases stored in the multi-domain context DB storage unit 912 in advance. Using the domain context DB, acoustic model learning is performed, a multi-domain acoustic model is obtained, and stored in the multi-domain acoustic model storage unit 914 (S913).

図２に示すように、音声合成装置９２のテキスト解析部９２１は、入力テキストをテキスト解析してコンテキストを得る（Ｓ９２１）。音声合成装置９２の音声パラメータ生成部９２２は、複数ドメイン音響モデル記憶部９１４に記憶された複数ドメイン音響モデルにコンテキストと合成するドメイン番号を入力し、音声パラメータを生成する（Ｓ９２２）。音声合成装置９２の音声波形生成部９２３は、得られた音声パラメータから、音声波形生成により、合成音声を得る（Ｓ９２３）。 As shown in FIG. 2, the text analysis unit 921 of the speech synthesizer 92 performs text analysis on the input text to obtain a context (S921). The voice parameter generation unit 922 of the voice synthesis device 92 inputs a domain number to be synthesized with the context into the multiple domain acoustic model stored in the multiple domain acoustic model storage unit 914, and generates a voice parameter (S922). The voice waveform generation unit 923 of the voice synthesizer 92 obtains a synthesized voice by generating a voice waveform from the obtained voice parameters (S923).

Zen et al., “Statistical parametric speech synthesis using deep neural networks,” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013 pp. 7962-7966.Zen et al., “Statistical parametric speech synthesis using deep neural networks,” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.IEEE, 2013 pp. 7962-7966. N. Hojo, Y. Ijima, and H. Mizuno, “An investigation of DNN-based speech synthesis using speaker codes,” Interspeech 2016, pp.2278-2282, 2016.N. Hojo, Y. Ijima, and H. Mizuno, “An investigation of DNN-based speech synthesis using speaker codes,” Interspeech 2016, pp.2278-2282, 2016. Y. Fan, Y. Qian, F.K. Soong, and L. He, “Multi-speaker modeling and speaker adaptation for DNN-based tts synthesis,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4475-4479, IEEE, 2015.Y. Fan, Y. Qian, FK Soong, and L. He, “Multi-speaker modeling and speaker adaptation for DNN-based tts synthesis,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4475 -4479, IEEE, 2015. 北条伸克他, “対話行為情報を表現可能な音声合成の検討”, 人工知能学会2016年全国大会, 2016.Nobukatsu Hojo et al., “Speech synthesis capable of expressing dialogue information”, Japan Society for Artificial Intelligence 2016 National Convention, 2016. Nose, Takashi, and Akinori Ito. “Analysis of spectral enhancement using global variance in HMM-based speech synthesis.” INTERSPEECH. 2014.Nose, Takashi, and Akinori Ito. “Analysis of spectral enhancement using global variance in HMM-based speech synthesis.” INTERSPEECH. 2014. Tomoki, Toda, and Keiichi Tokuda. “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis.” IEICE TRANSACTIONS on Information and Systems 90.5 (2007): 816-824.Tomoki, Toda, and Keiichi Tokuda. “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis.” IEICE TRANSACTIONS on Information and Systems 90.5 (2007): 816-824. Takamichi, Shinnosuke, et al. “A postfilter to modify the modulation spectrum in HMM-based speech synthesis.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.Takamichi, Shinnosuke, et al. “A postfilter to modify the modulation spectrum in HMM-based speech synthesis.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.

複数ドメイン音声ＤＢの整備には音声収録等のコスト、複数ドメインコンテキストＤＢの整備には音素、アクセント型等のアノテーションのコストが必要である。例えばこれらのコストが要因となり、複数ドメインＤＢのデータ量はドメインにより偏りが生じることがある。ここで、複数ドメインＤＮＮの学習時の目的関数のうち、各ドメインに関する項の占める割合は、各ドメインのデータ量に比例する。したがって、データ量の偏り（不均質）を考慮せずにモデル学習を行った場合、少量データしか得られないドメインのモデル化精度（入力されたコンテキストに対し推定される音声パラメータと正解音声パラメータとの平均二乗誤差など）は、他ドメインに比べ目的関数に対する影響が小さくなる。このため、少量データしか得られないドメインは、正確にモデル化されず、音声品質が劣化する可能性がある。 The maintenance of the multi-domain voice DB requires the cost of voice recording and the like, and the maintenance of the multi-domain context DB requires the cost of annotation such as phonemes and accent types. For example, due to these costs, the data amount of the multiple domain DB may be uneven depending on the domain. Here, the ratio of the term related to each domain in the objective function at the time of learning the multiple domain DNN is proportional to the data amount of each domain. Therefore, when model learning is performed without considering the bias (heterogeneity) of the data amount, the modeling accuracy of the domain in which only a small amount of data is obtained (the speech parameter estimated for the input context, the correct speech parameter, and the Mean square error) has less effect on the objective function than other domains. For this reason, a domain from which only a small amount of data can be obtained is not accurately modeled, and there is a possibility that voice quality is degraded.

そこで本発明では、少量の学習データしか得られないドメインについても、高品質な音声合成を実現する音響モデルを生成する音響モデル生成方法を提供することを目的とする。 Therefore, an object of the present invention is to provide an acoustic model generation method for generating an acoustic model that realizes high-quality speech synthesis even for a domain in which only a small amount of learning data is obtained.

音声データを、ＤＮＮ学習用のデータベースに含まれる音声について、その音声パラメータを分析し、保持したものとし、コンテキストデータを、ＤＮＮ学習用のデータベースに含まれる音声について、その発話のコンテキストを分析し、保持したものとし、ドメインを、音声に含まれるコンテキスト以外の情報を、カテゴリにより表現したものとし、複数ドメイン音声ＤＢを、複数のドメインの音声について、その音声データを保持したものとし、複数ドメインコンテキストＤＢを、複数のドメインの音声について、その発話のコンテキストデータを保持したものとする。 For the voice data, the voice parameters of the voice included in the DNN learning database are analyzed and held, and the context data is analyzed for the voice of the voice included in the DNN learning database, and the context of the utterance is analyzed. The domain is assumed to be stored, information other than the context included in the voice is represented by a category, and the multi-domain voice DB is configured to hold the voice data of the voice of the plurality of domains, and the multi-domain context is stored. Assume that the DB holds context data of utterances of voices of a plurality of domains.

本発明の音響モデル生成方法は、三つのステップを含む。 The acoustic model generation method of the present invention includes three steps.

第１のステップは、複数ドメインコンテキストＤＢ内の総フレーム数が最大でないドメインのコンテキストデータにコンテキストを追加して複数ドメイン均質化コンテキストＤＢを生成し、複数ドメイン均質化コンテキストＤＢの各コンテキストデータについて疑似音声データを生成して、複数ドメイン均質化擬似音声ＤＢを生成する。 The first step is to generate a multi-domain homogenized context DB by adding a context to context data of a domain in which the total number of frames in the multi-domain context DB is not the maximum, and to generate a pseudo-domain for each context data in the multi-domain homogenized context DB. The voice data is generated, and a multiple domain homogenized pseudo voice DB is generated.

第２のステップは、複数ドメイン音声ＤＢと複数ドメイン均質化擬似音声ＤＢを統合して複数ドメイン均質音声ＤＢを生成し、複数ドメインコンテキストＤＢと複数ドメイン均質化コンテキストＤＢを統合して複数ドメイン均質コンテキストＤＢを生成する。 The second step is to integrate the multi-domain audio DB and the multi-domain homogenized pseudo speech DB to generate a multi-domain homogenous speech DB, and to integrate the multi-domain context DB and the multi-domain homogenization context DB to obtain a multi-domain homogenous context DB. Generate a DB.

第３のステップは、学習データとして、複数ドメイン均質音声ＤＢと複数ドメイン均質コンテキストＤＢを使用して、音響モデルを学習する。 In the third step, an acoustic model is learned using a multi-domain homogeneous speech DB and a multi-domain homogeneous context DB as learning data.

本発明の音響モデル生成方法によれば、少量の学習データしか得られないドメインについても、高品質な音声合成を実現する音響モデルを生成することができる。 According to the acoustic model generation method of the present invention, it is possible to generate an acoustic model that realizes high-quality speech synthesis even for a domain in which only a small amount of learning data is obtained.

従来技術の音響モデル生成装置の構成を示すブロック図。FIG. 2 is a block diagram illustrating a configuration of a conventional acoustic model generation device. 従来技術の音声合成装置の構成を示すブロック図。FIG. 2 is a block diagram showing a configuration of a conventional speech synthesizer. 実施例１の音響モデル生成装置の構成を示すブロック図。FIG. 1 is a block diagram illustrating a configuration of an acoustic model generation device according to a first embodiment. 実施例１の音響モデル生成装置の動作を示すフローチャート。5 is a flowchart illustrating the operation of the acoustic model generation device according to the first embodiment. 実施例１の複数ドメイン均質化ＤＢ生成部の構成を示すブロック図。FIG. 2 is a block diagram illustrating a configuration of a multiple domain homogenization DB generation unit according to the first embodiment. 実施例１の複数ドメイン均質化ＤＢ生成部の動作を示すフローチャート。5 is a flowchart illustrating the operation of the multiple domain homogenization DB generation unit according to the first embodiment. 実施例１の複数ドメイン均質ＤＢ生成部の構成を示すブロック図。FIG. 2 is a block diagram illustrating a configuration of a multiple domain homogeneous DB generation unit according to the first embodiment. 実施例１の複数ドメイン均質ＤＢ生成部の動作を示すフローチャート。5 is a flowchart illustrating the operation of a multiple domain homogeneous DB generation unit according to the first embodiment. 実施例１の音声合成装置の構成を示すブロック図。FIG. 1 is a block diagram illustrating a configuration of a speech synthesis device according to a first embodiment. 実施例１の音声合成装置の動作を示すフローチャート。5 is a flowchart illustrating the operation of the speech synthesis device according to the first embodiment. 実施例２の音響モデル生成装置の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of an acoustic model generation device according to a second embodiment. 実施例２の音響モデル生成装置の動作を示すフローチャート。9 is a flowchart illustrating the operation of the acoustic model generation device according to the second embodiment. 実施例２の複数ドメイン均質化ポストフィルタ疑似音声ＤＢ生成部の構成を示すブロック図。FIG. 10 is a block diagram showing a configuration of a multiple domain homogenization post-filter pseudo-sound DB generator according to a second embodiment. 実施例２の複数ドメイン均質ＤＢ生成部の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a multiple domain homogeneous DB generation unit according to the second embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. Note that components having the same functions are given the same numbers, and overlapping descriptions are omitted.

≪用語の説明≫
＜音声パラメータ＞
ある音声信号に対して信号処理を行った結果得られる、各発話のＦ０情報（音高）、スペクトル包絡情報（ケプストラム、メルケプストラム等）等を表す。 ≪ Explanation of terms ≪
<Audio parameters>
It represents F0 information (pitch), spectral envelope information (cepstrum, mel-cepstrum, etc.) of each utterance obtained as a result of performing signal processing on a certain audio signal.

＜コンテキスト＞
ある発話について付与された発音等の情報を表す。コンテキストには、音素情報（発音情報）とアクセント情報（アクセント型、アクセント句長）が含まれている必要がある。コンテキストとして、これ以外にも品詞情報等が含まれていてもよい。また、各音素の開始時間、終了時間の情報（音素セグメンテーション情報）が保存されていてもよい。 <Context>
Represents information such as pronunciation given to a certain utterance. The context needs to include phoneme information (pronunciation information) and accent information (accent type, accent phrase length). The context may also include part of speech information and the like. Further, information on the start time and end time of each phoneme (phoneme segmentation information) may be stored.

＜音声データ＞
ＤＮＮ学習用のデータベースに含まれる音声について、その音声パラメータを分析し、保持したものを表す。 <Audio data>
For speech included in the DNN learning database, the speech parameters are analyzed and stored.

＜コンテキストデータ＞
ＤＮＮ学習用のデータベースに含まれる音声について、その発話のコンテキストを分析し、保持したものを表す。 <Context data>
For the speech included in the DNN learning database, the context of the utterance is analyzed and the speech is retained.

＜ドメイン＞
音声に含まれるコンテキスト以外の情報を、カテゴリにより表現したものを表す。例えば、話者（‘話者１’，…，‘話者Ｎ’）、感情（‘喜’，‘怒’，‘哀’，‘楽’，…）、対話行為（‘感嘆’，‘謝罪’，…）、等。 <Domain>
Represents information other than the context included in the audio, expressed by category. For example, speakers ('speaker 1', ..., 'speaker N'), emotions ('happy', 'angry', 'sad', 'easy', ...), dialogue acts ('exclamation', 'apology) ',…),etc.

＜ドメイン番号＞
音声合成に使用するドメインについて、各ドメインに対し番号を振ったものを表す。ドメイン数をＮとし、ｎ＝１，…，Ｎで表す。例えば、ｎ＝１：‘話者１’，…，ｎ＝Ｎ：‘話者Ｎ’を表す。 <Domain number>
For domains used for speech synthesis, numbers are assigned to each domain. The number of domains is N, and n = 1,..., N. For example, n = 1: 'speaker 1', ..., n = N: represents 'speaker N'.

＜複数ドメインコンテキストＤＢ＞
複数のドメインの音声について、その発話のコンテキストデータを保持したものを表す。ドメインｎに含まれる発話数をＫ_ｎ（ｎ＝１，…，Ｎ）としたとき、 <Multiple domain context DB>
Represents voices of a plurality of domains that hold context data of the utterance. When the number of utterances included in the domain n is K _n (n = 1,..., N),

で表す。ここで、データ整備のコスト等の制約から、各ドメインｎに対応する学習データ量（総フレーム数）は一般に一致するとは限らない（ｎ≠ｎ’のとき、Ｋ_ｎ≠Ｋ_ｎ’）。 Expressed by Here, the amount of learning data (total number of frames) corresponding to each domain n generally does not always match (Kn ≠ Kn 'when _n ≠ _n ') due to constraints such as the cost of data maintenance.

＜複数ドメイン音声ＤＢ＞
複数のドメインの音声について、その音声データを保持したものを表す。ドメインｎに含まれる発話数をＫ_ｎ（ｎ＝１，…，Ｎ）としたとき、 <Multi-domain audio DB>
For voices of a plurality of domains, the voice data is stored. When the number of utterances included in the domain n is K _n (n = 1,..., N),

で表す。複数ドメインコンテキストＤＢと同様、各ドメインｎに対応する学習データ量（総フレーム数）は一致するとは限らない。 Expressed by As with the multiple domain context DB, the amount of learning data (total number of frames) corresponding to each domain n does not always match.

＜複数ドメイン音響モデル＞
音声合成用のＤＮＮ音響モデルで、１つのモデルで複数のドメインの音声パラメータを合成可能であるようにモデル化・学習されたものを表す。例えば、話者コードを用いた複数話者音響モデル（非特許文献２）、shared hidden layer (SHU) による複数話者音響モデル（非特許文献３）、複数対話行為音響モデル（非特許文献４）を使用する。 <Multi-domain acoustic model>
This represents a DNN acoustic model for speech synthesis that has been modeled and learned so that speech parameters of a plurality of domains can be synthesized by one model. For example, a multi-speaker acoustic model using a speaker code (Non-Patent Document 2), a multi-speaker acoustic model using a shared hidden layer (SHU) (Non-Patent Document 3), a multi-dialogue action acoustic model (Non-Patent Document 4) Use

＜複数ドメイン均質化コンテキストＤＢ＞
データベースに含まれる各ドメインに対応する学習データ量（総フレーム数）を均質にするために使用するコンテキストデータベース。擬似データ生成のため、ＤＮＮの入力ベクトルとして活用する。ドメインｎに含まれる発話数をＫ’_ｎ（ｎ＝１，…，Ｎ）としたとき、 <Multi-domain homogenization context DB>
A context database used to homogenize the amount of learning data (total number of frames) corresponding to each domain included in the database. It is used as an input vector of DNN for generating pseudo data. When the number of utterances included in domain n is K ′ _n (n = 1,..., N),

で表す。 Expressed by

＜複数ドメイン均質化擬似音声ＤＢ＞
データベースに含まれる各ドメインに対応する学習データ量（総フレーム数）を均質にするために使用する音声データベース。複数ドメイン均質化コンテキストＤＢから擬似音声データを生成することにより作成する。ドメインｎに含まれる発話数をＫ’_ｎ（ｎ＝１，…，Ｎ）としたとき、 <Multi-domain homogenized pseudo speech DB>
A speech database used to homogenize the amount of training data (total number of frames) corresponding to each domain included in the database. It is created by generating pseudo audio data from the multiple domain homogenization context DB. When the number of utterances included in domain n is K ′ _n (n = 1,..., N),

で表す。複数ドメイン均質化コンテキストＤＢと複数ドメイン均質化擬似音声ＤＢのフレームは一対一に対応する。このため、二つのＤＢの各ドメインの総フレーム数は一致する。 Expressed by The frames of the multi-domain homogenization context DB and the multi-domain homogenization pseudo speech DB correspond one-to-one. Therefore, the total number of frames in each domain of the two DBs is the same.

＜複数ドメイン均質コンテキストＤＢ＞
複数ドメインコンテキストＤＢと複数ドメイン均質化コンテキストＤＢを統合することにより得られるコンテキストデータベース。音響モデル学習に使用する。 <Multi-domain homogeneous context DB>
A context database obtained by integrating the multi-domain context DB and the multi-domain homogenization context DB. Used for acoustic model learning.

＜複数ドメイン均質音声ＤＢ＞
複数ドメイン音声ＤＢと複数ドメイン均質化疑似音声ＤＢを統合することにより得られる音声データベース。音響モデル学習に使用する。 <Multi-domain homogeneous voice DB>
An audio database obtained by integrating the multiple domain audio DB and the multiple domain homogenized pseudo audio DB. Used for acoustic model learning.

＜複数ドメイン均質音響モデル＞
複数ドメイン均質コンテキストＤＢと複数ドメイン均質音声ＤＢを利用し、音響モデル学習により得られる音響モデル。 <Multi-domain homogeneous acoustic model>
An acoustic model obtained by acoustic model learning using a multi-domain homogeneous context DB and a multi-domain homogeneous speech DB.

≪用語の説明終わり≫
以下、図３を参照して実施例１の音響モデル生成装置１１の構成を説明する。同図に示すように本実施例の音響モデル生成装置１１は、複数ドメイン均質化ＤＢ生成部１１１と、複数ドメイン均質ＤＢ生成部１１２と、音響モデル学習部９１３と、複数ドメイン均質音響モデル記憶部１１４を含む。 ≫End of terminology explanation≫
Hereinafter, the configuration of the acoustic model generation device 11 according to the first embodiment will be described with reference to FIG. As shown in the figure, the acoustic model generation device 11 of the present embodiment includes a multi-domain homogenized DB generation unit 111, a multi-domain homogenous DB generation unit 112, an acoustic model learning unit 913, and a multi-domain homogenous acoustic model storage unit. 114.

以下、図４を参照して本実施例の音響モデル生成装置１１の動作を説明する。 Hereinafter, the operation of the acoustic model generation device 11 of the present embodiment will be described with reference to FIG.

〔複数ドメイン均質化ＤＢ生成部１１１〕
複数ドメイン均質化ＤＢ生成部１１１は、複数ドメインコンテキストＤＢ内の総フレーム数が最大でないドメインのコンテキストデータにコンテキストを追加して複数ドメイン均質化コンテキストＤＢを生成し、複数ドメイン均質化コンテキストＤＢの各コンテキストデータについて疑似音声データを生成して、複数ドメイン均質化擬似音声ＤＢを生成する（Ｓ１１１）。 [Multi-domain homogenization DB generation unit 111]
The multi-domain homogenization DB generation unit 111 generates a multi-domain homogenization context DB by adding a context to context data of a domain in which the total number of frames in the multi-domain context DB is not the maximum, and generates a multi-domain homogenization context DB. Pseudo voice data is generated for the context data, and a multiple domain homogenized pseudo voice DB is generated (S111).

〔複数ドメイン均質ＤＢ生成部１１２〕
複数ドメイン均質ＤＢ生成部１１２は、複数ドメイン音声ＤＢと複数ドメイン均質化擬似音声ＤＢを統合して複数ドメイン均質音声ＤＢを生成し、複数ドメインコンテキストＤＢと複数ドメイン均質化コンテキストＤＢを統合して複数ドメイン均質コンテキストＤＢを生成する（Ｓ１１２）。 [Multi-domain homogeneous DB generation unit 112]
The multi-domain homogenous DB generation unit 112 integrates the multi-domain audio DB and the multi-domain homogenized pseudo audio DB to generate a multi-domain homogenous audio DB, and integrates the multi-domain context DB and the multi-domain homogenized context DB to generate A domain homogeneous context DB is generated (S112).

〔音響モデル学習部９１３〕
音響モデル学習部９１３は、従来技術と同様に音響モデル学習を行う。ただし、音響モデル学習部９１３は、学習データとして、複数ドメイン音声ＤＢの代わりに複数ドメイン均質音声ＤＢを、複数ドメインコンテキストＤＢの代わりに複数ドメイン均質コンテキストＤＢを使用する。すなわち、音響モデル学習部９１３は、学習データとして、複数ドメイン均質音声ＤＢと複数ドメイン均質コンテキストＤＢを使用して、音響モデルを学習し、複数ドメイン均質音響モデル記憶部１１４に記憶する（Ｓ９１３）。音響モデル学習部９１３が学習する音響モデルを複数ドメイン均質音響モデルと呼ぶ。 [Acoustic model learning unit 913]
The acoustic model learning unit 913 performs acoustic model learning in the same manner as in the related art. However, the acoustic model learning unit 913 uses a multi-domain homogeneous speech DB instead of the multi-domain speech DB and a multi-domain homogeneous context DB instead of the multi-domain context DB as learning data. That is, the acoustic model learning unit 913 learns an acoustic model using the multi-domain homogeneous speech DB and the multi-domain homogeneous context DB as learning data, and stores the acoustic model in the multi-domain homogeneous acoustic model storage unit 114 (S913). The acoustic model learned by the acoustic model learning unit 913 is called a multi-domain homogeneous acoustic model.

以下、図５、図６、図７、図８を参照して、複数ドメイン均質化ＤＢ生成部１１１および複数ドメイン均質ＤＢ生成部１１２の構成および動作をさらに詳細に説明する。 Hereinafter, the configurations and operations of the multi-domain homogenized DB generator 111 and the multi-domain homogenized DB generator 112 will be described in more detail with reference to FIGS. 5, 6, 7, and 8.

［複数ドメイン均質化ＤＢ生成部１１１］
図５に示すように複数ドメイン均質化ＤＢ生成部１１１は、コンテキスト追加部１１１１と、複数ドメイン均質化コンテキストＤＢ記憶部１１１２と、音声パラメータ生成部９２２と、複数ドメイン均質化擬似音声ＤＢ記憶部１１１３を含む。コンテキスト追加部１１１１は、例えば次の（ａ）（ｂ）のサブステップを含むステップＳ１１１１を実行して、複数ドメイン均質化コンテキストＤＢを生成する。 [Multi-domain homogenization DB generation unit 111]
As shown in FIG. 5, the multi-domain homogenization DB generation unit 111 includes a context addition unit 1111, a multi-domain homogenization context DB storage unit 1112, a speech parameter generation unit 922, and a multi-domain homogenization pseudo speech DB storage unit 1113. including. The context adding unit 1111 executes, for example, step S1111 including the following sub-steps (a) and (b) to generate a multi-domain homogenized context DB.

（ａ）コンテキスト追加部１１１１は、各ドメインｎについて、複数ドメインコンテキストＤＢに含まれる総フレーム数が最大となるドメインｎ^＊およびその最大フレーム数Ｆ_ｎ＊を算出する。 (A) The context adding unit 1111 calculates, for each domain n, the domain n ^{* in} which the total number of frames included in the multiple domain context DB is the maximum and the maximum frame number _{Fn *} .

ここで、ドメインｎのｋ番目（ｋ＝１，…，Ｋ_ｎ）の発話のフレーム数をｆ_ｋ ^（ｎ）とした。 Here, the number of frames of the k-th (k = 1,..., K _n ) utterance of the domain n is defined as f _k ⁽ⁿ⁾ .

（ｂ）コンテキスト追加部１１１１は、ｎ^＊以外の各ドメインｎについて、ドメインの総フレーム数がＦ^’ _ｎ＝Ｆ_ｎ＊となるまで、各ドメインのコンテキストデータにコンテキストを追加して、複数ドメイン均質化コンテキストＤＢを生成する。コンテキスト追加部１１１１は、Ｆ^’ _ｎを (B) For each domain n other than n ^* , the context adding unit 1111 adds a context to the context data of each domain until the total number of frames of the domain becomes F ^′ _n = F _{n *,} and the context addition unit 1111 performs multi-domain homogenization. Generate the Context DB. Context adding unit 1111, the F ^_'n

の範囲の適当な値に設定することで、擬似データ生成に使用するデータ量を削減し、音声パラメータ生成、音響モデル学習に必要となる計算機メモリ量、計算時間のコストを削減すれば好適である。この時、例えば追加するコンテキストは、ドメインｎ以外のコンテキストとする。コンテキスト追加部１１１１は、生成した複数ドメイン均質化コンテキストＤＢを複数ドメイン均質化コンテキストＤＢ記憶部１１１２に記憶、保持する。 It is preferable to reduce the amount of data used for pseudo data generation, reduce the amount of computer memory required for speech parameter generation and acoustic model learning, and reduce the cost of calculation time by setting appropriate values in the range . At this time, for example, the context to be added is a context other than the domain n. The context adding unit 1111 stores and holds the generated multi-domain homogenized context DB in the multi-domain homogenized context DB storage unit 1112.

音声パラメータ生成部９２２は、複数ドメイン均質化コンテキストＤＢの各コンテキストについて、対応するドメイン番号と複数ドメイン音響モデルを使用し、音声パラメータを生成する処理を繰り返し、各コンテキストデータに対応する疑似音声データを生成し、複数ドメイン均質化擬似音声ＤＢとする（Ｓ９２２）。この時、例えば複数ドメイン音響モデルとして、従来技術により学習される複数ドメイン音響モデルを使用する。音声パラメータ生成部９２２は、生成された複数ドメイン均質化擬似音声ＤＢを、複数ドメイン均質化擬似音声ＤＢ記憶部１１１３に記憶、保持する。 The voice parameter generation unit 922 repeats the process of generating voice parameters using the corresponding domain number and the multiple domain acoustic model for each context of the multiple domain homogenized context DB, and generates pseudo voice data corresponding to each context data. Generated and used as a multiple domain homogenized pseudo audio DB (S922). At this time, for example, a multi-domain acoustic model learned by a conventional technique is used as the multi-domain acoustic model. The voice parameter generation unit 922 stores and holds the generated multi-domain homogenized pseudo voice DB in the multi-domain homogenized pseudo voice DB storage unit 1113.

［複数ドメイン均質ＤＢ生成部１１２］
図７に示すように、複数ドメイン均質ＤＢ生成部１１２は、音声ＤＢ統合部１１２１と、複数ドメイン均質音声ＤＢ記憶部１１２２と、コンテキストＤＢ統合部１１２３と、複数ドメイン均質コンテキストＤＢ記憶部１１２４を含む構成である。 [Multi-domain homogeneous DB generation unit 112]
As shown in FIG. 7, the multi-domain homogeneous DB generation unit 112 includes a speech DB integration unit 1121, a multi-domain homogeneous speech DB storage unit 1122, a context DB integration unit 1123, and a multi-domain homogeneous context DB storage unit 1124. Configuration.

音声ＤＢ統合部１１２１は、複数ドメイン音声ＤＢと複数ドメイン均質化擬似音声ＤＢを統合し、複数ドメイン均質音声ＤＢとして、複数ドメイン均質音声ＤＢ記憶部１１２２に記憶、保持する（Ｓ１１２１）。 The audio DB integration unit 1121 integrates the multiple domain audio DB and the multiple domain homogenized pseudo audio DB, and stores and holds the multiple domain homogeneous audio DB in the multiple domain homogeneous audio DB storage unit 1122 (S1121).

コンテキストＤＢ統合部１１２３は、複数ドメインコンテキストＤＢと複数ドメイン均質化コンテキストＤＢを統合し、複数ドメイン均質コンテキストＤＢとして、複数ドメイン均質コンテキストＤＢ記憶部１１２４に記憶、保持する（Ｓ１１２３）。 The context DB integration unit 1123 integrates the multi-domain context DB and the multi-domain homogenization context DB, and stores and holds the multi-domain homogenization context DB in the multi-domain homogenous context DB storage unit 1124 (S1123).

［音声合成装置１２］
図９に示すように、本実施例の音声合成装置１２は、従来技術と同様のテキスト解析部９２１と、音声パラメータ生成部９２２と、音声波形生成部９２３と、従来技術とは異なる複数ドメイン均質音響モデル記憶部１１４を含む。図１０に示すように、本実施例の音声合成装置１２は、従来技術と同様にステップ９２１、Ｓ９２２、Ｓ９２３を実行して合成音声を得る。ただし音響モデルとして、従来の複数ドメイン音響モデルの代わりに、複数ドメイン均質音響モデルを使用する点が従来技術とは異なる。 [Speech synthesizer 12]
As shown in FIG. 9, the speech synthesizer 12 according to the present embodiment includes a text analysis unit 921, a speech parameter generation unit 922, a speech waveform generation unit 923 similar to the related art, and a multi-domain homogenization unit different from the related art. An acoustic model storage unit 114 is included. As shown in FIG. 10, the speech synthesizer 12 according to the present embodiment executes steps 921, S922, and S923 to obtain a synthesized speech in the same manner as in the related art. However, it differs from the prior art in that a multi-domain homogeneous acoustic model is used instead of the conventional multi-domain acoustic model.

このように、本実施例の音響モデル生成装置１１は、不均質な複数ドメインＤＢを利用して学習された音響モデルを利用し、疑似音声データを生成し、擬似音声データを複数ドメイン音声ＤＢに追加し、複数ドメインコンテキストＤＢに対しても同様の追加を行うことで、学習データ中に含まれる各ドメインのデータ量を均質にすることができる。これにより、各ドメインについて均質なデータ量の学習データから音響モデル学習を行うことができ、少量データしか得られないドメインについても、高品質な合成音声を得ることができる。 As described above, the acoustic model generation device 11 of the present embodiment uses the acoustic model learned using the heterogeneous multi-domain DB, generates the pseudo-speech data, and converts the pseudo-speech data to the multi-domain speech DB. By adding the same to the multiple domain context DB, the data amount of each domain included in the learning data can be made uniform. Accordingly, acoustic model learning can be performed from training data of a uniform data amount for each domain, and high-quality synthesized speech can be obtained even for a domain where only a small amount of data can be obtained.

音響モデルにより生成される音声パラメータは、実際の人間による発話（自然発話）の音声パラメータに比べ、過剰に平滑化する傾向が知られている。実施例１では、過剰に平滑化した音声パラメータを学習データに追加するため、学習された音響モデルから生成される音声パラメータは、さらに平滑化する可能性がある。そこで本実施例では、擬似的に生成された音声パラメータに対し、過剰平滑化した音声パラメータを自然発話のものに近づけるためのポストフィルタ処理を行う。これにより、学習される音響モデルから生成される音声パラメータが平滑化することを回避することができる。 It is known that voice parameters generated by an acoustic model tend to be excessively smoothed as compared to voice parameters of actual human speech (natural speech). In the first embodiment, since the excessively smoothed speech parameters are added to the learning data, the speech parameters generated from the learned acoustic model may be further smoothed. Thus, in the present embodiment, post-filter processing is performed on the pseudo-generated speech parameters to bring the over-smoothed speech parameters closer to those of a naturally uttered speech. This makes it possible to avoid smoothing of the speech parameters generated from the learned acoustic model.

≪用語の説明≫
＜複数ドメイン均質化ポストフィルタ擬似音声ＤＢ＞
複数ドメイン均質化擬似音声ＤＢに含まれる各音声データについて、ポストフィルタ処理により、その音声パラメータの傾向を自然発話に近づける処理を行ったものを表す。 ≪ Explanation of terms ≪
<Multi-domain homogenized post-filter pseudo-speech DB>
For each voice data included in the multiple domain homogenized pseudo voice DB, a post-filtering process is performed to make the tendency of the voice parameter closer to a natural utterance.

＜複数ドメイン均質ポストフィルタ音声ＤＢ＞
複数ドメイン均質化ポストフィルタ擬似音声ＤＢと複数ドメイン音声ＤＢを統合して得られる音声データベース。 <Multi-domain homogeneous post-filter audio DB>
An audio database obtained by integrating the multi-domain homogenized post-filter pseudo audio DB and the multi-domain audio DB.

＜複数ドメイン均質ポストフィルタ音響モデル＞
複数ドメイン均質ポストフィルタ音声ＤＢと複数ドメイン均質コンテキストＤＢから音響モデル学習を行うことで得られる音響モデル。 <Multi-domain homogeneous post-filter acoustic model>
An acoustic model obtained by performing acoustic model learning from a multi-domain homogeneous post-filter speech DB and a multi-domain homogeneous context DB.

≪用語の説明終わり≫
図１１に示すように、本実施例の音響モデル生成装置２１は、実施例１と同様の複数ドメイン均質化ＤＢ生成部１１１と、実施例１とは異なる複数ドメイン均質化ポストフィルタ疑似音声ＤＢ生成部２１１と、実施例１とは異なる複数ドメイン均質ＤＢ生成部２１２と、実施例１および従来技術と同様の音響モデル学習部９１３と、実施例１とは異なる複数ドメイン均質ポストフィルタ音響モデル記憶部２１４を含む。 ≫End of terminology explanation≫
As shown in FIG. 11, the acoustic model generation device 21 of the present embodiment includes a multi-domain homogenization DB generation unit 111 similar to the first embodiment, and a multi-domain homogenization post-filter pseudo speech DB generation different from the first embodiment. Unit 211, a multi-domain homogeneous DB generation unit 212 different from the first embodiment, an acoustic model learning unit 913 similar to the first embodiment and the related art, and a multi-domain homogeneous post-filter acoustic model storage unit different from the first embodiment. 214.

以下、図１２を参照して本実施例の音響モデル生成装置２１の動作を説明する。 Hereinafter, the operation of the acoustic model generation device 21 of the present embodiment will be described with reference to FIG.

〔複数ドメイン均質化ＤＢ生成部１１１〕
実施例１と同様にステップＳ１１１を実行する。 [Multi-domain homogenization DB generation unit 111]
Step S111 is executed as in the first embodiment.

〔複数ドメイン均質化ポストフィルタ疑似音声ＤＢ生成部２１１〕
複数ドメイン均質化ポストフィルタ疑似音声ＤＢ生成部２１１は、ポストフィルタ処理により、複数ドメイン均質化擬似音声ＤＢから複数ドメイン均質化ポストフィルタ擬似音声ＤＢを取得する（Ｓ２１１）。 [Multi-domain homogenization post-filter pseudo-sound DB generator 211]
The multi-domain homogenized post-filter pseudo-speech DB generator 211 acquires the multi-domain homogenized post-filter pseudo-speech DB from the multi-domain homogenized pseudo-speech DB by post-filtering (S211).

〔複数ドメイン均質ＤＢ生成部２１２〕
複数ドメイン均質化擬似音声ＤＢの代わりに、複数ドメイン均質化ポストフィルタ擬似音声ＤＢを使用する点を除いて、実施例１と同様である。ただし、得られる音声ＤＢを、実施例１と区別して、複数ドメイン均質ポストフィルタ音声ＤＢと呼ぶ。すなわち、複数ドメイン均質ＤＢ生成部２１２は、複数ドメイン音声ＤＢと複数ドメイン均質化ポストフィルタ擬似音声ＤＢを統合して複数ドメイン均質ポストフィルタ音声ＤＢを生成し、実施例１と同様に、複数ドメインコンテキストＤＢと複数ドメイン均質化コンテキストＤＢを統合して複数ドメイン均質コンテキストＤＢを生成する（Ｓ２１２）。 [Multi-domain homogeneous DB generation unit 212]
This is the same as the first embodiment except that a multi-domain homogenized post-filter pseudo audio DB is used instead of the multi-domain homogenized pseudo audio DB. However, the obtained voice DB is referred to as a multi-domain homogeneous post-filter voice DB to distinguish it from the first embodiment. That is, the multi-domain homogenous DB generation unit 212 generates the multi-domain homogenous post-filter audio DB by integrating the multi-domain audio DB and the multi-domain homogenized post-filter pseudo audio DB. The DB and the multi-domain homogenized context DB are integrated to generate a multi-domain homogenous context DB (S212).

〔音響モデル学習部９１３〕
音響モデル学習部９１３は、実施例１と同様にステップＳ９１３を実行する。ただし、音響モデル学習部９１３は、学習データとして、複数ドメイン均質音声ＤＢの代わりに、複数ドメイン均質ポストフィルタ音声ＤＢを使用する。すなわち、音響モデル学習部９１３は、学習データとして、複数ドメイン均質ポストフィルタ音声ＤＢと複数ドメイン均質コンテキストＤＢを使用して、音響モデルを学習し、複数ドメイン均質ポストフィルタ音響モデル記憶部２１４に記憶する（Ｓ９１３）。音響モデル学習部９１３が学習する音響モデルを実施例１と区別して、複数ドメイン均質ポストフィルタ音響モデルと呼ぶ。 [Acoustic model learning unit 913]
The acoustic model learning unit 913 executes Step S913 as in the first embodiment. However, the acoustic model learning unit 913 uses a multi-domain homogeneous post-filter speech DB instead of the multi-domain homogeneous speech DB as learning data. That is, the acoustic model learning unit 913 learns the acoustic model using the multi-domain homogeneous post-filter speech DB and the multi-domain homogeneous context DB as the learning data, and stores the acoustic model in the multi-domain homogeneous post-filter acoustic model storage unit 214. (S913). The acoustic model that the acoustic model learning unit 913 learns is referred to as a multi-domain homogeneous post-filter acoustic model to distinguish it from the first embodiment.

［音声合成装置（図示略）］
複数ドメイン均質音響モデルの代わりに、複数ドメイン均質ポストフィルタ音響モデルを使用する点を除き、実施例１と同様である。 [Speech synthesizer (not shown)]
This is the same as the first embodiment except that a multi-domain homogeneous post-filter acoustic model is used instead of the multi-domain homogeneous acoustic model.

以下、図１３、図１４を参照して、複数ドメイン均質化ポストフィルタ擬似音声ＤＢ生成部２１１および複数ドメイン均質ＤＢ生成部２１２の構成をさらに詳細に説明する。 Hereinafter, the configurations of the multi-domain homogenized post-filter pseudo-speech DB generator 211 and the multi-domain homogenous DB generator 212 will be described in more detail with reference to FIGS.

［複数ドメイン均質化ポストフィルタ擬似音声ＤＢ生成部２１１］
図１３に示すように、複数ドメイン均質化ポストフィルタ擬似音声ＤＢ生成部２１１は、ポストフィルタ２１１１と、複数ドメイン均質化ポストフィルタ擬似音声ＤＢ記憶部２１１３を含む。ポストフィルタ２１１１は前述のポストフィルタ処理を実行する。ポストフィルタ処理としては、例えばケプストラム特徴量に対する分散保障処理（非特許文献５）、Global Variance 保障処理（非特許文献６）、変調スペクトル保障処理（非特許文献７）等でよい。ポストフィルタ２１１１は、取得した複数ドメイン均質化ポストフィルタ擬似音声ＤＢを複数ドメイン均質化ポストフィルタ擬似音声ＤＢ記憶部２１１３に記憶する。 [Multi-domain homogenized post-filter pseudo-sound DB generator 211]
As shown in FIG. 13, the multi-domain homogenized post-filter pseudo-speech DB generator 211 includes a post-filter 2111 and a multi-domain homogenized post-filter pseudo-speech DB storage 2113. The post filter 2111 performs the above-described post filter processing. As the post-filter processing, for example, dispersion guarantee processing for cepstrum feature amounts (Non-Patent Document 5), Global Variance guarantee processing (Non-Patent Document 6), modulation spectrum guarantee processing (Non-Patent Document 7), and the like may be used. The post filter 2111 stores the acquired multi-domain homogenized post-filter pseudo-speech DB in the multi-domain homogenized post-filter pseudo-speech DB storage unit 2113.

［複数ドメイン均質ＤＢ生成部２１２］
図１４に示すように、複数ドメイン均質ＤＢ生成部２１２は、実施例１と同様の音声ＤＢ統合部１１２１と、実施例１と異なる複数ドメイン均質ポストフィルタ音声ＤＢ記憶部２１２２と、実施例１と同様のコンテキストＤＢ統合部１１２３と、実施例１と同様の複数ドメイン均質コンテキストＤＢ記憶部１１２４を含む。音声ＤＢ統合部１１２１は、統合により得られた複数ドメイン均質ポストフィルタ音声ＤＢを複数ドメイン均質ポストフィルタ音声ＤＢ記憶部２１２２に記憶する。 [Multi-domain homogeneous DB generation unit 212]
As illustrated in FIG. 14, the multi-domain homogeneous DB generation unit 212 includes a speech DB integration unit 1121 similar to the first embodiment, a multi-domain homogeneous post-filter speech DB storage unit 2122 different from the first embodiment, A similar context DB integration unit 1123 and a multi-domain homogeneous context DB storage unit 1124 similar to the first embodiment are included. The audio DB integration unit 1121 stores the multi-domain homogeneous post-filter audio DB obtained by the integration in the multi-domain homogeneous post-filter audio DB storage unit 2122.

上述したように、音響モデルにより生成される音声パラメータは、実際の人間による発話（自然発話）の音声パラメータに比べ、時間方向に過剰に平滑化する傾向が知られている。実施例１では、過剰に平滑化した音声パラメータを学習データに追加するため、音響モデルの学習に影響を及ぼす可能性がある。 As described above, it is known that voice parameters generated by the acoustic model tend to be excessively smoothed in the time direction as compared to voice parameters of actual human utterances (natural utterances). In the first embodiment, an excessively smoothed speech parameter is added to the learning data, which may affect the acoustic model learning.

本実施例の音響モデル生成装置２１は、擬似的に生成された音声パラメータに対し、過剰平滑化した音声パラメータを自然発話のものに近づけるためのポストフィルタ処理を行う。これにより、過剰平滑化が音響モデルの学習に与える影響を回避することができ、音声品質を向上させることができる。 The acoustic model generation device 21 of the present embodiment performs post-filter processing on the pseudo-generated speech parameters to bring the over-smoothed speech parameters closer to those of spontaneous speech. As a result, it is possible to avoid the influence of excessive smoothing on the learning of the acoustic model, and it is possible to improve speech quality.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボード等が接続可能な入力部、液晶ディスプレイ等が接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタ等を備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭ等の記録媒体を読み書きできる装置（ドライブ）等を設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータ等がある。 <Supplementary note>
The device of the present invention includes, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) that can communicate outside the hardware entity. A communication unit, which may include a central processing unit, a cache memory and a register, a RAM or a ROM as a memory, an external storage device as a hard disk, and an input unit, an output unit, and a communication unit thereof. , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. As a physical entity having such hardware resources, there is a general-purpose computer or the like.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータ等が記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータ等は、ＲＡＭや外部記憶装置等に適宜に記憶される。 The external storage device of the hardware entity stores a program necessary to realize the above-described functions, data necessary for processing the program, and the like. It may be stored in a ROM that is a dedicated storage device). Data and the like obtained by the processing of these programs are appropriately stored in a RAM, an external storage device, or the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭ等）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段等と表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing of each program are read into a memory as needed, and interpreted and executed / processed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (each of the constituent elements described as the above-described section, means, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processing described in the above embodiment may be performed not only in chronological order according to the order described, but also in parallel or individually according to the processing capability of the apparatus that executes the processing or as necessary. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function of the hardware entity (the device of the present invention) described in the above embodiment is implemented by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on a computer, the processing functions of the hardware entities are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 A program describing this processing content can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape, or the like is used as a magnetic recording device, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), magneto-optical recording media, MO (Magneto-Optical disc), EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) as semiconductor memory, etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of the program is performed by, for example, selling, transferring, lending, or the like, a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when executing the process, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of the program, the computer may directly read the program from the portable recording medium and execute processing according to the program, and further, the program may be transferred from the server computer to the computer. Each time, the processing according to the received program may be sequentially executed. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by executing the program and acquiring the result without transferring the program from the server computer to the computer. It may be. It should be noted that the program in the present embodiment includes information used for processing by the computer and which is similar to the program (data that is not a direct command to the computer but has characteristics that define the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

なお、明細書、特許請求の範囲に記載された各ステップは各種の情報を生成する方法の各ステップに該当する。ここでいう各種の情報は特許法第二条第四項に規定するプログラム等（プログラム…その他電子計算機による処理の用に供する情報であってプログラムに準ずるもの）に該当するため、ここでいう各種の情報は、特許法第二条第三項第一号に規定する物に該当する。従って、明細書、特許請求の範囲に記載された各種の情報を生成する方法はすなわち、特許法第二条第三項第三号に規定する物を生産する方法に該当することはいうまでもない。 Each step described in the description and the claims corresponds to each step of a method for generating various information. The various types of information referred to here correspond to programs and the like (programs and other information used for processing by electronic computers and equivalent to the programs) prescribed in Article 2, Paragraph 4 of the Patent Act. The information in (1) corresponds to the provisions of Article 2, Paragraph 3, Item 1 of the Patent Act. Therefore, it goes without saying that the method of generating various types of information described in the description and the claims corresponds to the method of producing a product specified in Article 2, Paragraph 3, Item 3 of the Patent Act. Absent.

Claims

Speech data is obtained by analyzing and retaining speech parameters of speech included in a database for DNN learning,
As for the context data, it is assumed that the context of the utterance of the voice included in the DNN learning database is analyzed and held,
The domain is assumed that information other than the context included in the voice is represented by a category,
It is assumed that the multiple domain audio DB holds the audio data of the audio of the plurality of domains,
A multi-domain context DB holds the context data of the utterances for the voices of the plurality of domains,
The context is added to the context data of the domain in which the total number of frames in the multi-domain context DB is not the maximum to generate a multi-domain homogenized context DB, and the context data of the multi-domain homogenized context DB is simulated. A first step of generating audio data to generate a multiple domain homogenized pseudo audio DB;
The multi-domain voice DB and the multi-domain homogenized pseudo voice DB are integrated to generate a multi-domain homogenous voice DB, and the multi-domain context DB and the multi-domain homogenization context DB are integrated to form a multi-domain homogenous context DB. Generating a second step;
A third step of learning an acoustic model using the multi-domain homogeneous speech DB and the multi-domain homogeneous context DB as learning data;
An acoustic model generation method including:

Speech data is obtained by analyzing and retaining speech parameters of speech included in a database for DNN learning,
As for the context data, it is assumed that the context of the utterance of the voice included in the DNN learning database is analyzed and held,
The domain is assumed that information other than the context included in the voice is represented by a category,
It is assumed that the multiple domain audio DB holds the audio data of the audio of the plurality of domains,
A multi-domain context DB holds the context data of the utterances for the voices of the plurality of domains,
The context is added to the context data of the domain in which the total number of frames in the multi-domain context DB is not the maximum to generate a multi-domain homogenized context DB, and the context data of the multi-domain homogenized context DB is simulated. A first step of generating audio data to generate a multiple domain homogenized pseudo audio DB;
A second step of obtaining a multi-domain homogenized post-filter pseudo-speech DB from the multi-domain homogenized pseudo-speech DB by post-filtering;
The multi-domain voice DB and the multi-domain homogenized post-filter pseudo voice DB are integrated to generate a multi-domain homogenous post-filter voice DB, and the multi-domain context DB and the multi-domain homogenized context DB are integrated to form a multi-domain A third step of generating a homogeneous context DB;
A fourth step of learning an acoustic model using the multi-domain homogeneous post-filter speech DB and the multi-domain homogeneous context DB as learning data;
An acoustic model generation method including:

The acoustic model generation method according to claim 1 or 2,
In the first step,
The context is set such that the total number of frames of the domain in which the total number of frames in the multiple domain context DB is not the maximum is equal to the total number of frames of the domain in which the total number of frames in the multiple domain context DB is the maximum. The acoustic model generation method to be added.

A speech synthesis method for acquiring a synthesized speech using an acoustic model generated by the acoustic model generation method according to claim 1.

Speech data is obtained by analyzing and retaining speech parameters of speech included in a database for DNN learning,
As for the context data, it is assumed that the context of the utterance of the voice included in the DNN learning database is analyzed and held,
The domain is assumed that information other than the context included in the voice is represented by a category,
It is assumed that the multiple domain audio DB holds the audio data of the audio of the plurality of domains,
A multi-domain context DB holds the context data of the utterances for the voices of the plurality of domains,
The context is added to the context data of the domain in which the total number of frames in the multi-domain context DB is not the maximum to generate a multi-domain homogenized context DB, and the context data of the multi-domain homogenized context DB is simulated. A multi-domain homogenized DB generation unit that generates audio data and generates a multi-domain homogenized pseudo audio DB;
The multi-domain voice DB and the multi-domain homogenized pseudo voice DB are integrated to generate a multi-domain homogenous voice DB, and the multi-domain context DB and the multi-domain homogenization context DB are integrated to form a multi-domain homogenous context DB. A multi-domain homogenous DB generation unit to generate;
An acoustic model learning unit that learns an acoustic model using the multi-domain homogeneous speech DB and the multi-domain homogeneous context DB as learning data;
An acoustic model generation device including:

Speech data is obtained by analyzing and retaining speech parameters of speech included in a database for DNN learning,
As for the context data, it is assumed that the context of the utterance of the voice included in the DNN learning database is analyzed and held,
The domain is assumed that information other than the context included in the voice is represented by a category,
It is assumed that the multiple domain audio DB holds the audio data of the audio of the plurality of domains,
The multi-domain context DB holds the context data of the utterances of the voices of the plurality of domains,
The context is added to the context data of the domain in which the total number of frames in the multi-domain context DB is not the maximum to generate a multi-domain homogenized context DB, and the context data of the multi-domain homogenized context DB is simulated. A multi-domain homogenized DB generation unit that generates audio data and generates a multi-domain homogenized pseudo audio DB;
A multi-domain homogenized post-filter pseudo-speech DB generator that acquires a multi-domain homogenized post-filter pseudo-speech DB from the multi-domain homogenized pseudo-speech DB by post-filter processing;
The multi-domain voice DB and the multi-domain homogenized post-filter pseudo voice DB are integrated to generate a multi-domain homogenous post-filter voice DB, and the multi-domain context DB and the multi-domain homogenized context DB are integrated to form a multi-domain A multi-domain homogeneous DB generator for generating a homogeneous context DB;
An acoustic model learning unit that learns an acoustic model using the multi-domain homogeneous post-filter audio DB and the multi-domain homogeneous context DB as learning data;
An acoustic model generation device including:

A speech synthesizer for acquiring a synthesized speech using an acoustic model generated by the acoustic model generator according to claim 5.

A program for causing a computer to execute the acoustic model generation method according to claim 1.

A program for causing a computer to execute the speech synthesis method according to claim 4.