JP6680933B2

JP6680933B2 - Acoustic model learning device, speech synthesis device, acoustic model learning method, speech synthesis method, program

Info

Publication number: JP6680933B2
Application number: JP2019113938A
Authority: JP
Inventors: 伸克北条; 勇祐井島; 宮崎　昇; 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2020-04-15
Anticipated expiration: 2035-08-04
Also published as: JP2019179257A

Description

本発明は、音声データからディープニューラルネットワーク音響モデルを学習する音響モデル学習装置、学習されたディープニューラルネットワーク音響モデルを用いて合成音声を生成する音声合成装置、音響モデル学習方法、音声合成方法、プログラムに関する。 The present invention relates to an acoustic model learning device that learns a deep neural network acoustic model from speech data, a speech synthesis device that generates synthetic speech using the learned deep neural network acoustic model, an acoustic model learning method, a speech synthesis method, and a program. Regarding

目標話者の音声データから、その話者の合成音声を生成する手法として、ＤＮＮ（ディープニューラルネットワーク）に基づく技術がある（非特許文献１）。以下、図１、図２を参照して非特許文献１の音響モデル学習装置、音声合成装置の構成、および動作について説明する。図１は、同文献の音響モデル学習装置９１の構成を示すブロック図である。図２は、同文献の音声合成装置９２の構成を示すブロック図である。 There is a technique based on DNN (Deep Neural Network) as a method of generating a synthesized voice of a target speaker from voice data of the target speaker (Non-Patent Document 1). Hereinafter, configurations and operations of the acoustic model learning device and the speech synthesis device of Non-Patent Document 1 will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing a configuration of an acoustic model learning device 91 of the document. FIG. 2 is a block diagram showing the configuration of the speech synthesizer 92 of the document.

図１に示すように、非特許文献１の音響モデル学習装置９１は、話者音声データベース９１１と、音響モデル学習部９１３と、音響モデル記憶部９１４を含む。話者音声データベース９１１は、音声データ記憶部９１１１と、コンテキストデータ記憶部９１１２を含む。音声データ記憶部９１１１は、目標話者の音声データ（音声パラメータ）を予め記憶している。コンテキストデータ記憶部９１１２は、目標話者の音声データに対応するコンテキストデータを予め記憶している。詳細は後述するが、コンテキストデータには、少なくとも音声データの音素情報とアクセント情報が含まれるものとする。 As shown in FIG. 1, the acoustic model learning device 91 of Non-Patent Document 1 includes a speaker voice database 911, an acoustic model learning unit 913, and an acoustic model storage unit 914. The speaker voice database 911 includes a voice data storage unit 9111 and a context data storage unit 9112. The voice data storage unit 9111 stores in advance voice data (voice parameters) of the target speaker. The context data storage unit 9112 stores in advance context data corresponding to the voice data of the target speaker. Although details will be described later, it is assumed that the context data includes at least phoneme information and accent information of the voice data.

音響モデル学習部９１３は、目標話者の音声データ、コンテキストデータを用いて、ＤＮＮ（ディープニューラルネットワーク）による、目標話者の音響モデルを学習し、学習された音響モデル（以下、ＤＮＮ音響モデル、またはディープニューラルネットワーク音響モデルと呼称する）を音響モデル記憶部９１４に記憶する。 The acoustic model learning unit 913 learns the acoustic model of the target speaker by DNN (deep neural network) using the speech data and context data of the target speaker, and the learned acoustic model (hereinafter, the DNN acoustic model, Or, referred to as a deep neural network acoustic model) is stored in the acoustic model storage unit 914.

図２に示すように、非特許文献１の音声合成装置９２は、テキスト解析部９２１と、音声パラメータ生成部９２２と、音声波形生成部９２３を含む。 As shown in FIG. 2, the voice synthesis device 92 of Non-Patent Document 1 includes a text analysis unit 921, a voice parameter generation unit 922, and a voice waveform generation unit 923.

テキスト解析部９２１は、入力テキスト（音声合成目的のテキストデータ）を解析して、前述のコンテキストデータを取得する。音声パラメータ生成部９２２は、音響モデル記憶部９１４に記憶されたディープニューラルネットワーク音響モデルを用いて、コンテキストデータから音声パラメータを生成する。音声波形生成部９２３は、生成された音声パラメータを用いて音声波形を生成する。 The text analysis unit 921 analyzes the input text (text data for speech synthesis purpose) and acquires the context data described above. The voice parameter generation unit 922 uses the deep neural network acoustic model stored in the acoustic model storage unit 914 to generate a voice parameter from the context data. The voice waveform generation unit 923 generates a voice waveform using the generated voice parameter.

Zen et al., "Statistical parametric speech synthesis using deep neural networks." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013 pp. 7962-7966.Zen et al., "Statistical parametric speech synthesis using deep neural networks." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013 pp. 7962-7966.

ＤＮＮ音響モデルにより高品質な音声合成を達成するためには、音響モデル学習部９１３において、目標話者の大量の音声データおよびコンテキストデータが必要となる。また、一つのＤＮＮ音響モデルからは、単一話者の音声のみが合成可能であった。 In order to achieve high-quality speech synthesis by the DNN acoustic model, the acoustic model learning unit 913 needs a large amount of speech data and context data of the target speaker. Also, only one speaker's voice could be synthesized from one DNN acoustic model.

このため、複数の話者の音声をＤＮＮに基づく音声合成により達成するためには、複数の話者について、それぞれ大量の音声データおよびコンテキストデータが必要であり、学習にかかるコストが大きい。 Therefore, in order to achieve voices of a plurality of speakers by voice synthesis based on DNN, a large amount of voice data and context data are required for each of the plurality of speakers, and learning costs are high.

また、複数の話者の合成音声を得るためには、その人数に応じた数のＤＮＮ音響モデルを保持する必要があり、話者数の増大に伴い使用メモリ数が増大する。 Further, in order to obtain the synthesized speech of a plurality of speakers, it is necessary to hold as many DNN acoustic models as the number of speakers, and the number of memories used increases as the number of speakers increases.

そこで本発明では、小さなサイズかつ複数話者の合成音声を生成できるＤＮＮ音響モデルを低コストで学習できる音響モデル学習装置を提供することを目的とする。 Therefore, it is an object of the present invention to provide an acoustic model learning device that can learn a DNN acoustic model that can generate synthetic speech of a small size and multiple speakers at low cost.

本発明の音響モデル学習装置は、複数の話者の音声データと、少なくとも音声データの音素情報とアクセント情報を含む複数の話者のコンテキストデータと、話者を特定するデータあるいは話者の特徴を表すデータとを用いて、音声波形合成に必要な、音高パラメータ、及び、スペクトルパラメータを含む音声パラメータを生成するためのディープニューラルネットワーク音響モデルを学習する音響モデル学習部を有し、ニューラルネットワークの入力層に、少なくとも音声データの音素情報とアクセント情報を含む複数の話者のコンテキストデータと、話者を特定するデータあるいは話者の特徴を表すデータを入力することを特徴とする。 The acoustic model learning device of the present invention determines the voice data of a plurality of speakers, the context data of a plurality of speakers including at least phoneme information and accent information of the voice data, the data for specifying the speaker, or the characteristics of the speaker. Using the represented data, a pitch parameter, which is necessary for speech waveform synthesis, and an acoustic model learning unit that learns a deep neural network acoustic model for generating a speech parameter including a spectrum parameter. It is characterized in that context data of a plurality of speakers including at least phoneme information and accent information of voice data, data specifying a speaker or data representing characteristics of the speaker is input to the input layer.

本発明の音響モデル学習装置によれば、小さなサイズかつ複数話者の合成音声を生成できるＤＮＮ音響モデルを低コストで学習できる。 According to the acoustic model learning device of the present invention, it is possible to learn a DNN acoustic model having a small size and capable of generating synthetic speech of a plurality of speakers at low cost.

非特許文献１の音響モデル学習装置の構成を示すブロック図。The block diagram which shows the structure of the acoustic model learning apparatus of nonpatent literature 1. 非特許文献１の音声合成装置の構成を示すブロック図。FIG. 3 is a block diagram showing the configuration of the speech synthesis device of Non-Patent Document 1. 実施例１の音響モデル学習装置の構成を示すブロック図。3 is a block diagram showing the configuration of an acoustic model learning device of Example 1. FIG. 実施例１の音響モデル学習装置の動作を示すフローチャート。3 is a flowchart showing the operation of the acoustic model learning device of the first embodiment. 実施例１の音声合成装置の構成を示すブロック図。3 is a block diagram showing the configuration of the speech synthesizer of Embodiment 1. FIG. 実施例１の音声合成装置の動作を示すフローチャート。3 is a flowchart showing the operation of the speech synthesizer of the first embodiment. 実施例２の音響モデル学習装置の構成を示すブロック図。3 is a block diagram showing the configuration of an acoustic model learning device of Example 2. FIG. 実施例２の音響モデル学習装置の動作を示すフローチャート。5 is a flowchart showing the operation of the acoustic model learning device of the second embodiment. 実施例２の音声合成装置の構成を示すブロック図。3 is a block diagram showing the configuration of a speech synthesizer of Example 2. FIG. 実施例２の音声合成装置の動作を示すフローチャート。7 is a flowchart showing the operation of the speech synthesizer according to the second embodiment. 実施例３の音響モデル学習装置の構成を示すブロック図。7 is a block diagram showing the configuration of an acoustic model learning device of Example 3. FIG. 実施例３の音響モデル学習装置の動作を示すフローチャート。9 is a flowchart showing the operation of the acoustic model learning device of the third embodiment. 実施例３の音声合成装置の構成を示すブロック図。6 is a block diagram showing the configuration of a speech synthesizer according to a third embodiment. FIG. 実施例３の音声合成装置の動作を示すフローチャート。9 is a flowchart showing the operation of the speech synthesizer according to the third embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. It should be noted that components having the same function are denoted by the same reference numeral, and redundant description will be omitted.

以下、図３、図４を参照して実施例１の音響モデル学習装置の構成、および動作について説明する。図３は、本実施例の音響モデル学習装置１１の構成を示すブロック図である。図４は、本実施例の音響モデル学習装置１１の動作を示すフローチャートである。非特許文献１の音響モデル学習装置９１と異なる点は、本実施例の音響モデル学習装置１１が話者を特定するデータを活用する点である。 Hereinafter, the configuration and operation of the acoustic model learning device according to the first embodiment will be described with reference to FIGS. 3 and 4. FIG. 3 is a block diagram showing the configuration of the acoustic model learning device 11 of this embodiment. FIG. 4 is a flowchart showing the operation of the acoustic model learning device 11 of this embodiment. The difference from the acoustic model learning device 91 of Non-Patent Document 1 is that the acoustic model learning device 11 of the present embodiment utilizes data for identifying a speaker.

図３に示すように、本実施例の音響モデル学習装置１１は、複数話者音声データベース１１１と、音響モデル学習部１１３と、音響モデル記憶部９１４を含む。複数話者音声データベース１１１は、複数の話者（Ｎを２以上の整数とし、Ｎ人の話者）それぞれに対し、各話者の音声データを記憶した音声データ記憶部１１１１−１、…、１１１１−Ｎと、各話者の音声データに対応するコンテキストデータを記憶したコンテキストデータ記憶部１１１２−１、…、１１１２−Ｎを含む。音声データは、音声合成用のＤＮＮ音響モデルを学習する対象とされたＮ人の話者が複数の文章を発話した音声のデータである。コンテキストデータは、音声データ中の各発話につき一つずつ付与された発音等の情報である。コンテキストデータは音声データの発話情報を保持するものであって、少なくとも音素情報（発音情報）とアクセント情報（アクセント型、アクセント句長）が含まれている。コンテキストデータには、これ以外にも品詞情報等が含まれてもよい。なお、音響モデル記憶部９１４は、前述した非特許文献１の音響モデル学習装置９１における同名の構成要素と同じである。 As shown in FIG. 3, the acoustic model learning device 11 according to the present exemplary embodiment includes a multi-speaker speech database 111, an acoustic model learning unit 113, and an acoustic model storage unit 914. The multi-speaker voice database 111 stores, for each of a plurality of speakers (N is an integer of 2 or more, N speakers), voice data storage units 1111-1, ... 1111-N, and context data storage units 1112-1, ..., 1112-N that store context data corresponding to the voice data of each speaker. The voice data is data of voices produced by a plurality of N speakers who are targeted for learning the DNN acoustic model for voice synthesis. The context data is information such as pronunciation that is given one by one for each utterance in the voice data. The context data holds utterance information of voice data and includes at least phoneme information (pronunciation information) and accent information (accent type, accent phrase length). In addition to this, the context data may include part-of-speech information and the like. The acoustic model storage unit 914 is the same as the constituent element of the same name in the acoustic model learning device 91 of Non-Patent Document 1 described above.

音響モデル学習部１１３は、複数の話者の音声データと、対応するコンテキストデータに加え、話者を特定するデータを用いて、音声波形合成に必要な音声パラメータを生成するためのＤＮＮ音響モデルを学習し、学習されたＤＮＮ音響モデルを音響モデル記憶部９１４に記憶する（Ｓ１１３）。話者を特定するデータとは、ある音声データを読み上げた話者を特定するための情報（データ）である。話者を特定するデータを数値ベクトルとして表現した、例えば話者コードを用いることができる。話者コードは、Ｎ名の話者のうち、どの話者の発話かを識別する情報を、１−ｏｆ−Ｋ表現で表現したベクトルとすることができる。１−ｏｆ−Ｋ表現とはベクトルのある要素だけが１、他の全ての要素が０となる表現のことである。 The acoustic model learning unit 113 uses, in addition to the voice data of a plurality of speakers and the corresponding context data, the data that specifies the speaker to generate a DNN acoustic model for generating a voice parameter required for voice waveform synthesis. The learned DNN acoustic model is stored in the acoustic model storage unit 914 (S113). The data for specifying the speaker is information (data) for specifying the speaker who read a certain voice data. It is possible to use, for example, a speaker code in which the data specifying the speaker is expressed as a numerical vector. The speaker code can be a vector in which information that identifies which speaker of N speakers is uttered is expressed by a 1-of-K expression. The 1-of-K expression is an expression in which only one element of the vector is 1, and all other elements are 0.

すなわち音響モデル学習部１１３は、コンテキストデータを数値ベクトルで表現した言語特徴量ベクトルと、話者コードを連結したものを入力とし、話者、コンテキストデータに対応する音声パラメータを出力とするＤＮＮ音響モデルを学習する（Ｓ１１３）。 That is, the acoustic model learning unit 113 receives the language feature vector expressing the context data as a numerical vector and the speaker code connected, and outputs the DNN acoustic model that outputs the voice parameters corresponding to the speaker and the context data. Is learned (S113).

以下、図５、図６を参照して、本実施例の音声合成装置１２の構成、および動作について説明する。図５は、本実施例の音声合成装置１２の構成を示すブロック図である。図６は、本実施例の音声合成装置１２の動作を示すフローチャートである。非特許文献１の音声合成装置９２と異なる点は、本実施例の音声合成装置１２が話者を特定するデータを活用する点である。 The configuration and operation of the speech synthesizer 12 of this embodiment will be described below with reference to FIGS. 5 and 6. FIG. 5 is a block diagram showing the configuration of the speech synthesizer 12 of this embodiment. FIG. 6 is a flowchart showing the operation of the speech synthesizer 12 of this embodiment. The difference from the speech synthesizer 92 of Non-Patent Document 1 is that the speech synthesizer 12 of the present embodiment utilizes data for specifying a speaker.

図５に示すように、本実施例の音声合成装置１２は、テキスト解析部９２１と、音声パラメータ生成部１２２と、音声波形生成部９２３を含む。テキスト解析部９２１と、音声波形生成部９２３は、前述した非特許文献１の音声合成装置９２における同名の構成要素と同じ動作をする。音声パラメータ生成部１２２は、音響モデル記憶部９１４に記憶されたＤＮＮ音響モデルを用いて、入力テキストを解析して取得されたコンテキストデータと、入力テキストとともに入力される話者を特定するデータ（話者コード）から音声パラメータを生成する（Ｓ１２２）。音声パラメータは、音高パラメータ（基本周波数Ｆ０等）、スペクトルパラメータ（ケプストラム、メルケプストラム等）を含むものとする。具体的には、音声パラメータ生成部１２２は、コンテキストデータと話者コードを連結し、ＤＮＮ音響モデルへの入力ベクトルを得る。音声パラメータ生成部１２２は、入力ベクトルをＤＮＮ音響モデルへ入力し、順伝播により音声パラメータを生成する（Ｓ１２２）。音声波形生成部９２３は、非特許文献１と同様に、音声パラメータから、音声波形生成により合成音声を得る（Ｓ９２３）。音声波形生成部９２３は、音声波形生成の前に、例えば、maximum likelihood generation（ＭＬＰＧ）アルゴリズム（参考非特許文献１）を用いて時間方向に平滑化された音声パラメータ系列を得てもよい。音声波形生成には、例えば（参考非特許文献２）を用いてもよい。
（参考非特許文献１：益子他、“動的特徴を用いたHMMに基づく音声合成”、信学論、vol.J79-D-II，no.12，pp.2184-2190，Dec. 1996.）
（参考非特許文献２：今井他、“音声合成のためのメル対数スペクトル近似（MLSA）フィルタ”、電子情報通信学会論文誌 A Vol.J66-A No.2 pp.122-129, Feb. 1983.） As shown in FIG. 5, the speech synthesis device 12 of this embodiment includes a text analysis unit 921, a speech parameter generation unit 122, and a speech waveform generation unit 923. The text analysis unit 921 and the speech waveform generation unit 923 operate in the same manner as the components of the same name in the speech synthesis device 92 of Non-Patent Document 1 described above. The speech parameter generation unit 122 uses the DNN acoustic model stored in the acoustic model storage unit 914 to analyze the context data obtained by analyzing the input text, and the data (speaker that specifies the speaker input together with the input text. The voice parameter is generated from the person code) (S122). The voice parameters include pitch parameters (fundamental frequency F0, etc.) and spectrum parameters (cepstrum, mel cepstrum, etc.). Specifically, the voice parameter generation unit 122 connects the context data and the speaker code to obtain an input vector to the DNN acoustic model. The voice parameter generation unit 122 inputs the input vector to the DNN acoustic model and generates a voice parameter by forward propagation (S122). The voice waveform generation unit 923 obtains a synthetic voice by voice waveform generation from voice parameters, as in Non-Patent Document 1 (S923). The speech waveform generation unit 923 may obtain a speech parameter sequence smoothed in the time direction using, for example, a maximum likelihood generation (MLPG) algorithm (Reference Non-Patent Document 1) before generating the speech waveform. For example, (Reference Non-Patent Document 2) may be used for the voice waveform generation.
(Reference Non-Patent Document 1: Mashiko et al., "HMM-based speech synthesis using dynamic features", IEICE, vol.J79-D-II, no.12, pp.2184-2190, Dec. 1996. )
(Reference Non-Patent Document 2: Imai et al., “Mel-Log Spectral Approximation (MLSA) Filter for Speech Synthesis”, IEICE Transactions A Vol.J66-A No.2 pp.122-129, Feb. 1983. .)

本実施例の音響モデル学習装置１１によれば、コンテキストデータに加え、話者を特定するデータ（話者コード）を活用したため、対応するコンテキストデータと話者性を反映した音声パラメータを出力するＤＮＮ音響モデルを学習することができる。 According to the acoustic model learning apparatus 11 of the present embodiment, since the data (speaker code) that specifies the speaker is utilized in addition to the context data, the DNN that outputs the corresponding context data and the voice parameter reflecting the speaker characteristic. Can learn acoustic models.

本実施例では、音声パラメータに話者を特徴づける成分と日本語音声として話者間で共通する成分とが含まれることを仮定している。具体的には、話者を特徴づける成分に対応する入力として各話者の１−ｏｆ−Ｋ表現である話者コードが用いられ、日本語音声として話者間で共通する成分に対応する入力としてコンテキストデータが用いられる。話者を特徴づける成分と話者間で共通する成分とで構成される音声パラメータを教師信号として与えることで、ＤＮＮ内部でそれぞれの成分に対応したパラメータ推定器が学習される。これにより、単一のＤＮＮ音響モデルで学習に用いられた話者それぞれに対応する音声合成が可能となる。 In the present embodiment, it is assumed that the voice parameter includes a component that characterizes the speaker and a component that is common to both speakers as Japanese voice. Specifically, a speaker code, which is a 1-of-K expression of each speaker, is used as an input corresponding to a component that characterizes the speaker, and an input corresponding to a component that is common among speakers as a Japanese voice. Is used as the context data. A parameter estimator corresponding to each component is learned inside the DNN by giving a voice signal composed of a component characterizing the speaker and a component common to the speakers as a teacher signal. This enables speech synthesis corresponding to each speaker used for learning with a single DNN acoustic model.

日本語の音声は多様なコンテキストに対して多様な音声パラメータ表現となるため、多様なコンテキストに対して音声パラメータを精度よく推定するためには大量の音声データが必要となるのが通常であった。しかし本実施例では、音声パラメータに話者を特徴づける成分と日本語音声として話者間で共通する成分とが含まれることを仮定したため、複数話者にまたがって十分な量の音声データが存在すればよく、単一の話者について大量の音声データを準備する必要がない。すなわち、複数の話者の音声データを効率的に活用し、一つのＤＮＮ音響モデルを学習するため、学習に必要な音声データを減らすことができる。また、一つの音響モデルで複数の話者性を反映した音声合成を実現するため、より少ないメモリ使用量で、多数の話者を扱う音声合成システムを実現できる。 Since Japanese voice has various voice parameter expressions for various contexts, a large amount of voice data is usually required to accurately estimate voice parameters for various contexts. . However, in the present embodiment, since it is assumed that the voice parameter includes a component that characterizes the speaker and a component that is common among the speakers as Japanese voice, there is a sufficient amount of voice data across multiple speakers. All that is required is not to prepare a large amount of voice data for a single speaker. That is, since the voice data of a plurality of speakers are efficiently used and one DNN acoustic model is learned, the voice data required for learning can be reduced. Further, since voice synthesis that reflects a plurality of speaker characteristics is realized with one acoustic model, a voice synthesis system that handles a large number of speakers can be realized with a smaller memory usage.

実施例１のように、話者コード（１−ｏｆ−Ｋ表現）を用いる場合、複数話者音声データベース１１１に含まれる話者以外の話者の音声合成を行うことができない。そこで実施例２では、目標話者の参照発話のスペクトル情報の特徴を抽出し、モデル学習・音声合成に使用することにより、参照発話が得られる任意の目標話者についての音声合成を可能とした。以下、図７、図８を参照して実施例２の音響モデル学習装置２１の構成、および動作について説明する。図７は、本実施例の音響モデル学習装置２１の構成を示すブロック図である。図８は、本実施例の音響モデル学習装置２１の動作を示すフローチャートである。実施例１の音響モデル学習装置１１と異なる点は、本実施例の音響モデル学習装置２１が話者の特徴を表すデータ（話者スペクトル特徴ベクトル）を活用する点である。 When the speaker code (1-of-K expression) is used as in the first embodiment, it is not possible to perform voice synthesis for speakers other than the speakers included in the multi-speaker voice database 111. Therefore, in the second embodiment, the feature of the spectrum information of the reference utterance of the target speaker is extracted and used for model learning and voice synthesis, thereby enabling the voice synthesis for an arbitrary target speaker who can obtain the reference utterance. . Hereinafter, the configuration and operation of the acoustic model learning device 21 according to the second embodiment will be described with reference to FIGS. 7 and 8. FIG. 7 is a block diagram showing the configuration of the acoustic model learning device 21 of this embodiment. FIG. 8 is a flowchart showing the operation of the acoustic model learning device 21 of this embodiment. The difference from the acoustic model learning device 11 according to the first exemplary embodiment is that the acoustic model learning device 21 according to the present exemplary embodiment utilizes data (speaker spectrum characteristic vector) representing the characteristics of the speaker.

図７に示すように、本実施例の音響モデル学習装置２１は、複数話者音声データベース１１１と、スペクトル特徴抽出部２１２と、音響モデル学習部２１３と、音響モデル記憶部９１４を含み、複数話者音声データベース１１１、音響モデル記憶部９１４については実施例１の同名の構成要件と同じである。 As shown in FIG. 7, the acoustic model learning device 21 according to the present embodiment includes a multi-speaker speech database 111, a spectrum feature extraction unit 212, an acoustic model learning unit 213, and an acoustic model storage unit 914. The human voice database 111 and the acoustic model storage unit 914 are the same as the constituent requirements of the same name in the first embodiment.

スペクトル特徴抽出部２１２は、各話者の音声データ記憶部１１１１−１〜１１１１−Ｎから各話者の参照発話を抽出し、各話者の参照発話から各話者の話者スペクトル特徴ベクトルを生成する（Ｓ２１２）。ここで参照発話とは、学習時に使用する話者、または音声合成時の目標話者による発話であって、書き起こしが不要、短文の発話でよいという特徴がある。話者スペクトル特徴ベクトルとは、その話者の発話する音声に見られるスペクトル情報の特徴を、数値ベクトルで表現したものである。話者スペクトル特徴ベクトルの生成には、例えばｉ−ｖｅｃｔｏｒを使用してもよい。スペクトル特徴抽出部２１２については、例えば参考非特許文献３の知見などを利用し、ｉ−ｖｅｃｔｏｒ抽出器を使用してもよい。
（参考非特許文献３：Dehak, Najim, et al. "Front-end factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 19.4
(2011): 788-798.） The spectrum feature extraction unit 212 extracts the reference utterance of each speaker from the voice data storage units 1111-1 to 1111-N of each speaker, and extracts the speaker spectrum feature vector of each speaker from the reference utterance of each speaker. Generate (S212). Here, the reference utterance is an utterance by a speaker used at the time of learning or a target speaker at the time of voice synthesis, and is characterized in that transcription is unnecessary and a short sentence utterance is sufficient. The speaker spectrum feature vector is a numerical vector expressing the features of the spectrum information found in the voice spoken by the speaker. For example, i-vector may be used to generate the speaker spectrum feature vector. For the spectral feature extraction unit 212, for example, the knowledge of Reference Non-Patent Document 3 may be used and an i-vector extractor may be used.
(Reference Non-Patent Document 3: Dehak, Najim, et al. "Front-end factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 19.4.
(2011): 788-798.)

次に、音響モデル学習部２１３は、複数の話者の音声データと、複数の話者のコンテキストデータと、話者の特徴を表すデータである話者スペクトル特徴ベクトルとを用いて、ＤＮＮ音響モデルを学習し、学習されたＤＮＮ音響モデルを音響モデル記憶部９１４に記憶する（Ｓ２１３）。 Next, the acoustic model learning unit 213 uses the voice data of the plurality of speakers, the context data of the plurality of speakers, and the speaker spectrum feature vector that is the data representing the features of the speakers, and the DNN acoustic model. Is learned and the learned DNN acoustic model is stored in the acoustic model storage unit 914 (S213).

以下、図９、図１０を参照して、本実施例の音声合成装置２２の構成、および動作について説明する。図９は、本実施例の音声合成装置２２の構成を示すブロック図である。図１０は、本実施例の音声合成装置２２の動作を示すフローチャートである。実施例１の音声合成装置１２と異なる点は、本実施例の音声合成装置２２が話者の特徴を表すデータ（話者スペクトル特徴ベクトル）を活用する点である。 The configuration and operation of the speech synthesizer 22 of this embodiment will be described below with reference to FIGS. 9 and 10. FIG. 9 is a block diagram showing the configuration of the speech synthesizer 22 of this embodiment. FIG. 10 is a flowchart showing the operation of the speech synthesizer 22 of this embodiment. The difference from the speech synthesizer 12 of the first embodiment is that the speech synthesizer 22 of the present embodiment utilizes data (speaker spectrum feature vector) representing the characteristics of the speaker.

図９に示すように、本実施例の音声合成装置２２は、テキスト解析部９２１と、スペクトル特徴抽出部２２１と、音声パラメータ生成部２２２と、音声波形生成部９２３を含む。テキスト解析部９２１と、音声波形生成部９２３は、実施例１と同様である。スペクトル特徴抽出部２２１は、音声合成用のテキストと共に入力された参照発話から前述の話者スペクトル特徴ベクトルを抽出する（Ｓ２２１）。前述したように、参照発話は目標話者による発話である。 As shown in FIG. 9, the speech synthesis device 22 of this embodiment includes a text analysis unit 921, a spectrum feature extraction unit 221, a speech parameter generation unit 222, and a speech waveform generation unit 923. The text analysis unit 921 and the voice waveform generation unit 923 are the same as in the first embodiment. The spectrum feature extraction unit 221 extracts the above-described speaker spectrum feature vector from the reference utterance input together with the text for speech synthesis (S221). As described above, the reference utterance is the utterance by the target speaker.

音声パラメータ生成部２２２は、音響モデル記憶部９１４に記憶されたＤＮＮ音響モデルを用いて、入力テキストを解析して取得されたコンテキストデータと、参照発話から抽出された話者スペクトル特徴ベクトルから音声パラメータを生成する（Ｓ２２２）。 The speech parameter generation unit 222 uses the DNN acoustic model stored in the acoustic model storage unit 914 to analyze the input text, the context data obtained by analyzing the input text, and the speaker spectrum feature vector extracted from the reference utterance. Is generated (S222).

実施例１の音声合成装置１２では、話者コードを使用しているため、音響モデル学習時に使用する複数話者音声データベース１１１に含まれない目標話者については、音響モデル学習時に未知であるため、音声を合成することができない。この課題を解決するため、本実施例では、音声認識や話者識別の分野で使用されているｉ−ｖｅｃｔｏｒ等、当該話者の発話する音声のスペクトル情報の特徴を表現するベクトル（話者スペクトル特徴ベクトル）を使用する。これにより、複数話者音声データベース１１１に含まれない目標話者であっても、目標話者の音声と音響的に類似した話者の音声が音響モデル内でモデル化されているため、目標話者の参照発話が獲得できれば、目標話者に近いスペクトル特徴を持った音声を合成することができる。したがって、複数話者音声データベース１１１に含まれない目標話者であっても、その合成音声を生成することが可能となる。なお前述したように、話者スペクトル特徴ベクトルの生成には、例えばｉ−ｖｅｃｔｏｒを使用することができるが、ステップＳ２１２の実現方法はこれに限られない。 Since the voice synthesizer 12 of the first embodiment uses the speaker code, the target speaker not included in the multi-speaker voice database 111 used during acoustic model learning is unknown during acoustic model learning. , I cannot synthesize voice. In order to solve this problem, in the present embodiment, a vector (speaker spectrum) that expresses the characteristics of the spectrum information of the voice uttered by the speaker, such as i-vector used in the fields of voice recognition and speaker identification. Feature vector). As a result, even if the target speaker is not included in the multi-speaker voice database 111, the voice of the speaker acoustically similar to the voice of the target speaker is modeled in the acoustic model. If the reference utterance of the speaker is acquired, it is possible to synthesize a voice having spectral characteristics close to the target speaker. Therefore, even a target speaker who is not included in the multi-speaker voice database 111 can generate a synthesized voice. As described above, for example, i-vector can be used to generate the speaker spectrum feature vector, but the method of realizing step S212 is not limited to this.

実施例２の方法において、発話から話者情報ベクトルを抽出するための代表的な手法であるｉ−ｖｅｃｔｏｒは、話者識別分野や、音声認識分野においてモデルの話者適応を行う目的で提案されてきたものである。これらの分野では、音声に現れる個人性のうち、スペクトル情報の個人性がベクトルで表現されることが重要であった。一方で、音声合成分野において、目標話者の音声合成を実現するために話者情報ベクトルを抽出する場合、音声に現れる個人性のうち、スペクトル情報の個人性だけではなく、韻律情報の個人性も表現されていることが重要であり、この点が音声認識問題とは異なると考えられる。そこで実施例３の音響モデル学習装置３１では、話者の特徴を表すデータに、Ｆ０の情報をも含むようにした。以下、図１１、図１２を参照して実施例３の音響モデル学習装置３１の構成、および動作について説明する。図１１は、本実施例の音響モデル学習装置３１の構成を示すブロック図である。図１２は、本実施例の音響モデル学習装置３１の動作を示すフローチャートである。実施例２の音響モデル学習装置２１と異なる点は、本実施例の音響モデル学習装置３１が話者の特徴を表すデータとして話者スペクトル特徴ベクトルだけでなく、話者韻律特徴ベクトルを活用する点である。 In the method of the second embodiment, i-vector, which is a typical method for extracting the speaker information vector from the utterance, is proposed for the purpose of speaker adaptation of the model in the speaker identification field and the voice recognition field. It has come. In these fields, it was important that the individuality of the spectrum information, among the individualities appearing in the voice, be represented by a vector. On the other hand, in the field of speech synthesis, when the speaker information vector is extracted to realize the speech synthesis of the target speaker, among the individualities appearing in the voice, not only the individuality of the spectrum information but also the individuality of the prosodic information. Is also important, which is considered to be different from the speech recognition problem. Therefore, in the acoustic model learning device 31 of the third embodiment, the data representing the characteristics of the speaker also includes F0 information. Hereinafter, the configuration and operation of the acoustic model learning device 31 according to the third embodiment will be described with reference to FIGS. 11 and 12. FIG. 11 is a block diagram showing the configuration of the acoustic model learning device 31 of this embodiment. FIG. 12 is a flowchart showing the operation of the acoustic model learning device 31 of this embodiment. The difference from the acoustic model learning device 21 of the second embodiment is that the acoustic model learning device 31 of the present embodiment utilizes not only the speaker spectrum feature vector but also the speaker prosody feature vector as data representing the features of the speaker. Is.

図１１に示すように、本実施例の音響モデル学習装置３１は、複数話者音声データベース１１１と、スペクトル特徴抽出部２１２と、韻律特徴抽出部３１２と、音響モデル学習部３１３と、音響モデル記憶部９１４を含み、複数話者音声データベース１１１、スペクトル特徴抽出部２１２、音響モデル記憶部９１４については実施例２の同名の構成要件と同じである。 As shown in FIG. 11, the acoustic model learning device 31 of the present exemplary embodiment includes a multi-speaker speech database 111, a spectrum feature extraction unit 212, a prosody feature extraction unit 312, an acoustic model learning unit 313, and an acoustic model storage. The multi-speaker voice database 111, the spectrum feature extraction unit 212, and the acoustic model storage unit 914 including the unit 914 are the same as the constituent elements of the same name in the second embodiment.

韻律特徴抽出部３１２は、各話者の音声データ記憶部１１１１−１〜１１１１−Ｎから各話者の参照発話を抽出し、各話者の参照発話から各話者の話者韻律特徴ベクトルを生成する（Ｓ３１２）。話者韻律特徴ベクトルとは、音声に現れる個人性のうち、韻律情報の個人性を表現したベクトルである。より詳細には話者韻律特徴ベクトルは、その話者の発話する音声にみられる音響的特徴のうち、韻律情報の特徴を、数値ベクトルで表現したものである。 The prosody feature extraction unit 312 extracts the reference utterance of each speaker from the voice data storage units 1111-1 to 1111-N of each speaker and extracts the speaker prosody feature vector of each speaker from the reference utterance of each speaker. It is generated (S312). The speaker prosody feature vector is a vector expressing the individuality of the prosody information among the individualities appearing in the voice. More specifically, the speaker prosody feature vector is a numerical vector expressing the features of the prosody information among the acoustic features found in the speech uttered by the speaker.

韻律特徴抽出部３１２は、例えば、参照発話から分析されるＦ０系列の平均と分散を算出し、Ｆ０特徴情報を話者韻律特徴ベクトルとして抽出してもよい。韻律特徴抽出部３１２は、参考非特許文献４の手法を用いて、より詳細な韻律特徴のモデル化を行ってもよい。（参考非特許文献４：Dehak, Najim, Pierre Dumouchel, and Patrick Kenny. "Modeling prosodic features with joint factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 15.7 (2007): 2095-2103.） The prosody feature extraction unit 312 may calculate, for example, the average and variance of the F0 sequence analyzed from the reference utterance, and extract the F0 feature information as a speaker prosody feature vector. The prosody feature extraction unit 312 may perform more detailed prosody feature modeling using the method of Reference Non-Patent Document 4. (Reference Non-Patent Document 4: Dehak, Najim, Pierre Dumouchel, and Patrick Kenny. "Modeling prosodic features with joint factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 15.7 (2007): 2095-2103 .)

次に、音響モデル学習部３１３は、複数の話者の音声データと、複数の話者のコンテキストデータと、話者スペクトル特徴ベクトルと、話者韻律特徴ベクトルとを用いて、ＤＮＮ音響モデルを学習し、学習されたＤＮＮ音響モデルを音響モデル記憶部９１４に記憶する（Ｓ３１３）。 Next, the acoustic model learning unit 313 learns a DNN acoustic model by using voice data of a plurality of speakers, context data of a plurality of speakers, a speaker spectrum feature vector, and a speaker prosody feature vector. Then, the learned DNN acoustic model is stored in the acoustic model storage unit 914 (S313).

以下、図１３、図１４を参照して、本実施例の音声合成装置３２の構成、および動作について説明する。図１３は、本実施例の音声合成装置３２の構成を示すブロック図である。図１４は、本実施例の音声合成装置３２の動作を示すフローチャートである。実施例２の音声合成装置２２と異なる点は、本実施例の音声合成装置３２が話者の特徴を表すデータとして話者スペクトル特徴ベクトルだけでなく、話者韻律特徴ベクトルを活用する点である。 The configuration and operation of the speech synthesizer 32 of this embodiment will be described below with reference to FIGS. 13 and 14. FIG. 13 is a block diagram showing the configuration of the speech synthesizer 32 of this embodiment. FIG. 14 is a flowchart showing the operation of the speech synthesizer 32 of this embodiment. The difference from the voice synthesizer 22 of the second embodiment is that the voice synthesizer 32 of the present embodiment utilizes not only the speaker spectrum feature vector but also the speaker prosody feature vector as data representing the feature of the speaker. .

図１３に示すように、本実施例の音声合成装置３２は、テキスト解析部９２１と、スペクトル特徴抽出部２２１と、韻律特徴抽出部３２１と、音声パラメータ生成部３２２と、音声波形生成部９２３を含む。テキスト解析部９２１と、スペクトル特徴抽出部２２１と、音声波形生成部９２３は、実施例１と同様である。韻律特徴抽出部３２１は、音声合成用のテキストと共に入力された参照発話から前述の話者韻律特徴ベクトルを抽出する（Ｓ３２１）。前述したように、参照発話は目標話者による発話である。 As shown in FIG. 13, the speech synthesis device 32 of this embodiment includes a text analysis unit 921, a spectrum feature extraction unit 221, a prosody feature extraction unit 321, a speech parameter generation unit 322, and a speech waveform generation unit 923. Including. The text analysis unit 921, the spectrum feature extraction unit 221, and the voice waveform generation unit 923 are the same as in the first embodiment. The prosody feature extraction unit 321 extracts the above-mentioned speaker prosody feature vector from the reference utterance input together with the text for speech synthesis (S321). As described above, the reference utterance is the utterance by the target speaker.

音声パラメータ生成部３２２は、音響モデル記憶部９１４に記憶されたＤＮＮ音響モデルを用いて、入力テキストを解析して取得されたコンテキストデータと、話者スペクトル特徴ベクトルと、話者韻律特徴ベクトルから音声パラメータを生成する（Ｓ３２２）。 The voice parameter generation unit 322 uses the DNN acoustic model stored in the acoustic model storage unit 914 to analyze the context data obtained by analyzing the input text, the speaker spectrum feature vector, and the speaker prosody feature vector, and outputs the voice. Parameters are generated (S322).

ある話者の音響的特徴は、スペクトルの特徴、韻律の特徴に分類することができる。実施例２のように、話者スペクトル特徴ベクトルを使用した場合、その話者の特徴のうち、スペクトルの特徴が合成音声にも反映され、目標話者の韻律の特徴が反映されない。本実施例では、目標話者の韻律の情報も表現したベクトルを使用することにより、複数話者音声データベース１１１に含まれない話者の、韻律の特徴をも反映した音声を合成することが可能となる。 The acoustic characteristics of a speaker can be classified into spectral characteristics and prosody characteristics. When the speaker spectrum feature vector is used as in the second embodiment, of the features of the speaker, the features of the spectrum are reflected in the synthesized voice, and the features of the prosody of the target speaker are not reflected. In the present embodiment, by using a vector that also represents the prosody information of the target speaker, it is possible to synthesize a voice that also reflects the prosody features of the speakers not included in the multi-speaker voice database 111. Becomes

なお、上述の実施例において説明した音響モデル学習装置、音声合成装置をそれぞれ音響モデル学習部、音声合成部として、これらを構成要件として備える単独のハードウェアとして本発明を実現してもよい。 The acoustic model learning device and the speech synthesizing device described in the above embodiments may be implemented as the acoustic model learning unit and the speech synthesizing unit, respectively, and the present invention may be realized as independent hardware having these components as constituent elements.

また、上述の実施例において説明した話者コード、話者スペクトル特徴ベクトル、話者韻律特徴ベクトルなどは、話者の情報について表現したベクトルであるという共通項をもつため、これらを話者情報ベクトルと総称してもよい。 In addition, since the speaker code, the speaker spectrum feature vector, the speaker prosody feature vector, etc. described in the above-mentioned embodiment have a common term that they are vectors expressing the information of the speaker, these are the speaker information vectors. May be collectively referred to.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Additional notes>
The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating with the outside of the hardware entity. Connectable communication unit, CPU (Central Processing Unit, may include a cache memory or register, etc.), RAM or ROM as memory, external storage device as hard disk, and their input unit, output unit, communication unit , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged among external storage devices. If necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above-described functions and data necessary for processing of this program (not limited to the external storage device, for example, the program is read). It may be stored in a ROM that is a dedicated storage device). In addition, data and the like obtained by the processing of these programs are appropriately stored in the RAM, the external storage device, or the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM, etc.) and data necessary for the processing of each program are read into the memory as necessary, and interpreted and executed / processed by the CPU as appropriate. . As a result, the CPU realizes a predetermined function (each constituent element represented by the above, ... Unit, ... Means, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention. Further, the processes described in the above embodiments are not only executed in time series in the order described, but may be executed in parallel or individually according to the processing capability of the device that executes the processes or as necessary. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions of the hardware entity (the device of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on the computer, the processing functions of the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−
Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded in a computer-readable recording medium. The computer-readable recording medium may be any recording medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape or the like is used as a magnetic recording device, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disc. Memory), CD-
R (Recordable) / RW (ReWritable) or the like can be used as a magneto-optical recording medium, MO (Magneto-Optical disc) or the like, and semiconductor memory can be an EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like. .

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is performed by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, the program may be stored in a storage device of a server computer and transferred from the server computer to another computer via a network to distribute the program.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, the program recorded on a portable recording medium or the program transferred from the server computer in its own storage device. Then, when executing the processing, this computer reads the program stored in its own recording medium and executes the processing according to the read program. As another execution form of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be sequentially executed. Further, the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by executing the execution instruction and acquiring the result without transferring the program from the server computer to the computer. May be It should be noted that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to a computer but has the property of defining computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be implemented by hardware.

Claims

Using voice data of a plurality of speakers, context data of a plurality of speakers including at least phoneme information and accent information of the voice data, data specifying the speaker or data representing characteristics of the speaker An acoustic model learning unit that learns a deep neural network acoustic model for generating a voice parameter including a pitch parameter and a spectrum parameter required for voice waveform synthesis,
In the input layer of the neural network, context data of a plurality of speakers including at least phoneme information and accent information of the voice data, and data specifying the speakers or data representing characteristics of the speakers are input. Acoustic model learning device.

The acoustic model learning device according to claim 1, wherein
The acoustic model learning unit,
The deep neural network acoustic model is learned using voice data of a plurality of speakers, context data of a plurality of speakers, and data for specifying the speaker represented by a 1-of-K expression vector. ,
An acoustic model learning device, wherein context data of a plurality of speakers and data for specifying the speakers represented by a 1-of-K expression vector are input to an input layer of the neural network.

The acoustic model learning device according to claim 1, wherein
The acoustic model learning unit,
Speech data of a plurality of speakers, context data of a plurality of speakers, a speaker spectrum feature vector represented by i-vector as data representing the features of the speakers, and a speaker prosody feature vector representing F0 feature information. And learn the deep neural network acoustic model using
In the input layer of the neural network, context data of a plurality of speakers, a speaker spectrum feature vector represented by i-vector as data representing the features of the speakers, and a speaker prosody feature vector representing F0 feature information. An acoustic model learning device characterized by inputting ,.

A text analysis unit that analyzes the input text and acquires context data including at least phoneme information and accent information,
Deep neural network acoustic model learned using voice data of a plurality of speakers, context data of voice data of the plurality of speakers, and data specifying the speaker or data representing characteristics of the speaker. By using the context data obtained by analyzing the input text, and from the data that specifies the speaker or the data representing the characteristics of the speaker, which is input together with the input text, a pitch parameter and a spectrum. A voice parameter generation unit for generating voice parameters including parameters,
A voice waveform generating unit that generates a voice waveform using the generated voice parameter,
In the input layer of the neural network, context data of a plurality of speakers including at least phoneme information and accent information of the voice data, and data specifying the speakers or data representing characteristics of the speakers are input. A speech synthesizer.

The voice synthesizer according to claim 4,
The deep neural network acoustic model is
Learned using voice data of a plurality of speakers, context data of a plurality of speakers, and data specifying the speaker represented by a 1-of-K expression vector,
A voice synthesizer, wherein context data of a plurality of speakers and data for specifying the speakers expressed by a 1-of-K expression vector are input to an input layer of the neural network.

The voice synthesizer according to claim 4,
The deep neural network acoustic model is
Speech data of a plurality of speakers, context data of a plurality of speakers, a speaker spectrum feature vector represented by i-vector as data representing the features of the speakers, and a speaker prosody feature vector representing F0 feature information. And are learned using
In the input layer of the neural network, context data of a plurality of speakers, a speaker spectrum feature vector represented by i-vector as data representing the features of the speakers, and a speaker prosody feature vector representing F0 feature information. A voice synthesizer characterized by inputting ,.

An acoustic model learning method executed by an acoustic model learning device,
Using voice data of a plurality of speakers, context data of a plurality of speakers including at least phoneme information and accent information of the voice data, data specifying the speaker or data representing characteristics of the speaker , A step of learning a deep neural network acoustic model for generating a voice parameter including a pitch parameter and a spectrum parameter required for voice waveform synthesis,
In the input layer of the neural network, context data of a plurality of speakers including at least phoneme information and accent information of the voice data, and data specifying the speakers or data representing characteristics of the speakers are input. Acoustic model learning method.

A voice synthesis method executed by a voice synthesizer, comprising:
Analyzing the input text to obtain context data including at least phoneme information and accent information;
Deep neural network acoustic model learned using voice data of a plurality of speakers, context data of voice data of the plurality of speakers, and data specifying the speaker or data representing characteristics of the speaker. By using the context data obtained by analyzing the input text, and from the data that specifies the speaker or the data representing the characteristics of the speaker, which is input together with the input text, a pitch parameter and a spectrum. Generating voice parameters including parameters,
Generating a voice waveform using the generated voice parameters,
In the input layer of the neural network, context data of a plurality of speakers including at least phoneme information and accent information of the voice data, and data specifying the speakers or data representing characteristics of the speakers are input. Speech synthesis method.

A program that causes a computer to function as the acoustic model learning device according to claim 1.

A program that causes a computer to function as the voice synthesizer according to claim 4.