JP6622505B2

JP6622505B2 - Acoustic model learning device, speech synthesis device, acoustic model learning method, speech synthesis method, program

Info

Publication number: JP6622505B2
Application number: JP2015153948A
Authority: JP
Inventors: 伸克北条; 勇祐井島; 宮崎　昇; 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-08-04
Filing date: 2015-08-04
Publication date: 2019-12-18
Anticipated expiration: 2035-08-04
Also published as: JP2017032839A

Description

本発明は、音声データからディープニューラルネットワーク音響モデルを学習する音響モデル学習装置、学習されたディープニューラルネットワーク音響モデルを用いて合成音声を生成する音声合成装置、音響モデル学習方法、音声合成方法、プログラムに関する。 The present invention relates to an acoustic model learning device that learns a deep neural network acoustic model from speech data, a speech synthesizer that generates synthesized speech using the learned deep neural network acoustic model, an acoustic model learning method, a speech synthesis method, and a program About.

目標話者の音声データから、その話者の合成音声を生成する手法として、ＤＮＮ（ディープニューラルネットワーク）に基づく技術がある（非特許文献１）。以下、図１、図２を参照して非特許文献１の音響モデル学習装置、音声合成装置の構成、および動作について説明する。図１は、同文献の音響モデル学習装置９１の構成を示すブロック図である。図２は、同文献の音声合成装置９２の構成を示すブロック図である。 A technique based on DNN (Deep Neural Network) is a technique for generating synthesized speech of a speaker from speech data of a target speaker (Non-Patent Document 1). Hereinafter, the configurations and operations of the acoustic model learning device and the speech synthesis device of Non-Patent Document 1 will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing a configuration of an acoustic model learning device 91 of the same document. FIG. 2 is a block diagram showing the configuration of the speech synthesizer 92 of the same document.

図１に示すように、非特許文献１の音響モデル学習装置９１は、話者音声データベース９１１と、音響モデル学習部９１３と、音響モデル記憶部９１４を含む。話者音声データベース９１１は、音声データ記憶部９１１１と、コンテキストデータ記憶部９１１２を含む。音声データ記憶部９１１１は、目標話者の音声データ（音声パラメータ）を予め記憶している。コンテキストデータ記憶部９１１２は、目標話者の音声データに対応するコンテキストデータを予め記憶している。詳細は後述するが、コンテキストデータには、少なくとも音声データの音素情報とアクセント情報が含まれるものとする。 As illustrated in FIG. 1, the acoustic model learning device 91 of Non-Patent Document 1 includes a speaker voice database 911, an acoustic model learning unit 913, and an acoustic model storage unit 914. The speaker voice database 911 includes a voice data storage unit 9111 and a context data storage unit 9112. The voice data storage unit 9111 stores voice data (voice parameters) of the target speaker in advance. The context data storage unit 9112 stores in advance context data corresponding to the voice data of the target speaker. Although details will be described later, it is assumed that the context data includes at least phoneme information and accent information of the voice data.

音響モデル学習部９１３は、目標話者の音声データ、コンテキストデータを用いて、ＤＮＮ（ディープニューラルネットワーク）による、目標話者の音響モデルを学習し、学習された音響モデル（以下、ＤＮＮ音響モデル、またはディープニューラルネットワーク音響モデルと呼称する）を音響モデル記憶部９１４に記憶する。 The acoustic model learning unit 913 learns the acoustic model of the target speaker using DNN (Deep Neural Network) using the speech data and context data of the target speaker, and learns the acoustic model (hereinafter referred to as DNN acoustic model, Or the deep neural network acoustic model) is stored in the acoustic model storage unit 914.

図２に示すように、非特許文献１の音声合成装置９２は、テキスト解析部９２１と、音声パラメータ生成部９２２と、音声波形生成部９２３を含む。 As illustrated in FIG. 2, the speech synthesizer 92 of Non-Patent Document 1 includes a text analysis unit 921, a speech parameter generation unit 922, and a speech waveform generation unit 923.

テキスト解析部９２１は、入力テキスト（音声合成目的のテキストデータ）を解析して、前述のコンテキストデータを取得する。音声パラメータ生成部９２２は、音響モデル記憶部９１４に記憶されたディープニューラルネットワーク音響モデルを用いて、コンテキストデータから音声パラメータを生成する。音声波形生成部９２３は、生成された音声パラメータを用いて音声波形を生成する。 The text analysis unit 921 analyzes the input text (text data intended for speech synthesis) and acquires the above-described context data. The voice parameter generation unit 922 generates a voice parameter from the context data using the deep neural network acoustic model stored in the acoustic model storage unit 914. The voice waveform generation unit 923 generates a voice waveform using the generated voice parameter.

Zen et al., "Statistical parametric speech synthesis using deep neural networks." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013 pp. 7962-7966.Zen et al., "Statistical parametric speech synthesis using deep neural networks." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.IEEE, 2013 pp. 7962-7966.

ＤＮＮ音響モデルにより高品質な音声合成を達成するためには、音響モデル学習部９１３において、目標話者の大量の音声データおよびコンテキストデータが必要となる。また、一つのＤＮＮ音響モデルからは、単一話者の音声のみが合成可能であった。 In order to achieve high quality speech synthesis using the DNN acoustic model, the acoustic model learning unit 913 needs a large amount of speech data and context data of the target speaker. In addition, only a single speaker's voice can be synthesized from one DNN acoustic model.

このため、複数の話者の音声をＤＮＮに基づく音声合成により達成するためには、複数の話者について、それぞれ大量の音声データおよびコンテキストデータが必要であり、学習にかかるコストが大きい。 For this reason, in order to achieve speech of a plurality of speakers by speech synthesis based on DNN, a large amount of speech data and context data are required for each of the plurality of speakers, and learning costs are high.

また、複数の話者の合成音声を得るためには、その人数に応じた数のＤＮＮ音響モデルを保持する必要があり、話者数の増大に伴い使用メモリ数が増大する。 In addition, in order to obtain synthesized speech of a plurality of speakers, it is necessary to store a number of DNN acoustic models corresponding to the number of the speakers, and the number of used memories increases as the number of speakers increases.

そこで本発明では、小さなサイズかつ複数話者の合成音声を生成できるＤＮＮ音響モデルを低コストで学習できる音響モデル学習装置を提供することを目的とする。 Therefore, an object of the present invention is to provide an acoustic model learning apparatus capable of learning a DNN acoustic model that can generate synthesized speech of a plurality of speakers with a small size at a low cost.

本発明の音響モデル学習装置は、複数の話者の音声データと、少なくとも音声データの音素情報とアクセント情報を含む複数の話者のコンテキストデータと、話者を特定するデータあるいは話者の特徴を表すデータとを用いて、音声波形合成に必要な音声パラメータを生成するためのディープニューラルネットワーク音響モデルを学習する。 The acoustic model learning device according to the present invention includes voice data of a plurality of speakers, context data of a plurality of speakers including at least phoneme information and accent information of the voice data, data for specifying a speaker, or speaker characteristics. A deep neural network acoustic model for generating speech parameters necessary for speech waveform synthesis is learned using the data to be represented.

本発明の音響モデル学習装置によれば、小さなサイズかつ複数話者の合成音声を生成できるＤＮＮ音響モデルを低コストで学習できる。 According to the acoustic model learning device of the present invention, it is possible to learn a DNN acoustic model that can generate synthesized speech of a plurality of speakers with a small size at a low cost.

非特許文献１の音響モデル学習装置の構成を示すブロック図。The block diagram which shows the structure of the acoustic model learning apparatus of a nonpatent literature 1. FIG. 非特許文献１の音声合成装置の構成を示すブロック図。The block diagram which shows the structure of the speech synthesizer of a nonpatent literature 1. FIG. 実施例１の音響モデル学習装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of an acoustic model learning device according to Embodiment 1. FIG. 実施例１の音響モデル学習装置の動作を示すフローチャート。3 is a flowchart illustrating the operation of the acoustic model learning device according to the first embodiment. 実施例１の音声合成装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a speech synthesizer according to a first embodiment. 実施例１の音声合成装置の動作を示すフローチャート。3 is a flowchart showing the operation of the speech synthesis apparatus according to the first embodiment. 実施例２の音響モデル学習装置の構成を示すブロック図。FIG. 5 is a block diagram illustrating a configuration of an acoustic model learning device according to a second embodiment. 実施例２の音響モデル学習装置の動作を示すフローチャート。9 is a flowchart illustrating the operation of the acoustic model learning device according to the second embodiment. 実施例２の音声合成装置の構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a speech synthesizer according to a second embodiment. 実施例２の音声合成装置の動作を示すフローチャート。9 is a flowchart showing the operation of the speech synthesizer according to the second embodiment. 実施例３の音響モデル学習装置の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of an acoustic model learning device according to a third embodiment. 実施例３の音響モデル学習装置の動作を示すフローチャート。10 is a flowchart illustrating the operation of the acoustic model learning device according to the third embodiment. 実施例３の音声合成装置の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a speech synthesizer according to a third embodiment. 実施例３の音声合成装置の動作を示すフローチャート。10 is a flowchart showing the operation of the speech synthesizer according to the third embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

以下、図３、図４を参照して実施例１の音響モデル学習装置の構成、および動作について説明する。図３は、本実施例の音響モデル学習装置１１の構成を示すブロック図である。図４は、本実施例の音響モデル学習装置１１の動作を示すフローチャートである。非特許文献１の音響モデル学習装置９１と異なる点は、本実施例の音響モデル学習装置１１が話者を特定するデータを活用する点である。 Hereinafter, the configuration and operation of the acoustic model learning apparatus according to the first embodiment will be described with reference to FIGS. 3 and 4. FIG. 3 is a block diagram illustrating a configuration of the acoustic model learning device 11 according to the present embodiment. FIG. 4 is a flowchart showing the operation of the acoustic model learning device 11 of this embodiment. The difference from the acoustic model learning device 91 of Non-Patent Document 1 is that the acoustic model learning device 11 of the present embodiment uses data for identifying a speaker.

図３に示すように、本実施例の音響モデル学習装置１１は、複数話者音声データベース１１１と、音響モデル学習部１１３と、音響モデル記憶部９１４を含む。複数話者音声データベース１１１は、複数の話者（Ｎを２以上の整数とし、Ｎ人の話者）それぞれに対し、各話者の音声データを記憶した音声データ記憶部１１１１−１、…、１１１１−Ｎと、各話者の音声データに対応するコンテキストデータを記憶したコンテキストデータ記憶部１１１２−１、…、１１１２−Ｎを含む。音声データは、音声合成用のＤＮＮ音響モデルを学習する対象とされたＮ人の話者が複数の文章を発話した音声のデータである。コンテキストデータは、音声データ中の各発話につき一つずつ付与された発音等の情報である。コンテキストデータは音声データの発話情報を保持するものであって、少なくとも音素情報（発音情報）とアクセント情報（アクセント型、アクセント句長）が含まれている。コンテキストデータには、これ以外にも品詞情報等が含まれてもよい。なお、音響モデル記憶部９１４は、前述した非特許文献１の音響モデル学習装置９１における同名の構成要素と同じである。 As shown in FIG. 3, the acoustic model learning device 11 of this embodiment includes a multi-speaker speech database 111, an acoustic model learning unit 113, and an acoustic model storage unit 914. The multi-speaker voice database 111 stores voice data storage units 1111-1,... That store voice data of each speaker for each of a plurality of speakers (N is an integer of 2 or more, N speakers). 1111-N and a context data storage unit 1112-1, ..., 1122-N that stores context data corresponding to each speaker's voice data. The voice data is data of voice in which a plurality of sentences are spoken by N speakers targeted for learning a DNN acoustic model for voice synthesis. The context data is information such as pronunciation given to each utterance in the voice data. The context data holds speech information of speech data, and includes at least phoneme information (pronunciation information) and accent information (accent type, accent phrase length). In addition to this, the context data may include part-of-speech information. The acoustic model storage unit 914 is the same as the component of the same name in the acoustic model learning device 91 of Non-Patent Document 1 described above.

音響モデル学習部１１３は、複数の話者の音声データと、対応するコンテキストデータに加え、話者を特定するデータを用いて、音声波形合成に必要な音声パラメータを生成するためのＤＮＮ音響モデルを学習し、学習されたＤＮＮ音響モデルを音響モデル記憶部９１４に記憶する（Ｓ１１３）。話者を特定するデータとは、ある音声データを読み上げた話者を特定するための情報（データ）である。話者を特定するデータを数値ベクトルとして表現した、例えば話者コードを用いることができる。話者コードは、Ｎ名の話者のうち、どの話者の発話かを識別する情報を、１−ｏｆ−Ｋ表現で表現したベクトルとすることができる。１−ｏｆ−Ｋ表現とはベクトルのある要素だけが１、他の全ての要素が０となる表現のことである。 The acoustic model learning unit 113 generates a DNN acoustic model for generating speech parameters necessary for speech waveform synthesis using speech data of a plurality of speakers and corresponding context data, and data specifying the speakers. Learning and learning the learned DNN acoustic model in the acoustic model storage unit 914 (S113). The data for specifying a speaker is information (data) for specifying a speaker who has read a certain voice data. For example, a speaker code expressing data specifying a speaker as a numerical vector can be used. The speaker code can be a vector in which information for identifying which speaker among the N speakers is expressed in 1-of-K expression. The 1-of-K expression is an expression in which only one element of a vector is 1 and all other elements are 0.

すなわち音響モデル学習部１１３は、コンテキストデータを数値ベクトルで表現した言語特徴量ベクトルと、話者コードを連結したものを入力とし、話者、コンテキストデータに対応する音声パラメータを出力とするＤＮＮ音響モデルを学習する（Ｓ１１３）。 That is, the acoustic model learning unit 113 receives as input a language feature amount vector representing context data as a numerical vector and a speaker code, and outputs a speech parameter corresponding to the speaker and the context data. Is learned (S113).

以下、図５、図６を参照して、本実施例の音声合成装置１２の構成、および動作について説明する。図５は、本実施例の音声合成装置１２の構成を示すブロック図である。図６は、本実施例の音声合成装置１２の動作を示すフローチャートである。非特許文献１の音声合成装置９２と異なる点は、本実施例の音声合成装置１２が話者を特定するデータを活用する点である。 Hereinafter, the configuration and operation of the speech synthesizer 12 of this embodiment will be described with reference to FIGS. FIG. 5 is a block diagram showing the configuration of the speech synthesizer 12 of this embodiment. FIG. 6 is a flowchart showing the operation of the speech synthesizer 12 of this embodiment. The difference from the speech synthesizer 92 of Non-Patent Document 1 is that the speech synthesizer 12 of the present embodiment uses data for specifying a speaker.

図５に示すように、本実施例の音声合成装置１２は、テキスト解析部９２１と、音声パラメータ生成部１２２と、音声波形生成部９２３を含む。テキスト解析部９２１と、音声波形生成部９２３は、前述した非特許文献１の音声合成装置９２における同名の構成要素と同じ動作をする。音声パラメータ生成部１２２は、音響モデル記憶部９１４に記憶されたＤＮＮ音響モデルを用いて、入力テキストを解析して取得されたコンテキストデータと、入力テキストとともに入力される話者を特定するデータ（話者コード）から音声パラメータを生成する（Ｓ１２２）。音声パラメータは、音高パラメータ（基本周波数Ｆ０等）、スペクトルパラメータ（ケプストラム、メルケプストラム等）を含むものとする。具体的には、音声パラメータ生成部１２２は、コンテキストデータと話者コードを連結し、ＤＮＮ音響モデルへの入力ベクトルを得る。音声パラメータ生成部１２２は、入力ベクトルをＤＮＮ音響モデルへ入力し、順伝播により音声パラメータを生成する（Ｓ１２２）。音声波形生成部９２３は、非特許文献１と同様に、音声パラメータから、音声波形生成により合成音声を得る（Ｓ９２３）。音声波形生成部９２３は、音声波形生成の前に、例えば、maximum likelihood generation（ＭＬＰＧ）アルゴリズム（参考非特許文献１）を用いて時間方向に平滑化された音声パラメータ系列を得てもよい。音声波形生成には、例えば（参考非特許文献２）を用いてもよい。
（参考非特許文献１：益子他、“動的特徴を用いたHMMに基づく音声合成”、信学論、vol.J79-D-II，no.12，pp.2184-2190，Dec. 1996.）
（参考非特許文献２：今井他、“音声合成のためのメル対数スペクトル近似（MLSA）フィルタ”、電子情報通信学会論文誌 A Vol.J66-A No.2 pp.122-129, Feb. 1983.） As shown in FIG. 5, the speech synthesizer 12 of this embodiment includes a text analysis unit 921, a speech parameter generation unit 122, and a speech waveform generation unit 923. The text analysis unit 921 and the speech waveform generation unit 923 perform the same operations as the components of the same name in the speech synthesizer 92 of Non-Patent Document 1 described above. The voice parameter generation unit 122 uses the DNN acoustic model stored in the acoustic model storage unit 914 to analyze the input text, and the data (speaker) specifying the speaker input together with the input text. Voice parameters are generated from the user code) (S122). The voice parameters include a pitch parameter (basic frequency F0, etc.) and a spectrum parameter (cepstrum, mel cepstrum, etc.). Specifically, the speech parameter generation unit 122 connects the context data and the speaker code, and obtains an input vector to the DNN acoustic model. The voice parameter generation unit 122 inputs the input vector to the DNN acoustic model, and generates a voice parameter by forward propagation (S122). The speech waveform generation unit 923 obtains synthesized speech by speech waveform generation from speech parameters as in Non-Patent Document 1 (S923). The speech waveform generation unit 923 may obtain a speech parameter series smoothed in the time direction using, for example, a maximum likelihood generation (MLPG) algorithm (reference non-patent document 1) before speech waveform generation. For example, (Reference Non-Patent Document 2) may be used for the speech waveform generation.
(Reference Non-Patent Document 1: Mashiko et al., “HMM-based speech synthesis using dynamic features”, Theory of Science, vol.J79-D-II, no.12, pp.2184-2190, Dec. 1996. )
(Reference Non-Patent Document 2: Imai et al., “Mel Logarithmic Spectrum Approximation (MLSA) Filter for Speech Synthesis”, IEICE Transactions A Vol.J66-A No.2 pp.122-129, Feb. 1983 .)

本実施例の音響モデル学習装置１１によれば、コンテキストデータに加え、話者を特定するデータ（話者コード）を活用したため、対応するコンテキストデータと話者性を反映した音声パラメータを出力するＤＮＮ音響モデルを学習することができる。 According to the acoustic model learning device 11 of the present embodiment, in addition to the context data, data (speaker code) for specifying a speaker is utilized, so that the corresponding context data and speech parameters reflecting the speaker characteristics are output. DNN An acoustic model can be learned.

本実施例では、音声パラメータに話者を特徴づける成分と日本語音声として話者間で共通する成分とが含まれることを仮定している。具体的には、話者を特徴づける成分に対応する入力として各話者の１−ｏｆ−Ｋ表現である話者コードが用いられ、日本語音声として話者間で共通する成分に対応する入力としてコンテキストデータが用いられる。話者を特徴づける成分と話者間で共通する成分とで構成される音声パラメータを教師信号として与えることで、ＤＮＮ内部でそれぞれの成分に対応したパラメータ推定器が学習される。これにより、単一のＤＮＮ音響モデルで学習に用いられた話者それぞれに対応する音声合成が可能となる。 In the present embodiment, it is assumed that a component that characterizes a speaker and a component that is common among speakers as Japanese speech are included in the speech parameters. Specifically, a speaker code that is a 1-of-K representation of each speaker is used as an input corresponding to a component that characterizes the speaker, and an input corresponding to a component common among the speakers as Japanese speech Context data is used as By providing a speech parameter composed of a component characterizing the speaker and a component common to the speakers as a teacher signal, a parameter estimator corresponding to each component is learned inside the DNN. This enables speech synthesis corresponding to each speaker used for learning with a single DNN acoustic model.

日本語の音声は多様なコンテキストに対して多様な音声パラメータ表現となるため、多様なコンテキストに対して音声パラメータを精度よく推定するためには大量の音声データが必要となるのが通常であった。しかし本実施例では、音声パラメータに話者を特徴づける成分と日本語音声として話者間で共通する成分とが含まれることを仮定したため、複数話者にまたがって十分な量の音声データが存在すればよく、単一の話者について大量の音声データを準備する必要がない。すなわち、複数の話者の音声データを効率的に活用し、一つのＤＮＮ音響モデルを学習するため、学習に必要な音声データを減らすことができる。また、一つの音響モデルで複数の話者性を反映した音声合成を実現するため、より少ないメモリ使用量で、多数の話者を扱う音声合成システムを実現できる。 Since Japanese speech has various speech parameter expressions for various contexts, it was normal to require a large amount of speech data to accurately estimate speech parameters for various contexts. . However, in this embodiment, since it is assumed that the speech parameter includes a component that characterizes the speaker and a component that is common among speakers as Japanese speech, a sufficient amount of speech data exists across multiple speakers. It is only necessary to prepare a large amount of voice data for a single speaker. That is, since voice data of a plurality of speakers is efficiently used and one DNN acoustic model is learned, the voice data necessary for learning can be reduced. Further, since speech synthesis reflecting a plurality of speaker characteristics is realized with one acoustic model, it is possible to realize a speech synthesis system that handles a large number of speakers with less memory usage.

実施例１のように、話者コード（１−ｏｆ−Ｋ表現）を用いる場合、複数話者音声データベース１１１に含まれる話者以外の話者の音声合成を行うことができない。そこで実施例２では、目標話者の参照発話のスペクトル情報の特徴を抽出し、モデル学習・音声合成に使用することにより、参照発話が得られる任意の目標話者についての音声合成を可能とした。以下、図７、図８を参照して実施例２の音響モデル学習装置２１の構成、および動作について説明する。図７は、本実施例の音響モデル学習装置２１の構成を示すブロック図である。図８は、本実施例の音響モデル学習装置２１の動作を示すフローチャートである。実施例１の音響モデル学習装置１１と異なる点は、本実施例の音響モデル学習装置２１が話者の特徴を表すデータ（話者スペクトル特徴ベクトル）を活用する点である。 When the speaker code (1-of-K expression) is used as in the first embodiment, speech synthesis of speakers other than the speakers included in the multi-speaker speech database 111 cannot be performed. Therefore, in the second embodiment, the feature of spectral information of the reference utterance of the target speaker is extracted and used for model learning / speech synthesis, thereby enabling speech synthesis for any target speaker from which the reference utterance is obtained. . Hereinafter, the configuration and operation of the acoustic model learning device 21 according to the second embodiment will be described with reference to FIGS. 7 and 8. FIG. 7 is a block diagram illustrating a configuration of the acoustic model learning device 21 according to the present embodiment. FIG. 8 is a flowchart showing the operation of the acoustic model learning device 21 of the present embodiment. The difference from the acoustic model learning device 11 of the first embodiment is that the acoustic model learning device 21 of the present embodiment uses data (speaker spectrum feature vector) representing speaker characteristics.

図７に示すように、本実施例の音響モデル学習装置２１は、複数話者音声データベース１１１と、スペクトル特徴抽出部２１２と、音響モデル学習部２１３と、音響モデル記憶部９１４を含み、複数話者音声データベース１１１、音響モデル記憶部９１４については実施例１の同名の構成要件と同じである。 As shown in FIG. 7, the acoustic model learning device 21 according to the present embodiment includes a multi-speaker speech database 111, a spectrum feature extraction unit 212, an acoustic model learning unit 213, and an acoustic model storage unit 914. The person voice database 111 and the acoustic model storage unit 914 are the same as the configuration requirements of the same name in the first embodiment.

スペクトル特徴抽出部２１２は、各話者の音声データ記憶部１１１１−１〜１１１１−Ｎから各話者の参照発話を抽出し、各話者の参照発話から各話者の話者スペクトル特徴ベクトルを生成する（Ｓ２１２）。ここで参照発話とは、学習時に使用する話者、または音声合成時の目標話者による発話であって、書き起こしが不要、短文の発話でよいという特徴がある。話者スペクトル特徴ベクトルとは、その話者の発話する音声に見られるスペクトル情報の特徴を、数値ベクトルで表現したものである。話者スペクトル特徴ベクトルの生成には、例えばｉ−ｖｅｃｔｏｒを使用してもよい。スペクトル特徴抽出部２１２については、例えば参考非特許文献３の知見などを利用し、ｉ−ｖｅｃｔｏｒ抽出器を使用してもよい。
（参考非特許文献３：Dehak, Najim, et al. "Front-end factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 19.4 (2011): 788-798.） The spectrum feature extraction unit 212 extracts each speaker's reference utterance from each speaker's voice data storage units 1111-1 to 1111-N, and calculates each speaker's speaker spectrum feature vector from each speaker's reference utterance. Generate (S212). Here, the reference utterance is an utterance by a speaker used at the time of learning or a target speaker at the time of speech synthesis, and has a feature that it is not necessary to transcribe and may be a short utterance. The speaker spectrum feature vector is a numerical vector representing features of spectrum information found in the speech uttered by the speaker. For example, i-vector may be used to generate the speaker spectrum feature vector. For the spectrum feature extraction unit 212, for example, an i-vector extractor may be used by utilizing the knowledge of Reference Non-Patent Document 3.
(Reference Non-Patent Document 3: Dehak, Najim, et al. "Front-end factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 19.4 (2011): 788-798.)

次に、音響モデル学習部２１３は、複数の話者の音声データと、複数の話者のコンテキストデータと、話者の特徴を表すデータである話者スペクトル特徴ベクトルとを用いて、ＤＮＮ音響モデルを学習し、学習されたＤＮＮ音響モデルを音響モデル記憶部９１４に記憶する（Ｓ２１３）。 Next, the acoustic model learning unit 213 uses the DNN acoustic model using speech data of a plurality of speakers, context data of the plurality of speakers, and a speaker spectrum feature vector that is data representing speaker characteristics. And the learned DNN acoustic model is stored in the acoustic model storage unit 914 (S213).

以下、図９、図１０を参照して、本実施例の音声合成装置２２の構成、および動作について説明する。図９は、本実施例の音声合成装置２２の構成を示すブロック図である。図１０は、本実施例の音声合成装置２２の動作を示すフローチャートである。実施例１の音声合成装置１２と異なる点は、本実施例の音声合成装置２２が話者の特徴を表すデータ（話者スペクトル特徴ベクトル）を活用する点である。 Hereinafter, the configuration and operation of the speech synthesizer 22 of the present embodiment will be described with reference to FIGS. 9 and 10. FIG. 9 is a block diagram showing the configuration of the speech synthesizer 22 of this embodiment. FIG. 10 is a flowchart showing the operation of the speech synthesizer 22 of this embodiment. The difference from the speech synthesizer 12 of the first embodiment is that the speech synthesizer 22 of the present embodiment uses data (speaker spectrum feature vector) representing the features of the speaker.

図９に示すように、本実施例の音声合成装置２２は、テキスト解析部９２１と、スペクトル特徴抽出部２２１と、音声パラメータ生成部２２２と、音声波形生成部９２３を含む。テキスト解析部９２１と、音声波形生成部９２３は、実施例１と同様である。スペクトル特徴抽出部２２１は、音声合成用のテキストと共に入力された参照発話から前述の話者スペクトル特徴ベクトルを抽出する（Ｓ２２１）。前述したように、参照発話は目標話者による発話である。 As shown in FIG. 9, the speech synthesizer 22 of this embodiment includes a text analysis unit 921, a spectrum feature extraction unit 221, a speech parameter generation unit 222, and a speech waveform generation unit 923. The text analysis unit 921 and the speech waveform generation unit 923 are the same as those in the first embodiment. The spectrum feature extraction unit 221 extracts the above-described speaker spectrum feature vector from the reference utterance input together with the text for speech synthesis (S221). As described above, the reference utterance is an utterance by the target speaker.

音声パラメータ生成部２２２は、音響モデル記憶部９１４に記憶されたＤＮＮ音響モデルを用いて、入力テキストを解析して取得されたコンテキストデータと、参照発話から抽出された話者スペクトル特徴ベクトルから音声パラメータを生成する（Ｓ２２２）。 The speech parameter generation unit 222 uses the DNN acoustic model stored in the acoustic model storage unit 914 to analyze the input text and the speech parameter from the speaker spectrum feature vector extracted from the reference utterance. Is generated (S222).

実施例１の音声合成装置１２では、話者コードを使用しているため、音響モデル学習時に使用する複数話者音声データベース１１１に含まれない目標話者については、音響モデル学習時に未知であるため、音声を合成することができない。この課題を解決するため、本実施例では、音声認識や話者識別の分野で使用されているｉ−ｖｅｃｔｏｒ等、当該話者の発話する音声のスペクトル情報の特徴を表現するベクトル（話者スペクトル特徴ベクトル）を使用する。これにより、複数話者音声データベース１１１に含まれない目標話者であっても、目標話者の音声と音響的に類似した話者の音声が音響モデル内でモデル化されているため、目標話者の参照発話が獲得できれば、目標話者に近いスペクトル特徴を持った音声を合成することができる。したがって、複数話者音声データベース１１１に含まれない目標話者であっても、その合成音声を生成することが可能となる。なお前述したように、話者スペクトル特徴ベクトルの生成には、例えばｉ−ｖｅｃｔｏｒを使用することができるが、ステップＳ２１２の実現方法はこれに限られない。 Since the speech synthesizer 12 of the first embodiment uses speaker codes, target speakers not included in the multi-speaker speech database 111 used during acoustic model learning are unknown during acoustic model learning. Can't synthesize speech. In order to solve this problem, in the present embodiment, a vector (speaker spectrum) expressing features of spectrum information of speech uttered by the speaker, such as i-vector used in the fields of speech recognition and speaker identification. Feature vector). As a result, even if the target speaker is not included in the multi-speaker speech database 111, the speech of a speaker that is acoustically similar to the speech of the target speaker is modeled in the acoustic model. If the speaker's reference utterance can be acquired, it is possible to synthesize speech having spectral characteristics close to the target speaker. Therefore, even a target speaker that is not included in the multi-speaker speech database 111 can generate the synthesized speech. As described above, for example, i-vector can be used to generate the speaker spectrum feature vector, but the implementation method of step S212 is not limited to this.

実施例２の方法において、発話から話者情報ベクトルを抽出するための代表的な手法であるｉ−ｖｅｃｔｏｒは、話者識別分野や、音声認識分野においてモデルの話者適応を行う目的で提案されてきたものである。これらの分野では、音声に現れる個人性のうち、スペクトル情報の個人性がベクトルで表現されることが重要であった。一方で、音声合成分野において、目標話者の音声合成を実現するために話者情報ベクトルを抽出する場合、音声に現れる個人性のうち、スペクトル情報の個人性だけではなく、韻律情報の個人性も表現されていることが重要であり、この点が音声認識問題とは異なると考えられる。そこで実施例３の音響モデル学習装置３１では、話者の特徴を表すデータに、Ｆ０の情報をも含むようにした。以下、図１１、図１２を参照して実施例３の音響モデル学習装置３１の構成、および動作について説明する。図１１は、本実施例の音響モデル学習装置３１の構成を示すブロック図である。図１２は、本実施例の音響モデル学習装置３１の動作を示すフローチャートである。実施例２の音響モデル学習装置２１と異なる点は、本実施例の音響モデル学習装置３１が話者の特徴を表すデータとして話者スペクトル特徴ベクトルだけでなく、話者韻律特徴ベクトルを活用する点である。 In the method of the second embodiment, i-vector, which is a representative method for extracting speaker information vectors from utterances, has been proposed for the purpose of speaker adaptation of models in the speaker identification field and the speech recognition field. It has been. In these fields, it is important that the personality of the spectrum information among the personalities appearing in the speech is expressed by a vector. On the other hand, in the speech synthesis field, when extracting the speaker information vector in order to realize the target speaker's speech synthesis, the personality of the prosodic information as well as the individuality of the spectrum information among the individualities that appear in the speech It is important that this is expressed, and this point is considered to be different from the speech recognition problem. Therefore, in the acoustic model learning device 31 of the third embodiment, the data representing the characteristics of the speaker includes the information of F0. Hereinafter, the configuration and operation of the acoustic model learning device 31 according to the third embodiment will be described with reference to FIGS. 11 and 12. FIG. 11 is a block diagram illustrating a configuration of the acoustic model learning device 31 according to the present embodiment. FIG. 12 is a flowchart showing the operation of the acoustic model learning device 31 of the present embodiment. The difference from the acoustic model learning device 21 of the second embodiment is that the acoustic model learning device 31 of the present embodiment uses not only the speaker spectrum feature vector but also the speaker prosody feature vector as data representing the speaker characteristics. It is.

図１１に示すように、本実施例の音響モデル学習装置３１は、複数話者音声データベース１１１と、スペクトル特徴抽出部２１２と、韻律特徴抽出部３１２と、音響モデル学習部３１３と、音響モデル記憶部９１４を含み、複数話者音声データベース１１１、スペクトル特徴抽出部２１２、音響モデル記憶部９１４については実施例２の同名の構成要件と同じである。 As shown in FIG. 11, the acoustic model learning device 31 of the present embodiment includes a multi-speaker speech database 111, a spectrum feature extraction unit 212, a prosody feature extraction unit 312, an acoustic model learning unit 313, and an acoustic model storage. The multi-speaker speech database 111, the spectrum feature extraction unit 212, and the acoustic model storage unit 914 are the same as the constituent elements of the same name in the second embodiment.

韻律特徴抽出部３１２は、各話者の音声データ記憶部１１１１−１〜１１１１−Ｎから各話者の参照発話を抽出し、各話者の参照発話から各話者の話者韻律特徴ベクトルを生成する（Ｓ３１２）。話者韻律特徴ベクトルとは、音声に現れる個人性のうち、韻律情報の個人性を表現したベクトルである。より詳細には話者韻律特徴ベクトルは、その話者の発話する音声にみられる音響的特徴のうち、韻律情報の特徴を、数値ベクトルで表現したものである。 The prosodic feature extraction unit 312 extracts each speaker's reference utterance from each speaker's speech data storage units 1111-1 to 1111-N, and obtains the speaker's prosodic feature vector from each speaker's reference utterance. Generate (S312). The speaker prosody feature vector is a vector expressing the personality of prosodic information among the personalities appearing in speech. More specifically, the speaker prosody feature vector is a numerical vector representing the features of prosodic information among the acoustic features found in the speech uttered by the speaker.

韻律特徴抽出部３１２は、例えば、参照発話から分析されるＦ０系列の平均と分散を算出し、Ｆ０特徴情報を話者韻律特徴ベクトルとして抽出してもよい。韻律特徴抽出部３１２は、参考非特許文献４の手法を用いて、より詳細な韻律特徴のモデル化を行ってもよい。（参考非特許文献４：Dehak, Najim, Pierre Dumouchel, and Patrick Kenny. "Modeling prosodic features with joint factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 15.7 (2007): 2095-2103.） The prosodic feature extraction unit 312 may calculate, for example, the average and variance of the F0 sequence analyzed from the reference utterance, and extract the F0 feature information as a speaker prosodic feature vector. The prosodic feature extraction unit 312 may perform more detailed prosodic feature modeling using the technique of Reference Non-Patent Document 4. (Reference Non-Patent Document 4: Dehak, Najim, Pierre Dumouchel, and Patrick Kenny. "Modeling prosodic features with joint factor analysis for speaker verification." Audio, Speech, and Language Processing, IEEE Transactions on 15.7 (2007): 2095-2103 .)

次に、音響モデル学習部３１３は、複数の話者の音声データと、複数の話者のコンテキストデータと、話者スペクトル特徴ベクトルと、話者韻律特徴ベクトルとを用いて、ＤＮＮ音響モデルを学習し、学習されたＤＮＮ音響モデルを音響モデル記憶部９１４に記憶する（Ｓ３１３）。 Next, the acoustic model learning unit 313 learns the DNN acoustic model using the speech data of the plurality of speakers, the context data of the plurality of speakers, the speaker spectrum feature vector, and the speaker prosody feature vector. Then, the learned DNN acoustic model is stored in the acoustic model storage unit 914 (S313).

以下、図１３、図１４を参照して、本実施例の音声合成装置３２の構成、および動作について説明する。図１３は、本実施例の音声合成装置３２の構成を示すブロック図である。図１４は、本実施例の音声合成装置３２の動作を示すフローチャートである。実施例２の音声合成装置２２と異なる点は、本実施例の音声合成装置３２が話者の特徴を表すデータとして話者スペクトル特徴ベクトルだけでなく、話者韻律特徴ベクトルを活用する点である。 Hereinafter, the configuration and operation of the speech synthesizer 32 of this embodiment will be described with reference to FIGS. FIG. 13 is a block diagram showing the configuration of the speech synthesizer 32 of this embodiment. FIG. 14 is a flowchart showing the operation of the speech synthesizer 32 of this embodiment. The difference from the speech synthesizer 22 of the second embodiment is that the speech synthesizer 32 of the present embodiment uses not only the speaker spectrum feature vector but also the speaker prosody feature vector as data representing the speaker characteristics. .

図１３に示すように、本実施例の音声合成装置３２は、テキスト解析部９２１と、スペクトル特徴抽出部２２１と、韻律特徴抽出部３２１と、音声パラメータ生成部３２２と、音声波形生成部９２３を含む。テキスト解析部９２１と、スペクトル特徴抽出部２２１と、音声波形生成部９２３は、実施例１と同様である。韻律特徴抽出部３２１は、音声合成用のテキストと共に入力された参照発話から前述の話者韻律特徴ベクトルを抽出する（Ｓ３２１）。前述したように、参照発話は目標話者による発話である。 As shown in FIG. 13, the speech synthesizer 32 of this embodiment includes a text analysis unit 921, a spectrum feature extraction unit 221, a prosody feature extraction unit 321, a speech parameter generation unit 322, and a speech waveform generation unit 923. Including. The text analysis unit 921, the spectrum feature extraction unit 221 and the speech waveform generation unit 923 are the same as those in the first embodiment. The prosodic feature extraction unit 321 extracts the above-mentioned speaker prosodic feature vector from the reference utterance input together with the text for speech synthesis (S321). As described above, the reference utterance is an utterance by the target speaker.

音声パラメータ生成部３２２は、音響モデル記憶部９１４に記憶されたＤＮＮ音響モデルを用いて、入力テキストを解析して取得されたコンテキストデータと、話者スペクトル特徴ベクトルと、話者韻律特徴ベクトルから音声パラメータを生成する（Ｓ３２２）。 The speech parameter generation unit 322 uses the DNN acoustic model stored in the acoustic model storage unit 914 to generate speech from the context data obtained by analyzing the input text, the speaker spectrum feature vector, and the speaker prosody feature vector. A parameter is generated (S322).

ある話者の音響的特徴は、スペクトルの特徴、韻律の特徴に分類することができる。実施例２のように、話者スペクトル特徴ベクトルを使用した場合、その話者の特徴のうち、スペクトルの特徴が合成音声にも反映され、目標話者の韻律の特徴が反映されない。本実施例では、目標話者の韻律の情報も表現したベクトルを使用することにより、複数話者音声データベース１１１に含まれない話者の、韻律の特徴をも反映した音声を合成することが可能となる。 The acoustic features of a speaker can be classified into spectral features and prosodic features. When the speaker spectral feature vector is used as in the second embodiment, the spectral feature among the speaker features is reflected in the synthesized speech, and the prosody feature of the target speaker is not reflected. In this embodiment, it is possible to synthesize speech that also reflects the prosodic features of speakers not included in the multi-speaker speech database 111 by using a vector that also expresses the prosody information of the target speaker. It becomes.

なお、上述の実施例において説明した音響モデル学習装置、音声合成装置をそれぞれ音響モデル学習部、音声合成部として、これらを構成要件として備える単独のハードウェアとして本発明を実現してもよい。 Note that the acoustic model learning device and the speech synthesizer described in the above embodiments may be used as an acoustic model learning unit and a speech synthesizer, respectively, and the present invention may be realized as a single piece of hardware including these components.

また、上述の実施例において説明した話者コード、話者スペクトル特徴ベクトル、話者韻律特徴ベクトルなどは、話者の情報について表現したベクトルであるという共通項をもつため、これらを話者情報ベクトルと総称してもよい。 In addition, since the speaker code, speaker spectrum feature vector, speaker prosody feature vector, etc. described in the above embodiment have a common term that is a vector expressing speaker information, these are the speaker information vector. May be collectively referred to.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

And audio data of a plurality of speakers, a plurality of speakers context data including phonemic information and accent information of at least the audio data, and data for identifying the speaker represented by 1-of-K expression vector An acoustic model learning unit that learns a deep neural network acoustic model for generating speech parameters including a pitch parameter and a spectrum parameter necessary for speech waveform synthesis using
To the input layer of the neural network, enter at least a plurality of speakers context data including the phoneme information and accent information of the audio data, the data for specifying the speaker represented by 1-of-K expression vector An acoustic model learning device characterized by that.

A text analysis unit that analyzes input text and obtains context data including at least phoneme information and accent information;
And audio data of a plurality of speakers, and context data of the speech data of the plurality of speakers, inputs the data for identifying the speaker is represented in the input layer of the neural network 1-of-K expression vector pitch parameter and, and, by using the learning has been deep neural network acoustic models to generate speech parameters including a spectrum parameter, and context data obtained by analyzing the input text, together with the input text data do we pitch parameter that identifies the speaker represented by 1-of-K expression vector is input, and a speech parameter generation unit for generating a speech parameter containing spectral parameter,
A voice waveform generation unit that generates a voice waveform using the generated voice parameter;
To the input layer of the neural network, inputting and context data obtained by analyzing the input text, the data for identifying the speaker represented by 1-of-K expression vector is input together with the input text A speech synthesizer characterized by the above.

The speech synthesizer according to claim 2,
The deep neural network acoustic model is
Learned using the voice data, the context data, and data representing the characteristics of the speaker;
The data representing the speaker characteristics include:
A speech synthesizer including at least a speaker prosody feature vector which is a vector expressing features of prosody information of speech uttered by the speaker.

An acoustic model learning method executed by the acoustic model learning device,
And audio data of a plurality of speakers, a plurality of speakers context data including phonemic information and accent information of at least the audio data, and data for identifying the speaker represented by 1-of-K expression vector Learning a deep neural network acoustic model for generating speech parameters including pitch parameters and spectral parameters necessary for speech waveform synthesis using
To the input layer of the neural network, enter at least a plurality of speakers context data including the phoneme information and accent information of the audio data, the data for specifying the speaker represented by 1-of-K expression vector An acoustic model learning method characterized by the above.

A speech synthesis method executed by a speech synthesizer,
Analyzing the input text to obtain context data including at least phoneme information and accent information;
And audio data of a plurality of speakers, and context data of the speech data of the plurality of speakers, inputs the data for identifying the speaker is represented in the input layer of the neural network 1-of-K expression vector pitch parameter and, and, by using the learning has been deep neural network acoustic models to generate speech parameters including a spectrum parameter, and context data obtained by analyzing the input text, together with the input text 1-of-K data or we pitch parameter that identifies the speaker represented by expression vectors that are input, and a step of generating a speech parameters including spectral parameter,
Generating a speech waveform using the generated speech parameters;
To the input layer of the neural network, inputting and context data obtained by analyzing the input text, the data for identifying the speaker represented by 1-of-K expression vector is input together with the input text A speech synthesis method characterized by the above.

The program which makes a computer function as a speech synthesizer of Claim 2 or 3 .