JP6542823B2

JP6542823B2 - Acoustic model learning device, speech synthesizer, method thereof and program

Info

Publication number: JP6542823B2
Application number: JP2017042430A
Authority: JP
Inventors: 伸克北条; 勇祐井島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-03-07
Filing date: 2017-03-07
Publication date: 2019-07-10
Anticipated expiration: 2037-03-07
Also published as: JP2018146821A

Description

本発明は、スペクトル包絡情報と基本周波数(以下「F₀」ともいう)情報とを用いて音声を合成する音声合成装置、音声合成の際に用いる音響モデルを学習する音響モデル学習装置、それらの方法及びプログラムに関する。 The present invention provides a speech synthesis apparatus that synthesizes speech using spectral envelope information and fundamental frequency (hereinafter also referred to as “F ₀ ”) information, an acoustic model learning apparatus that learns an acoustic model used in speech synthesis, and It relates to a method and program.

音声データから音声合成用モデルを学習し、合成音声を生成する手法として、DNN(deep neural network)に基づく技術がある(非特許文献１参照)。図１は従来技術に係る音響モデル学習装置８０の機能ブロック図、図２は従来技術に係る音声合成装置９０の機能ブロック図を示す。 There is a technology based on DNN (deep neural network) as a method of learning a speech synthesis model from speech data and generating synthesized speech (see Non-Patent Document 1). FIG. 1 is a functional block diagram of an acoustic model learning device 80 according to the prior art, and FIG. 2 is a functional block diagram of a speech synthesizer 90 according to the prior art.

スペクトル包絡・F₀ベクトルデータ作成部８２は、F₀データ{f₁,f₂,…,f_N}と、スペクトル包絡データ{s₁,s₂,…,s_N}から、スペクトル包絡・F₀データ{x₁,x₂,…,x_N}を作成する。ただし、学習用音声データの総数をNとし、n=1,2,…,Nとする。図中、{f₁,f₂,…,f_N}等をf_n等と表現する。言語特徴量ベクトルデータ作成部８１は、コンテキストデータ{t₁,t₂,…,t_N}から、言語特徴量ベクトルデータ{l₁,l₂,…,l_N}を作成する。スペクトル包絡生成モデル・変換パラメータ学習部８４は、スペクトル包絡・F₀データ{x₁,x₂,…,x_N}、言語特徴量ベクトルデータ{l₁,l₂,…,l_N}から、スペクトル包絡・F₀生成DNNを学習する。 Spectral envelope · F ₀ vector data creation unit 82, F ₀ data _{_{{f 1, f 2, ...}} , f N} and the spectral envelope data _{_{{s 1, s 2, ...}} , s N} from the spectral envelope · F Create ₀ data {x ₁ , x ₂ ,..., X _N }. However, it is assumed that the total number of learning voice data is N, and n = 1, 2,. In the figure, {f ₁ , f ₂ ,..., F _N } etc. are expressed as f _n etc. Language feature vector data generating unit 81, context data _{_{{t 1, t 2, ...}} , t N} from the language feature vector data _{_{{l 1, l 2, ...}} , l N} to create. Spectrum envelope generation model conversion parameter learning unit 84, the spectral envelope-F ₀ data _{_{{x 1, x 2, ...}} , x N}, language feature vector data _{_{{l 1, l 2, ...}} , l N} from The spectral envelope · F ₀ generation DNN is learned.

音声合成装置９０では、テキスト解析部９１で合成するテキストtex_oをテキスト解析し、コンテキストt_oを得る。言語特徴量ベクトル抽出部９２は、コンテキストt_oから言語特徴量ベクトルl_oを抽出する。スペクトル包絡生成部９４は、スペクトル包絡・F₀生成DNNを用いて、言語特徴量ベクトルl_oからスペクトル包絡情報s_o、F₀情報f_oを生成する。音声波形生成部９５は、得られたスペクトル包絡情報s_o、F₀情報f_oから、音声波形生成により、合成音声波形z_oを得る。 In the speech synthesizer 90, the text analysis unit 91 analyzes the text tex _o synthesized to obtain the context t _o . Language feature vector extraction section 92 extracts the language feature vector l _o from the context t _o. The spectral envelope generation unit 94 generates spectral envelope information s _o and F ₀ information f _o from the language feature quantity vector l _o using the spectral envelope and F ₀ generation DNN. Speech waveform generation unit 95, resulting spectral envelope information s _o, from F ₀ information f _o, the speech waveform generation, obtain a synthesized speech waveform z _o.

Zen et al., "Statistical parametric speech synthesis using deep neural networks", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013 pp. 7962-7966.Zen et al., "Statistical parametric speech synthesis using deep neural networks", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on IEEE. 2013, pp. 7962-7966.

人間の発話する音声では、スペクトル包絡情報とF₀情報には依存関係があることが知られている。この依存関係を再現することで、合成音声を高品質化することが可能である。 In speech uttered by humans, it is known that there is a dependency between spectral envelope information and F ₀ information. By reproducing this dependency, it is possible to improve the quality of synthetic speech.

しかしながら、従来技術では、コンテキストから得られる言語特徴量ベクトルを入力とし、スペクトル包絡情報とF₀情報を出力するDNNを利用し、スペクトル包絡情報とF₀情報の依存関係が明にモデル化されていない。そのため、音声品質に改善の余地があると考えられる。 However, in the prior art, an input language feature value vector obtained from the context, using DNN for outputting spectral envelope information and the F ₀ information, the dependency of the spectral envelope information and the F ₀ information is modeled in light Absent. Therefore, it is considered that there is room for improvement in voice quality.

本発明は、スペクトル包絡情報とF₀情報の依存関係を明にモデル化し、従来よりも品質の高い音声を合成する音声合成装置、そのための音響モデルを学習する音響モデル学習装置、その方法、及びプログラムを提供することを目的とする。 The present invention explicitly models the dependency between spectrum envelope information and F ₀ information, and synthesizes speech with higher quality speech than before, an acoustic model learning apparatus for learning an acoustic model therefor, and a method therefor The purpose is to provide a program.

上記の課題を解決するために、本発明の一態様によれば、音響モデル学習装置は、学習用音声データの総数をNとし、Nを1以上の整数の何れかとし、n=1,2,…,Nとし、N個の学習用音声データの基本周波数をそれぞれ示すN個の基本周波数情報f_L,nと、N個の学習用音声データのコンテキストを数値ベクトルでそれぞれ表現したN個の言語特徴量ベクトルl_L,nとを用いて、言語特徴量ベクトルを入力とし、対応する基本周波数情報を出力とする基本周波数生成モデルを学習する基本周波数生成モデル学習部と、N個の基本周波数情報f_L,nと、N個の言語特徴量ベクトルl_L,nと、N個の学習用音声データのスペクトル包絡をそれぞれ示すスペクトル包絡情報s_L,nとを用いて、基本周波数情報と言語特徴量ベクトルとを入力とし、スペクトル包絡情報を出力とするスペクトル包絡生成モデルを学習するスペクトル包絡生成モデル学習部とを含む。 In order to solve the above problems, according to one aspect of the present invention, the acoustic model learning device sets the total number of learning speech data to N and N is any integer of 1 or more, n = 1, 2 , ..., N, N pieces of fundamental frequency information f _{L, n} respectively indicating the fundamental frequencies of the N pieces of learning speech data, and N pieces of contexts of the N pieces of learning speech data respectively represented by numerical vectors A fundamental frequency generation model learning unit that learns a fundamental frequency generation model using a linguistic feature vector as an input and a corresponding fundamental frequency information as an output using a linguistic feature vector l _{L, n,} and N fundamental frequencies Using fundamental information f _{L, n} , N language feature vectors l _{L, n,} and spectral envelope information s _{L, n} respectively indicating the spectral envelopes of N training speech data, fundamental frequency information and language A feature quantity vector is an input, and spectral envelope information is an output And a spectral envelope generation model learning unit for learning a spectral envelope generation model.

上記の課題を解決するために、本発明の他の態様によれば、音響モデル学習方法は、学習用音声データの総数をNとし、Nを1以上の整数の何れかとし、n=1,2,…,Nとし、N個の学習用音声データの基本周波数をそれぞれ示すN個の基本周波数情報f_L,nと、N個の学習用音声データのコンテキストを数値ベクトルでそれぞれ表現したN個の言語特徴量ベクトルl_L,nとを用いて、言語特徴量ベクトルを入力とし、対応する基本周波数情報を出力とする基本周波数生成モデルを学習する基本周波数生成モデル学習ステップと、N個の基本周波数情報f_L,nと、N個の言語特徴量ベクトルl_L,nと、N個の学習用音声データのスペクトル包絡をそれぞれ示すスペクトル包絡情報s_L,nとを用いて、基本周波数情報と言語特徴量ベクトルとを入力とし、スペクトル包絡情報を出力とするスペクトル包絡生成モデルを学習するスペクトル包絡生成モデル学習ステップとを含む。 In order to solve the above problems, according to another aspect of the present invention, in the acoustic model learning method, the total number of learning speech data is N, and N is any integer of 1 or more, n = 1, N pieces of fundamental frequency information f _{L, n} indicating the fundamental frequencies of N pieces of learning voice data, and N pieces of contexts of N pieces of learning voice data represented by numerical vectors Basic frequency generation model learning step of learning a fundamental frequency generation model using a language feature vector as an input and a corresponding fundamental frequency information as an output using the language feature vector l _{L, n of} N, and N basics By using frequency information f _{L, n} , N language feature vectors l _{L, n,} and spectral envelope information s _{L, n} respectively indicating the spectral envelope of N learning speech data, basic frequency information and Inputs linguistic feature vectors and outputs spectral envelope information And a spectral envelope generation model learning step of learning a spectral envelope generation model.

本発明によれば、従来よりも品質の高い音声を合成することができるという効果を奏する。 According to the present invention, it is possible to synthesize speech of higher quality than before.

従来技術に係る音響モデル学習装置の機能ブロック図。The functional block diagram of the acoustic model learning apparatus which concerns on a prior art. 従来技術に係る音声合成装置の機能ブロック図。The functional block diagram of the speech synthesizer concerning a prior art. 第一実施形態に係る音響モデル学習装置の機能ブロック図。1 is a functional block diagram of an acoustic model learning device according to a first embodiment. 第一実施形態に係る音響モデル学習装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the acoustic model learning apparatus which concerns on 1st embodiment. 第一実施形態に係る音声合成装置の機能ブロック図。FIG. 1 is a functional block diagram of a speech synthesizer according to a first embodiment. 第一実施形態に係る音声合成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech synthesizer concerning 1st embodiment. 第二実施形態に係る音響モデル学習装置の機能ブロック図。The functional block diagram of the acoustic model learning apparatus which concerns on 2nd embodiment. 第二実施形態に係る音響モデル学習装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the acoustic model learning apparatus which concerns on 2nd embodiment. 第二実施形態に係る音声合成装置の機能ブロック図。FIG. 7 is a functional block diagram of a speech synthesis device according to a second embodiment. 第二実施形態に係る音声合成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech synthesizer concerning 2nd embodiment. 第三実施形態に係る音響モデル学習装置の機能ブロック図。The functional block diagram of the acoustic model learning apparatus which concerns on 3rd embodiment. 第三実施形態に係る音響モデル学習装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the acoustic model learning apparatus which concerns on 3rd embodiment. 第三実施形態に係る音声合成装置の機能ブロック図。The functional block diagram of the speech synthesizer concerning a third embodiment. 第三実施形態に係る音声合成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech synthesizer concerning 3rd embodiment. 第四実施形態に係る音響モデル学習装置の機能ブロック図。The functional block diagram of the acoustic model learning apparatus which concerns on 4th embodiment. 第四実施形態に係る音響モデル学習装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the acoustic model learning apparatus which concerns on 4th embodiment. 第四実施形態に係る音声合成装置の機能ブロック図。The functional block diagram of the speech synthesizer concerning a fourth embodiment. 第四実施形態に係る音声合成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech synthesizer concerning 4th embodiment.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used in the following description, the same reference numerals are given to constituent parts having the same functions and steps for performing the same processing, and redundant description will be omitted. In the following description, the processing performed for each element of a vector or matrix is applied to all elements of that vector or matrix unless otherwise noted.

＜第一実施形態のポイント＞
本実施形態では、スペクトル包絡情報を生成するDNNの入力に、F₀情報を活用する。音声合成器の入力として、読み、アクセントなどの従来のコンテキストに加え、F₀情報を活用し、対応するF₀情報を反映したスペクトル包絡情報を出力するようにDNNを構成する。このような構成により、F₀情報との依存関係を反映したスペクトル包絡情報を生成することが可能となる。生成されたスペクトル包絡情報とF₀情報の依存関係が満たされることで、合成音声の自然性が向上する。 <Point of the first embodiment>
In the present embodiment, the F ₀ information is utilized for the input of the DNN that generates the spectral envelope information. The DNN is configured to utilize the F ₀ information as input to the speech synthesizer, in addition to the conventional contexts such as reading and accent, and to output spectral envelope information reflecting the corresponding F ₀ information. With such a configuration, it is possible to generate spectrum envelope information that reflects the dependency with the F ₀ information. The naturalness of the synthesized speech is improved by satisfying the dependency between the generated spectral envelope information and the F ₀ information.

＜全体構成＞
本実施形態は、音響モデル学習装置１１０および音声合成装置１２０から構成される。図３，４，５，６は、それぞれ音響モデル学習装置１１０の機能ブロック図、その処理フローを示す図、音声合成装置１２０の機能ブロック図、その処理フローを示す図である。 <Overall configuration>
The present embodiment includes an acoustic model learning device 110 and a speech synthesis device 120. FIGS. 3, 4, 5 and 6 are functional block diagrams of the acoustic model learning device 110, a process flow thereof, a functional block diagram of the speech synthesizer 120, and a process flow thereof.

音響モデル学習装置１１０では、F₀データ{f₁,f₂,…,f_N}、スペクトル包絡データ{s₁,s₂,…,s_N}、コンテキストデータ{t₁,t₂,…,t_N}を用いて、F₀生成DNN(図中、DNN_fとも記載する)およびスペクトル包絡生成DNN(図中、DNN_sとも記載する)を学習する。 In the acoustic model learning device 110, F ₀ data {f ₁ , f ₂ ,..., F _N }, spectrum envelope data {s ₁ , s ₂ ,..., S _N }, context data {t ₁ , t ₂ ,. Using t _N }, the F ₀ generation DNN (also described as DNN _{f in the} figure) and the spectral envelope generation DNN (also described as DNN _{s in the} figure) are learned.

音声合成装置１２０では、入力テキストtex_oのテキスト解析・言語特徴量ベクトル抽出から得られる言語特徴量ベクトルl_o、およびF₀生成DNNから、F₀情報f_oを生成する。次に、言語特徴量ベクトルl_o、生成されたF₀情報f_o、およびスペクトル包絡生成DNNからスペクトル包絡情報s_oを生成する。 In the speech synthesis device 120, the input text tex _o text analysis and speech feature quantity linguistic feature obtained from vector extraction amount vector l _o, and the F ₀ generation DNN, generates an F ₀ information f _o. Next, spectral envelope information s _o is generated from the language feature vector l _o , the generated F ₀ information f _o , and the spectral envelope generation DNN.

＜用語、使用するデータに関する説明＞
・F₀データ、スペクトル包絡データ
F₀データ、スペクトル包絡データは、それぞれ、音響モデル学習に使用する音声データ(以下、学習用音声データともいう)の音声信号に対して信号処理を行った結果得られる、各発話のF₀情報（音高）f_n、スペクトル包絡情報（ケプストラム、メルケプストラム等）s_nをそれぞれ学習用音声データの総数N個分保持したデータである。コンテキストデータ中の発話数Nを用いて、F₀データを{f₁,f₂,…,f_N}、スペクトル包絡データを{s₁,s₂,…,s_N}で表現する。 <Terminology, Description of Data to be Used>
・ F ₀ data, spectrum envelope data
F ₀ data, the spectral envelope data, respectively, the audio data (hereinafter, also referred to as training speech data) to be used in the acoustic model training obtained as a result of performing the signal processing on the audio signal, F ₀ information of each utterance a (pitch) f _n, spectral envelope information (cepstrum, Mel cepstrum, etc.) n total amount held data of each training speech data s _n. Using utterances number N in the context data, the F ₀ data _{_{{f 1, f 2, ...}} , f N}, {s 1, s 2, ..., s N} the spectral envelope data is expressed by.

例えば、F₀情報f_nは、N個の学習用音声データのうちのn番目の学習用音声データの時間長をT_nフレームとした場合、各フレーム時刻の音高の情報を保持したデータであり、1×T_n次元の実ベクトルとする。または、有声/無声の情報を含む2×T_n次元の実ベクトルとしてもよい。
例えば、スペクトル包絡情報s_nは、n番目の学習用音声データの各フレーム時刻の音韻の情報を保持したデータであり、抽出されたケプストラム、メルケプストラムの低次元のみを抽出して利用してもよい。発話nの時間長をT_nフレームとした場合、例えば、M次元のメルケプストラムを使用して、M×T_n次元の実ベクトル等とする。 For example, when the time length of the n-th learning speech data of the N pieces of learning speech data is T _n frames, the F ₀ information f _n is data holding pitch information of each frame time. There is a 1 × T _n- dimensional real vector. Alternatively, it may be a 2 × T _n- dimensional real vector including voiced / unvoiced information.
For example, the spectral envelope information s _n is data holding phonological information of each frame time of the n-th training speech data, and even if only low dimensions of the extracted cepstrum and mel cepstrum are extracted and used Good. When the time length of the utterance n is T _n frames, for example, an M-dimensional mel cepstrum is used to form an M × T _n- dimensional real vector or the like.

・コンテキストデータ
コンテキストデータは、学習用音声データのコンテキスト（発話情報）を学習用音声データの総数N個分保持したデータである。例えば、コンテキストデータを{t₁,t₂,…,t_N}で表現する。
例えば、コンテキストt_nは、n番目の学習用音声データについて付与された発音等の情報である。コンテキストには、音素情報（発音情報）とアクセント情報（アクセント型、アクセント句長）を含んでいる必要がある。コンテキストとして、これ以外にも品詞情報等も含んでいてもよい。また、各音素の開始時間、終了時間の情報（音素セグメンテーション情報）が保存されていてもよい。 Context Data Context data is data in which the context (speech information) of learning speech data is held for the total number N of learning speech data. For example, context data is represented by {t ₁ , t ₂ ,..., T _N }.
For example, the context t _n is information such as pronunciation given to the n-th learning speech data. The context needs to include phoneme information (pronunciation information) and accent information (accent type, accent phrase length). As a context, part-of-speech information may be included in addition to this. In addition, information on the start time and end time of each phoneme (phoneme segmentation information) may be stored.

・言語特徴量ベクトル
言語特徴量ベクトルl_nは、コンテキストt_nを数値ベクトルで表現したものである。例えば、非特許文献１のように、音素情報、アクセント情報をそれぞれ1-of-K表現し、さらに文長などの数値情報と連結し得られる数値ベクトルとする。当該発話の時間長をT_nフレームとした場合、例えばフレーム辺りK次元のベクトルを使用し、言語特徴量ベクトルl_nとして、K×T_n次元の実ベクトルを使用する。
言語特徴量ベクトルデータは、コンテキストデータ{t₁,t₂,…,t_N}に含まれる各発話について、対応する言語特徴量ベクトルl_nを保持したものである。コンテキストデータ中の発話数Nを用いて、{l₁,l₂,…,l_N}として表現する。 Language Feature Amount Vector The language feature amount vector l _n is the context t _n represented by a numerical vector. For example, as in Non-Patent Document 1, the phoneme information and the accent information are each represented by 1-of-K, and further, they are combined with numerical information such as a sentence length to obtain a numerical vector. If the time length of the utterance is T _n frames, for example, a K-dimensional vector around a frame is used, and a K × T _n- dimensional real vector is used as the language feature vector l _n .
The linguistic feature quantity vector data holds corresponding linguistic feature quantity vectors l _n for each utterance included in the context data {t ₁ , t ₂ ,..., T _N }. Expressing as {l ₁ , l ₂ ,..., L _N } using the number of utterances N in the context data.

・言語特徴量・F₀ベクトル
言語特徴量・F₀ベクトルは、言語特徴量ベクトルl_nとF₀情報f_nの双方の情報を保持したベクトルである。例えば、言語特徴量ベクトルl_nとF₀情報f_nとを連結し、x_n=[l_n ^T,f_n ^T]^Tとして作成する。
言語特徴量・F₀ベクトルデータは、N個の学習用音声データに含まれる各学習用音声データについて、言語特徴量・F₀ベクトルx_nを抽出し、データとして保持したものである。コンテキストデータ中の発話数Nを用いて、{x₁,x₂,…,x_N}で表現する。 Language Feature Value F ₀ Vector The language feature value F ₀ vector is a vector holding information of both the language feature value vector l _n and the F ₀ information f _n . For example, the language feature vector l _n and the F ₀ information f _n are connected to create x _n = [l _n ^T , f _n ^T ] ^T.
The language feature amount · F ₀ vector data is obtained by extracting the language feature amount · F ₀ vector x _n from each of the learning speech data included in the N pieces of learning speech data, and holding it as data. It is represented by {x ₁ , x ₂ ,..., X _N } using the number of utterances N in the context data.

＜第一実施形態に係る音響モデル学習装置１１０＞
F₀データ、スペクトル包絡データ、コンテキストデータから音響モデル学習を行い、DNN音響モデルを出力する。従来手法のアルゴリズムと異なる点は、(1)言語特徴量・F₀ベクトルデータを作成する点、(2)F₀のみを生成するF₀生成DNNを学習する点、(3)スペクトル包絡の生成のために言語特徴量のみでなくF₀情報も活用するため、スペクトル包絡生成DNNの入力として言語特徴量・F₀ベクトルデータを使用する点である。 <Acoustic Model Learning Device 110 According to First Embodiment>
Perform acoustic model learning from the F ₀ data, spectral envelope data, and context data, and output a DNN acoustic model. Is different from the algorithm of the conventional method points, (1) the point of creating a language feature value · F ₀ vector data, a point to learn the F ₀ generation DNN to generate only (2) F _0, the generation of (3) the spectral envelope In order to utilize not only the language feature but also the F ₀ information, the language feature / F ₀ vector data is used as an input of the spectral envelope generation DNN.

図３は第一実施形態に係る音響モデル学習装置１１０の機能ブロック図を、図４はその処理フローを示す。
例えば、この音響モデル学習装置１１０は、CPUと、RAMと、以下の処理を実行するためのプログラムを記録したROMを備えたコンピュータで構成され、機能的には次に示すように構成されている。音響モデル学習装置１１０は、言語特徴量ベクトルデータ作成部１１１と、言語特徴量・F₀ベクトルデータ作成部１１２と、F₀生成モデル学習部１１３と、スペクトル包絡生成モデル学習部１１４とを含む。以下、各部の処理内容を説明する。 FIG. 3 shows a functional block diagram of the acoustic model learning device 110 according to the first embodiment, and FIG. 4 shows its processing flow.
For example, the acoustic model learning device 110 is configured of a computer including a CPU, a RAM, and a ROM storing a program for executing the following processing, and is functionally configured as follows: . The acoustic model learning device 110 includes a language feature vector data generation unit 111, a language feature / F ₀ vector data generation unit 112, an F ₀ generation model learning unit 113, and a spectrum envelope generation model learning unit 114. The processing content of each part will be described below.

＜言語特徴量ベクトルデータ作成部１１１＞
言語特徴量ベクトルデータ作成部１１１は、コンテキストデータ{t₁,t₂,…,t_N}を入力とし、各発話のコンテキストt_nに対し言語特徴量ベクトルl_nを作成し（Ｓ１１１）、その結果を言語特徴量ベクトルデータ{l₁,l₂,…,l_N}として保持する。 <Language feature vector data creation unit 111>
The language feature vector data creation unit 111 receives context data {t ₁ , t ₂ ,..., T _N } and creates a language feature vector l _n for the context t _{n of} each utterance (S 111). The result is held as language feature vector data {l ₁ , l ₂ ,..., L _N }.

＜言語特徴量・F₀ベクトルデータ作成部１１２＞
言語特徴量・F₀ベクトルデータ作成部１１２は、F₀データ{f₁,f₂,…,f_N}と言語特徴量ベクトルデータ{l₁,l₂,…,l_N}とを入力とし、n番目の学習用音声データに対応する言語特徴量ベクトルl_nとF₀情報f_nとを連結し、言語特徴量・F₀ベクトルx_n=[l_n ^T,f_n ^T]^Tとして作成し、N発話分に対して同様の処理を行い、言語特徴量・F₀ベクトルデータ{x₁,x₂,…,x_N}を作成し(Ｓ１１２)、保持する。 <Language Feature Value / F ₀ Vector Data Creation Unit 112>
Linguistic feature quantity · F ₀ vector data creating unit 112, F ₀ data _{_{{f 1, f 2, ...}} , f N} language feature vector data _{_{{l 1, l 2, ...}} , l N} as input and , Lin the language feature vector l _n corresponding to the n-th training speech data and the F ₀ information f _n, and create as the language feature quantity · F ₀ vector x _n = [l _n ^T , f _n ^T ] ^T Then, the same processing is performed for N utterances, and the language feature amount / F ₀ vector data {x ₁ , x ₂ ,..., X _N } are created (S 112) and held.

＜F₀生成モデル学習部１１３＞
F₀生成モデル学習部１１３は、言語特徴量ベクトルデータ{l₁,l₂,…,l_N}とF₀データ{f₁,f₂,…,f_N}とを入力とし、これらのデータを用いて、言語特徴量ベクトルを入力とし、対応するF₀情報を出力とするDNN(以下、F₀生成モデルともいい、図中、DNN_fとも記載する)を学習し（Ｓ１１３）、保持する。F₀生成モデルの学習方法としては、既存の如何なる技術を用いてもよい。例えば、入出力で使用するベクトルを除いて、学習方法・モデル構成等は非特許文献１と同様とする。 <F ₀ Generation Model Learning Unit 113>
F ₀ generation model learning unit 113, the language feature vector data _{_{{l 1, l 2, ...}} , l N} and F ₀ data _{_{{f 1, f 2, ...}} , f N} as input and these data Learn a DNN (hereinafter also referred to as an F ₀ generation model, also described as DN N _{f in} the figure) that takes a language feature amount vector as an input and an output corresponding F ₀ information using (S 113), and holds . Any existing technique may be used as a learning method of the F ₀ generation model. For example, except for vectors used for input and output, the learning method, model configuration, and the like are the same as in Non-Patent Document 1.

＜スペクトル包絡生成モデル学習部１１４＞
スペクトル包絡生成モデル学習部１１４は、言語特徴量・F₀ベクトルデータ{x₁,x₂,…,x_N}とスペクトル包絡データ{s₁,s₂,…,s_N}とを入力とし、これらのデータを用いて、言語特徴量・F₀ベクトルを入力とし、スペクトル包絡情報を出力とするスペクトル包絡生成DNN(以下、スペクトル包絡生成モデルともいい、図中、DNN_sとも記載する)を学習する（Ｓ１１４）。スペクトル包絡生成モデルの学習方法としては、既存の如何なる技術を用いてもよい。例えば、入出力で使用するベクトルを除いて、学習方法・モデル構成等は非特許文献１と同様とする。 <Spectrum envelope generation model learning unit 114>
Spectral envelope generator model learning unit 114, the language characteristic quantity · F ₀ vector data _{_{{x 1, x 2, ...}} , x N} and spectral envelope data _{_{{s 1, s 2, ...}} , s N} as input and, These data are used to learn a spectral envelope generation DNN (hereinafter also referred to as a spectral envelope generation model, also described as DNN _{s in} the figure), which receives as input a linguistic feature and an F ₀ vector, and outputs spectral envelope information. (S114). Any existing technique may be used as a learning method of the spectral envelope generation model. For example, except for vectors used for input and output, the learning method, model configuration, and the like are the same as in Non-Patent Document 1.

＜第一実施形態に係る音声合成装置１２０＞
音声合成装置１２０は、合成するテキストtex_oから、合成音声z_oを生成する。従来手法のアルゴリズムと異なる点は、F₀生成DNNから、F₀情報f_nのみを生成し、スペクトル包絡生成DNNからスペクトル包絡情報s_oを生成する際に、言語特徴量l_nと併せてF₀情報f_nを使用する点である。 <Speech synthesizer 120 according to the first embodiment>
The speech synthesizer 120 generates synthesized speech z _o from the text tex _o to be synthesized. Conventional approaches algorithm differs from F ₀ generation DNN, generates only F ₀ information f _n, in generating the spectral envelope information s _o from the spectrum envelope generating DNN, together with linguistic feature quantity l _n F ₀ information f _n is used.

図５は第一実施形態に係る音声合成装置１２０の機能ブロック図を、図６はその処理フローを示す。 FIG. 5 shows a functional block diagram of the speech synthesizer 120 according to the first embodiment, and FIG. 6 shows its processing flow.

例えば、この音声合成装置１２０は、CPUと、RAMと、以下の処理を実行するためのプログラムを記録したROMを備えたコンピュータで構成され、機能的には次に示すように構成されている。音声合成装置１２０は、テキスト解析部１２１と、言語特徴量ベクトル抽出部１２２と、F₀生成部１２３と、言語特徴量・F₀ベクトル作成部１２４Ｂと、スペクトル包絡生成部１２４と、音声波形生成部１２５とを含む。以下、各部の処理内容を説明する。 For example, the voice synthesizer 120 is configured by a computer including a CPU, a RAM, and a ROM storing a program for executing the following processing, and is functionally configured as follows. The speech synthesizer 120 includes a text analysis unit 121, a language feature vector extraction unit 122, an F ₀ generation unit 123, a language feature / F ₀ vector generation unit 124B, a spectrum envelope generation unit 124, and an audio waveform generation. And a section 125. The processing content of each part will be described below.

＜テキスト解析部１２１＞
テキスト解析部１２１は、音声合成の対象となるテキストtex_oを入力とし、テキストtex_oをテキスト解析し（Ｓ１２１）、コンテキストt_oを得る。 <Text Analysis Unit 121>
The text analysis unit 121 takes as input the text tex _o to be subjected to speech synthesis, analyzes the text tex _o as text (S121), and obtains the context t _o .

＜言語特徴量ベクトル抽出部１２２＞
言語特徴量ベクトル抽出部１２２は、コンテキストt_oを入力とし、コンテキストt_oに対応する言語特徴量ベクトルl_oを抽出し（Ｓ１２２）、出力する。 <Language feature vector extraction unit 122>
The language feature vector extraction unit 122 receives the context t _o as an input, extracts the language feature vector l _o corresponding to the context t _o (S 122), and outputs it.

＜F₀生成部１２３＞
F₀生成部１２３は、音声合成に先立ち予めF₀生成モデルDDN_fを受け取っておく。音声合成時には、F₀生成部１２３は、言語特徴量ベクトルl_oを入力とし、F₀生成モデルDDN_fの順伝播を行い、出力ベクトルを、F₀情報f_oとして出力する（Ｓ１２３）。なお、F₀情報f_oは、テキストtex_oに対応する音声波形の基本周波数を示す情報である。 <F ₀ generation unit 123>
The F ₀ generation unit 123 receives the F ₀ generation model DDN _f in advance prior to speech synthesis. At the time of speech synthesis, the F ₀ generation unit 123 receives the language feature quantity vector _lo as input, performs forward propagation of the F ₀ generation model DDN _f , and outputs an output vector as F ₀ information f _o (S 123). Incidentally, F ₀ information f _o is information showing the basic frequency of the voice waveform corresponding to the text tex _o.

＜言語特徴量・F₀ベクトル作成部１２４Ｂ＞
言語特徴量・F₀ベクトル作成部１２４Ｂは、言語特徴量ベクトルl_oとF₀情報f_oとを入力とし、言語特徴量ベクトルl_oとF₀情報f_oとを連結し、言語特徴量・F₀ベクトルx_o=[l_o ^T,f_o ^T]^Tとして作成し(Ｓ１２４Ｂ)、出力する。 <Language Feature Value / F ₀ Vector Creation Unit 124 B>
Linguistic feature quantity-F ₀ vector creation section 124B receives as input the language feature vector l _o and F ₀ information f _o, connects the language feature vector l _o and F ₀ information f _o, linguistic feature quantity and It is created as an F ₀ vector x _o = [l _o ^T , f _o ^T ] ^T (S 124 B) and output.

＜スペクトル包絡生成部１２４＞
スペクトル包絡生成部１２４は、音声合成に先立ち予めスペクトル包絡生成モデルDDN_sを受け取っておく。スペクトル包絡生成部１２４は、言語特徴量・F₀ベクトルx_oを入力とし、スペクトル包絡生成モデルDDN_sの順伝播を行い、出力ベクトルを、スペクトル包絡情報s_oとして、出力する（Ｓ１２４）。なお、スペクトル包絡情報s_oは、テキストtex_oに対応する音声波形のスペクトル包絡情報を示す情報である。 <Spectrum envelope generation unit 124>
The spectrum envelope generation unit 124 receives the spectrum envelope generation model DDN _s in advance prior to speech synthesis. Spectrum envelope generating unit 124 inputs the linguistic feature quantity · F ₀ vector x _o, performs forward propagation of the spectral envelope generating model DDN _s, the output vector, as the spectral envelope information s _o, and outputs (S124). Incidentally, the spectral envelope information s _o is information indicating the spectral envelope information of a speech waveform corresponding to the text tex _o.

＜音声波形生成部１２５＞
音声波形生成部１２５は、F₀情報f_oとスペクトル包絡情報s_oとを受け取り、これらの値を用いて、テキストtex_oに対応する音声波形(合成音声z_o)を生成し（Ｓ１２５）、出力する。音声波形生成の前に、例えば、maximum likelihood generation (MLPG) アルゴリズム（参考文献１参照）を用いて時間方向に平滑化された音声パラメータ系列を得てもよい。また、音声波形生成には、例えば（参考文献２）を用いてもよい。
[参考文献１]益子他，“動的特徴を用いたHMMに基づく音声合成”，信学論，vol.J79-D-II，no.12，pp.2184-2190，Dec. 1996.
[参考文献２]今井他，“音声合成のためのメル対数スペクトル近似（MLSA）フィルタ”，電子情報通信学会論文誌 A Vol.J66-A No.2 pp.122-129, Feb. 1983. <Voice waveform generation unit 125>
The speech waveform generation unit 125 receives the F ₀ information f _o and the spectral envelope information s _o and generates a speech waveform (synthesized speech z _o ) corresponding to the text tex _o using these values (S 125), Output. Prior to speech waveform generation, for example, a speech parameter sequence smoothed in the time direction may be obtained using a maximum likelihood generation (MLPG) algorithm (see reference 1). Further, for example, (Reference 2) may be used for speech waveform generation.
[Reference 1] Masuko et al., "HMM-based speech synthesis using dynamic features," Theory of philosophy, vol. J79-D-II, no. 12, pp. 2184-2190, Dec. 1996.
[Reference 2] Imai et al., "Mel log spectral approximation (MLSA) filter for speech synthesis", Transactions of the Institute of Electronics, Information and Communication Engineers A Vol. J66-A No. 2 pp. 122-129, Feb. 1983.

＜効果＞
以上の構成により、音声合成器の入力として、読み、アクセントなどの従来のコンテキストに加え、F₀情報を反映したスペクトル包絡情報を出力するようスペクトル包絡生成DNNを構成する。これにより、F₀情報との依存関係を満たすスペクトル包絡情報を生成することが可能となる。これにより、合成音声の品質が向上する。 <Effect>
With the above configuration, as the input speech synthesizer, read, in addition to the conventional context, such as accents, constituting the spectrum envelope generation DNN to output the spectral envelope information reflecting the F ₀ information. This makes it possible to generate spectral envelope information that satisfies the dependency relationship with the F ₀ information. This improves the quality of the synthesized speech.

＜第二実施形態＞
第一実施形態と異なる部分を中心に説明する。 Second Embodiment
Description will be made focusing on parts different from the first embodiment.

第一実施形態において、スペクトル包絡生成DNNの入力として、極端に高いF₀、極端に低いF₀が使用される場合がある。その場合、スペクトル包絡生成DNNから生成されるスペクトル包絡情報が不安定となり、合成音声の品質が劣化する可能性がある。 In the first embodiment, extremely high F ₀ and extremely low F ₀ may be used as the input of the spectral envelope generation DNN. In that case, the spectral envelope information generated from the spectral envelope generation DNN may become unstable and the quality of the synthesized speech may be degraded.

この課題点に対し、本実施形態では、スペクトル包絡生成DNNの入力として、F₀情報をある有界の関数F(x)に入力した際の出力値を使用する。スペクトル包絡生成DNNの入力が有界となることで、スペクトル包絡生成DNNから生成されるスペクトル包絡情報が安定し、合成音声の品質が向上する。 In order to solve this problem, in this embodiment, an output value when F ₀ information is input to a bounded function F (x) is used as an input of the spectral envelope generation DNN. The fact that the input of the spectral envelope generation DNN is bounded stabilizes the spectral envelope information generated from the spectral envelope generation DNN and improves the quality of the synthesized speech.

本実施形態は、第一実施形態と比較し、言語特徴量・F₀ベクトル(データ)作成部の前段にF₀変換部があり、F₀情報を変換する点が異なる。 This embodiment, compared to the first embodiment, there are F ₀ conversion unit in front of the linguistic feature quantity · F ₀ vector (data) creating unit, that converts the F ₀ information is different.

＜用語、使用するデータに関する説明＞
・言語特徴量・変換F₀ベクトル
本実施形態における言語特徴量・変換F₀ベクトルは、言語特徴量ベクトルl_nとF₀情報f_nの双方の情報を保持したベクトルである。F₀値変換した出力を使用する点が第一実施形態と異なる。例えば、言語特徴量ベクトルl_nと変換後のF₀情報f_n ⁽¹⁾の二つのベクトルを連結し、x_n=[l_n ^T,f_n ^(1)T]^Tとして作成する。 <Terminology, Description of Data to be Used>
· Linguistic feature quantity and converting F ₀ vector linguistic feature quantity and converting F ₀ vector in the present embodiment is a vector which holds both the information of the language feature vector l _n and F ₀ information f _n. The point which uses the output which carried out F ₀ value conversion differs from a 1st embodiment. For example, two vectors of the language feature vector l _n and the converted F ₀ information f _n ⁽¹⁾ are connected to create x _n = [l _n ^T , f _n ^{(1) T} ] ^T.

＜第二実施形態に係る音響モデル学習装置２１０＞
図７は第二実施形態に係る音響モデル学習装置２１０の機能ブロック図を、図８はその処理フローを示す。
音響モデル学習装置２１０は、言語特徴量ベクトルデータ作成部１１１と、言語特徴量・F₀ベクトルデータ作成部１１２と、F₀生成モデル学習部１１３と、スペクトル包絡生成モデル学習部１１４と、F₀変換部２１５とを含む。 <Sound Model Learning Device 210 According to Second Embodiment>
FIG. 7 shows a functional block diagram of the acoustic model learning device 210 according to the second embodiment, and FIG. 8 shows its processing flow.
The acoustic model learning device 210 includes a language feature vector data generation unit 111, a language feature / F ₀ vector data generation unit 112, an F ₀ generation model learning unit 113, a spectrum envelope generation model learning unit 114, and F _0. And a conversion unit 215.

＜F₀変換部２１５＞
F₀変換部２１５は、F₀データ{t₁,t₂,…,t_N}を入力とし、有界のベクトル関数F(f_n)を用いて、F₀データ{f₁,f₂,…,f_N}を変換し（Ｓ２１５）、変換後のF₀データ{f₁ ⁽¹⁾,f₂ ⁽¹⁾,…,f_N ⁽¹⁾}を出力する。例えば、F₀情報f_nのフレーム長をT_nフレームとし、f_n=[f_n1,f_n2,…,f_{nT_n}]^Tとしたとき、F(f_n)=[G(f_n1)、G(f_n2)、…、G(f_{nT_n})]^Tとする。ただし、下付き添え字T_nは、T_nを意味する。ここで、Gは有界のスカラー関数である。例えば、Gとして、sigmoid関数

を使用する。 <F ₀ conversion unit 215>
The F ₀ conversion unit 215 receives F ₀ data {t ₁ , t ₂ ,..., T _N } as input, and uses bounded vector function F (f _n ) to obtain F ₀ data {f ₁ , f ₂ , , F _N } is converted (S 215), and the converted F ₀ data {f ₁ ⁽¹⁾ , f ₂ ⁽¹⁾ ,..., F _N ⁽¹⁾ } are output. For example, _{assuming that} the frame length of the F ₀ information f _n is T _n frame and f _n = [f _n1 , f _n2 ,..., F _{nT_n} ] ^T , F (f _n ) = [G (f _n1 ), G (f _n2 ),..., G (f _n ^{T —} _n )] ^T. However, the subscript T_n means T _n . Here, G is a bounded scalar function. For example, as G, sigmoid function

Use

なお、言語特徴量・F₀ベクトルデータ作成部１１２は、F₀データ{f₁,f₂,…,f_N}に代えて、変換後のF₀データ{f₁ ⁽¹⁾,f₂ ⁽¹⁾,…,f_N ⁽¹⁾}を用いる。他の構成は第一実施形態と同様である。 Note that the language feature value / F ₀ vector data creation unit 112 substitutes the converted F ₀ data {f ₁ ⁽¹⁾ , f ₂ ^{(in place} of the F ₀ data {f ₁ , f ₂ ,..., F _N } ⁾ . ¹⁾ , ..., f _N ⁽¹⁾ } is used. The other configuration is the same as that of the first embodiment.

＜第二実施形態に係る音声合成装置２２０＞
図９は第二実施形態に係る音声合成装置２２０の機能ブロック図を、図１０はその処理フローを示す。
音声合成装置２２０は、テキスト解析部１２１と、言語特徴量ベクトル抽出部１２２と、F₀生成部１２３と、言語特徴量・F₀ベクトル作成部１２４Ｂと、スペクトル包絡生成部１２４と、音声波形生成部１２５と、F₀変換部２２４Ａとを含む。 <Speech synthesizer 220 according to the second embodiment>
FIG. 9 shows a functional block diagram of the speech synthesizer 220 according to the second embodiment, and FIG. 10 shows its processing flow.
The speech synthesis unit 220 includes a text analysis unit 121, a language feature vector extraction unit 122, an F ₀ generation unit 123, a language feature / F ₀ vector generation unit 124B, a spectrum envelope generation unit 124, and an audio waveform generation. Unit 125 and an F ₀ conversion unit 224A.

＜F₀変換部２２４Ａ＞
F₀変換部２２４Ａは、F₀情報f_oとを入力とし、有界のベクトル関数F(f_n)を用いて、F₀情報f_oを変換し（Ｓ２２４Ａ）、変換後のF₀情報f_o ⁽¹⁾を出力する。変換方法としては、F₀変換部２１５と対応する方法を用いればよい。
なお、言語特徴量・F₀ベクトル作成部１２４Ｂは、F₀情報f_oに代えて、変換後のF₀情報f_o ⁽¹⁾を用いる。他の構成は第一実施形態と同様である。 <F ₀ conversion unit 224A>
The F ₀ conversion unit 224 A receives the F ₀ information f _o and uses the bounded vector function F (f _n ) to convert the F ₀ information f _o (S 224 A), and the converted F ₀ information f _o Output ⁽¹⁾ . As a conversion method, a method corresponding to the F ₀ conversion unit 215 may be used.
Incidentally, the language characteristic quantity · F ₀ vector generating unit 124B has, F ₀ instead of the information f _o, using F ₀ information f _o ⁽¹⁾ after the conversion. The other configuration is the same as that of the first embodiment.

＜効果＞
このような構成とすることで、第一実施形態と同様の効果を得ることができる。さらに、スペクトル包絡生成DNNから生成されるスペクトル包絡情報が安定し、合成音声の品質が向上する。なお、第二実施形態では、用いるF₀データ、F₀情報を限定しており、第一実施形態を限定したものとも言える。 <Effect>
With such a configuration, the same effect as that of the first embodiment can be obtained. Furthermore, the spectral envelope information generated from the spectral envelope generation DNN is stabilized, and the quality of synthesized speech is improved. In the second embodiment, the F ₀ data and the F ₀ information to be used are limited, and it can be said that the first embodiment is limited.

＜第三実施形態＞
第二実施形態と異なる部分を中心に説明する。 Third Embodiment
Description will be made focusing on parts different from the second embodiment.

第二実施形態においては、ベクトル関数F(x)として固定された一つの関数を使用する。一方、学習データから、スペクトル包絡生成DNNの入力として適切なベクトル関数F(x)を推定することができれば、より適切にスペクトル包絡情報とF₀情報の依存関係がモデル化されると考えられる。 In the second embodiment, one fixed function is used as the vector function F (x). On the other hand, from the learning data, if it is possible to estimate an appropriate vector function F (x) as an input spectrum envelope generating DNN, considered more appropriate dependencies of the spectral envelope information and the F ₀ information is modeled.

この課題点に対し、本実施形態では、固定された一つの関数F(x)の替わりに、F₀値変換パラメータθ⁽⁰⁾を持つ関数F(x;θ⁽⁰⁾)を使用し、学習データを使用してF₀値変換パラメータθ⁽⁰⁾を推定する。適切なベクトル関数を学習することで、スペクトル包絡情報とF₀情報の依存関係がより柔軟にモデル化され、合成音声品質が向上する。 To solve this problem, in the present embodiment, a function F (x; θ ⁽⁰⁾ ) having an F ₀ value conversion parameter θ ⁽⁰ ) is used instead of one fixed function F (x), The training data is used to estimate the F ₀ value conversion parameter θ ⁽⁰⁾ . By learning an appropriate vector function, the dependency between the spectral envelope information and the F ₀ information can be more flexibly modeled, and the synthetic speech quality can be improved.

＜用語、使用するデータに関する説明＞
・F₀値変換パラメータ
F₀値変換パラメータは、パラメトリックF₀値変換において使用するパラメータであり、θ⁽⁰⁾で表現する。 <Terminology, Description of Data to be Used>
・ F ₀ value conversion parameter
The F ₀ value conversion parameter is a parameter used in parametric F ₀ value conversion, and is expressed by θ ⁽⁰⁾ .

・パラメトリック変換F₀情報
パラメトリック変換F₀情報は、パラメトリックF₀値変換により出力される実数値である。f_n ⁽²⁾で表現する。 Parametric Transform F ₀ Information The parametric transform F ₀ information is a real value output by parametric F ₀ value transformation. Expressed by f _n ⁽²⁾ .

・言語特徴量・パラメトリック変換F₀ベクトル
言語特徴量・パラメトリック変換F₀ベクトルは、言語特徴量ベクトルl_nと、パラメトリック変換F₀情報f_n ⁽²⁾から得られるベクトルである。x_nで表現する。例えば、言語特徴量ベクトルl_nと、パラメトリック変換F₀情報f_n ⁽²⁾を連結し、x_n=[l_n ^T,f_n ^(2)T]^Tとして作成する。 Language Feature Value Parametric Conversion F ₀ Vector The language feature value parametric conversion F ₀ vector is a vector obtained from the language feature value vector l _n and parametric conversion F ₀ information f _n ⁽²⁾ . Expressed by x _n . For example, the language feature vector l _n and the parametric conversion F ₀ information f _n ⁽²⁾ are connected to create x _n = [l _n ^T , f _n ^{(2) T} ] ^T.

・パラメトリックF₀値変換
F₀情報f_n、F₀値変換パラメータθ⁽⁰⁾、F₀値変換関数F(x;θ⁽⁰⁾)を用いて、パラメトリック変換F₀情報f_n ⁽²⁾を出力する。この際、F(x;θ⁽⁰⁾)として、値域が有界なベクトルを使用する。また、DNNの誤差逆伝播によるθ⁽⁰⁾の学習を可能とするため、F(x;θ⁽⁰⁾)の出力値がθ⁽⁰⁾について微分可能である関数を使用する。例えば、F₀情報f_nのフレーム長をT_nフレームとし、f_n=[f_n1,f_n2,…,f_{nT_n}]^T、F(f_n;θ⁽⁰⁾)=[G(f_n1;θ⁽⁰⁾),G(f_n2;θ⁽⁰⁾),…,G(f_{nT_n};θ⁽⁰⁾)]^Tとしたとき、パラメトリックなsigmoid関数

を使用する。または、ベクトル関数F(x;θ⁽⁰⁾)を、xを入力ベクトル、θ⁽⁰⁾をパラメータとするニューラルネットワークとしてもよい。・ Parametric F ₀ value conversion
Parametric conversion F ₀ information f _n ⁽²⁾ is output using F ₀ information f _n , F ₀ value conversion parameter θ ⁽⁰⁾ , and F ₀ value conversion function F (x; θ ⁽⁰⁾ ). At this time, a vector whose value range is bounded is used as F (x; θ ⁽⁰⁾ ). Also, in order to enable learning of θ ⁽⁰⁾ by error back propagation of DNN, a function is used in which the output value of F (x; θ ⁽⁰⁾ ) is differentiable with respect to θ ⁽⁰⁾ . For example, the frame length F ₀ information f _n and T _n _{_{frame, f n = [f n1,}} f n2, ..., f nT_n] T, F (f n; θ (0)) = [G (f n1; Let θ ⁽⁰⁾ , G (f _n2 ; θ ⁽⁰⁾ ), ..., G (f _{nT_n} ; θ ⁽⁰⁾ )] ^T , then the parametric sigmoid function

Use Alternatively, the vector function F (x; θ ⁽⁰⁾ ) may be a neural network using x as an input vector and θ ⁽⁰⁾ as a parameter.

＜第三実施形態に係る音響モデル学習装置３１０＞
図１１は第三実施形態に係る音響モデル学習装置３１０の機能ブロック図を、図１２はその処理フローを示す。 <Sound Model Learning Device 310 According to Third Embodiment>
FIG. 11 shows a functional block diagram of the acoustic model learning device 310 according to the third embodiment, and FIG. 12 shows its process flow.

音響モデル学習装置３１０は、スペクトル包絡データ、F₀データ、コンテキストデータから、スペクトル包絡生成DNN学習・F₀値変換パラメータ推定を行い、スペクトル包絡生成DNNおよびF₀値変換パラメータを出力する点が第二実施形態と異なる。スペクトル包絡生成DNN学習・F₀値変換パラメータ推定では、スペクトル包絡データ、F₀データ、言語特徴量ベクトルデータから、スペクトル包絡DNNのF₀値変換パラメータを推定する。 The acoustic model learning device 310 performs spectrum envelope generation DNN learning / F ₀ value conversion parameter estimation from spectrum envelope data, F ₀ data, and context data, and outputs a spectrum envelope generation DNN and F ₀ value conversion parameter. It differs from the two embodiments. The spectral envelope generating DNN learning · F ₀ value conversion parameter estimation, spectral envelope data, F ₀ data, from the language feature vector data, to estimate the F ₀ value conversion parameters of the spectral envelope DNN.

音響モデル学習装置３１０は、言語特徴量ベクトルデータ作成部１１１と、言語特徴量・F₀ベクトルデータ作成部１１２と、F₀生成モデル学習部１１３と、スペクトル包絡生成モデル・変換パラメータ学習部３１４と、F₀変換部３１５とを含む。 The acoustic model learning device 310 includes a language feature vector data generation unit 111, a language feature / F ₀ vector data generation unit 112, an F ₀ generation model learning unit 113, a spectrum envelope generation model / conversion parameter learning unit 314, and the like. , F ₀ conversion unit 315.

＜F₀変換部３１５＞
F₀変換部３１５は、学習に先立ち予めF₀値変換パラメータθ⁽⁰⁾を初期化しておく。例えば、F₀値変換パラメータθ⁽⁰⁾を乱数により初期化する。例えば、F₀値変換パラメータθ⁽⁰⁾の初期化は、有界のベクトル関数F(x;θ⁽⁰⁾)として、パラメトリックなsigmoid関数を使用する場合、標準正規分布からサンプリングする。 <F ₀ conversion unit 315>
Prior to learning, the F ₀ conversion unit 315 initializes the F ₀ value conversion parameter θ ⁽⁰⁾ in advance. For example, the F ₀ value conversion parameter θ ⁽⁰⁾ is initialized by a random number. For example, initialization of the F ₀ value conversion parameter θ ⁽⁰⁾ is sampled from a standard normal distribution when using a parametric sigmoid function as a bounded vector function F (x; θ ⁽⁰⁾ ).

F₀変換部３１５は、学習時においてF₀データ{f₁,f₂,…,f_N}を入力とし、F₀データ{f₁,f₂,…,f_N}とF₀値変換パラメータθ⁽⁰⁾とを用いて、パラメトリックF₀値変換を行い(f_n ⁽²⁾=[G(f_n1;θ⁽⁰⁾),G(f_n2;θ⁽⁰⁾),…,G(f_{nT_n};θ⁽⁰⁾)]、Ｓ３１５)、パラメトリック変換F₀データ{f₁ ⁽²⁾,f₂ ⁽²⁾,…,f_N ⁽²⁾}を求め、出力する。
なお、言語特徴量・F₀ベクトルデータ作成部１１２は、F₀データ{f₁ ⁽¹⁾,f₂ ⁽¹⁾,…,f_N ⁽¹⁾}に代えて、変換後のF₀データ{f₁ ⁽²⁾,f₂ ⁽²⁾,…,f_N ⁽²⁾}を用いる。 F ₀ conversion unit 315, F ₀ data _{_{{f 1, f 2, ...}} , f N} at the time of learning as input, F ₀ data _{_{{f 1, f 2, ...}} , f N} and F ₀ value conversion parameters Perform parametric F ₀ value conversion using θ ⁽⁰⁾ and (f _n ⁽²⁾ = [G (f _n1 ; θ ⁽⁰⁾ ), G (f _n2 ; θ ⁽⁰⁾ ), ..., G ( f _{n T — n} ; θ ⁽⁰⁾ ]], S 315) Parametric transformation F ₀ data {f ₁ ⁽²⁾ , f ₂ ⁽²⁾ ,..., f _N ⁽²⁾ } are obtained and output.
Note that the language feature value / F ₀ vector data creation unit 112 replaces the F ₀ data {f ₁ ⁽¹⁾ , f ₂ ⁽¹⁾ ,..., F _N ⁽¹⁾ } with converted F ₀ data { Use f ₁ ⁽²⁾ , f ₂ ⁽²⁾ ,..., f _N ⁽²⁾ }.

＜スペクトル包絡生成モデル・変換パラメータ学習部３１４＞
スペクトル包絡生成モデル・変換パラメータ学習部３１４は、F₀値変換パラメータθ⁽⁰⁾(初期値)と、言語特徴量・F₀ベクトルデータ{x₁,x₂,…,x_N}(ただし、x_n=[f_n ^(2)T,l_n ^T]^T)とスペクトル包絡データ{s₁,s₂,…,s_N}とを入力とし、これらのデータを用いて、言語特徴量・F₀ベクトルを入力とし、スペクトル包絡情報を出力とするスペクトル包絡生成DNNとF₀値変換パラメータθ⁽⁰⁾を学習し、学習後のスペクトル包絡生成DNNとF₀値変換パラメータθ⁽¹⁾を出力する。例えば、以下のように学習する。 <Spectrum envelope generation model / conversion parameter learning unit 314>
The spectral envelope generation model / conversion parameter learning unit 314 includes the F ₀ value conversion parameter θ ⁽⁰⁾ (initial value) and the language feature amount / F ₀ vector data {x ₁ , x ₂ , ..., x _N } (wherein With x _n = [f _n ^{(2) T} , l _n ^T ] ^T ) and spectral envelope data {s ₁ , s ₂ ,..., s _N } as input, using these data, the language feature amount F ₀ vector as an input, learns the spectrum envelope generation DNN and F ₀ value conversion parameter theta ⁽⁰⁾ to output the spectrum envelope information, outputs the spectral envelope generating DNN and F ₀ value after learning transformation parameter theta ⁽¹⁾ Do. For example, it learns as follows.

(1)言語特徴量・パラメトリック変換F₀ベクトルx_nをDNNの入力ベクトルとし、DNNを順伝播する。
(2)出力ベクトルz_n(n番目の学習用音声データから得られるスペクトル包絡情報)とスペクトル包絡情報s_nの誤差を計測し、誤差を逆伝播し、DNNのパラメータW、F₀値変換パラメータθ⁽⁰⁾の誤差勾配を算出する。ただし、DNNのパラメータWは、学習に先立ち予め乱数により初期化しておく。例えば、Wの乱数初期化は、非特許文献１と同様の方法を用いる。また、誤差関数としては、例えばz_nとs_nの最小二乗誤差を使用する。
(3)誤差勾配に従い、パラメータWとF₀値変換パラメータθ⁽⁰⁾を更新する。 (1) Linguistic Feature Parameter-Parametric Transform F ₀ Vector x _n is an input vector of DNN, and DNN is forward propagated.
(2) Measure the error between the output vector z _n (spectral envelope information obtained from the n-th speech data for learning) and the spectral envelope information s _n , back propagate the error, and use the parameters W and F ₀ value conversion parameters of DNN Calculate the error gradient of θ ⁽⁰⁾ . However, the parameter W of DNN is initialized in advance by random numbers prior to learning. For example, random number initialization of W uses the same method as in Non-Patent Document 1. Also, as the error function, for example, the least square error of z _n and s _n is used.
(3) Update parameter W and F ₀ value conversion parameter θ ⁽⁰⁾ according to the error gradient.

F₀変換部３１５における処理（Ｓ３１５）、言語特徴量・F₀ベクトルデータ作成部１１２における処理(Ｓ１１２)、及び上述の(1)〜(3)の処理を収束判定されるまで反復する。
得られた(収束したと判断されたときの)パラメータW、F₀値変換パラメータθ⁽⁰⁾をそれぞれ学習後のスペクトル包絡生成DNN、F₀値変換パラメータθ⁽¹⁾として出力する。例えば、収束判定として、反復回数が閾値に達したか、反復ごとの誤差関数の変化が閾値よりも小さくなったか、またはその両方を使用する。 Processing in the F ₀ conversion unit 315 (S315), processing in the linguistic feature quantity · F ₀ vector data creation unit 112 (S112), and the above-mentioned (1) to repeat ~ the process (3) until the convergence criterion.
The obtained parameter W ⁽ when determined to have converged) and the F ₀ value conversion parameter θ ⁽⁰⁾ are output as learned spectral envelope generation DNN and F ₀ value conversion parameter θ ⁽¹⁾ , respectively. For example, as the convergence determination, use is made of whether the number of iterations has reached a threshold, the change in error function between iterations has become smaller than the threshold, or both.

＜第三実施形態に係る音声合成装置３２０＞
図１３は第三実施形態に係る音声合成装置３２０の機能ブロック図を、図１４はその処理フローを示す。
音声合成装置３２０は、音響モデル学習装置３１０で得られるF₀値変換パラメータθ⁽¹⁾を使用し、F₀情報f_oを変換する点が第二実施形態と異なる。
音声合成装置３２０は、テキスト解析部１２１と、言語特徴量ベクトル抽出部１２２と、F₀生成部１２３と、言語特徴量・F₀ベクトル作成部１２４Ｂと、スペクトル包絡生成部１２４と、音声波形生成部１２５と、F₀変換部３２４Ａとを含む。 <Speech synthesizer 320 according to the third embodiment>
FIG. 13 shows a functional block diagram of the speech synthesizer 320 according to the third embodiment, and FIG. 14 shows its processing flow.
The voice synthesizer 320 differs from that of the second embodiment in that the F ₀ information f _o is converted using the F ₀ value conversion parameter θ ⁽¹⁾ obtained by the acoustic model learning device 310.
The speech synthesizer 320 includes a text analysis unit 121, a language feature vector extraction unit 122, an F ₀ generation unit 123, a language feature / F ₀ vector generation unit 124B, a spectrum envelope generation unit 124, and an audio waveform generation. Unit 125 and an F ₀ conversion unit 324A.

＜F₀変換部３２４Ａ＞
F₀変換部３２４Ａは、F₀値変換パラメータθ⁽¹⁾とF₀情報f_oとを入力とし、F₀情報f_oとF₀値変換パラメータθ⁽¹⁾とを用いて、パラメトリックF₀値変換を行い(f_o ⁽²⁾=[G(f_o1;θ⁽¹⁾),G(f_o2;θ⁽¹⁾),…,G(f_{oT_o};θ⁽¹⁾)]、Ｓ３２４Ａ)、パラメトリック変換F₀データf_o ⁽²⁾を出力する。このとき、使用する関数F(x;θ⁽¹⁾)は、F₀変換部３１５で使用されるパラメトリックF₀値変換と同一のものを使用する。
なお、言語特徴量・F₀ベクトル作成部１２４Ｂは、変換F₀情報f_o ⁽¹⁾に代えて、パラメトリック変換F₀情報f_o ⁽²⁾を用いる。 <F ₀ conversion unit 324A>
The F ₀ conversion unit 324 A receives the F ₀ value conversion parameter θ ⁽¹⁾ and the F ₀ information f _o as input, and uses the F ₀ information f _o and the F ₀ value conversion parameter θ ⁽¹⁾ to obtain a parametric F ₀ performs value conversion _{^{(f o (2) = [}} G (f o1; θ (1)), G (f o2; θ (1)), ..., G (f oT_o; θ (1))], S324A) , Parametric transformation F ₀ data f _o ⁽²⁾ is output. At this time, the function F (x; θ ⁽¹⁾ ) used is the same as the parametric F ₀ value conversion used in the F ₀ conversion unit 315.
The language feature amount / F ₀ vector creation unit 124 B uses parametric conversion F ₀ information f _o ⁽²⁾ instead of the conversion F ₀ information f _o ⁽¹⁾ .

＜効果＞
このような構成とすることで、第二実施形態と同様の効果を得ることができる。さらに、スペクトル包絡情報とF₀情報の依存関係がより柔軟にモデル化され、合成音声品質が向上する。 <Effect>
With such a configuration, the same effect as that of the second embodiment can be obtained. Further, dependency of the spectral envelope information and the F ₀ information is more flexible model, synthesized speech quality is improved.

＜第四実施形態＞
第三実施形態と異なる部分を中心に説明する。 Fourth Embodiment
Description will be made focusing on parts different from the third embodiment.

第三実施形態のスペクトル包絡生成DNN学習・F₀値変換パラメータ推定において、高品質な音声を合成可能とするためには、パラメータ生成誤差の小さいスペクトル包絡生成DNNを学習できればよい。ここで、勾配法などの初期値に依存するアルゴリズムを利用する場合、スペクトル包絡生成DNNのパラメータ誤差を十分に小さくするためには、適切な初期値を設定する必要があるという課題がある。 In the spectral envelope generation DNN learning / F ₀ value conversion parameter estimation of the third embodiment, in order to enable synthesis of high quality speech, it is sufficient to learn a spectral envelope generation DNN having a small parameter generation error. Here, in the case of using an algorithm dependent on an initial value such as the gradient method, there is a problem that an appropriate initial value needs to be set in order to sufficiently reduce the parameter error of the spectral envelope generation DNN.

本実施形態では、F₀値変換パラメータの初期値として、第三実施形態のスペクトル包絡生成DNN学習・F₀値変換パラメータ推定において推定されたF₀値変換パラメータθ⁽¹⁾を使用する。第三実施形態で推定されたF₀値変換パラメータθ⁽¹⁾は、あるスペクトル包絡生成DNNのパラメータ生成誤差を最小化する基準で決定されたものであるため、それをF₀値変換パラメータθ⁽¹⁾として設定し、再度スペクトル包絡生成DNN学習・F₀値変換パラメータ推定を実施することで、さらにパラメータ生成誤差の小さいスペクトル包絡生成DNNを学習可能であると期待される。これにより、より合成音声の品質を向上させる。 In this embodiment, as an initial value of F ₀ value conversion parameters, using the spectrum envelope generation DNN learning · F ₀ value conversion parameters estimated F ₀ value in the estimation conversion parameters in the third embodiment theta ^(1). Since the F ₀ value conversion parameter θ ⁽¹⁾ estimated in the third embodiment is determined by a criterion for minimizing the parameter generation error of a certain spectral envelope generation DNN, it is referred to as the F ₀ value conversion parameter θ It is expected that, by setting ⁽¹⁾ and performing spectrum envelope generation DNN learning / F ₀ value conversion parameter estimation again, it is possible to learn spectrum envelope generation DNN with a smaller parameter generation error. This further improves the quality of synthesized speech.

＜用語、使用するデータに関する説明＞
・再推定パラメトリックF₀値変換パラメータ
再推定パラメトリックF₀値変換パラメータは、音響モデル学習装置４１０により得られる、パラメトリックF₀値変換のためのパラメータであり、θ⁽²⁾と表記する。第三実施形態の学習結果であるパラメトリックF₀値変換パラメータθ⁽¹⁾を初期値として利用し、再推定される点が第三実施形態と異なる。 <Terminology, Description of Data to be Used>
And re-estimating Parametric F ₀ value conversion parameter re-estimation parametric F ₀ value conversion parameters are obtained by an acoustic model learning unit 410, a parameter for parametric F ₀ value conversion, it is referred to as theta ^(2). The point which is re-estimated using parametric F ₀ value conversion parameter theta ⁽¹⁾ which is a learning result of a third embodiment as an initial value differs from a third embodiment.

＜第四実施形態に係る音響モデル学習装置４１０＞
図１５は第三実施形態に係る音響モデル学習装置４１０の機能ブロック図を、図１６はその処理フローを示す。
音響モデル学習装置４１０は、第三実施形態で得られるF₀値変換パラメータθ⁽¹⁾を初期値として利用して、スペクトル包絡生成DNN学習・F₀値変換パラメータ再推定を行い、スペクトル包絡生成DNNと再推定F₀値変換パラメータθ⁽²⁾を出力する点が第三実施形態と異なる。 <Acoustic Model Learning Device 410 According to Fourth Embodiment>
FIG. 15 shows a functional block diagram of an acoustic model learning device 410 according to the third embodiment, and FIG. 16 shows its processing flow.
The acoustic model learning device 410 performs spectral envelope generation DNN learning / F ₀ value conversion parameter re-estimation using the F ₀ value conversion parameter θ ⁽¹⁾ obtained in the third embodiment as an initial value, and generates a spectrum envelope. The point of outputting DNN and re-estimated F ₀ value conversion parameter θ ⁽²⁾ is different from the third embodiment.

音響モデル学習装置４１０は、音響モデル学習装置３１０と、言語特徴量ベクトルデータ作成部１１１と、言語特徴量・F₀ベクトルデータ作成部１１２と、F₀生成モデル学習部１１３と、スペクトル包絡生成モデル・変換パラメータ学習部３１４と、F₀変換部４１５とを含む。
なお、音響モデル学習装置３１０は、音響モデル学習装置４１０の処理に先立ち、第三実施形態で説明した処理を実行し、F₀値変換パラメータθ⁽¹⁾を求め（Ｓ３１０）、出力する。 The acoustic model learning device 410 includes an acoustic model learning device 310, a language feature vector data generation unit 111, a language feature / F ₀ vector data generation unit 112, an F ₀ generation model learning unit 113, and a spectrum envelope generation model. A conversion parameter learning unit 314 and an F ₀ conversion unit 415 are included.
The acoustic model learning device 310 executes the processing described in the third embodiment prior to the processing of the acoustic model learning device 410, obtains the F ₀ value conversion parameter θ ⁽¹⁾ (S310), and outputs it.

＜F₀変換部４１５及びスペクトル包絡生成モデル・変換パラメータ学習部４１４＞
F₀変換部４１５及びスペクトル包絡生成モデル・変換パラメータ学習部４１４は、学習に先立ち予め、F₀値変換パラメータθ⁽⁰⁾に代えて、音響モデル学習装置３１０の出力値であるF₀値変換パラメータθ⁽¹⁾を初期値として設定する。F₀変換部４１５及びスペクトル包絡生成モデル・変換パラメータ学習部４１４の処理内容は、それぞれF₀変換部３１５及びスペクトル包絡生成モデル・変換パラメータ学習部３１４と同様である（Ｓ４１５、Ｓ４１４）。なお、スペクトル包絡生成モデル・変換パラメータ学習部４１４は、F₀値変換パラメータθ⁽¹⁾に代えて、再推定F₀値変換パラメータθ⁽²⁾を出力する。 <F ₀ conversion unit 415 and spectrum envelope generation model / conversion parameter learning unit 414>
The F ₀ conversion unit 415 and the spectral envelope generation model / conversion parameter learning unit 414 replace the F ₀ value conversion parameter θ ⁽⁰⁾ in advance prior to learning, and convert the F ₀ value which is the output value of the acoustic model learning device 310 The parameter θ ⁽¹⁾ is set as an initial value. The processing contents of the F ₀ conversion unit 415 and the spectrum envelope generation model / conversion parameter learning unit 414 are the same as those of the F ₀ conversion unit 315 and the spectrum envelope generation model / conversion parameter learning unit 314 (S 415, S 414). The spectrum envelope generation model / conversion parameter learning unit 414 outputs the re-estimated F ₀ value conversion parameter θ ⁽²⁾ in place of the F ₀ value conversion parameter θ ⁽¹⁾ .

＜第四実施形態に係る音声合成装置４２０＞
図１７は第三実施形態に係る音声合成装置４２０の機能ブロック図を、図１８はその処理フローを示す。
音声合成装置４２０は、スペクトル包絡生成時に、F₀値変換パラメータθ⁽¹⁾でなく、再推定F₀値変換パラメータθ⁽²⁾を使用する点が第三実施形態と異なる。
音声合成装置２２０は、テキスト解析部１２１と、言語特徴量ベクトル抽出部１２２と、F₀生成部１２３と、言語特徴量・F₀ベクトル作成部１２４Ｂと、スペクトル包絡生成部１２４と、音声波形生成部１２５と、F₀変換部４２４Ａとを含む。 <Speech synthesizer 420 according to the fourth embodiment>
FIG. 17 shows a functional block diagram of the speech synthesizer 420 according to the third embodiment, and FIG. 18 shows its processing flow.
The speech synthesizer 420 differs from the third embodiment in that the re-estimated F ₀ value conversion parameter θ ⁽²⁾ is used instead of the F ₀ value conversion parameter θ ⁽¹⁾ at the time of spectrum envelope generation.
The speech synthesis unit 220 includes a text analysis unit 121, a language feature vector extraction unit 122, an F ₀ generation unit 123, a language feature / F ₀ vector generation unit 124B, a spectrum envelope generation unit 124, and an audio waveform generation. Unit 125 and an F ₀ conversion unit 424 A.

＜F₀変換部４２４Ａ＞
F₀変換部４２４Ａは、再推定F₀値変換パラメータθ⁽²⁾とF₀情報f_oとを入力とし、F₀情報f_oとF₀値変換パラメータθ⁽¹⁾とを用いて、パラメトリックF₀値変換を行い(f_o ⁽²⁾=[G(f_o1;θ⁽²⁾),G(f_o2;θ⁽²⁾),…,G(f_{oT_o};θ⁽²⁾)]、Ｓ４２４Ａ)、パラメトリック変換F₀データf_o ⁽²⁾を出力する。このとき、使用する関数G(x;θ⁽²⁾)は、F₀変換部４１５で使用されるパラメトリックF₀値変換と同一のものを使用する。
なお、言語特徴量・F₀ベクトル作成部１２４Ｂは、f_o ⁽¹⁾に代えて、パラメトリック変換F₀データf_o ⁽²⁾を用いる。 <F ₀ conversion unit 424A>
The F ₀ conversion unit 424 A receives the re-estimated F ₀ value conversion parameter θ ⁽²⁾ and the F ₀ information f _o as a parametric input using the F ₀ information f _o and the F ₀ value conversion parameter θ ^(1). perform F ₀ value conversion _{^{(f o (2) = [}} G (f o1; θ (2)), G (f o2; θ (2)), ..., G (f oT_o; θ (2))], S424A), and outputs parametric conversion F ₀ data f _o ⁽²⁾ . At this time, the function G (x; θ ⁽²⁾ ) used is the same as the parametric F ₀ value conversion used in the F ₀ conversion unit 415.
Note that the language feature value / F ₀ vector creation unit 124 B uses parametric transformation F ₀ data f _o ⁽²⁾ instead of f _o ⁽¹⁾ .

＜効果＞
このような構成とすることで、第三実施形態と同様の効果を得ることができる。さらに、パラメータ生成誤差の小さいスペクトル包絡生成DNNを学習可能であると期待される。 <Effect>
With such a configuration, the same effect as that of the third embodiment can be obtained. Furthermore, it is expected that the spectral envelope generation DNN with a small parameter generation error can be learned.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other Modifications>
The present invention is not limited to the above embodiments and modifications. For example, the various processes described above may be performed not only in chronological order according to the description, but also in parallel or individually depending on the processing capability of the apparatus that executes the process or the necessity. In addition, changes can be made as appropriate without departing from the spirit of the present invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and Recording Medium>
In addition, various processing functions in each device described in the above-described embodiment and modification may be realized by a computer. In that case, the processing content of the function that each device should have is described by a program. By executing this program on a computer, various processing functions in each of the above-described devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD, a CD-ROM or the like in which the program is recorded. Furthermore, the program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage unit. Then, at the time of execution of the process, the computer reads the program stored in its storage unit and executes the process according to the read program. In another embodiment of the program, the computer may read the program directly from the portable recording medium and execute processing in accordance with the program. Furthermore, each time a program is transferred from this server computer to this computer, processing according to the received program may be executed sequentially. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by executing instructions and acquiring results from the server computer without transferring the program to the computer It may be Note that the program includes information provided for processing by a computer that conforms to the program (such as data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of the processing content may be realized as hardware.

Claims

Let N be the total number of learning voice data, N be any integer greater than or equal to 1, n = 1, 2,..., N and N fundamental frequencies indicating the fundamental frequencies of N training voice data A language feature vector is input using information f _{L, n} and N language feature vectors l _{L, n} representing the contexts of the N pieces of speech data for learning as numerical vectors, respectively A fundamental frequency generation model learning unit that learns a fundamental frequency generation model that outputs fundamental frequency information;
N pieces of the fundamental frequency information f _{L, n} , N pieces of the language feature vector l _{L, n,} and spectrum envelope information s _{L, n} respectively indicating spectrum envelopes of the N pieces of learning speech data And a spectral envelope generation model learning unit for learning a spectral envelope generation model using fundamental frequency information and a linguistic feature vector as inputs and using spectral envelope information as outputs.
Acoustic model learning device.

The acoustic model learning device according to claim 1, wherein
Including a fundamental frequency conversion unit that converts each of the N pieces of the fundamental frequency information f _{L, n} using a bounded scalar function g,
The N pieces of fundamental frequency information f _{L, n} used in the spectrum envelope generation model learning unit are values converted by the fundamental frequency conversion unit,
Acoustic model learning device.

The acoustic model learning device according to claim 2, wherein
The fundamental frequency transformation unit transforms the N pieces of fundamental frequency information f _{L, n} using the scalar function g and its parameter θ,
In the spectrum envelope generation model learning unit, spectrum envelopes of N pieces of converted fundamental frequency information f _{L, n} , N pieces of the language feature vector l _{L, n} , and N pieces of the speech data for learning are generated. Using spectral envelope information s _{L, n} respectively shown, the converted fundamental frequency information and language feature vector are input, and a spectral envelope generation model as output spectral envelope information and parameter θ are learned.
Acoustic model learning device.

The acoustic model learning device according to claim 3, wherein
In the spectrum envelope generation model learning unit, the parameter after learning is θ ⁽¹⁾ , and the fundamental frequency conversion unit uses the scalar function g and its parameter θ ⁽¹⁾ to obtain N pieces of fundamental frequency information f Convert _{L and n} respectively,
In the spectrum envelope generation model learning unit, spectrum envelopes of N pieces of converted fundamental frequency information f _{L, n} , N pieces of the language feature vector l _{L, n} , and N pieces of the speech data for learning are generated. A spectral envelope generation model and scalar function parameter θ ⁽¹⁾ which takes as input the fundamental frequency information after conversion and the linguistic feature vector using the spectral envelope information s _{L, n} respectively shown and outputs the spectral envelope information To learn
Acoustic model learning device.

A speech synthesis apparatus which performs speech synthesis using the fundamental frequency generation model learned by the acoustic model learning device according to any one of claims 1 to 4 and the spectrum envelope generation model,
The fundamental frequency generation model is used to generate fundamental frequency information f _O indicating the fundamental frequency of the speech waveform corresponding to the target text from the language feature vector l _O corresponding to the context obtained by analyzing the target text by text analysis A fundamental frequency generation unit,
Spectrum envelope generation unit that generates spectrum envelope information s _O indicating the spectrum envelope of the speech waveform corresponding to the target text from the language feature vector l _O and the fundamental frequency information f _O using the spectrum envelope generation model When,
And a speech waveform generation unit that generates a speech waveform corresponding to the target text using the fundamental frequency information f _O and the spectrum envelope information s _O.
Speech synthesizer.

Let N be the total number of learning voice data, N be any integer greater than or equal to 1, n = 1, 2,..., N and N fundamental frequencies indicating the fundamental frequencies of N training voice data A language feature vector is input using information f _{L, n} and N language feature vectors l _{L, n} representing the contexts of the N pieces of speech data for learning as numerical vectors, respectively A fundamental frequency generation model learning step of learning a fundamental frequency generation model that outputs fundamental frequency information;
N pieces of the fundamental frequency information f _{L, n} , N pieces of the language feature vector l _{L, n,} and spectrum envelope information s _{L, n} respectively indicating spectrum envelopes of the N pieces of learning speech data And a spectral envelope generation model learning step of learning a spectral envelope generation model using the fundamental frequency information and the linguistic feature vector as inputs and the spectral envelope information as output.
Acoustic model learning method.

A speech synthesis method for speech synthesis using the fundamental frequency generation model learned by the acoustic model learning method according to claim 6 and the spectrum envelope generation model,
The fundamental frequency generation model is used to generate fundamental frequency information f _O indicating the fundamental frequency of the speech waveform corresponding to the target text from the language feature vector l _O corresponding to the context obtained by analyzing the target text by text analysis A fundamental frequency generation step,
A spectrum envelope generation step of generating spectrum envelope information s _O indicating a spectrum envelope of a speech waveform corresponding to the target text from the language feature vector l _O and the fundamental frequency information f _O using the spectrum envelope generation model When,
An audio waveform generation step of generating an audio waveform corresponding to a target text using the fundamental frequency information f _O and the spectrum envelope information s _O ;
Speech synthesis method.

A program for causing a computer to function as the acoustic model learning device according to any one of claims 1 to 4 or the speech synthesis device according to claim 5.