JP2017194510A

JP2017194510A - Acoustic model learning device, voice synthesis device, methods therefor and programs

Info

Publication number: JP2017194510A
Application number: JP2016083174A
Authority: JP
Inventors: 伸克北条; Nobukatsu Hojo; 勇祐井島; Yusuke Ijima
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-04-18
Filing date: 2016-04-18
Publication date: 2017-10-26
Anticipated expiration: 2036-04-18
Also published as: JP6594251B2

Abstract

PROBLEM TO BE SOLVED: To provide an acoustic model learning technology for learning an acoustic model for enabling voice synthesis taking a speaking intention into account, and a technology capable of performing voice synthesis taking the speaking intention into account.SOLUTION: An acoustic model learning device comprises: a context data storage part 11 storing context data; a language feature amount vector extraction part 13 for extracting a language feature amount vector of each of context data while using each of context data read from the context data storage part; an intention information vector storage part 16 storing an intention information vector representing a speaking intention of each of context data; and an acoustic model learning part 17 for generating an acoustic model by performing acoustic model learning while using the extracted language feature amount vector of each of context data, voice data corresponding to each of context data and the intention information vector of each of context data read from the intention information vector storage part.SELECTED DRAWING: Figure 1

Description

この発明は、音声合成技術及び音声合成をするために用いられる音響モデルを学習する技術に関する。 The present invention relates to a voice synthesis technique and a technique for learning an acoustic model used for voice synthesis.

音声データから音声合成用モデルを学習し、合成音声を生成する手法として、DNN(Deep Neural Network)に基づく技術がある（例えば、非特許文献１参照。）。この技術の概要を図１３及び図１４に示す。 There is a technique based on DNN (Deep Neural Network) as a technique for learning a speech synthesis model from speech data and generating synthesized speech (see, for example, Non-Patent Document 1). An outline of this technique is shown in FIGS.

従来は、図１３から図１４に例示するように、音声データと、コンテキストデータに基づいて生成された言語特徴量ベクトルとに基づいて、DNN音響モデルを学習していた。また、合成するテキストをテキスト解析することにより得られたコンテキストと、学習されたDNN音響モデルから、音声パラメータを生成し、得られた音声パラメータから、音声波形生成により、合成音声波形を得ていた。 Conventionally, as exemplified in FIGS. 13 to 14, a DNN acoustic model has been learned based on speech data and a language feature vector generated based on context data. Also, speech parameters were generated from the context obtained by text analysis of the text to be synthesized and the learned DNN acoustic model, and synthesized speech waveforms were obtained from the obtained speech parameters by generating speech waveforms. .

Zen et al., "Statistical parametric speech synthesis using deep neural networks", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7962-7966.Zen et al., "Statistical parametric speech synthesis using deep neural networks", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.IEEE, 2013, pp. 7962-7966.

しかしながら、人間の発声では、発声されるテキストの読みやアクセント等の情報のみに応じて発声を行うのではなく、意図に応じて韻律等を使い分け発声することにより、意図を伝達することがある。一方、従来の音声合成は、合成するテキストの読み、アクセントなどのコンテキスト情報のみに基づき音声パラメータが生成され、意図は考慮されていない。したがって、従来の音声合成では、テキストに対応する意図と不適合な音声が合成され、誤解が生じたり、合成音声が不自然に感じられたりする可能性があった。 However, in the case of human utterance, the intention may be transmitted by uttering properly using prosody etc. according to the intention, instead of uttering only according to information such as reading of text to be uttered and accents. On the other hand, in conventional speech synthesis, speech parameters are generated based only on context information such as reading of text to be synthesized and accents, and the intention is not taken into consideration. Therefore, in the conventional speech synthesis, speech that is incompatible with the intention corresponding to the text is synthesized, and there is a possibility that misunderstanding occurs or the synthesized speech is felt unnatural.

この発明の目的は、発話意図を考慮した音声合成を可能とするための音響モデルを学習する音響モデル学習装置、発話意図を考慮した音声合成を可能とした音声合成装置、これらの方法及びプログラムを提供することである。 An object of the present invention is to provide an acoustic model learning device that learns an acoustic model for enabling speech synthesis in consideration of speech intention, a speech synthesis device that enables speech synthesis in consideration of speech intention, and a method and a program thereof. Is to provide.

この発明の一態様による音響モデル学習装置は、各コンテキストデータが記憶されているコンテキストデータ記憶部と、コンテキストデータ記憶部から読み込んだ各コンテキストデータを用いて、各コンテキストデータの言語特徴量ベクトルを抽出する言語特徴量ベクトル抽出部と、各コンテキストデータの発話意図を表する意図情報ベクトルが記憶されている意図情報ベクトル記憶部と、抽出された各コンテキストデータの言語特徴量ベクトルと各コンテキストデータに対応する音声データと意図情報ベクトル記憶部から読み込んだ各コンテキストデータの意図情報ベクトルとを用いて、音響モデル学習を行うことにより音響モデルを生成する音響モデル学習部と、を備えている。 An acoustic model learning device according to an aspect of the present invention extracts a language feature vector of each context data using a context data storage unit storing each context data and each context data read from the context data storage unit Language feature vector extraction unit, intention information vector storage unit storing intention information vectors representing utterance intention of each context data, and corresponding language feature vector and each context data of each extracted context data An acoustic model learning unit that generates an acoustic model by performing acoustic model learning using the voice data to be performed and the intention information vector of each context data read from the intention information vector storage unit.

この発明の一態様による音声合成装置は、入力されたテキストを解析してコンテキストを得るテキスト解析部と、各コンテキストの言語特徴量ベクトルを抽出する言語特徴量ベクトル抽出部と、入力された発話意図を表す意図情報ベクトルと、請求項１の音響モデル学習装置で生成された音響モデルと、抽出された言語特徴量ベクトルとを用いて、音声パラメータを生成する音声パラメータ生成部と、生成された音声パラメータを用いて合成音声を生成する音声波形生成部と、を備えている。 A speech synthesizer according to an aspect of the present invention includes a text analysis unit that analyzes input text to obtain a context, a language feature vector extraction unit that extracts a language feature vector of each context, and an input utterance intention A speech parameter generation unit that generates speech parameters using the intention information vector representing the speech model, the acoustic model generated by the acoustic model learning device according to claim 1, and the extracted language feature vector; and the generated speech A speech waveform generation unit that generates synthesized speech using parameters.

発話意図を考慮した音声合成を可能とするための音響モデルを学習することができる。発話意図を考慮した音声合成をすることができる。 It is possible to learn an acoustic model for enabling speech synthesis in consideration of the utterance intention. Speech synthesis can be performed in consideration of the utterance intention.

第一実施形態の音響モデル学習装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic model learning apparatus of 1st embodiment. 音響モデル学習方法の例を説明するための流れ図。The flowchart for demonstrating the example of the acoustic model learning method. 意図情報ベクトル作成部１５の処理の例を説明するための流れ図。The flowchart for demonstrating the example of a process of the intention information vector preparation part 15. FIG. 第一実施形態の音声合成装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the speech synthesizer of 1st embodiment. 音声合成方法の例を説明するための流れ図。The flowchart for demonstrating the example of the speech synthesis method. 第二実施形態の音響モデル学習装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic model learning apparatus of 2nd embodiment. 意図クラス学習部１９の例を説明するためのブロック図。The block diagram for demonstrating the example of the intention class learning part 19. FIG. 第二実施形態の音声合成装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the speech synthesizer of 2nd embodiment. 第三実施形態の音響モデル学習装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic model learning apparatus of 3rd embodiment. 第三実施形態の音声合成装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the speech synthesizer of 3rd embodiment. 第三実施形態の音声合成方法の例を説明するための流れ図。The flowchart for demonstrating the example of the speech synthesis method of 3rd embodiment. 第四実施形態の音声合成方法の例を説明するための流れ図。The flowchart for demonstrating the example of the speech synthesis method of 4th embodiment. 従来の音響モデル学習装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the conventional acoustic model learning apparatus. 音声合成装置の例を説明するためのブロック図。The block diagram for demonstrating the example of a speech synthesizer.

以下、図面を参照して、この発明の一実施形態について説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

［第一実施形態］
（音響モデル学習装置及び方法）
第一実施形態の音響モデル学習装置は、図１に例示するように、コンテキストデータ記憶部１１、音声データ記憶部１２、言語特徴量ベクトル抽出部１３、意図データ記憶部１４、意図情報ベクトル作成部１５、意図情報ベクトル記憶部１６、音響モデル学習部１７及び音響モデル記憶部１８を備えている。 [First embodiment]
(Acoustic model learning apparatus and method)
As illustrated in FIG. 1, the acoustic model learning device according to the first embodiment includes a context data storage unit 11, a voice data storage unit 12, a language feature vector extraction unit 13, an intention data storage unit 14, and an intention information vector creation unit. 15, an intention information vector storage unit 16, an acoustic model learning unit 17, and an acoustic model storage unit 18 are provided.

第一実施形態の音響モデル学習方法は、音響モデル学習装置の各部が図２及び以下に説明するステップＳ１３からＳ１７の処理を実行することにより実現される。 The acoustic model learning method according to the first embodiment is realized by causing each unit of the acoustic model learning device to execute the processes of steps S13 to S17 described below with reference to FIG.

音響モデル学習装置及び方法は、音声データ、コンテキストデータ及び各発話の意図に対応する意図情報を用いて、音響モデルを学習するものである。 The acoustic model learning apparatus and method learns an acoustic model using speech data, context data, and intention information corresponding to the intention of each utterance.

＜コンテキストデータ記憶部１１＞
コンテキストデータ記憶部１１には、各コンテキストデータが記憶されている。Iを正の整数として、コンテキストデータ記憶部１１に記憶されているコンテキストデータの総数は、例えばI個である。コンテキストデータは、音声データ記憶部１２に記憶されている音声データ中の各発話に対して付与された発音等の情報である。音声データ中の各発話に１つのコンテキストデータが付与されている。コンテキストデータには、音素情報（発音情報）とアクセント情報（アクセント型、アクセント句長）が例えば含まれている。コンテキストデータには、これ以外にも品詞情報等が含まれていてもよい。 <Context data storage unit 11>
Each context data is stored in the context data storage unit 11. The total number of context data stored in the context data storage unit 11 is I, for example, where I is a positive integer. The context data is information such as pronunciation given to each utterance in the voice data stored in the voice data storage unit 12. One context data is given to each utterance in the voice data. The context data includes phoneme information (pronunciation information) and accent information (accent type, accent phrase length), for example. In addition to this, the part of speech information may be included in the context data.

＜音声データ記憶部１２＞
音声データ記憶部１２には、音響モデル学習に使用する音声データが記憶されている。この音声データは、例えば音声信号に対して信号処理を行った結果、得られる音声パラメータ（音高パラメータ（基本周波数（F0）等）、スペクトルパラメータ（ケプストラム、メルケプストラム等））等のデータである。 <Audio data storage unit 12>
The voice data storage unit 12 stores voice data used for acoustic model learning. This audio data is data such as audio parameters (pitch parameters (basic frequency (F0), etc.), spectrum parameters (cepstrum, mel cepstrum, etc.)) obtained as a result of performing signal processing on the audio signal, for example. .

コンテキストデータ記憶部１１に記憶されているコンテキストデータの総数がI個である場合には、I個のコンテキストデータにそれぞれ対応するI個の音声データが、音声データ記憶部１２に記憶される。 When the total number of context data stored in the context data storage unit 11 is I, I audio data respectively corresponding to the I context data is stored in the audio data storage unit 12.

＜言語特徴量ベクトル抽出部１３＞
言語特徴量ベクトル抽出部１３は、コンテキストデータ記憶部１１から読み込んだ各コンテキストデータを用いて、各コンテキストデータの言語特徴量ベクトルを抽出する（ステップＳ１３）。抽出された言語特徴量ベクトルデータは、音響モデル学習部１７に出力される。 <Language feature vector extraction unit 13>
The language feature vector extraction unit 13 extracts the language feature vector of each context data using each context data read from the context data storage unit 11 (step S13). The extracted language feature vector data is output to the acoustic model learning unit 17.

言語特徴量ベクトルは、コンテキストデータを数値ベクトルで表現したものである。例えば、非特許文献１のように、音素情報、アクセント情報をそれぞれ1-of-K表現し、更に文長などの数値情報と連結し得られる数値ベクトルとする。 The language feature vector represents context data as a numerical vector. For example, as in Non-Patent Document 1, phoneme information and accent information are each expressed in 1-of-K, and are further set as numerical vectors obtained by concatenating numerical information such as sentence length.

＜意図データ記憶部１４＞
意図データ記憶部１４には、意図データが記憶されているとする。意図データは、音声データ、コンテキストデータに含まれる各発話に対して付与される意図情報を保持したデータである。 <Intent data storage unit 14>
It is assumed that intention data is stored in the intention data storage unit 14. The intention data is data that holds intention information given to each utterance included in the voice data and context data.

意図情報は、各発話に対して付与される、その発話の意図を表す情報である。Nを２以上の整数として、全意図情報はN種類からなり、１つの発話に対し１つの意図情報が対応づけられるものとする。N種類の意図情報を{c₁,c₂,…,c_N}で表現する。例えば、参考文献１の表６の全33種類（N=33）からなる対話行為情報を利用し、{c₁="挨拶", c₂="情報提供",…,c₃₃="その他"}のように各対話行為に対応する文字列とする。 The intention information is information that is given to each utterance and represents the intention of the utterance. Assume that N is an integer of 2 or more, and all intention information is composed of N types, and one intention information is associated with one utterance. N types of intention information are represented by {c ₁ , c ₂ ,..., C _N }. For example, using the dialogue action information consisting of all 33 types (N = 33) in Table 6 of Reference 1, {c ₁ = "greeting", c ₂ = "information provision", ..., c ₃₃ = "other" } Is a character string corresponding to each dialogue act.

〔参考文献１〕目黒豊美, et al. "聞き役対話の分析および分析に基づいた対話制御部の構築.", 情報処理学会論文誌, 53.12, 2012, pp.2787-2801. [Reference 1] Toyomi Meguro, et al. "Analysis of listening dialogue and construction of dialogue control unit based on analysis.", Transactions of Information Processing Society of Japan, 53.12, 2012, pp.2787-2801.

意図データでは、音声データ、コンテキストデータ中の全発話数Iを用いて、{d₁,d₂,…,d_I}のように例えば表現される。各発話についてそれぞれ１つずつ意図情報が対応し、例えば文番号iの発話についてn番目の意図情報が対応するとき，d_i=c_nとすることで構成される。 The intention data is expressed, for example, as {d ₁ , d ₂ ,..., D _I } using the total number of utterances I in the voice data and context data. When the intended information one each for each utterance corresponds, for example sentence number i n th intention information about speech corresponding configured by a d _i = c _n.

＜意図情報ベクトル作成部１５＞
意図情報ベクトル作成部１５は、意図データ記憶部１４から読み込んだ意図データを用いて、各コンテキストデータの発話意図を表す意図情報ベクトルを作成する（ステップＳ１５）。作成された意図情報ベクトルは、意図情報ベクトル記憶部１６に記憶される。 <Intention information vector creation unit 15>
The intention information vector creation unit 15 uses the intention data read from the intention data storage unit 14 to create an intention information vector representing the utterance intention of each context data (step S15). The created intention information vector is stored in the intention information vector storage unit 16.

Kを正の整数として、意図情報ベクトルは、各意図情報をK次元の数値ベクトルで表現したものである。意図情報ベクトルv_iは、各コンテキストデータi(i=1,2,…,I:Iは全コンテキストデータ数)に対し、それぞれd_iに基づき決定される。各コンテキストデータに対応する意図情報ベクトルの集合である意図情報ベクトルデータは、V={v₁,v₂,…,v_I}のように表現される。 The intention information vector is a representation of each intention information as a K-dimensional numerical vector, where K is a positive integer. The intention information vector v _i is determined based on d _i for each context data i (i = 1, 2,..., I: I is the total number of context data). Intention information vector data, which is a set of intention information vectors corresponding to each context data, is expressed as V = {v ₁ , v ₂ ,..., V _I }.

意図情報ベクトル作成部１５は、例えば、コンテキストデータiに対応する意図情報ベクトルv_iの次元をNとし(K=N)、v_i=[v₁ ⁱ,v₂ ⁱ,…,v_N ⁱ]表現したとき、そのコンテキストデータiに対応する意図情報c_nの入力に対し、下記のような意図情報の1-of-K表現を使用することにより、そのコンテキストデータiに対応する意図情報ベクトルv_iを作成する。 For example, the intention information vector creation unit 15 sets the dimension of the intention information vector v _i corresponding to the context data i to N (K = N), and v _i = [v ₁ ⁱ , v ₂ ⁱ ,..., V _N ⁱ ]. When expressed, the intention information vector v corresponding to the context data i can be obtained by using the following 1-of-K representation of the intention information for the input of the intention information c _n corresponding to the context data i. Create _i .

ここで、意図情報ベクトルの次元を表現するインデックスをn’=1,2,…,Nとした。 Here, the index expressing the dimension of the intention information vector is set to n ′ = 1, 2,.

意図情報ベクトル作成部１５の処理の例を図３に示す。 An example of processing of the intention information vector creation unit 15 is shown in FIG.

＜意図情報ベクトル記憶部１６＞
意図情報ベクトル記憶部１６には、各コンテキストデータの発話意図を表する意図情報ベクトルが記憶される。 <Intention information vector storage unit 16>
The intention information vector storage unit 16 stores an intention information vector representing the utterance intention of each context data.

意図情報ベクトルは、上記に説明するように、意図情報ベクトルデータ作成部１６により作成され、意図情報ベクトル記憶部１６に記憶される。この意図情報ベクトルの作成は、音響モデル学習の処理を行う前に、予め行われてもよい。 The intention information vector is created by the intention information vector data creation unit 16 and stored in the intention information vector storage unit 16 as described above. The creation of the intention information vector may be performed in advance before the acoustic model learning process is performed.

＜音響モデル学習部１７＞
音響モデル学習部１７は、言語特徴量ベクトル抽出部１３で抽出された各コンテキストデータの言語特徴量ベクトルと、音声データ記憶部１２から読み込んだ各コンテキストデータに対応する音声データと、意図情報ベクトル記憶部１６から読み込んだ各コンテキストデータの意図情報ベクトルとを用いて、音響モデル学習を行うことにより音響モデルを生成する（ステップＳ１７）。音声データ記憶部１２から読み込んだ各コンテキストデータに対応する音声データは、例えば音声パラメータである。生成された音響モデルは、音響モデル記憶部１８に記憶される。 <Acoustic model learning unit 17>
The acoustic model learning unit 17 stores a language feature vector of each context data extracted by the language feature vector extraction unit 13, voice data corresponding to each context data read from the voice data storage unit 12, and intention information vector storage. An acoustic model is generated by performing acoustic model learning using the intention information vector of each context data read from the unit 16 (step S17). The audio data corresponding to each context data read from the audio data storage unit 12 is, for example, an audio parameter. The generated acoustic model is stored in the acoustic model storage unit 18.

音響モデル学習装置及び方法は、音響モデル学習をする際に、意図情報ベクトルを用いる点で従来と異なる。 The acoustic model learning apparatus and method are different from conventional ones in that an intention information vector is used when performing acoustic model learning.

例えば、音声データ、言語特徴量ベクトルデータ及び意図情報ベクトルデータから、音響モデル学習を行い、言語特徴量ベクトル、意図情報ベクトルを入力、対応する音声パラメータを出力とするDNN音響モデルを学習する。DNN音響モデルの構成に関して、意図情報ベクトルは、単に言語特徴量ベクトルと連結し、DNNの入力ベクトルとして活用すればよい。または、音声認識分野における参考文献２のモデルのように、モデルと類似しした構成により、意図情報ベクトルをDNNの単数または複数の中間層に入力し、学習してもよい。学習アルゴリズムについては、非特許文献１などと同様に、誤差逆伝播や確率的勾配降下法など、従来の一般的なDNN学習アルゴリズムを使用することができる。 For example, acoustic model learning is performed from speech data, language feature vector data, and intention information vector data, and a DNN acoustic model that receives a language feature vector and intention information vector and outputs corresponding speech parameters is learned. Regarding the configuration of the DNN acoustic model, the intention information vector may be simply connected to the language feature vector and utilized as the DNN input vector. Alternatively, the intention information vector may be input to one or more intermediate layers of the DNN and learned by a configuration similar to the model, such as the model of Reference 2 in the speech recognition field. As for the learning algorithm, a conventional general DNN learning algorithm such as error back-propagation and stochastic gradient descent can be used as in Non-Patent Document 1.

〔参考文献２〕
Xue, Shaofei, et al. "Fast adaptation of deep neural network based on discriminant codes for speech recognition.", Audio, Speech, and Language Processing, IEEE/ACM Transactions on 22.12 (2014), pp.1713-1725. [Reference 2]
Xue, Shaofei, et al. "Fast adaptation of deep neural network based on discriminant codes for speech recognition.", Audio, Speech, and Language Processing, IEEE / ACM Transactions on 22.12 (2014), pp.1713-1725.

（音声合成装置及び方法）
第一実施形態の音声合成装置は、図４に例示するように、テキスト解析部２１、言語特徴量ベクトル抽出部２２、意図情報ベクトル作成部２３、音声パラメータ生成部２４及び音声波形生成部２５を備えている。 (Speech synthesizer and method)
As illustrated in FIG. 4, the speech synthesizer of the first embodiment includes a text analysis unit 21, a language feature vector extraction unit 22, an intention information vector creation unit 23, a speech parameter generation unit 24, and a speech waveform generation unit 25. I have.

第一実施形態の音声合成方法は、音声合成装置の各部が図５及び以下に説明するステップＳ２１からＳ２５の処理を実行することにより実現される。 The speech synthesis method according to the first embodiment is realized by causing each unit of the speech synthesizer to execute the processes of steps S21 to S25 described below with reference to FIG.

音声合成装置及び方法は、入力テキスト、入力テキストに対応する意図情報、音響モデル、音響モデル学習部１７で得られた音響モデルから合成音声を得るものでる。 The speech synthesizer and method obtains synthesized speech from an input text, intention information corresponding to the input text, an acoustic model, and an acoustic model obtained by the acoustic model learning unit 17.

音声合成装置及び方法では、合成するテキストと意図情報とから、合成音声が生成される。処理手順の一例は下記の通りである。 In the speech synthesis apparatus and method, synthesized speech is generated from text to be synthesized and intention information. An example of the processing procedure is as follows.

意図情報は、ユーザにより指定され、キーボード、マウス等の入力手段により入力される。また、意図情報の推定器を事前に準備し、入力テキストから自動で推定し、入力してもよい。また、参考文献１の技術を利用した対話システムから得られる対話行為情報を意図情報として利用する等、他システムから獲得できる情報に基づき入力してもよい。音声合成装置及び方法で利用される意図情報ベクトル抽出は、音響モデル学習装置及び方法で使用された意図情報ベクトル抽出と同一であるとする。 The intention information is designated by the user and input by an input means such as a keyboard or a mouse. An intention information estimator may be prepared in advance, automatically estimated from the input text, and input. Moreover, you may input based on the information which can be acquired from other systems, such as using the dialog action information obtained from the dialog system using the technique of the reference document 1 as intention information. The intention information vector extraction used in the speech synthesis apparatus and method is assumed to be the same as the intention information vector extraction used in the acoustic model learning apparatus and method.

＜テキスト解析部２１＞
テキスト解析部２１は、入力されたテキストをテキスト解析し、合成テキストの読み、アクセントなどの情報であるコンテキストを得る（ステップＳ２１）。得られたコンテキストは、言語特徴量ベクトル抽出部２２に出力される。 <Text analysis unit 21>
The text analysis unit 21 performs text analysis on the input text and obtains a context that is information such as reading of the synthesized text and accents (step S21). The obtained context is output to the language feature vector extraction unit 22.

＜言語特徴量ベクトル抽出部２２＞
言語特徴量ベクトル抽出部２２は、入力されたコンテキストに対応する言語特徴量ベクトルを抽出する（ステップＳ２２）。抽出された言語特徴量ベクトルは、音声パラメータ生成部２４に出力される。 <Language feature vector extraction unit 22>
The language feature vector extraction unit 22 extracts a language feature vector corresponding to the input context (step S22). The extracted language feature vector is output to the speech parameter generation unit 24.

言語特徴量ベクトル抽出部２２の処理は、言語特徴量ベクトル抽出部１３の処理と同様であるため、ここでは重複説明を省略する。 Since the process of the language feature vector extraction unit 22 is the same as the process of the language feature vector extraction unit 13, duplicate description is omitted here.

＜意図情報ベクトル作成部２３＞
意図情報ベクトル作成部２３は、入力された意図情報c_nに対応する意図情報ベクトルを作成する（ステップＳ２３）。作成された意図情報ベクトルは、音声パラメータ生成部２４に出力される。 <Intention information vector creation unit 23>
Intention information vector generating unit 23 generates the intention information vector corresponding to the intended information c _n input (step S23). The created intention information vector is output to the voice parameter generation unit 24.

意図情報ベクトル作成部２３の処理は、意図情報ベクトル作成部１５の処理と同様であるため、ここでは重複説明を省略する。 Since the process of the intention information vector creating unit 23 is the same as the process of the intention information vector creating unit 15, duplicate description is omitted here.

＜音響モデル記憶部１８＞
音響モデル記憶部１８には、音響モデル学習装置及び方法により生成された音響モデルが記憶されている。 <Acoustic model storage unit 18>
The acoustic model storage unit 18 stores an acoustic model generated by the acoustic model learning apparatus and method.

＜音声パラメータ生成部２４＞
音声パラメータ生成部２４は、言語特徴量ベクトル抽出部２２で得られた言語特徴量ベクトルと、意図情報ベクトル作成部２３で作成された意図情報ベクトルと、音響モデル記憶部１８から読み込んだ音響モデルとを用いて、音声パラメータを生成する（ステップＳ２４）。生成された音声パラメータは、音声波形生成部２５に出力される。 <Audio parameter generation unit 24>
The speech parameter generation unit 24 includes the language feature vector obtained by the language feature vector extraction unit 22, the intention information vector created by the intention information vector creation unit 23, and the acoustic model read from the acoustic model storage unit 18. Is used to generate voice parameters (step S24). The generated speech parameter is output to the speech waveform generation unit 25.

音声パラメータ生成部２４は、例えば、言語特徴量ベクトル及び意図情報ベクトルを、音響モデルに入力し、順伝播により音声パラメータを生成する。 For example, the speech parameter generation unit 24 inputs a language feature vector and an intention information vector to the acoustic model, and generates a speech parameter by forward propagation.

＜音声波形生成部２５＞
音声波形生成部２５は、音声パラメータ生成部２４で生成された音声パラメータから、音声波形生成により合成音声を得る（ステップＳ２５）。 <Audio waveform generation unit 25>
The voice waveform generation unit 25 obtains synthesized voice by voice waveform generation from the voice parameters generated by the voice parameter generation unit 24 (step S25).

音声波形生成の前に、例えば、maximum likelihood generation (MLPG) アルゴリズム（例えば、参考文献３参照。）を用いて時間方向に平滑化された音声パラメータ系列を得てもよい。また、音声波形生成には、例えば参考文献４に記載されている手法を用いてもよい。 Prior to speech waveform generation, for example, a speech parameter sequence smoothed in the time direction may be obtained using a maximum likelihood generation (MLPG) algorithm (see, for example, Reference 3). Further, for example, a technique described in Reference Document 4 may be used for generating a speech waveform.

〔参考文献３〕益子他，“動的特徴を用いたHMMに基づく音声合成”，信学論，vol.J79-D-II，no.12，pp.2184-2190，Dec. 1996.
〔参考文献４〕今井他，“音声合成のためのメル対数スペクトル近似（MLSA）フィルタ”，電子情報通信学会論文誌 A Vol.J66-A No.2 pp.122-129, Feb. 1983. [Reference 3] Masuko et al., "HMM-based speech synthesis using dynamic features", IEICE, vol.J79-D-II, no.12, pp.2184-2190, Dec. 1996.
[Reference 4] Imai et al., “Mel Log Spectrum Approximation (MLSA) Filter for Speech Synthesis”, IEICE Transactions A Vol.J66-A No.2 pp.122-129, Feb. 1983.

このように、各コンテキストの意図に対応する意図情報を活用する。すなわち、音声合成装置の入力として、読み、アクセントなどの従来のコンテキストに加え、意図情報を活用し、対応する意図情報を反映した音声パラメータを出力するよう音響モデルを構成する。これにより、各意図に対応する音声パラメータの傾向を、音響モデルから生成される音声パラメータに反映させることが可能となる。このようにして、合成するテキストの意図と適合した音声を合成することにより、音声により意図を表現し、誤解が生じたり、合成音声が不自然に感じられたりすることを防ぐことができる。 Thus, the intention information corresponding to the intention of each context is utilized. That is, as an input to the speech synthesizer, in addition to conventional contexts such as reading and accent, the acoustic model is configured to utilize intention information and output speech parameters reflecting the corresponding intention information. Thereby, it is possible to reflect the tendency of the voice parameter corresponding to each intention in the voice parameter generated from the acoustic model. In this way, by synthesizing speech that matches the intention of the text to be synthesized, it is possible to prevent the misunderstanding or the unnatural feeling of the synthesized speech from being expressed by the speech.

［第二実施形態］
第一実施形態において、類似した音声の表情付けに対し、複数の意図情報が対応する場合がある。例えば，意図情報として、参考文献１のような対話行為情報を利用する場合、情報提供、自己開示_事実などの対話行為は、音声に強く表情付けが行われず、通常の読み上げ口調に近い音声が発話される可能性がある。そのため、第一実施形態では、過剰にクラス数の大きい分類が、音声による意図の表現のために使用される可能性がある。クラス数が増大すると、入力コンテキストの次元数が増加し、音響モデル（例えばDNN音響モデル）のパラメータ数が増加する。一般に、パラメータ数の大きい音響モデルは学習データに対する過学習を招きやすく、合成音声の品質を低下させたり、音声による意図表現の表現力を低下させたりする。または、十分な合成音声品質や音声による意図表現の表現力を得るために、大量の音声データ、コンテキストデータが必要となり、音声合成装置及び方法の学習のためのコストが増大する。 [Second Embodiment]
In the first embodiment, a plurality of pieces of intention information may correspond to similar voice expression. For example, when the dialogue action information as in Reference 1 is used as the intention information, the dialogue action such as information provision, self-disclosure_facts, etc. does not express the voice strongly, and the voice close to the normal reading tone is obtained. There is a possibility of being uttered. Therefore, in the first embodiment, a classification having an excessively large number of classes may be used for expressing an intention by voice. As the number of classes increases, the number of dimensions of the input context increases, and the number of parameters of the acoustic model (eg, DNN acoustic model) increases. In general, an acoustic model with a large number of parameters is likely to cause overlearning of learning data, which lowers the quality of synthesized speech or reduces the expressiveness of intentional expression by speech. Alternatively, a large amount of speech data and context data is required to obtain sufficient synthesized speech quality and expressive power of intention expression by speech, and the cost for learning the speech synthesizer and method increases.

そこで、第二実施形態では、例えば、音声パラメータに基づき、意図情報のクラスタリングを実施し、意図クラス情報を得る。音声パラメータの傾向が類似した複数の意図を１つの意図クラスで表現し、コンテキストとして活用することで、パラメータ数の小さい音響モデルによる学習により過学習を防ぎ、合成音声の品質を改善したり、音声による意図表現の表現力を上昇させたりする。また、少量のデータから音声合成器の学習が可能となり、コストが減少する。 Therefore, in the second embodiment, for example, intention information clustering is performed based on voice parameters to obtain intention class information. By expressing multiple intentions with similar voice parameter trends in one intention class and using them as contexts, learning with an acoustic model with a small number of parameters prevents over-learning and improves the quality of synthesized speech. Increase the expressive power of intention expression. In addition, the speech synthesizer can be learned from a small amount of data, and the cost is reduced.

以下、第一実施形態と異なる部分を主に説明する。第一実施形態と同様の部分については、重複説明を省略する。 Hereinafter, parts different from the first embodiment will be mainly described. A duplicate description of the same parts as in the first embodiment is omitted.

（音響モデル学習装置及び方法）
第二実施形態の音響モデル学習装置は、図６に例示するように、意図クラス学習部１９及び意図クラス分類情報記憶部１１０を更に備えている。意図クラス学習部１９は、図７に例示するように、意図特徴ベクトル抽出部１９１及び意図クラスタリング部１９２を例えば備えている。 (Acoustic model learning apparatus and method)
The acoustic model learning apparatus according to the second embodiment further includes an intention class learning unit 19 and an intention class classification information storage unit 110 as illustrated in FIG. The intention class learning unit 19 includes, for example, an intention feature vector extraction unit 191 and an intention clustering unit 192, as illustrated in FIG.

＜意図特徴ベクトル抽出部１９１＞
意図特徴ベクトル抽出部１９１は、各意図情報について、対応する発話の音声データから、意図情報の特徴を表す意図特徴ベクトルを得る。得られた意図特徴ベクトルは、意図クラスタリング部１９２に出力される。 <Intention feature vector extraction unit 191>
For each intention information, the intention feature vector extraction unit 191 obtains an intention feature vector representing the characteristics of the intention information from the speech data of the corresponding utterance. The obtained intention feature vector is output to the intention clustering unit 192.

例えば、意図特徴ベクトルとして、まず、各意図情報のF0、発話速度、パワーの平均・標準偏差を求め、意図特徴ベクトルとして使用する。このとき、意図情報c_n（n=1,2,…,N:Nは全意図数）の意図特徴ベクトルw_nは例えば以下のように定義される。 For example, as the intention feature vector, first, F0, speech rate, and power average / standard deviation of each intention information are obtained and used as the intention feature vector. At this time, the intention information _{c n (n = 1,2, ...} , N: N is the total intended number) intended feature vector w _n of is defined as follows, for example.

ここで、mnF0_n,stdF0_nはそれぞれ意図情報c_nのF0の平均値および標準偏差、mnPow_n,stdPow_nはそれぞれ意図情報c_nのパワーの平均値および標準偏差、mnSr_n,stdSr_nはそれぞれ意図情報c_nの発話速度の平均値および標準偏差である。または、意図特徴ベクトルとして、例えばケプストラム特徴量などのスペクトル特徴量を使用してもよい。また、語尾1モーラのF0について、時間差分係数の平均・標準偏差を算出し、使用するなど、発話全体ではなく、特定の時間区間に関する統計処理を行い、意図特徴ベクトルとして使用してもよい。 Here, mnF0 _{_n,} stdF0 _n is F0 mean and standard deviation of the respective intention information c _{_n,} mnPow _n, mean and standard deviation of the power of each StdPow _n intention information c _{_{_n,}} mnSr _n, stdSr _n are each the average value of the utterance speed of the intention information c _n and the standard deviation. Alternatively, a spectral feature quantity such as a cepstrum feature quantity may be used as the intention feature vector. In addition, for F0 of the ending one mora, statistical processing regarding a specific time interval may be performed instead of the entire utterance such as calculating and using the average / standard deviation of the time difference coefficient, and may be used as the intention feature vector.

＜意図クラスタリング部１９２＞
意図クラスタリング部１９２は、意図特徴ベクトル抽出部１９１で得られたN個の意図特徴ベクトルを用いて、任意のM（Mは2以上N未満の整数）個に分割するようにクラスタリングを行うことで、意図クラス分類情報を得る。得られた意図クラス分類情報は、意図クラス分類情報記憶部１１０に記憶される。 <Intention clustering unit 192>
The intention clustering unit 192 uses the N intention feature vectors obtained by the intention feature vector extraction unit 191 to perform clustering so as to be divided into arbitrary M (M is an integer of 2 or more and less than N). Get intention class classification information. The obtained intention class classification information is stored in the intention class classification information storage unit 110.

クラスタリングアルゴリズムには、k-means法や階層的クラスタリングなどの一般的なクラスタリングアルゴリズムを使用することができる。 As the clustering algorithm, a general clustering algorithm such as k-means method or hierarchical clustering can be used.

意図クラス分類情報は、各意図情報がどの意図クラス情報に属するかに関する情報である。例えば、各意図情報c_n（n=1,2,…,N:Nは全意図情報数）がそれぞれ意図クラス情報e_in(1≦i_n≦M)にクラスタリングされるとき、そのインデックスをリスト形式のデータI=[i₁,i₂,…,i_N]として保持される。「e_in」の「in」は、iの下付きnである「i_n」を意味するとする。このデータIが意図クラス分類情報の例である。 The intention class classification information is information regarding which intention class information each intention information belongs to. For example, when each intention information c _n (n = 1, 2,..., N: N is the total number of intention information) is clustered into intention class information e _in (1 ≦ i _n ≦ M), its index is listed. Data of the format is held as I = [i ₁ , i ₂ ,..., I _N ]. “In” _in “e _in ” means “i _n ” which is a subscript n of i. This data I is an example of intention class classification information.

意図クラス情報は、意図情報をクラスタリングした結果を表現する情報であり、全クラス数をM(Mは2以上N未満の整数)としすると、例えば{e₁,e₂,…,e_M}のように表される。 The intention class information is information representing the result of clustering the intention information. When the total number of classes is M (M is an integer of 2 or more and less than N), for example, {e ₁ , e ₂ , ..., e _M } It is expressed as follows.

＜意図情報ベクトル作成部１５＞
第二実施形態の意図情報ベクトル作成部１５は、ある意図情報を入力したとき、意図クラス分類情報に基づき、対応する意図情報ベクトルを出力する。 <Intention information vector creation unit 15>
When an intention information is input, the intention information vector creation unit 15 of the second embodiment outputs a corresponding intention information vector based on the intention class classification information.

まず、意図情報ベクトル作成部１５は、あるコンテキストデータに対応する意図情報c_nが入力されたとき、意図クラス分類情報に基づき、入力された意図情報c_nに対応する意図クラス情報e_inを得る。この意図クラス分類情報に基づき、意図クラス情報を出力する点が第一実施形態と異なる部分である。 First, the intention information vector generating unit 15, when the intended information c _n corresponding to a certain context data is input, based on the intended classification information, obtain intention class information e _in corresponding to the intention information c _n input . The point which outputs intention class information based on this intention class classification information is a different part from 1st embodiment.

そして、意図情報ベクトル作成部１５は、意図クラス情報e_inに対応する意図情報ベクトルv_iを出力する。 Then, the intention information vector creation unit 15 outputs an intention information vector v _i corresponding to the intention class information e _in .

例えば、第一実施形態と同様に、コンテキストデータiに対応する意図情報ベクトルv_iの次元をMとし(K=M)、v_i=[v₁ ⁱ,v₂ ⁱ,…,v_M ⁱ]と表現したとき、そのコンテキストデータiに対応する意図クラス情報e_mの入力に対し、下記のような意図情報の1-of-K表現を使用することにより、そのコンテキストデータiに対応する意図情報ベクトルv_iを作成する。 For example, as in the first embodiment, the dimension of the intention information vector v _i corresponding to the context data i is M (K = M), and v _i = [v ₁ ⁱ , v ₂ ⁱ ,..., V _M ⁱ ] when expressed as intention information to the input of the intention class information e _m corresponding to the context data i, by using the 1-of-K expression intent information as described below, corresponding to the context data i Create vector v _i .

ここで、意図情報ベクトルの次元を表現するインデックスをm’=1,2,…,Mとした。 Here, the index expressing the dimension of the intention information vector is m ′ = 1, 2,.

このように、第二実施形態では、各コンテキストデータに対応する意図情報ベクトルは、上記各コンテキストデータの発話意図に対応する意図クラス情報を表すベクトルとなる。 Thus, in the second embodiment, the intention information vector corresponding to each context data is a vector representing the intention class information corresponding to the utterance intention of each context data.

音響モデル学習装置及び方法の他の処理は、第一実施形態と同様である。 Other processes of the acoustic model learning apparatus and method are the same as those in the first embodiment.

（音声合成装置及び方法）
第二実施形態の音響モデル学習装置は、図８に例示するように、意図クラス分類情報記憶部１１０を更に備えている。 (Speech synthesizer and method)
The acoustic model learning device of the second embodiment further includes an intention class classification information storage unit 110 as illustrated in FIG.

意図情報ベクトル作成部２３は、第二実施形態の意図情報ベクトル作成部１５と同様の処理を行う。 The intention information vector creation unit 23 performs the same processing as the intention information vector creation unit 15 of the second embodiment.

すなわち、第二実施形態の意図情報ベクトル作成部２３は、各発話の意図情報d_iと、意図クラス分類情報記憶部１１０から読み込んだ意図クラス分類情報から意図情報ベクトル抽出を行い、意図情報ベクトルv_iを出力する。 In other words, the intention information vector generating unit 23 of the second embodiment performs the intended information d _i for each utterance, the intention information vector extracted from the intended classification information read from the intended classification information storage unit 110, intended information vector v _i is output.

音声合成装置及び方法の他の処理は、第一実施形態と同様である。 Other processes of the speech synthesis apparatus and method are the same as those in the first embodiment.

［第三実施形態］
第二実施形態において、音声による意図表現の高い表現力を実現するためには、音響モデル（例えば、DNN音響モデル）の尤度を最大化する意図クラスとDNN音響モデルのパラメータを学習できればよい。第二実施形態のモデル学習では、前段の意図クラス学習部で各発話の意図クラスを決定し、後段の音響モデル学習部では、前段で決定された意図クラスを使用して、音響モデルの尤度を最大化する音響モデルのパラメータを決定している。しかし、意図クラス、音響モデルのパラメータについて多段的に最適化を行うため、得られる意図クラスと音響モデルのパラメータは局所解に陥り、DNN音響モデルの尤度は十分に大きくならない可能性がある。したがって、音声による意図表現の表現力を十分に向上することができない可能性がある。 [Third embodiment]
In the second embodiment, in order to realize high expressive power of intention expression by speech, it is only necessary to learn an intention class that maximizes the likelihood of an acoustic model (for example, DNN acoustic model) and parameters of the DNN acoustic model. In the model learning of the second embodiment, the intention class learning unit in the previous stage determines the intention class of each utterance, and the acoustic model learning unit in the subsequent stage uses the intention class determined in the previous stage to calculate the likelihood of the acoustic model. The parameter of the acoustic model that maximizes is determined. However, since the intention class and acoustic model parameters are optimized in multiple stages, the obtained intention class and acoustic model parameters fall into a local solution, and the likelihood of the DNN acoustic model may not be sufficiently large. Therefore, there is a possibility that the expressive power of intention expression by voice cannot be sufficiently improved.

そこで、第三実施形態では、音響モデルの尤度を最大化する意図クラスと音響モデルのパラメータを同時に学習するアルゴリズムにより、より大きな音響モデルの尤度を達成する意図クラスと音響モデルのパラメータを推定可能とし、音声による意図表現の表現力をさらに向上させる。 Therefore, in the third embodiment, the intention class and acoustic model parameters that achieve a larger likelihood of the acoustic model are estimated by an algorithm that simultaneously learns the intention class and acoustic model parameters that maximize the acoustic model likelihood. It is possible to improve the expressive power of intention expression by voice.

以下、第一実施形態及び第二実施形態と異なる部分を主に説明する。第一実施形態及び第二実施形態と同様の部分については、重複説明を省略する。 Hereinafter, the parts different from the first embodiment and the second embodiment will be mainly described. A duplicate description of the same parts as those in the first embodiment and the second embodiment is omitted.

（音響モデル学習装置及び方法）
第三実施形態の音響モデル学習装置は、図９に例示するように、意図クラス決定部１１１及び尤度基準意図クラス分類情報記憶部１１２を更に備えている。 (Acoustic model learning apparatus and method)
The acoustic model learning device of the third embodiment further includes an intention class determination unit 111 and a likelihood reference intention class classification information storage unit 112 as illustrated in FIG.

＜音響モデル学習部１７＞
第三実施形態の音響モデル学習部１７は、音声データ記憶部１２から読み込んだ音声データと、言語特徴量ベクトル抽出部１３が抽出した言語特徴量ベクトルデータと、意図情報ベクトル記憶部１６から読み込んだ意図情報ベクトルデータとから、各発話意図が各意図クラスに属する確率である意図クラス確率と音響モデルのパラメータを同時に推定し、音響モデルと意図クラス分類情報とを出力する（ステップＳ１７）。例えば、各意図情報に対応する意図クラス情報を隠れ変数とし、EMアルゴリズムのMステップに勾配法を適応するGeneralized EM (GEM)アルゴリズム（例えば、参考文献５参照。）を用いる。GEMアルゴリズムでは、音響モデルのパラメータ、意図クラス確率について適当な初期値を与え、両者が交互に更新される。 <Acoustic model learning unit 17>
The acoustic model learning unit 17 of the third embodiment reads the speech data read from the speech data storage unit 12, the language feature vector data extracted by the language feature vector extraction unit 13, and the intention information vector storage unit 16. From the intention information vector data, the intention class probability, which is the probability that each utterance intention belongs to each intention class, and the parameters of the acoustic model are simultaneously estimated, and the acoustic model and the intention class classification information are output (step S17). For example, a generalized EM (GEM) algorithm (see, for example, Reference 5) that uses intention class information corresponding to each intention information as a hidden variable and applies a gradient method to M steps of the EM algorithm is used. In the GEM algorithm, appropriate initial values are given for acoustic model parameters and intention class probabilities, and both are updated alternately.

第二実施形態の意図クラスタリング部１９２で得られる意図クラス分類情報を活用し、第二実施形態と同様の意図情報ベクトルを出力した上で、意図クラス確率の初期値としてもよい。音響モデルのパラメータの初期値は、非特許文献１等と同様に，乱数を設定することができる。 The intention class classification information obtained by the intention clustering unit 192 of the second embodiment may be used to output an intention information vector similar to that of the second embodiment, and may be used as the initial value of the intention class probability. Random numbers can be set as the initial values of the parameters of the acoustic model, as in Non-Patent Document 1.

〔参考文献５〕宮川雅巳, "EM アルゴリズムとその周辺.", 応用統計学 16.1, 1987, pp.1-21 [Reference 5] Masami Miyakawa, "EM algorithm and its surroundings", Applied statistics 16.1, 1987, pp.1-21

このようにして、音響モデル学習部１７は、言語特徴量ベクトル抽出部１３で抽出された各コンテキストデータの言語特徴量ベクトルと各コンテキストデータに対応する音声データと意図情報ベクトル記憶部１６から読み込んだ各コンテキストデータの意図情報ベクトルとを用いて、予め定められた各発話意図が各意図クラスに属する確率の初期値に基づいて、音響モデル学習を行うことにより音響モデルと各発話意図が各意図クラスに属する確率とを生成する。 In this way, the acoustic model learning unit 17 reads the language feature vector of each context data extracted by the language feature vector extraction unit 13, the voice data corresponding to each context data, and the intention information vector storage unit 16. Using the intention information vector of each context data, the acoustic model and each utterance intention are assigned to each intention class by performing acoustic model learning based on the initial probability of each utterance intention belonging to each intention class. Probability of belonging to

生成された音響モデルは、音響モデル記憶部１８に記憶される。生成された各発話意図が各意図クラスに属する確率は、意図クラス決定部１１１に出力される。 The generated acoustic model is stored in the acoustic model storage unit 18. The probability that each generated utterance intention belongs to each intention class is output to the intention class determination unit 111.

＜意図クラス決定部１１１＞
意図クラス決定部１１１は、意図クラス確率から、尤度基準意図クラス分類情報を決定する（ステップＳ１１１、図１１参照）。例えば、各意図情報c_n(n=1,2,…,N:Nは全意図数)に対し、意図クラス確率が最大となる意図クラスのインデックスi_n=argmax_mp_nmを出力し、リスト形式のデータI=[i₁,i₂,…,i_N]として保持する。 <Intention class determination unit 111>
The intention class determination unit 111 determines likelihood reference intention class classification information from the intention class probability (step S111, see FIG. 11). For example, for each intention information c _n (n = 1, 2,..., N: N is the total number of intentions), the index i _n = argmax _m p _nm of the intention class having the maximum intention class probability is output, and the list The format data is stored as I = [i ₁ , i ₂ ,..., I _N ].

意図クラス確率は、各意図情報c_n（n=1,2,…,N:Nは全意図数）が各意図クラス情報e_m (m=1,2,…,M:Mは全意図クラス数）に属する確率p_nmである。 The intention class probability is that each intention information c _n (n = 1, 2,..., N: N is the total number of intentions) is each intention class information e _m (m = 1, 2,..., M: M is all the intention classes. The probability p _nm belonging to (number).

決定された尤度基準意図クラス分類情報は、尤度基準意図クラス分類情報記憶部１１２に記憶される。 The determined likelihood criterion intention class classification information is stored in the likelihood criterion intention class classification information storage unit 112.

尤度基準意図クラス分類情報は、上記のように、意図クラス確率と意図クラス決定により決定される。意図クラス分類情報と同様に，例えば、各意図情報c_n（n=1,…,N:Nは全意図数）がそれぞれ意図クラス情報e_in(1≦i_n≦M:Mは全意図クラス数)にクラスタリングされるとき、そのインデックスをリスト形式のデータI=[i₁,i₂,…,i_N]として保持される。 Likelihood reference intention class classification information is determined by intention class probability and intention class determination as described above. Similar to the intention class classification information, for example, each intention information c _n (n = 1,..., N: N is the total number of intentions) is intended class information e _in (1 ≦ i _n ≦ M: M is all intention classes) When the data is clustered into (number), the index is held as list format data I = [i ₁ , i ₂ ,..., I _N ].

このようにして、意図クラス決定部１１１は、各発話意図が各意図クラスに属する確率を最大にする意図クラスを、各発話意図が属する意図クラスとして決定する。 In this way, the intention class determination unit 111 determines the intention class that maximizes the probability that each utterance intention belongs to each intention class as the intention class to which each utterance intention belongs.

（音声合成装置及び方法）
第三実施形態の音声合成装置は、図１０に例示するように、尤度基準意図クラス分類情報記憶部１１２を備えている。 (Speech synthesizer and method)
As illustrated in FIG. 10, the speech synthesizer according to the third embodiment includes a likelihood reference intention class classification information storage unit 112.

＜意図情報ベクトル作成部２３＞
第三実施形態の意図情報ベクトル作成部２３は、意図クラス分類情報記憶部１１０から読み込んだ意図クラス分類情報ではなく、尤度基準意図クラス分類情報記憶部１１２から読み込んだ尤度基準意図クラス分類情報を用いて意図情報ベクトル抽出を行う（ステップＳ２３）。 <Intention information vector creation unit 23>
The intention information vector creation unit 23 of the third embodiment is not the intention class classification information read from the intention class classification information storage unit 110, but the likelihood reference intention class classification information read from the likelihood reference intention class classification information storage unit 112. Intention information vector extraction is performed using (step S23).

［第四実施形態］
第三実施形態の音響モデル・意図クラス学習で使用されるアルゴリズムにおいてGEMアルゴリズムなどの初期値依存性のあるアルゴリズムを使用する場合、音響モデル（例えば、DNN音響モデル）の尤度を十分に大きくし、音声による意図表現の表現力を十分に向上するためには、適切な初期値を設定することが好ましい。 [Fourth embodiment]
When an algorithm having an initial value dependency such as the GEM algorithm is used in the algorithm used in the acoustic model / intention class learning of the third embodiment, the likelihood of the acoustic model (for example, DNN acoustic model) is sufficiently increased. In order to sufficiently improve the expressiveness of intention expression by voice, it is preferable to set an appropriate initial value.

そこで、第四実施形態では、尤度基準意図クラス再分類情報に基づく意図クラス確率の初期値の設定と、意図クラス確率算出・音響モデル学習による尤度基準意図クラス再分類情報の更新とを反復する。反復の各ステップで意図クラス確率算出・音響モデル学習により得られる尤度基準意図クラス再分類情報は、ある音響モデルの尤度を最大化する基準で学習されたものであるため、それを意図クラス確率の初期値として設定し、再度、意図クラス確率算出・音響モデル学習を実行することで、さらに尤度の大きい音響モデルを学習可能であると期待される。したがって、尤度基準意図クラス再分類情報に基づく意図クラス確率の初期値の設定と、意図クラス確率算出・音響モデル学習による尤度基準意図クラス再分類情報の更新を反復することにより、反復的に音響モデルの尤度を上昇させることができる。これにより、音声による意図表現の表現力をさらに向上させる。 Therefore, in the fourth embodiment, the initial value of the intention class probability based on the likelihood reference intention class reclassification information and the updating of the likelihood reference intention class reclassification information by intention class probability calculation and acoustic model learning are repeated. To do. Likelihood reference intention class reclassification information obtained by intention class probability calculation / acoustic model learning at each iteration step is learned based on a criterion that maximizes the likelihood of an acoustic model. It is expected that an acoustic model having a higher likelihood can be learned by setting the initial value of the probability and executing the intention class probability calculation / acoustic model learning again. Therefore, by repeatedly setting the initial value of the intention class probability based on the likelihood reference intention class reclassification information and updating the likelihood reference intention class reclassification information through intention class probability calculation and acoustic model learning, iteratively The likelihood of the acoustic model can be increased. Thereby, the expressive power of intention expression by voice is further improved.

以下、第三実施形態と異なる部分を主に説明する。第三実施形態と同様の部分については、重複説明を省略する。 Hereinafter, parts different from the third embodiment will be mainly described. A duplicate description of the same parts as in the third embodiment is omitted.

（音響モデル学習装置及び方法）
＜音響モデル学習部１７＞
第四実施形態の音響モデル学習部１７は、言語特徴量ベクトル抽出部１３で抽出された各コンテキストデータの言語特徴量ベクトルと、各コンテキストデータに対応する音声データと、意図情報ベクトル記憶部１６から読み込んだ各コンテキストデータの意図情報ベクトルとを用いて、予め定められた各発話意図が各意図クラスに属する確率の初期値に基づいて、音響モデル学習を行うことにより音響モデルと各発話意図が各意図クラスに属する確率とを生成する（ステップＳ１７）。 (Acoustic model learning apparatus and method)
<Acoustic model learning unit 17>
The acoustic model learning unit 17 according to the fourth embodiment includes a language feature vector of each context data extracted by the language feature vector extraction unit 13, voice data corresponding to each context data, and an intention information vector storage unit 16. The acoustic model and each utterance intention are obtained by performing acoustic model learning based on the initial value of the probability that each predetermined utterance intention belongs to each intention class using the intention information vector of each read context data. A probability belonging to the intention class is generated (step S17).

生成された各発話意図が各意図クラスに属する確率は、意図クラス決定部１１１に出力される。意図クラス決定部１１１では、第三実施形態で説明した方法と同様の方法により、各発話意図が属する意図クラスが決定される。 The probability that each generated utterance intention belongs to each intention class is output to the intention class determination unit 111. The intention class determination unit 111 determines the intention class to which each utterance intention belongs by the same method as that described in the third embodiment.

第四実施形態では、意図クラス決定部１１１で決定された各発話意図が属する意図クラスに各発話意図が属する確率を１とし、他の意図クラスに各発話意図が属する確率を０とする確率を、音響モデル学習部１７における上記予め定められた各発話意図が各意図クラスに属する確率の初期値として、音響モデル学習部１７及び意図クラス決定部１１１の処理が繰り返し行われる。 In the fourth embodiment, the probability that each utterance intention belongs to the intention class to which each utterance intention belongs determined by the intention class determination unit 111 is 1, and the probability that each utterance intention belongs to other intention classes is 0. Then, the acoustic model learning unit 17 and the intention class determining unit 111 are repeatedly processed as an initial value of the probability that each predetermined speech intention in the acoustic model learning unit 17 belongs to each intention class.

音響モデル尤度データsを、音響モデル学習部１７において、意図クラス確率の初期値の設定と音響モデル学習とを反復する各ステップにおける音響モデルの尤度を記録するデータとするる。例えば、j番目のステップにおける音響モデルの尤度をs_jとしたとき、音響モデル尤度データsは、s=[s₁,s₂,…,s_J] (Jは全ステップ数)のように表現される。 The acoustic model likelihood data s is assumed to be data for recording the likelihood of the acoustic model at each step in which the acoustic model learning unit 17 repeats the setting of the initial value of the intention class probability and the acoustic model learning. For example, when the acoustic model likelihood in the j-th step is s _j , the acoustic model likelihood data s is as follows: s = [s ₁ , s ₂ , ..., s _J ] (J is the total number of steps) It is expressed in

このとき、繰り返しの処理を行うために、例えば、音響モデル学習部１７は、まず、音響モデル尤度データの初期化を行う。すなわち、音響モデル学習部１７は、音響モデル尤度データsを初期化する。 At this time, in order to perform repetitive processing, for example, the acoustic model learning unit 17 first initializes acoustic model likelihood data. That is, the acoustic model learning unit 17 initializes the acoustic model likelihood data s.

そして、音響モデル学習部１７は、音声データ記憶部１２から読み込んだ音声データと、言語特徴量ベクトル抽出部１３から抽出した言語特徴量ベクトルと、学習された音響モデルと、意図クラス確率とから、音響モデル尤度を算出し、音響モデル尤度データsを更新する。例えば、リスト形式のデータsの末尾に音響モデル尤度を追加する。 Then, the acoustic model learning unit 17 uses the speech data read from the speech data storage unit 12, the language feature vector extracted from the language feature vector extraction unit 13, the learned acoustic model, and the intention class probability. The acoustic model likelihood is calculated, and the acoustic model likelihood data s is updated. For example, the acoustic model likelihood is added to the end of the list format data s.

音響モデル学習部１７は、音響モデル尤度データsから、学習ステップの’終了’又は’終了でない’を出力する。判定基準としては、学習の反復ステップ数が所定値に達したか、リスト形式の音響モデル尤度データsの末尾二項の差がある閾値s_thよりも低かったか（s_j-s_j-1<s_th）、またはその組み合わせ等を用いることができる。 The acoustic model learning unit 17 outputs “end” or “not finished” of the learning step from the acoustic model likelihood data s. As a criterion, whether the number of iteration steps of learning has reached a predetermined value or whether the difference between the last two terms of the acoustic model likelihood data s in the list format is lower than a threshold value s _th (s _j -s _j-1 <s _th ) or a combination thereof can be used.

図１２に例示するように、音響モデル学習部１７の終了判定において’終了’と判定されるまで、音響モデル学習部１７及び意図クラス決定部１１１の処理が繰り返し行われる。音響モデル学習部１７の終了判定において’終了’と判定された場合には、最後に生成された音響モデルが、最終的な音響モデルとして音響モデル記憶部１８に記憶される。 As illustrated in FIG. 12, the processes of the acoustic model learning unit 17 and the intention class determining unit 111 are repeatedly performed until “end” is determined in the end determination of the acoustic model learning unit 17. When it is determined as “end” in the end determination of the acoustic model learning unit 17, the last generated acoustic model is stored in the acoustic model storage unit 18 as a final acoustic model.

（音声合成装置及び方法）
第四実施形態の音声合成装置及び方法は、第三実施形態の音声合成装置及び方法と同様であるため、ここでは重複説明を省略する。 (Speech synthesizer and method)
Since the speech synthesizer and method of the fourth embodiment are the same as the speech synthesizer and method of the third embodiment, redundant description is omitted here.

［プログラム及び記録媒体］
音響モデル学習装置又は音声合成装置における各処理をコンピュータによって実現する場合、音響モデル学習装置又は音声合成装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各処理がコンピュータ上で実現される。 [Program and recording medium]
When each process in the acoustic model learning device or the speech synthesizer is realized by a computer, the processing contents of functions that the acoustic model learning device or the speech synthesizer should have are described by a program. Then, by executing this program on a computer, each process is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

［変形例］
音響モデル学習装置及び音声合成装置、これらの方法において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Modification]
The processing described in the acoustic model learning device and the speech synthesis device and these methods are not only executed in time series according to the order of description, but also in parallel or individually as required by the processing capability of the device that executes the processing. May be executed.

例えば、ステップＳ１３の処理の前にステップＳ１５の処理を行ってもよいし、ステップＳ１３の処理とステップＳ１５の処理とを並行して行ってもよい。また、例えば、ステップＳ２１及びステップＳ２２の処理の前にステップＳ２３の処理を行ってもよいし、ステップＳ２１及びステップＳ２２の処理とステップＳ２３の処理とを並行して行ってもよい。 For example, the process of step S15 may be performed before the process of step S13, or the process of step S13 and the process of step S15 may be performed in parallel. Further, for example, the process of step S23 may be performed before the processes of step S21 and step S22, or the process of step S21 and step S22 and the process of step S23 may be performed in parallel.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

Claims

A context data storage unit storing each context data;
Using each context data read from the context data storage unit, a language feature vector extraction unit that extracts a language feature vector of each context data;
An intention information vector storage unit storing an intention information vector representing the utterance intention of each context data;
Acoustic model learning is performed using the language feature vector of each extracted context data, the speech data corresponding to each context data, and the intention information vector of each context data read from the intention information vector storage unit. An acoustic model learning unit for generating an acoustic model by performing,
An acoustic model learning device.

The acoustic model learning device according to claim 1,
For each utterance intention, it is assumed that intention class information that is information about the intention class to which each utterance intention belongs is predetermined.
The intention information vector of each context data is a vector representing intention class information corresponding to the utterance intention of each context data.
Acoustic model learning device.

The acoustic model learning device according to claim 1,
The acoustic model learning unit includes a language feature vector of each of the extracted context data, speech data corresponding to each of the context data, and an intention information vector of each of the context data read from the intention information vector storage unit. And generating an acoustic model and a probability that each utterance intention belongs to each intention class by performing acoustic model learning based on a predetermined initial value of the probability that each utterance intention belongs to each intention class,
An intention class determining unit that determines an intention class that maximizes a probability that each utterance intention belongs to each intention class as an intention class to which each utterance intention belongs;
Acoustic model learning device.

In the acoustic model learning device according to claim 3,
The probability that the probability that the utterance intention belongs to the intention class to which the utterance intention belongs determined by the intention class determination unit is 1 and the probability that the utterance intention belongs to another intention class is 0 is determined in advance. As an initial value of the probability that each utterance intention belongs to each intention class, the processing of the acoustic model learning unit and the intention class determination unit is repeated.
Acoustic model learning device.

A text analysis unit that parses input text and obtains context;
A language feature vector extraction unit for extracting a language feature vector of each context;
A speech parameter generation unit that generates speech parameters using the intention information vector representing the input speech intention, the acoustic model generated by the acoustic model learning device according to claim 1, and the extracted language feature vector. When,
A speech waveform generation unit that generates synthesized speech using the generated speech parameters;
A speech synthesizer.

A language feature vector extraction unit that extracts a language feature vector of each context data using each context data read from a context data storage unit in which each context data is stored;
An intention information vector in which an acoustic model learning unit stores a language feature vector of each of the extracted context data, speech data corresponding to each of the context data, and an intention information vector representing the utterance intention of each context data An acoustic model learning step of generating an acoustic model by performing acoustic model learning using the intention information vector of each context data read from the storage unit;
Acoustic model learning method including

A text analysis step in which a text analysis unit analyzes input text to obtain a context;
A language feature vector extraction step in which a language feature vector extraction unit extracts a language feature vector of each context;
The speech parameter generation unit uses the intention information vector representing the input utterance intention, the acoustic model generated by the acoustic model learning device according to claim 1, and the extracted language feature vector to generate speech parameters. An audio parameter generation step to generate;
A speech waveform generation step in which a speech waveform generation unit generates synthesized speech using the generated speech parameters;
A speech synthesis method including:

The program for functioning a computer as each part of the acoustic model learning apparatus in any one of Claim 1 to 4, or the speech synthesizer of Claim 5.