JP7228998B2

JP7228998B2 - speech synthesizer and program

Info

Publication number: JP7228998B2
Application number: JP2018227704A
Authority: JP
Inventors: 清栗原; 信正清山; 正熊野; 篤今井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2018-08-27
Filing date: 2018-12-04
Publication date: 2023-02-27
Anticipated expiration: 2038-12-04
Also published as: JP2020034883A

Description

特許法第３０条第２項適用日本音響学会２０１８年秋季研究発表会講演論文集が記録されたＣＤ－ＲＯＭ発行日平成３０年８月２９日集会名日本音響学会２０１８年秋季研究発表会開催日平成３０年９月１２日Application of Article 30, Paragraph 2 of the Patent Act Acoustical Society of Japan 2018 Autumn Research Presentation Meeting CD-ROM containing lecture proceedings Issue date August 29, 2018 Meeting name Acoustical Society of Japan 2018 Fall Research Presentation Date September 12, 2018

本発明は、音声合成装置及びプログラムに関する。 The present invention relates to a speech synthesizer and program.

近年、統計モデルを用いた音声合成技術が進歩したことにより、テキストから音声を合成する技術が知られている。例えば、ディープニューラルネットワーク（Deep Neural Network：ＤＮＮ）を用いて話者の音声等の特徴を学習し、テキストから音声合成を行う技術が開発されている（例えば、非特許文献１、２、３参照）。また、英語で記述された文字列からメルスペクトログラムを推定し、このメルスペクトログラムから音声波形を生成する技術も開発されている（非特許文献４参照）。 In recent years, due to advances in speech synthesis technology using statistical models, techniques for synthesizing speech from text are known. For example, techniques have been developed for learning features such as a speaker's voice using a Deep Neural Network (DNN) and performing speech synthesis from text (see, for example, Non-Patent Documents 1, 2, and 3). ). A technique has also been developed for estimating a mel-spectrogram from a character string written in English and generating a speech waveform from this mel-spectrogram (see Non-Patent Document 4).

従来の統計的音声合成装置は、音響特徴量を算出して音声合成を行うために、音素ラベルファイルを用いた統計モデルにより音声を生成する。この音素ラベルファイルは、音素や音素の時間長、品詞等のラベルが含まれ、音声の音響特徴量からラベルを付与する。 A conventional statistical speech synthesizer generates speech by a statistical model using a phoneme label file in order to calculate acoustic features and synthesize speech. This phoneme label file contains labels such as phonemes, durations of phonemes, parts of speech, etc., and labels are given from acoustic features of speech.

Kiyoshi Kurihara et al，"Automatic generation of audio descriptions for sports programs"，International Broadcasting Convention [IBC 2017]，2017年Kiyoshi Kurihara et al, "Automatic generation of audio descriptions for sports programs", International Broadcasting Convention [IBC 2017], 2017 栗原清，清山信正，今井篤，都木徹，"話者の特徴と感情表現を制御可能なDNN音声合成方式の検討"，電子情報通信学会総合大会，2017年，D-14-10，p.150Kiyoshi Kurihara, Nobumasa Kiyoyama, Atsushi Imai, Toru Miyakogi, "Study of DNN speech synthesis method that can control speaker's characteristics and emotional expression", The Institute of Electronics, Information and Communication Engineers General Conference, 2017, D-14-10, p .150 北条，井島，宮崎，"話者コードを用いたＤＮＮ音声合成の検討"，日本音響学会講演論文集，2015年9月，p.215-218Hojo, Ijima, Miyazaki, "Study of DNN speech synthesis using speaker code", Proceedings of Acoustical Society of Japan, September 2015, pp.215-218 Shen et al.，[online]，2018年2月，"Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions"，arXiv:1712.05884，[2018年7月11日検索]，インターネット<URL: https://arxiv.org/pdf/1712.05884.pdf>Shen et al., [online], February 2018, "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions", arXiv:1712.05884, [searched July 11, 2018], Internet<URL: https://arxiv .org/pdf/1712.05884.pdf>

上記のように統計的音声合成装置では音素ラベルファイルが用いられるが、音声には音響的な解析を行う上で曖昧な音が存在するため、この音声を構成する音素に応じた音響特徴量を正しく認識することが難しい場合があり、上述したラベルを正しく付与されないことがあった。また、音素の境界を正しく判別する事が難しい場合、正しい音素ラベルファイルを生成するには、人手による修正を要し、このため人的なコストや時間的コストの問題も存在していた。さらに、日本語の場合、漢字と平仮名とカタカナとの多様な組み合わせによる仮名漢字混じりの文を網羅するために、大量の学習データを要するとともに、同じ文字列でも読み仮名が複数存在するため、学習が正しく行えないという問題もあった。このため、非特許文献４に開示された技術をそのまま仮名漢字混じりの日本語の文に適用することは難しい。 As mentioned above, a statistical speech synthesizer uses a phoneme label file. It was sometimes difficult to recognize correctly, and sometimes the labels described above were not given correctly. In addition, when it is difficult to correctly determine the boundaries of phonemes, manual correction is required to generate a correct phoneme label file, which poses problems of human cost and time cost. Furthermore, in the case of Japanese, a large amount of learning data is required in order to cover sentences mixed with kana and kanji, which are made up of various combinations of kanji, hiragana, and katakana. There was also the problem that it could not be performed correctly. Therefore, it is difficult to directly apply the technique disclosed in Non-Patent Document 4 to Japanese sentences containing kana and kanji characters.

本発明は、このような事情を考慮してなされたもので、品質の良い音声を低コストで合成できる音声合成装置及びプログラムを提供する。 The present invention has been made in consideration of such circumstances, and provides a speech synthesizer and program capable of synthesizing high-quality speech at low cost.

本発明の一態様は、日本語の発話内容を表す文章を当該発話内容の読み方を表す文字又は文字列と韻律を表す韻律記号と発話に与える特徴を表す発話スタイル記号とを用いた文字列により記述した第一のテキストデータを、当該第一のテキストデータから音響特徴量を生成する第一の音響特徴量生成モデルに入力し、前記発話内容に対応する音声の音響特徴量を推定する第一推定処理、又は、前記読み方を表す文字又は文字列と前記韻律記号と用いた文字列により記述した第二のテキストデータを、当該第二のテキストデータから音響特徴量を生成する第二の音響特徴量生成モデルに入力し、前記発話内容に対応する音声の音響特徴量を推定する第二推定処理、又は、前記読み方を表す文字又は文字列と前記発話スタイル記号とを用いた文字列により記述した第三のテキストデータを、当該第三のテキストデータから音響特徴量を生成する第三の音響特徴量生成モデルに入力し、前記発話内容に対応する音声の音響特徴量を推定する第三推定処理のいずれかを行う音響特徴量推定部と、前記音響特徴量推定部が前記第一推定処理、又は、前記第二推定処理、又は、前記第三推定処理のいずれかにより推定した前記音響特徴量を用いて音声波形を推定するボコーダ部と、を備え、前記第一の音響特徴量生成モデル、前記第二の音響特徴量生成モデル、及び、前記第三の音響特徴量生成モデルは、ディープニューラルネットワークを用いたエンコーダ及びデコーダを有し、前記エンコーダは、再帰型ニューラルネットワークにより、前記テキストデータが示す前記発話内容に前記文章内における当該発話内容の前後の文字列を考慮した文字列の特徴量を生成し、前記デコーダは、再帰型ニューラルネットワークにより、前記エンコーダが生成した前記特徴量と過去に生成した音響特徴量とに基づいて、前記テキストデータが示す前記発話内容に対応する音声の音響特徴量を生成する、ことを特徴とする音声合成装置である。 According to one aspect of the present invention, a sentence representing the content of a Japanese utterance is formed by a character string using characters or character strings representing how to read the content of the utterance, a prosody symbol representing a prosody, and an utterance style symbol representing a characteristic given to the utterance. inputting the described first text data to a first acoustic feature quantity generation model for generating an acoustic feature quantity from the first text data, and estimating an acoustic feature quantity of speech corresponding to the speech content; Estimation processing , or a second acoustic feature for generating an acoustic feature amount from the second text data described by the character or character string representing the reading and the character string using the prosody symbol from the second text data A second estimation process for estimating the acoustic feature value of the speech corresponding to the utterance content by inputting it into a quantity generation model, or described by a character string or a character string representing the reading and a character string using the utterance style symbol A third estimation process of inputting the third text data to a third acoustic feature value generation model that generates an acoustic feature value from the third text data, and estimating the acoustic feature value of the speech corresponding to the utterance content. and the acoustic feature value estimated by the acoustic feature value estimating unit by either the first estimation process, the second estimation process, or the third estimation process. and a vocoder unit that estimates a speech waveform using a deep neural An encoder and a decoder using a network are provided, and the encoder uses a recurrent neural network to consider character strings before and after the utterance content in the text for the utterance content indicated by the text data. and the decoder uses a recursive neural network to generate an acoustic feature of speech corresponding to the utterance content indicated by the text data based on the feature amount generated by the encoder and the acoustic feature amount generated in the past. A speech synthesizer characterized by generating a quantity.

本発明の一態様は、上述する音声合成装置であって、前記読み方を表す前記文字は、カタカナ、ひらがな、アルファベット又は発音記号であり、前記第一の音響特徴量生成モデル、前記第二の音響特徴量生成モデル、及び、前記第三の音響特徴量生成モデルは、ディープニューラルネットワークを用いたアテンションネットワークをさらに有し、前記アテンションネットワークは、前記エンコーダが出力した前記特徴量に対して重み付けを行うための重みを生成し、生成した前記重みにより前記特徴量に重み付けを行って前記デコーダへ入力し、前記デコーダは、再帰型ニューラルネットワークにより、前記アテンションネットワークから入力された前記特徴量と過去に生成した音響特徴量とに基づいて、前記テキストデータが示す前記発話内容に対応する音声の音響特徴量を生成する、ことを特徴とする。 One aspect of the present invention is the speech synthesis device described above , wherein the characters representing the reading are katakana, hiragana, the alphabet, or phonetic symbols, and the first acoustic feature value generation model, the second and the third acoustic feature generation model further include an attention network using a deep neural network, wherein the attention network weights the features output by the encoder and weighting the feature amount with the generated weight and inputting it to the decoder, and the decoder uses a recurrent neural network to combine the feature amount input from the attention network with the past and generating an acoustic feature amount of speech corresponding to the utterance content indicated by the text data.

本発明の一態様は、上述する音声合成装置であって、前記韻律記号は、アクセント位置を指定する記号と、句又はフレーズの区切りを指定する記号と、文末のイントネーションを指定する記号と、ポーズを指定する記号とのうちのいずれかを含む、ことを特徴とする。 One aspect of the present invention is the above-described speech synthesizer, wherein the prosodic symbols include a symbol that designates an accent position, a symbol that designates a break of a phrase or a phrase, a symbol that designates intonation at the end of a sentence, and a symbol that designates an intonation at the end of a sentence. and a symbol that specifies the size .

本発明の一態様は、上述する音声合成装置であって、発話に与える前記特徴は、感情、発話スタイル、又は、話者である、ことを特徴とする。 One aspect of the present invention is the speech synthesizer described above, wherein the feature given to the speech is an emotion, a speech style, or a speaker.

本発明の一態様は、上述する音声合成装置であって、前記特徴を与える対象の発話は、前記発話スタイル記号が所定位置に付加された１以上の文の発話全体、前記発話スタイル記号に囲まれた１以上の文の発話全体、又は、前記発話スタイル記号により囲まれた１以上の文節の発話である、ことを特徴とする。 One aspect of the present invention is the above-described speech synthesizer, wherein the utterance to be given the feature is the entire utterance of one or more sentences to which the utterance style symbol is added at a predetermined position, surrounded by the utterance style symbol. or the utterance of one or more phrases surrounded by the utterance style symbols.

本発明の一態様は、コンピュータを、上述したいずれかの音声合成装置として機能させるためのプログラムである。 One aspect of the present invention is a program for causing a computer to function as any of the speech synthesizers described above.

本発明によれば、品質の良い音声を低コストにより合成することができる。 According to the present invention, high-quality speech can be synthesized at low cost.

本発明の第１の実施形態による音声合成装置及び従来技術による音声合成装置の概要を示す図である。1 is a diagram showing an outline of a speech synthesizer according to a first embodiment of the present invention and a conventional speech synthesizer; FIG. 同実施形態による音声合成装置の構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the speech synthesizer by the same embodiment. 同実施形態による中間言語に用いられる韻律記号を示す図である。It is a figure which shows the prosody mark used for the intermediate language by the same embodiment. 同実施形態による音声合成装置の学習処理を示すフロー図である。FIG. 4 is a flowchart showing learning processing of the speech synthesizer according to the same embodiment; 同実施形態による音声合成装置の音声合成処理を示すフロー図である。FIG. 4 is a flowchart showing speech synthesis processing of the speech synthesizer according to the same embodiment; 同実施形態による音響特徴量生成モデル及び学習アルゴリズムを示す図である。It is a figure which shows the acoustic feature-value generation model and learning algorithm by the same embodiment. 同実施形態によるエンコーダの例を示す図である。It is a figure which shows the example of the encoder by the same embodiment. 同実施形態によるデコーダの例を示す図である。It is a figure which shows the example of the decoder by the same embodiment. 同実施形態による音響特徴量生成モデルを用いた音声合成アルゴリズムを示す図である。It is a figure which shows the speech-synthesis algorithm using the acoustic feature-value generation model by the same embodiment. 同実施形態による評価実験の結果を示す図である。It is a figure which shows the result of the evaluation experiment by the same embodiment. 第２の実施形態による音声合成装置の構成例を示す機能ブロック図である。FIG. 11 is a functional block diagram showing a configuration example of a speech synthesizer according to a second embodiment; FIG. 同実施形態による音声合成装置の音声合成処理の概要を示す図である。It is a figure which shows the outline|summary of the speech synthesizing process of the speech synthesizer by the same embodiment. 同実施形態による音響特徴量生成モデル及び学習アルゴリズムを示す図である。It is a figure which shows the acoustic feature-value generation model and learning algorithm by the same embodiment. 同実施形態による音響特徴量生成モデルを用いた音声合成アルゴリズムを示す図である。It is a figure which shows the speech-synthesis algorithm using the acoustic feature-value generation model by the same embodiment. 同実施形態によるエンコーダの例を示す図である。It is a figure which shows the example of the encoder by the same embodiment. 同実施形態による評価実験の結果を示す図である。It is a figure which shows the result of the evaluation experiment by the same embodiment. 同実施形態による評価実験の結果を示す図である。It is a figure which shows the result of the evaluation experiment by the same embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

[第１の実施形態]
図１は、本実施形態による音声合成装置１及び従来技術による音声合成装置９の概要を示す図である。従来技術による音声合成装置９では、第一言語処理部９１が日本語の仮名漢字混じりの文の仮名（例えば、カタカナ）表記及び韻律記号を推定し、第二言語処理部９２がその推定結果に音素ラベルや音素の長さ等のラベルを付与し、音素ラベルファイルを生成する。音響特徴量推定部９３は、人手により修正が行われた音素ラベルファイルを用いて例えばＤＮＮ（Deep Neural Network；ディープニューラルネットワーク）により周波数波形を音響特徴量として推定し、ボコーダ部９４は、推定された周波数波形から音声波形を推定する。 [First embodiment]
FIG. 1 is a diagram showing an overview of a speech synthesizer 1 according to this embodiment and a speech synthesizer 9 according to the prior art. In the speech synthesizer 9 according to the prior art, the first language processing unit 91 estimates the kana (for example, katakana) notation and prosodic symbols of Japanese sentences mixed with kana and kanji, and the second language processing unit 92 uses the estimation result as Add labels such as phoneme labels and phoneme lengths to generate a phoneme label file. The acoustic feature amount estimation unit 93 uses the manually modified phoneme label file to estimate the frequency waveform as an acoustic feature amount by, for example, a DNN (Deep Neural Network), and the vocoder unit 94 estimates the The speech waveform is estimated from the obtained frequency waveform.

一方、本実施形態の音声合成装置１は、言語処理部４１と、音響特徴量推定部４２と、ボコーダ部４３とを備える。言語処理部４１は、日本語の仮名漢字混じりの文を、仮名と韻律記号を用いた中間言語に変換する。本実施形態では、仮名としてカタカナを用いるが、ひらがなやアルファベットや発音記号を用いてもよい。また、仮名に代えて、音素を表す記号を用いることも可能である。中間言語に用いられる韻律記号は、韻律を表す文字である。音響特徴量推定部４２は、中間言語が記述されたテキストデータを入力データに用いて、ＤＮＮにより音響特徴量を推定する。音響特徴量には、例えば、メルスペクトログラムが用いられる。ボコーダ部４３は、WaveNetなどのＤＮＮ等を用いて、音響特徴量から音声波形を推定する。WaveNetは、例えば、参考文献１「A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior and K. Kavukcuoglu,“WaveNet: A Generative Model for Raw Audio,” arXiv:1609.03499, 2016」に記載されている。 On the other hand, the speech synthesizer 1 of this embodiment includes a language processing unit 41 , an acoustic feature amount estimation unit 42 and a vocoder unit 43 . The language processing unit 41 converts a Japanese sentence containing a mixture of kana and kanji into an intermediate language using kana and prosodic symbols. In this embodiment, katakana is used as kana, but hiragana, the alphabet, or phonetic symbols may be used. It is also possible to use symbols representing phonemes instead of kana. The prosody symbols used in the intermediate language are letters representing prosody. The acoustic feature amount estimating unit 42 uses text data describing an intermediate language as input data to estimate an acoustic feature amount by DNN. A mel spectrogram, for example, is used as the acoustic feature quantity. The vocoder unit 43 uses a DNN such as WaveNet to estimate a speech waveform from acoustic features. WaveNet is described, for example, in reference 1 "A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior and K. Kavukcuoglu," WaveNet: A Generative Model for Raw Audio, ”arXiv:1609.03499, 2016”.

このように本実施形態の音声合成装置１は、音素や音素の位置等を詳しく規定したフルコンテキストラベルが不要であり、カタカナと韻律記号を表す文字により記述した中間言語のテキストデータから、ＤＮＮを用いて直接音響特徴量を生成する。よって、音響特徴量を生成するＤＮＮの学習に用いるデータの作成が容易であり、例えば、既存の音声データを学習データとして活用しやすくなる。これにより、人的コスト及び時間的コストを低減しながら、大量のデータを用いて学習を行い、音声合成の精度を向上させることができる。 As described above, the speech synthesizer 1 according to the present embodiment does not require a full context label that defines phonemes and the positions of the phonemes in detail. directly generate acoustic features. Therefore, it is easy to create data used for learning of DNN that generates acoustic features, and for example, it becomes easy to utilize existing voice data as learning data. As a result, it is possible to improve the accuracy of speech synthesis by performing learning using a large amount of data while reducing human cost and time cost.

図２は、本実施形態による音声合成装置１の構成例を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみを抽出したものである。音声合成装置１は、記憶部２０と、学習部３０と、音声合成部４０とを備えて構成される。 FIG. 2 is a functional block diagram showing a configuration example of the speech synthesizing device 1 according to this embodiment, in which only functional blocks related to this embodiment are extracted. The speech synthesizer 1 includes a storage unit 20 , a learning unit 30 and a speech synthesis unit 40 .

記憶部２０は、音響特徴量生成モデル２０－１及び音声波形生成モデル２０－２を記憶する。音響特徴量生成モデル２０－１は、テキストデータを入力し、音響特徴量を表すデータを出力するＤＮＮである。音声波形生成モデル２０－２は、音響特徴量のデータを入力し、音声波形を出力するＤＮＮである。 The storage unit 20 stores an acoustic feature generation model 20-1 and a speech waveform generation model 20-2. The acoustic feature value generation model 20-1 is a DNN that inputs text data and outputs data representing acoustic feature values. The speech waveform generation model 20-2 is a DNN that inputs acoustic feature data and outputs a speech waveform.

学習部３０は、学習データを用いて、記憶部２０に記憶されている音響特徴量生成モデル２０－１を更新する。学習データは、発話の音声波形を表す学習用音声データと、その発話の内容を仮名漢字混じりで記述した学習用テキストデータとの組である。学習部３０は、正解音響特徴量算出部３１と、モデル更新部３２とを備える。 The learning unit 30 updates the acoustic feature value generation model 20-1 stored in the storage unit 20 using the learning data. The learning data is a set of learning voice data representing a voice waveform of an utterance and learning text data describing the contents of the utterance in a mixture of kana and kanji. The learning unit 30 includes a correct acoustic feature value calculation unit 31 and a model update unit 32 .

正解音響特徴量算出部３１は、学習データに含まれる学習用音声データの音声波形から音響特徴量を算出する。モデル更新部３２は、正解音響特徴量算出部３１が学習用音声データから算出した音響特徴量と、音声合成部４０が学習データに含まれる学習用テキストデータに基づいて推定した音響特徴量との差分に基づいて、記憶部２０に記憶されている音響特徴量生成モデル２０－１を更新する。 The correct acoustic feature amount calculation unit 31 calculates an acoustic feature amount from the speech waveform of the learning speech data included in the learning data. The model update unit 32 updates the acoustic feature amount calculated from the learning speech data by the correct acoustic feature amount calculation unit 31 and the acoustic feature amount estimated by the speech synthesis unit 40 based on the learning text data included in the learning data. Based on the difference, the acoustic feature value generation model 20-1 stored in the storage unit 20 is updated.

音声合成部４０は、カタカナ及び韻律記号により記述された中間言語のテキストデータを入力して音響特徴量生成モデル２０－１を実行し、発話内容の音声の音響特徴量を表すデータを得る。音声合成部４０は、言語処理部４１と、音響特徴量推定部４２と、ボコーダ部４３とを備える。 The speech synthesizing unit 40 receives intermediate language text data described in katakana and prosodic symbols, executes the acoustic feature quantity generation model 20-1, and obtains data representing the acoustic feature quantity of the speech content of the utterance. The speech synthesizing unit 40 includes a language processing unit 41 , an acoustic feature quantity estimating unit 42 and a vocoder unit 43 .

言語処理部４１は、仮名漢字混じり文のテキストデータを、カタカナ及び韻律記号を用いた中間言語に変換する。この変換は、形態素解析などの既存技術により行うことができる。言語処理部４１は、中間言語を表すテキストデータを音響特徴量推定部４２に出力する。音響特徴量推定部４２は、記憶部２０に記憶されている音響特徴量生成モデル２０－１に、言語処理部４１から入力した中間言語のテキストデータを入力することにより、中間言語により記述された発話内容の音響特徴量を推定する。ボコーダ部４３は、音響特徴量推定部４２が推定した音響特徴量を入力とし、記憶部２０に記憶されている音声波形生成モデル２０－２を用いて音声波形を生成する。 The language processing unit 41 converts the text data of sentences mixed with kana and kanji into an intermediate language using katakana and prosody symbols. This conversion can be done by existing techniques such as morphological analysis. The language processing unit 41 outputs text data representing the intermediate language to the acoustic feature amount estimation unit 42 . The acoustic feature value estimating unit 42 inputs text data in the intermediate language input from the language processing unit 41 to the acoustic feature value generation model 20-1 stored in the storage unit 20, so that the text data described in the intermediate language Estimate the acoustic features of the utterance content. The vocoder unit 43 receives as input the acoustic feature quantity estimated by the acoustic feature quantity estimation unit 42 and generates a speech waveform using the speech waveform generation model 20 - 2 stored in the storage unit 20 .

音響特徴量生成モデル２０－１の学習時、言語処理部４１及び音響特徴量推定部４２は、学習部３０として動作する。言語処理部４１は、学習データに含まれる学習用テキストデータを中間言語に変換し、音響特徴量推定部４２は、この変換された中間言語を表すテキストデータを音響特徴量生成モデル２０－１に入力して音響特徴量を推定し、推定結果をモデル更新部３２に出力する。 During learning of the acoustic feature value generation model 20-1, the language processing unit 41 and the acoustic feature value estimation unit 42 operate as the learning unit 30. FIG. The language processing unit 41 converts text data for learning included in the learning data into an intermediate language, and the acoustic feature value estimation unit 42 converts the text data representing the converted intermediate language to the acoustic feature value generation model 20-1. The acoustic feature amount is estimated by inputting, and the estimation result is output to the model updating unit 32 .

なお、音声合成装置１は、１台以上のコンピュータ装置により実現することができる。音声合成装置１が複数台のコンピュータ装置により実現される場合、いずれの機能部をいずれのコンピュータ装置により実現するかは任意とすることができる。例えば、記憶部２０及び学習部３０を１台又は複数台のサーバコンピュータにより実現し、音声合成部４０をクライアント端末で実現してもよい。また、同一の機能部を複数台のコンピュータ装置により実現してもよい。 Note that the speech synthesizer 1 can be realized by one or more computer devices. When the speech synthesizer 1 is implemented by a plurality of computer devices, it is possible to arbitrarily decide which functional unit is implemented by which computer device. For example, the storage unit 20 and the learning unit 30 may be implemented by one or more server computers, and the speech synthesis unit 40 may be implemented by a client terminal. Also, the same functional unit may be realized by a plurality of computer devices.

図３は、本実施形態の中間言語に用いられる韻律記号を示す図である。図３に示す韻律記号は、参考文献２「音声入出力方式標準化専門委員会，JEITA規格 IT-4006 日本語テキスト音声合成用記号，社団法人電子情報技術産業協会，2010年，p.4-10」に記載の韻律記号を改変した情報である。韻律情報には、アクセント位置の指定、句・フレーズの区切り指定、文末イントネーションの指定、ポーズの指定などの種類がある。アクセント位置の指定には、アクセント上昇位置を表す韻律記号「’」が用いられ、韻律記号の直前のモーラにアクセント核があることを表す。アクセント位置の指定には、さらに、アクセント下降位置を表す韻律記号「＿」を用いてもよい。句・フレーズの区切り指定には、アクセント句の区切りを表す韻律記号「／」、及び、フレーズの区切りを表す韻律記号「＃」が用いられる。文末イントネーションの指定には、通常の文末を表す韻律記号「＝」、及び、疑問の文末を表す韻律記号「？」が用いられる。ポーズの指定には、ポーズを表す韻律記号「＄％」が用いられる。なお、句・フレーズの区切り指定については、使用しなくてもよい。 FIG. 3 is a diagram showing prosody symbols used in the intermediate language of this embodiment. The prosody symbols shown in Fig. 3 are taken from Reference 2, "Speech Input/Output Method Standardization Committee, JEITA Standard IT-4006 Symbols for Japanese Text Speech Synthesis, Japan Electronics and Information Technology Industries Association, 2010, p.4-10. ” is information obtained by modifying the prosodic symbols described in The prosody information includes types such as designation of accent position, designation of phrase/phrase break, designation of intonation at the end of sentence, designation of pause , and the like. The prosody mark "'" is used to specify the accent position, and indicates that the mora immediately preceding the prosody mark has an accent kernel. For specifying the accent position, the prosody symbol "_" representing the accent descending position may be used. A prosodic symbol "/" representing an accent phrase delimiter and a prosodic symbol "#" representing a phrase delimiter are used to specify phrase/phrase delimiters. The prosody symbol “=” representing the end of a normal sentence and the prosody symbol “?” representing the end of an interrogative sentence are used to designate the end-of-sentence intonation. A prosody symbol “$%” representing a pause is used to specify the pause. Note that it is not necessary to use phrase/phrase delimiter designations.

これらの韻律記号には、便宜的に上記の記号を割り振っているが、アクセント上昇位置を表す韻律記号、アクセント下降位置を表す韻律記号、句・フレーズの区切りを表す韻律記号、フレーズの区切りを表す韻律記号、文末を表す韻律記号、疑問の文末を表す韻律記号、ポーズを表す韻律記号のそれぞれを、他の記号に置き換えて学習する事で、上記と同等の機能を持たせる事が可能である。 For convenience, the above symbols are assigned to these prosody symbols. It is possible to have the same function as the above by replacing the prosody symbols, prosody symbols that indicate the end of sentences, prosody symbols that indicate the end of interrogative sentences, and prosody symbols that indicate pauses with other symbols. .

図４は、音声合成装置１の学習処理を示すフロー図である。
まず、ステップＳ１１０において、音声合成装置１は、学習データを入力する。ステップＳ１２０において、正解音響特徴量算出部３１は、学習データに含まれる未選択の学習用音声データを一つ選択し、選択した学習用音声データが示す音声波形から音響特徴量を算出する。ステップＳ１３０において、言語処理部４１は、選択された学習用音声データの発話内容が記述された学習用テキストデータを学習データから取得して形態素解析等を行い、発話内容を表す文章を、読み仮名と韻律記号とを用いた文字列により記載した中間言語に変換する。ユーザは、必要に応じて中間言語を修正してもよい。ステップＳ１４０において、音響特徴量推定部４２は、記憶部２０から読み出した音響特徴量生成モデル２０－１に、ステップＳ１３０において言語処理部４１が生成した中間言語を表すテキストデータである中間言語データを入力して音響特徴量を推定する。 FIG. 4 is a flowchart showing learning processing of the speech synthesizer 1. As shown in FIG.
First, in step S110, the speech synthesizer 1 inputs learning data. In step S120, the correct acoustic feature amount calculation unit 31 selects one unselected learning speech data included in the learning data, and calculates an acoustic feature amount from the speech waveform indicated by the selected learning speech data. In step S130, the language processing unit 41 acquires learning text data in which the utterance content of the selected learning voice data is described, performs morphological analysis and the like, and converts sentences representing the utterance content into reading kana. and a prosody symbol. The user may modify the intermediate language as needed. In step S140, the acoustic feature value estimation unit 42 stores the intermediate language data, which is text data representing the intermediate language generated by the language processing unit 41 in step S130, in the acoustic feature value generation model 20-1 read from the storage unit 20. Input and estimate acoustic features.

ステップＳ１５０において、モデル更新部３２は、ステップＳ１２０において正解音響特徴量算出部３１が算出した音響特徴量と、ステップＳ１４０において音響特徴量推定部４２が推定した音響特徴量との差分に基づいて、記憶部２０に記憶されている音響特徴量生成モデル２０－１を更新する。具体的には、モデル更新部３２は、この誤差をＭＳＥ（最小二乗法）により算出し、算出した差分が小さくなるように、確率的勾配降下法のＡＤＡＭを用いて、音響特徴量生成モデル２０－１における各ユニット（ノード）への入力の重み等を更新する。ＭＳＥは、例えば、参考文献３「GitHub, Inc，[online]， " Spectrogram Feature prediction network"，[2018年8月24日検索]，インターネット<URL:https://github.com/Rayhane-mamah/Tacotron-2/wiki/Spectrogram-Feature-prediction-network#training>」に記載されている。また、ＡＤＡＭは、例えば、参考文献４「Diederik P. Kingma，Jimmy Ba，[online]，2017年，" ADAM: A Method for Stochastic Optimization "，arXiv:1412.6980v9，[2018年8月24日検索]，インターネット<URL: https://arxiv.org/pdf/1412.6980.pdf >」に記載されている。 In step S150, the model updating unit 32 performs the The acoustic feature value generation model 20-1 stored in the storage unit 20 is updated. Specifically, the model updating unit 32 calculates this error by MSE (least squares method), and uses ADAM of the stochastic gradient descent method so that the calculated difference becomes small. -1 to update the weight of the input to each unit (node). MSE, for example, Reference 3 "GitHub, Inc, [online], " Spectrogram Feature prediction network", [searched August 24, 2018], Internet <URL: https://github.com/Rayhane-mamah/ Tacotron-2/wiki/Spectrogram-Feature-prediction-network#training>”. In addition, ADAM is described in, for example, reference 4 "Diederik P. Kingma, Jimmy Ba, [online], 2017," ADAM: A Method for Stochastic Optimization ", arXiv:1412.6980v9, [searched on August 24, 2018] , Internet <URL: https://arxiv.org/pdf/1412.6980.pdf>”.

ステップＳ１６０において、学習部３０は、モデル更新が終了したか否かを判断する。例えば、正解音響特徴量算出部３１が算出した音響特徴量と、音響特徴量推定部４２が推定した音響特徴量との平均二乗誤差が所定以下となった場合にモデル更新が終了したと判断する。学習部３０は、モデル更新が終了していないと判断した場合（ステップＳ１６０：ＮＯ）、ステップＳ１２０からの処理を繰り返す。そして、学習部３０は、モデル更新が終了したと判断した場合（ステップＳ１６０：ＹＥＳ）、学習処理を終了する。 In step S160, the learning unit 30 determines whether or not the model update has ended. For example, when the mean square error between the acoustic feature quantity calculated by the correct acoustic feature quantity calculating unit 31 and the acoustic feature quantity estimated by the acoustic feature quantity estimating unit 42 is equal to or less than a predetermined value, it is determined that the model update is completed. . When the learning unit 30 determines that the model update has not ended (step S160: NO), the processing from step S120 is repeated. Then, when the learning unit 30 determines that the model update is finished (step S160: YES), the learning process is finished.

図５は、音声合成装置１の音声合成処理を示すフロー図である。
まず、ステップＳ２１０において、音声合成部４０は、発話内容を表す仮名漢字混じりの文章のテキストデータを入力する。発話内容を表す文章は、１文でもよく複数文でもよい。ステップＳ２２０において、言語処理部４１は、入力されたテキストデータに形態素解析を行い、発話内容を表す文章を、読み仮名と韻律記号とを用いた文字列により記載した中間言語に変換する。ユーザは、必要に応じて中間言語を修正してもよい。 FIG. 5 is a flowchart showing the speech synthesizing process of the speech synthesizer 1. As shown in FIG.
First, in step S210, the speech synthesizing unit 40 inputs text data of sentences containing kana and kanji representing the content of the speech. A sentence representing the contents of the utterance may be one sentence or a plurality of sentences. In step S220, the language processing unit 41 performs morphological analysis on the input text data, and converts sentences representing the contents of the utterance into an intermediate language described by character strings using reading kana and prosodic symbols. The user may modify the intermediate language as needed.

ステップＳ２３０において、音響特徴量推定部４２は、記憶部２０から読み出した音響特徴量生成モデル２０－１に、ステップＳ２２０において生成された中間言語を表すテキストデータである中間言語データを入力して音響特徴量を推定する。ステップＳ２４０において、ボコーダ部４３は、記憶部２０から読み出した音声波形生成モデル２０－２に、ステップＳ２３０において生成された音響特徴量を入力し、音声波形を推定する。ボコーダ部４３は、推定した音声波形を音声データにより、あるいは、スピーカーなどの音声出力部（図示せず）により出力する。 In step S230, the acoustic feature value estimation unit 42 inputs the intermediate language data, which is text data representing the intermediate language generated in step S220, to the acoustic feature value generation model 20-1 read from the storage unit 20, and Estimate features. In step S240, the vocoder unit 43 inputs the acoustic features generated in step S230 to the speech waveform generation model 20-2 read from the storage unit 20, and estimates the speech waveform. The vocoder unit 43 outputs the estimated voice waveform as voice data or from a voice output unit (not shown) such as a speaker.

図６は、音声合成装置１が用いる音響特徴量生成モデル及び学習アルゴリズムを示す図である。まず、音響特徴量生成モデル６０について説明する。図６に示す音響特徴量生成モデル６０は、音響特徴量生成モデル２０－１の一例であり、非特許文献４に示す技術を適用したＤＮＮである。音響特徴量生成モデル６０は、エンコーダ６１及びデコーダ６５を有する。図７は、エンコーダ６１の例を示す図であり、図８は、デコーダ６５の例を示す図である。なお、デコーダ６５が有するアテンションネットワーク６５１ついては、図７に記載されている。図６～図８を用いて、エンコーダ６１及びデコーダ６５について説明する。 FIG. 6 is a diagram showing an acoustic feature value generation model and a learning algorithm used by the speech synthesizer 1. As shown in FIG. First, the acoustic feature value generation model 60 will be described. The acoustic feature quantity generation model 60 shown in FIG. 6 is an example of the acoustic feature quantity generation model 20-1, and is a DNN to which the technology disclosed in Non-Patent Document 4 is applied. The acoustic feature quantity generation model 60 has an encoder 61 and a decoder 65 . FIG. 7 is a diagram showing an example of the encoder 61, and FIG. 8 is a diagram showing an example of the decoder 65. As shown in FIG. Note that the attention network 651 of the decoder 65 is shown in FIG. The encoder 61 and decoder 65 will be described with reference to FIGS. 6 to 8. FIG.

エンコーダ６１は、ＣＮＮ（Convolutional Neural Network；畳み込みニューラルネットワーク）及びＲＮＮ（Recurrent Neural Network；再帰型ニューラルネットワーク）により、入力された中間言語のテキストデータが示す文章内の発話内容に、そのテキストデータが示す文章内における当該発話内容の前後の文脈を考慮した文字列の特徴量を生成することができる。デコーダ６５は、ＲＮＮにより、エンコーダ６１が生成した特徴量と、過去に生成した音響特徴量とに基づいて、入力されたテキストデータが示す発話内容に対応する音声の予測の音響特徴量を１フレームずつ生成する。 The encoder 61 uses a CNN (Convolutional Neural Network) and an RNN (Recurrent Neural Network) to convert the utterance content in the sentence indicated by the input intermediate language text data to the text data. It is possible to generate a feature amount of a character string in consideration of the context before and after the utterance content in the sentence. The decoder 65 uses the RNN to generate, based on the feature amount generated by the encoder 61 and the acoustic feature amount generated in the past, an acoustic feature amount for predicting speech corresponding to the utterance content indicated by the input text data for one frame. generated one by one.

エンコーダ６１は、文字列変換処理６１１と、畳み込みネットワーク６１２と、双方向ＬＳＴＭネットワーク６１３とにより構成される。文字列変換処理６１１では、中間言語の記述に用いられている各文字を数値に変換し、中間言語をベクトル表現に変換する。 The encoder 61 is composed of a character string conversion process 611 , a convolutional network 612 and a bidirectional LSTM network 613 . Character string conversion processing 611 converts each character used in the description of the intermediate language into a numerical value, and converts the intermediate language into a vector representation.

畳み込みネットワーク６１２は、複数層（例えば、３層）の畳み込みレイヤが接続されたニューラルネットワークである。各畳み込みレイヤでは、中間言語のベクトル表現に対して、所定の文字数に相当する大きさの複数のフィルタにより畳み込み処理を行い、さらに、バッチ正規化及びＲｅＬＵ（Rectified Linear Units）活性化を行う。これにより、発話内容の文脈がモデル化される。例えば、３層の畳み込みレイヤのフィルタサイズは［５，０，０］、フィルタの数は５１２である。デコーダ６５に入力する文字列の特徴量を生成するために、畳み込みネットワーク６１２の出力が双方向ＬＳＴＭネットワーク６１３に入力される。双方向ＬＳＴＭネットワーク６１３は、５１２ユニット（各方向に２５６ユニット）の単一の双方向ＬＳＴＭである。双方向ＬＳＴＭネットワーク６１３により、入力されたテキストデータに記述された文章内における前後の文脈を考慮した文字列の特徴量を生成することが可能となる。ＬＳＴＭは、ＲＮＮ（Recurrent Neural Network）の一つである。 The convolutional network 612 is a neural network in which multiple (eg, three) convolutional layers are connected. Each convolution layer performs convolution processing on the vector representation of the intermediate language using a plurality of filters having a size corresponding to a predetermined number of characters, and further performs batch normalization and ReLU (Rectified Linear Units) activation. This models the context of the utterance content. For example, the filter size of the three convolution layers is [5, 0, 0] and the number of filters is 512. The output of the convolutional network 612 is input to a bi-directional LSTM network 613 in order to generate character string features for input to the decoder 65 . Bidirectional LSTM network 613 is a single bidirectional LSTM of 512 units (256 units in each direction). The bi-directional LSTM network 613 makes it possible to generate character string feature amounts that take into consideration the contexts before and after the sentences described in the input text data. LSTM is one of RNNs (Recurrent Neural Networks).

デコーダ６５は、自己回帰ＲＮＮである。デコーダ６５は、アテンションネットワーク６５１と、前処理ネットワーク６５２と、ＬＳＴＭネットワーク６５３と、第一線形変換処理６５４と、後処理ネットワーク６５５と、加算処理６５６と、第二線形変換処理６５７とにより構成される。 Decoder 65 is an autoregressive RNN. The decoder 65 comprises an attention network 651, a preprocessing network 652, an LSTM network 653, a first linear transformation process 654, a post-processing network 655, an addition process 656, and a second linear transformation process 657. .

アテンションネットワーク６５１は、自己回帰ＲＮＮにアテンション機能を追加したネットワークであり、エンコーダ６１からの出力全体を１フレームごとに要約した固定長のコンテキストベクトルを出力する。アテンションネットワーク６５１は、双方向ＬＳＴＭネットワーク６１３からの出力（エンコーダ出力）を入力する。フレームごとに、要約を生成するためにエンコーダ出力からデータを抽出するときの重みは、エンコーダ出力におけるデータ位置に応じて異なっている。アテンションネットワーク６５１は、エンコーダ出力から抽出したデータに、前のデコードのタイミングで生成したコンテキストベクトルを用いて特徴を追加したデータを用いて、今回のフレームの出力となるコンテキストベクトル（アテンションネットワーク出力）を生成する。 The attention network 651 is a network obtained by adding an attention function to the autoregressive RNN, and outputs a fixed-length context vector summarizing the entire output from the encoder 61 for each frame. Attention network 651 inputs the output (encoder output) from bidirectional LSTM network 613 . For each frame, the weights in extracting data from the encoder output to generate the summary are different depending on the data position in the encoder output. The attention network 651 uses the data extracted from the encoder output to add features using the context vector generated at the timing of the previous decoding, and converts the context vector (attention network output) to be the output of the current frame. Generate.

前処理ネットワーク６５２は、前回の時間ステップにおいて第一線形変換処理６５４が出力したデータを入力する。前処理ネットワーク６５２は、それぞれ２５６個の隠れＲｅＬＵユニットからなる完全結合された複数（例えば２つ）のレイヤを含んだニューラルネットワークである。ＲｅＬＵユニットからなるレイヤは、各ユニットの値がゼロよりも小さい場合はゼロを出力し、ゼロよりも大きい場合はそのままの値を出力する。ＬＳＴＭネットワーク６５３は、１０２４ユニットを有する複数（例えば、２層）の一方向ＬＳＴＭが結合されたニューラルネットワークであり、前処理ネットワーク６５２からの出力と、アテンションネットワーク６５１からの出力を結合したデータを入力する。フレームの音響特徴量は、前のフレームの音響特徴量の影響を受けるため、アテンションネットワーク６５１から出力された現在のフレームの特徴量に、前処理ネットワーク６５２からの出力を結合することにより、前のフレームの音響特徴量に基づく特徴を付加している。（詳細は非特許文献４を参照されたい。） The preprocessing network 652 inputs the data output by the first linear transformation process 654 at the previous time step. The preprocessing network 652 is a neural network containing multiple (eg, two) fully connected layers of 256 hidden ReLU units each. A layer consisting of ReLU units outputs zero if the value of each unit is less than zero, and outputs the value as is if it is greater than zero. The LSTM network 653 is a neural network in which multiple (for example, two layers) one-way LSTMs having 1024 units are combined, and the data obtained by combining the output from the preprocessing network 652 and the output from the attention network 651 is input. do. Since the acoustic features of a frame are affected by the acoustic features of the previous frame, by combining the features of the current frame output from the attention network 651 with the output from the preprocessing network 652, A feature based on the acoustic feature amount of the frame is added. (See Non-Patent Document 4 for details.)

第一線形変換処理６５４は、ＬＳＴＭネットワーク６５３から出力されたデータを線形変換し、１フレーム分のメルスペクトログラムのデータであるコンテキストベクトルを生成する。第一線形変換処理６５４は、生成したコンテキストベクトルを、前処理ネットワーク６５２、後処理ネットワーク６５５及び加算処理６５６に出力する。 A first linear transformation process 654 linearly transforms the data output from the LSTM network 653 to generate a context vector, which is mel-spectrogram data for one frame. First linear transformation process 654 outputs the generated context vector to pre-processing network 652 , post-processing network 655 and summation process 656 .

後処理ネットワーク６５５は、複数層（例えば、５層）の畳み込みネットワークを結合したニューラルネットワークである。例えば、５層の畳み込みネットワークは、フィルタサイズが［５，０，０］、フィルタの数は１０２４である。各畳み込みネットワークでは、畳み込み処理及びバッチ正規化と、最後の層を除いてtanh活性化とを行う。後処理ネットワーク６５５からの出力は、波長変換後の全体的な品質を改善するために用いられる。加算処理６５６では、第一線形変換処理６５４が生成したコンテキストベクトルと、後処理ネットワーク６５５からの出力とを加算する。 The post-processing network 655 is a neural network that combines multiple layers (eg, 5 layers) of convolutional networks. For example, a 5-layer convolutional network has a filter size of [5,0,0] and 1024 filters. Each convolutional network performs convolution processing and batch normalization and tanh activations except for the last layer. The output from post-processing network 655 is used to improve the overall quality after wavelength conversion. Addition operation 656 adds the context vector generated by first linear transformation operation 654 and the output from post-processing network 655 .

上記のスペクトログラムフレーム予測と並行して、第二線形変換処理６５７では、ＬＳＴＭネットワーク６５３の出力とアテンションコンテキストとの連結をスカラに投影したのちシグモイド活性化を行って、出力シーケンスが完了したかの判定に用いるストップトークン（Stop Token）を出力する。 In parallel with spectrogram frame prediction above, a second linear transform operation 657 projects the concatenation of the output of the LSTM network 653 and the attention context to a scalar followed by sigmoidal activation to determine if the output sequence is complete. Outputs the Stop Token used for

続いて、学習アルゴリズムについて説明する。図４に示す学習処理のステップＳ１２０において、正解音響特徴量算出部３１は、学習用音声データＡ１が示す音声波形にＦＦＴ（Fast Fourier Transform：高速フーリエ変換）を行った結果にＡＢＳ（絶対値算出処理）を行い、さらに、メルフィルタバンク処理を行ってＭＦＣＣ（Mel-Frequency Cepstrum Coefficients：メル周波数ケプストラム係数）を取得する。正解音響特徴量算出部３１は、ＭＦＣＣからメルスペクトログラムＡ２を音響特徴量として算出する。 Next, the learning algorithm will be explained. In step S120 of the learning process shown in FIG. 4, the correct acoustic feature value calculation unit 31 applies ABS (absolute value calculation processing), and further mel filter bank processing is performed to obtain MFCCs (Mel-Frequency Cepstrum Coefficients). The correct acoustic feature amount calculator 31 calculates the mel spectrogram A2 from the MFCC as an acoustic feature amount.

一方で、ステップＳ１４０において、音響特徴量推定部４２は、学習用テキストデータから生成された中間言語データである学習用中間言語データＢ１を音響特徴量生成モデル６０に入力し、メルスペクトログラムＢ２を推定結果として得る。ステップＳ１５０において、モデル更新部３２は、正解音響特徴量算出部３１が算出したメルスペクトログラムＡ２と、音響特徴量生成モデル６０により推定したメルスペクトログラムＢ２との差分を誤差として算出する。モデル更新部３２は、算出した誤差に基づいて、音響特徴量生成モデル６０を更新する。 On the other hand, in step S140, the acoustic feature value estimation unit 42 inputs learning intermediate language data B1, which is intermediate language data generated from learning text data, to the acoustic feature value generation model 60, and estimates a mel-spectrogram B2. get as a result. In step S150, the model update unit 32 calculates the difference between the mel-spectrogram A2 calculated by the correct acoustic feature quantity calculation unit 31 and the mel-spectrogram B2 estimated by the acoustic feature quantity generation model 60 as an error. The model updating unit 32 updates the acoustic feature quantity generation model 60 based on the calculated error.

学習部３０は、複数の学習データを用いて、学習用音声データから算出したメルスペクトログラムと、学習用中間言語データから音響特徴量生成モデル６０により推定したメルスペクトログラムとの差分が小さくなるように、音響特徴量生成モデル６０を更新する。 The learning unit 30 uses a plurality of learning data so that the difference between the mel-spectrogram calculated from the learning speech data and the mel-spectrogram estimated from the learning intermediate language data by the acoustic feature value generation model 60 becomes small. The acoustic feature value generation model 60 is updated.

図９は、音響特徴量生成モデル６０を用いた音声合成アルゴリズムを示す図である。図５のステップＳ２３０において、音響特徴量推定部４２は、仮名漢字混じりのテキストデータを基に生成された中間言語データＣ１を学習済みの音響特徴量生成モデル６０に入力し、フレーム毎の音響特徴量であるメルスペクトログラムＣ２を生成し、ボコーダ部４３に出力する。ステップＳ２４０において、ボコーダ部４３は、記憶部２０に記憶されている音声波形生成モデル２０－２にフレーム毎のメルスペクトログラムＣ２を入力し、時間領域波形に逆変換して音声波形Ｃ３を生成する。音声波形生成モデル２０－２には、例えば、多層の畳み込みネットワークを利用したWaveNetを用いる。なお、この処理には、上記以外の種類のボコーダ部を用いて実現してもよい。 FIG. 9 is a diagram showing a speech synthesis algorithm using the acoustic feature value generation model 60. As shown in FIG. In step S230 of FIG. 5, the acoustic feature value estimation unit 42 inputs the intermediate language data C1 generated based on the text data including kana and kanji characters to the trained acoustic feature value generation model 60, and calculates the acoustic feature value for each frame. A mel-spectrogram C2, which is a quantity, is generated and output to the vocoder unit 43. FIG. In step S240, the vocoder unit 43 inputs the frame-by-frame mel-spectrogram C2 to the speech waveform generation model 20-2 stored in the storage unit 20, and inversely transforms it into a time domain waveform to generate a speech waveform C3. WaveNet using a multi-layered convolutional network, for example, is used for the speech waveform generation model 20-2. Note that this processing may be implemented using a vocoder section of a type other than the above.

続いて、本実施形態の音声合成装置１によるメルスペクトログラムの推定精度に関する評価実験の結果について示す。評価実験には、女性ナレーター１名が発声した１２，５１８文（１８時間）の音声コーパスを使用した。音声データはサンプリング周波数２２０５０［Hz］、１６［ビット］量子化のＰＣＭ（pulse code modulation）である。音声コーパスのうち１２，４５２文を音響特徴量生成モデルの学習に用い、残りのデータのうち無作為に抽出した１０文を評価実験に用いた。学習回数は５３５，０００回である。 Next, the results of evaluation experiments on the accuracy of mel-spectrogram estimation by the speech synthesizer 1 of this embodiment will be described. A speech corpus of 12,518 sentences (18 hours) uttered by one female narrator was used for the evaluation experiment. Audio data is PCM (pulse code modulation) with a sampling frequency of 22050 [Hz] and quantization of 16 [bits]. 12,452 sentences from the speech corpus were used for learning the acoustic feature value generation model, and 10 sentences randomly extracted from the remaining data were used for the evaluation experiment. The number of times of learning is 535,000.

被験者への音声刺激には、４種類×１０文を用いた。この４種類は、仮名及び韻律記号により記述された中間言語データを入力に用いて音声合成装置１が生成した合成音声（本実施形態）、従来技術により原音声を分析合成した音声（分析合成）、仮名のみを入力データとして音声合成装置１が生成した合成音声（仮名のみ）、及び、原音声である。 Four types of 10 sentences were used for the voice stimuli to the subjects. These four types are synthesized speech generated by the speech synthesizer 1 using intermediate language data described by kana and prosody symbols as input (this embodiment), and speech synthesized by analyzing and synthesizing the original speech by conventional technology (analysis synthesis). , synthesized speech (kana only) generated by the speech synthesizer 1 using only kana as input data, and original speech.

被験者は音声研究専門家６人である。各被験者は、ヘッドホンにより各自が聞き取りやすい音量で音声刺激を聴取し、評定を行った。被験者はランダムに提示された音声刺激に対して総合的な音質に関する５段階評価を行った。被験者全員の評価結果から平均オピニオン評点（ＭＯＳ）を求めた。 The subjects were six speech research professionals. Each subject listened to the speech stimuli at a volume that was easy for each subject to hear through headphones and rated them. Subjects rated the overall sound quality on a 5-point scale for randomly presented speech stimuli. A mean opinion score (MOS) was obtained from the evaluation results of all subjects.

図１０は、評価実験の結果を示す図である。図１０では、ＭＯＳ値と９５%信頼区間とを示している。本実施形態の音声合成装置１により合成された音声は、原音声より劣るものの、分析合成と同程度の品質であり、仮名のみを入力データに用いるよりも高く評価された。これは、韻律記号が有効に機能したものと考えられる。 FIG. 10 is a diagram showing the results of evaluation experiments. FIG. 10 shows MOS values and 95% confidence intervals. The speech synthesized by the speech synthesizer 1 of this embodiment is inferior to the original speech, but has a quality comparable to that of the analysis synthesis, and was evaluated higher than using only kana as input data. This is thought to be due to the effective functioning of prosody symbols.

本実施形態の音声合成装置１によれば、仮名と韻律記号とを用いて記述された中間言語のテキストデータから直接音響特徴量を生成し、また、その生成に用いられるモデルを学習できる。本実施形態では、日本語の音声表現の多様性と正確性を担保しつつ、入力に用いる文字列の種類を限定する事で、End-to-End音声合成に適した入力表現を得られる。日本語の漢字は、読み方が複数あることから、その文字列が必ずしも音声と一致しないが、本実施形態の音声合成装置１は、中間言語に仮名を用いることにより、日本語の正確性を担保しつつ自然な音声を合成でき、アクセントの位置やポーズ位置についても制御する事ができる。 According to the speech synthesizer 1 of this embodiment, it is possible to generate acoustic features directly from intermediate language text data described using kana and prosody symbols, and to learn a model used for the generation. In this embodiment, input expressions suitable for end-to-end speech synthesis can be obtained by limiting the types of character strings used for input while ensuring the diversity and accuracy of Japanese speech expressions. Since Japanese kanji characters have multiple readings, the character strings do not necessarily match the speech. It is possible to synthesize natural voices while doing so, and it is possible to control the accent position and pose position.

上述した実施形態では、発話内容を表す文章を当該発話内容の仮名と韻律を表す韻律記号とを用いた文字列により記述した中間言語データを言語処理部４１において生成しているが、このような中間言語データを人手で生成してもよい。この場合、音声合成装置１は、言語処理部４１を備えなくてもよい。 In the above-described embodiment, the language processing unit 41 generates the intermediate language data in which sentences representing the content of the utterance are described by a character string using the kana of the content of the utterance and prosody symbols representing the prosody. Intermediate language data may be generated manually. In this case, the speech synthesizer 1 does not have to include the language processing section 41 .

なお、本実施形態における日本語音声合成に用いる中間言語の表記方法は、非特許文献４に記載されたエンコーダ・デコーダモデルの音声合成手法に限定せず、他のエンコーダ・デコーダモデルにも適用可能である。例えば、参考文献５「Wei Ping et al.，[online]，2018年2月，"Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning", arXiv:1710.07654，インターネット<URL: https://arxiv.org/pdf/1710.07654.pdf>」に記載のエンコーダ・デコーダモデルに適用可能である。 Note that the notation method of the intermediate language used for Japanese speech synthesis in this embodiment is not limited to the speech synthesis method of the encoder/decoder model described in Non-Patent Document 4, and can also be applied to other encoder/decoder models. is. For example, Reference 5 "Wei Ping et al., [online], February 2018, "Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning", arXiv:1710.07654, Internet <URL: https:// arxiv.org/pdf/1710.07654.pdf>”.

本実施形態の音声合成装置１では、音素や音素の位置等を詳しく規定したフルコンテキストラベルが不要であるため、学習データの作成が容易である。よって、既存の音声データを学習データとして活用しやすくなる。従来法で高品質な合成音を得るには、学習データに人手で音素区切り境界を付与するなど煩雑な作業を行う必要があったが、本実施形態では音素区切り境界の情報は必要なく、自動で読み仮名と韻律記号に対する境界が決定される。そのため、従来のようなＨＴＳ準拠フルコンテキストラベルを使用する場合と比較し、１音素あたりのコストは１／３程度に削減される。さらには、作業時間も大幅に短縮できるため、大量の学習データを作成して音響特徴量生成モデルの精度を向上させることができる。 Since the speech synthesizer 1 of the present embodiment does not require full context labels that specify phonemes, positions of phonemes, etc. in detail, it is easy to create training data. Therefore, it becomes easier to utilize existing speech data as learning data. In order to obtain high-quality synthesized speech with the conventional method, it was necessary to perform complicated tasks such as manually assigning phoneme boundaries to the training data. determines the boundaries for the yomigana and prosodic symbols. Therefore, the cost per phoneme is reduced to about ⅓ compared to the case of using conventional HTS-compliant full-context labels. Furthermore, since the work time can be greatly reduced, a large amount of learning data can be created to improve the accuracy of the acoustic feature value generation model.

また、既存の表記法を活用することにより、既存のフロントエンドとの接続が容易であり、既存のシステムの利用が容易となる。また、音声合成装置１は、音素境界を事前にデータとして持っていなくても、ＨＭＭ（Hidden Markov Model、隠れマルコフモデル）等による強制アライメントを実施する事なく、中間言語のみからアライメントを実施したかのように音素を学習することができる。 In addition, by utilizing the existing notation, it is easy to connect with the existing front end, and the use of the existing system becomes easy. In addition, even if the speech synthesizer 1 does not have phoneme boundaries as data in advance, it does not perform forced alignment using HMMs (Hidden Markov Models) or the like, and whether alignment is performed only from the intermediate language. You can learn phonemes like

［第２の実施形態］
番組制作の意図に沿った放送品質の音声合成を実現するためには、番組の演出要件に応じて発話スタイルを制御することが重要である。例えば、ニュース、スポーツ実況、ドキュメンタリーなど、番組によってそれぞれ異なる発話スタイルが求められる。本実施形態では、発話全体に与える特徴を文字列で表されるタグなどの発話スタイル記号により制御可能とする。発話全体に与える特徴は、例えば、発話スタイル（実況調、ニュース調）や、感情（悲しい、うれしいなど）、話者である。以下では、第１の実施形態との差分を中心に説明する。 [Second embodiment]
In order to achieve broadcast-quality speech synthesis that meets the intentions of program production, it is important to control the utterance style according to the performance requirements of the program. For example, different speaking styles are required depending on programs such as news, live sports, and documentaries. In this embodiment, the characteristics given to the entire utterance can be controlled by utterance style symbols such as tags represented by character strings. The features given to the entire utterance are, for example, utterance style (play-by-play, news style), emotion (sad, happy, etc.), and speaker. The following description focuses on differences from the first embodiment.

図１１は、本実施形態による音声合成装置１ａの構成例を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみを抽出したものである。図１１において、図２に示す第１の実施形態による音声合成装置１と同一の部分には同一の符号を付し、その説明を省略する。音声合成装置１ａは、記憶部２０と、学習部３０と、音声合成部４０ａとを備えて構成される。 FIG. 11 is a functional block diagram showing a configuration example of the speech synthesizing device 1a according to this embodiment, in which only functional blocks related to this embodiment are extracted. In FIG. 11, the same parts as those of the speech synthesizing apparatus 1 according to the first embodiment shown in FIG. The speech synthesizer 1a includes a storage unit 20, a learning unit 30, and a speech synthesis unit 40a.

音声合成部４０ａが、第１の実施形態の音声合成部４０と異なる点は、言語処理部４１に代えて言語処理部４１ａを備える点である。言語処理部４１ａは、言語処理部４１と同様に仮名漢字混じり文のテキストデータを、カタカナ及び韻律記号を用いた中間言語に変換する。さらに、言語処理部４１ａは、カタカナ及び韻律記号を用いた中間言語に対して、発話全体に与える特徴を表す記号を付加する。以下では、発話全体に与える特徴を表す記号を「発話スタイル記号」と記載する。発話スタイル記号には、仮名（読み方を表す文字）とは異なり、かつ、韻律記号を表す文字又は文字列とも異なる文字又は文字列を使用する。 The speech synthesis unit 40a differs from the speech synthesis unit 40 of the first embodiment in that it includes a language processing unit 41a instead of the language processing unit 41. FIG. Like the language processing unit 41, the language processing unit 41a converts the text data of sentences mixed with kana and kanji into an intermediate language using katakana and prosodic symbols. Furthermore, the language processing unit 41a adds symbols representing features given to the entire utterance to the intermediate language using katakana and prosody symbols. In the following, symbols representing features given to the entire utterance are referred to as "utterance style symbols". The utterance style symbols use characters or character strings that are different from kana (characters representing readings) and also different from characters or character strings representing prosody symbols.

なお、音声合成装置１ａは、１台以上のコンピュータ装置により実現することができる。音声合成装置１ａが複数台のコンピュータ装置により実現される場合、いずれの機能部をいずれのコンピュータ装置により実現するかは任意とすることができる。例えば、音声合成部４０ａをクライアント端末で実現し、記憶部２０及び学習部３０を１台又は複数台のサーバコンピュータにより実現してもよい。あるいは、言語処理部４１ａをクライアント端末で実現し、他の機能部をサーバコンピュータで実現してもよい。また、同一の機能部を複数台のコンピュータ装置により実現してもよい。また、音声合成装置１ａは、図示しない表示部及び入力部を備えてもよい。 Note that the speech synthesizer 1a can be realized by one or more computer devices. When the speech synthesizing device 1a is realized by a plurality of computer devices, it is possible to arbitrarily decide which functional unit is to be realized by which computer device. For example, the speech synthesizing unit 40a may be realized by a client terminal, and the storage unit 20 and the learning unit 30 may be realized by one or more server computers. Alternatively, the language processing unit 41a may be implemented by the client terminal, and the other functional units may be implemented by the server computer. Also, the same functional unit may be realized by a plurality of computer devices. The speech synthesizer 1a may also include a display unit and an input unit (not shown).

図１２は、音声合成装置１ａによる音声合成処理の流れを示す図である。以下、図１１を併用して説明を続ける。テキストＤ１は、発話内容を表す仮名漢字混じりの文章のテキストデータであり、音声合成部４０ａに入力される。言語処理部４１ａは、テキストＤ１を形態素解析するなどしてテキストＤ２を得る。テキストＤ２は、第１の実施形態において用いられる中間言語であり、読み仮名と韻律記号とを用いた文字列である。テキストＤ２に、人手で修正を加えてもよい。続いて言語処理部４１ａは、テキストＤ２に発話スタイル記号を付加し、本実施形態における中間言語となるテキストＤ３を得る。図１２では、発話タグ「＜ｔａｇ＞」を発話スタイル記号として用いている。 FIG. 12 is a diagram showing the flow of speech synthesizing processing by the speech synthesizing device 1a. The description will be continued below with reference to FIG. The text D1 is text data of sentences mixed with kana and kanji representing the content of the speech, and is input to the speech synthesizing unit 40a. The language processing unit 41a obtains the text D2 by, for example, morphologically analyzing the text D1. The text D2 is an intermediate language used in the first embodiment, and is a character string using reading kana and prosody symbols. Text D2 may be corrected manually. Subsequently, the language processing unit 41a adds an utterance style symbol to the text D2 to obtain the text D3, which is the intermediate language in this embodiment. In FIG. 12, the utterance tag "<tag>" is used as the utterance style symbol.

発話スタイル記号「＜ｔａｇ＞」における「ｔａｇ」の部分には、発話全体に与える特徴の種類を表す文字列を使用可能である。発話スタイル記号を表す文字列の文字数を変えてもよい。例えば、発話全体に与える特徴が悲しい感情のときには「＜ｓａｄ＞」を使用し、ニュース調のときには「＜ｎｅｗｓ＞」を使用し、話者Ａのときには「＜ｓｐｋｅｒＡ＞」を使用する。また、図１２では、発話全体に与える特徴を付与したい文を、発話スタイル記号により囲っているが、文の先頭のみに発話スタイル記号を付与してもよい。発話スタイル記号により囲む文は一文でもよく、複数文でもよい。また、文中の文節に特徴を与える場合は、特徴を与えるその文節を発話スタイル記号により囲む。このように、特徴を与える対象の発話は、発話スタイル記号が所定位置に付加された１以上の文の発話全体、発話スタイル記号に囲まれた１以上の文の発話全体、又は、発話スタイル記号により囲まれた１以上の文節の部分の発話全体とすることができる。 A character string representing the type of feature to be given to the entire utterance can be used for the 'tag' portion of the utterance style symbol '<tag>'. The number of characters in the character string representing the speech style symbol may be changed. For example, "<sad>" is used when the feature given to the whole utterance is a sad emotion, "<news>" is used when it is news-like, and "<spkerA>" is used when speaker A is present. Also, in FIG. 12, the sentence to which the feature given to the entire utterance is to be added is surrounded by the utterance style symbol, but the utterance style symbol may be added only to the beginning of the sentence. The sentences enclosed by the utterance style symbols may be one sentence or plural sentences. Also, when a feature is given to a phrase in a sentence, the phrase giving the feature is surrounded by utterance style symbols. In this way, the utterance to which features are to be assigned is the entire utterance of one or more sentences with utterance style symbols added at predetermined positions, the entire utterance of one or more sentences surrounded by utterance style symbols, or the utterance style symbols. can be the entire utterance of the portion of one or more clauses enclosed by .

ここでは、発話スタイル記号として、ＸＭＬ（extensible markup language）のように人間の可読性を重視した発話タグ「＜ｔａｇ＞」を用いているが、「＊」、「－」、「＃」などの記号やそれらの組み合わせを用いてもよい。これらの記号は、半角でも全角でもよい。 Here, as an utterance style symbol, an utterance tag "<tag>" that emphasizes human readability like XML (extensible markup language) is used. or combinations thereof may be used. These symbols may be half-width or full-width.

言語処理部４１ａは、例えば、スポーツ実況の文章など、所定の目的で使用される文章を自動生成する文章生成システムからテキストＤ１を入力してもよい。この場合、文章生成システムは、自動生成された文書を記述したテキストＤ１と、その文章の目的に応じた、発話全体に与える特徴を示す情報とを、言語処理部４１ａに入力する。 The language processing unit 41a may input the text D1 from a sentence generation system that automatically generates sentences used for a predetermined purpose, such as sports commentary sentences. In this case, the text generation system inputs text D1 describing an automatically generated document and information indicating characteristics given to the entire utterance according to the purpose of the text to the language processing unit 41a.

また、発話に与える特徴をユーザが入力してもよい。この場合、表示部（図示せず）は、テキストＤ１又はテキストＤ２と、発話全体に与える特徴に対応したアイコンの一覧（各感情に対応したアイコン、各発話スタイルに対応したアイコン、各話者に対応したアイコンなど）を表示する。ユーザは、ポインティングデバイスにより、付加したい特徴を表すアイコンを選択する。言語処理部４１ａは、選択されたアイコンに対応した発話スタイル記号を、テキストＤ２に含まれる文章の前後に付加し、テキストＤ３を生成する。なお、ユーザは、表示されているテキストＤ１又はテキストＤ２の一部の文又は文節を入力部（図示せず）により選択するようにしてもよい。言語処理部４１ａは、選択された文又は文節に対応した、テキストＤ２の部分の前後に発話スタイル記号を付加する。言語処理部４１ａは、生成したテキストＤ３を音響特徴量推定部４２に出力する。 Also, the user may input features to be given to the utterance. In this case, the display unit (not shown) displays the text D1 or text D2 and a list of icons corresponding to features given to the entire utterance (icon corresponding to each emotion, icon corresponding to each utterance style, icon corresponding to each speaker). corresponding icon, etc.). The user selects an icon representing the feature to be added with the pointing device. The language processing unit 41a adds utterance style symbols corresponding to the selected icon before and after sentences included in the text D2 to generate the text D3. It should be noted that the user may select a partial sentence or clause of the displayed text D1 or text D2 using an input unit (not shown). The language processing unit 41a adds utterance style symbols before and after the portion of the text D2 corresponding to the selected sentence or phrase. The language processing unit 41a outputs the generated text D3 to the acoustic feature estimation unit .

あるいは、ユーザは、発話スタイル記号を手動で入力してもよい。具体的には、ユーザは、表示部（図示せず）に表示されているテキストＤ２に対し、マウス等のポインティングデバイスにより発話スタイル記号の入力位置を指定する。さらに、ユーザは、キーボードなどにより、発話全体に与える特徴に応じた発話スタイル記号を入力する。 Alternatively, the user may manually enter the speech style symbols. Specifically, the user designates the input position of the utterance style symbol on the text D2 displayed on the display unit (not shown) using a pointing device such as a mouse. Further, the user inputs utterance style symbols according to the characteristics given to the entire utterance using a keyboard or the like.

音響特徴量推定部４２及びボコーダ部４３は、第１の実施形態と同様の処理を行う。すなわち、音響特徴量推定部４２は、非特許文献４、参考文献５に記載の技術等を用い、ＲＮＮのＳｅｑ２Ｓｅｑ（エンコーダ・デコーダモデル）とエンコーダの出力に対して重み付けを行うための重み（アテンション）を生成するアテンションネットワークとにより音響特徴量を推定する。エンコーダは、中間言語で記述された文字列であるテキストＤ３をベクトル化してエンコードを行う。デコーダは、エンコーダの出力に重み付けを行い、自己回帰ＲＮＮによりメルスペクトログラムの音響特徴量を生成する。ボコーダ部４３は、参考文献１に記載の技術等を用いて、音響特徴量から音声波形を推定する。 The acoustic feature estimation unit 42 and vocoder unit 43 perform the same processing as in the first embodiment. That is, the acoustic feature quantity estimating unit 42 uses the techniques described in Non-Patent Document 4 and Reference Document 5, etc., and uses weights (attention ) to estimate acoustic features. The encoder vectorizes and encodes the text D3, which is a character string written in an intermediate language. The decoder weights the output of the encoder and generates acoustic features of the mel-spectrogram by autoregressive RNN. The vocoder unit 43 uses the technique described in reference 1 or the like to estimate the speech waveform from the acoustic feature amount.

韻律記号を用いることにより、韻律（アクセントの高低）、文末の上がり下がり、ポーズなど局所的な音響的特徴が制御可能である。一方、発話スタイル記号を用いることにより、音声合成における、発話全体や一部の口調や調子、感情、話者をコントロール可能である。発話スタイル記号を用いた中間言語により、実況調やニュース調などの番組演出に対応した音声を、少量の学習データによりモデル学習できる。また、音声合成装置１ａは、複数の特徴を単一の音響特徴量生成モデル２０－１により学習させてもよい。この場合、音声合成装置１ａは、学習させた音響特徴量生成モデル２０－１を用いて、学習に用いた特徴を有する音声を合成することができる。 By using prosodic symbols, it is possible to control local acoustic features such as prosody (high and low accent), rise and fall at the end of sentences, and pauses. On the other hand, by using utterance style symbols, it is possible to control the tone, tone, emotion, and speaker of the entire utterance or part of the utterance in speech synthesis. By using an intermediate language using speech style symbols, it is possible to model speech corresponding to program production such as live commentary and news style with a small amount of training data. Further, the speech synthesizer 1a may learn a plurality of features using a single acoustic feature value generation model 20-1. In this case, the speech synthesizer 1a can synthesize speech having the features used for learning, using the trained acoustic feature value generation model 20-1.

音声合成装置１ａの学習処理は、図４のフロー図が示す第１の実施形態とステップＳ１３０の処理を除いて同様である。ステップＳ１３０において、音声合成装置１ａの言語処理部４１ａは、第１の実施形態の言語処理部４１と同様に学習用テキストデータを読み仮名と韻律記号とを用いた文字列に変換する。言語処理部４１ａは、変換後の文字列に、学習用音声データの発話に与える特徴を表す発話スタイル記号を付加して中間言語を生成する。 The learning process of the speech synthesizer 1a is the same as that of the first embodiment shown in the flow chart of FIG. 4 except for the process of step S130. In step S130, the language processing unit 41a of the speech synthesizer 1a converts the learning text data into a character string using reading kana and prosodic symbols, like the language processing unit 41 of the first embodiment. The language processing unit 41a generates an intermediate language by adding, to the character string after conversion, an utterance style symbol representing a feature given to the utterance of the speech data for learning.

図１３は、音声合成装置１ａの学習アルゴリズムを示す図である。音声合成装置１ａは、第１の実施形態の音響特徴量生成モデル６０の構成を変化させることなく、発話スタイル記号を学習用中間言語データに設定するのみでスタイル制御を可能とする。例えば、悲しい音声ばかりの音声コーパスを音響特徴量生成モデル６０の学習に用いる。この音声コーパスに含まれる各音声のデータを、学習用音声データＡ４とする。音声合成装置１ａの言語処理部４１ａは、学習用音声データＡ４の発話内容を形態素解析し、形態素解析の結果を、悲しい感情を表す発話タグ「＜ｓａｄ＞」で囲って学習用中間言語データＢ４を生成する。音声合成装置１ａは、音声コーパスから得られた学習用音声データＡ４と、この学習用音声データＡ４の発話内容から生成された学習用中間言語データＢ４との対を学習データに用いて、音響特徴量生成モデル６０の学習を行う。また、音声合成装置１ａは、例えば話者Ａの音声を、発話タグ「＜ｓｐｋｅｒＡ＞」を用いて学習し、話者Ｂの音声を、発話タグ「＜ｓｐｋｅｒＢ＞」を用いて学習する。音声合成装置１ａの学習アルゴリズムは、学習用音声データＡ１と学習用中間言語データＢ１の対に代えて、学習用音声データＡ４と学習用中間言語データＢ４の対を用いること以外は、図６に示す第１の実施形態による音声合成装置１の学習アルゴリズムと同様である。 FIG. 13 is a diagram showing a learning algorithm of the speech synthesizer 1a. The speech synthesizer 1a enables style control only by setting utterance style symbols in intermediate language data for learning without changing the configuration of the acoustic feature quantity generation model 60 of the first embodiment. For example, a speech corpus containing only sad speech is used for learning the acoustic feature quantity generation model 60 . The data of each speech included in this speech corpus is set as learning speech data A4. The language processing unit 41a of the speech synthesizer 1a morphologically analyzes the utterance content of the speech data for learning A4, encloses the result of the morphological analysis with an utterance tag "<sad>" representing sad emotion, and converts it into intermediate language data for learning B4. to generate The speech synthesizer 1a uses, as learning data, a pair of training speech data A4 obtained from a speech corpus and training intermediate language data B4 generated from the utterance content of this training speech data A4 to generate acoustic features. The quantity generation model 60 is learned. Further, the speech synthesizer 1a learns, for example, the voice of speaker A using the utterance tag "<spkerA>", and learns the voice of speaker B using the utterance tag "<spkerB>". The learning algorithm of the speech synthesizer 1a is as shown in FIG. It is the same as the learning algorithm of the speech synthesizer 1 according to the first embodiment shown.

音声合成装置１ａの音声合成処理は、図５のフロー図が示す第１の実施形態とステップＳ２２０の処理を除いて同様である。ステップＳ２２０において、言語処理部４１ａは、発話内容を表す仮名漢字混じりの文章のテキストデータを、第１の実施形態の言語処理部４１と同様に読み仮名と韻律記号とを用いた文字列に変換する。言語処理部４１ａは、変換された文字列に、所望の発話スタイルを表す発話スタイル記号を付加した中間言語を生成する。 The speech synthesizing process of the speech synthesizer 1a is the same as that of the first embodiment shown in the flow chart of FIG. 5 except for the process of step S220. In step S220, the language processing unit 41a converts the text data of the sentence containing kana and kanji representing the content of the utterance into a character string using reading kana and prosody symbols in the same manner as the language processing unit 41 of the first embodiment. do. The language processing unit 41a generates an intermediate language by adding an utterance style symbol representing a desired utterance style to the converted character string.

図１４は、音声合成装置１ａの音響特徴量生成モデル６０を用いた音声合成アルゴリズムを示す図である。図１４に示す音声合成アルゴリズムが、図９に示す第１の実施形態の音声合成アルゴリズムと異なる点は、中間言語データＣ１に代えて、中間言語データＣ４が入力される点である。中間言語データＣ４は、発話タグ（発話スタイル記号）、韻律記号及びカタカナを用いて記述される。中間言語データＣ４が入力される点以外については、図１４に示す音声合成アルゴリズムは、図９に示す第１の実施形態の音声合成アルゴリズムと同様である。音響特徴量生成モデル６０は、図１３に示す学習アルゴリズムにより学習されたモデルである。 FIG. 14 is a diagram showing a speech synthesis algorithm using the acoustic feature amount generation model 60 of the speech synthesizer 1a. The speech synthesis algorithm shown in FIG. 14 differs from the speech synthesis algorithm of the first embodiment shown in FIG. 9 in that intermediate language data C4 is input instead of intermediate language data C1. The intermediate language data C4 is described using utterance tags (utterance style symbols), prosody symbols, and katakana. The speech synthesis algorithm shown in FIG. 14 is the same as the speech synthesis algorithm of the first embodiment shown in FIG. 9 except that the intermediate language data C4 is input. The acoustic feature quantity generation model 60 is a model learned by the learning algorithm shown in FIG.

図１５は、本実施形態のエンコーダ６１の例を示す図である。エンコーダ６１へ入力される中間言語データは、学習処理の場合は図１３において入力される学習用中間言語データＢ４に対応し、音声合成処理の場合は図１４において入力される中間言語データＣ４に対応する。文字列変換処理６１１では、中間言語の記述に用いられている各文字や記号を数値に変換し、中間言語をベクトル表現に変換する。例えば、文字列変換処理６１１では、発話タグ「＜ｔａｇ＞」の部分を、「＜」、「ｔ」、「ａ」、「ｇ」、「＞」それぞれを表す値に変換する。文字列変換処理６１１以降は、図７に示す第１の実施形態のエンコーダ６１と同様である。また、本実施形態のデコーダ６５は、図８に示す第１の実施形態と同様である。 FIG. 15 is a diagram showing an example of the encoder 61 of this embodiment. The intermediate language data input to the encoder 61 corresponds to the learning intermediate language data B4 input in FIG. 13 in the case of learning processing, and corresponds to the intermediate language data C4 input in FIG. 14 in the case of speech synthesis processing. do. Character string conversion processing 611 converts each character or symbol used in the description of the intermediate language into a numerical value, and converts the intermediate language into a vector representation. For example, in the character string conversion processing 611, the portion of the utterance tag "<tag>" is converted into values representing "<", "t", "a", "g", and ">". Character string conversion processing 611 and subsequent steps are the same as those of the encoder 61 of the first embodiment shown in FIG. Also, the decoder 65 of this embodiment is the same as that of the first embodiment shown in FIG.

上述したように、エンコーダ６１の構造には、第１の実施形態からの変更はない。しかしながら、文字列変換処理６１１によりベクトル表現に変換された中間言語の発話スタイル記号（発話タグ）は、畳み込みネットワーク６１２において近くに位置する文字列と畳み込まれる。さらに、双方向ＬＳＴＭネットワーク６１３において、発話スタイル記号は、発話全体に影響を及ぼす。このため、アテンションネットワーク６５１において、エンコーダ６１からの出力を受ける層は、発話スタイル制御を受け付けることになる。アテンションネットワーク６５１の構造も、第１の実施形態からの変化はない。そして、デコーダ６５が、ＲＮＮにより音響特徴量を推定するときには、中間言語データに記述された発話スタイル記号に応じた特色がある音声コーパスと同じ特徴を持った音声、具体的には「＜ｓａｄ＞」の音声コーパスのように悲しい感情の音声の特徴を持った音声や、「＜ｓｐｋｅｒＡ＞」の音声コーパスのように話者Ａの音声の特徴を持った音声を再現可能となる。 As mentioned above, the structure of the encoder 61 remains unchanged from the first embodiment. However, the intermediate language speech style symbols (speech tags) converted to vector representation by the string conversion process 611 are convolved with nearby strings in the convolution network 612 . Furthermore, in bi-directional LSTM network 613, speech style symbols affect the entire utterance. Therefore, in the attention network 651, the layer that receives the output from the encoder 61 receives speech style control. The structure of the attention network 651 is also unchanged from the first embodiment. Then, when the decoder 65 estimates the acoustic feature amount by RNN, the speech having the same features as the speech corpus having the feature corresponding to the utterance style symbol described in the intermediate language data, specifically "<sad> , which has the characteristics of sad emotional speech, and the speech corpus of "<spkerA>," which has the characteristics of speaker A's speech, can be reproduced.

上記のように、エンコーダ６１は、双方向ＬＳＴＭネットワーク６１３を用いているため、本実施形態では、発話スタイル記号を、韻律記号及びカタカナで記述された文章の前後に配置している。 As described above, since the encoder 61 uses the bidirectional LSTM network 613, in this embodiment, the utterance style symbols are placed before and after the prosody symbols and sentences written in katakana.

上述した実施形態では、中間言語データを言語処理部４１ａにおいて生成しているが、中間言語データを人手により生成するか、中間言語データを音声合成装置１ａの外部の装置により生成して音声合成装置１ａに入力してもよい。この場合、音声合成装置１ａは、言語処理部４１ａを備えなくてもよい。 In the above-described embodiment, the intermediate language data is generated in the language processing unit 41a. You can enter 1a. In this case, the speech synthesizer 1a does not have to include the language processing section 41a.

続いて、本実施形態の音声合成装置１ａによる評価実験の結果について示す。評価実験には、女性ナレーター１名が発声した１２，５１８文（１８時間）の音声コーパスを使用した。この音声コーパスに含まれる音声データの分類は、スポーツ実況（以下、「実況」と記載）が２，５９６文（３時間４０分）、悲哀が６３３文（５０分）、通常読み上げ（以下、「平静」と記載）が９，２２２文（１３時間）である。音声データは、サンプリング周波数２２，０５０［Hz］、１６［ビット］量子化のＰＣＭである。音響特徴量生成モデル６０には非特許文献４の技術を用い、ボコーダ部４３には、参考文献１に記載の技術を用いた。モデル学習処理及び音声合成処理において使用したメルスペクトログラムは、それぞれ８０［次元］、窓関数は１，０２４［ｐｏｉｎｔ］、フレームシフトは１１．６［ｍｓ］である。 Next, the results of evaluation experiments performed by the speech synthesizer 1a of this embodiment will be described. A speech corpus of 12,518 sentences (18 hours) uttered by one female narrator was used for the evaluation experiment. The speech data included in this speech corpus is classified into 2,596 sentences (3 hours and 40 minutes) for sports commentary (hereinafter referred to as "commentary"), 633 sentences (50 minutes) for sadness, and normal reading (hereinafter referred to as " 9,222 sentences (13 hours). Audio data is PCM with a sampling frequency of 22,050 [Hz] and quantization of 16 [bits]. The technique of Non-Patent Document 4 is used for the acoustic feature value generation model 60, and the technique described in Reference Document 1 is used for the vocoder unit 43. FIG. The mel-spectrogram used in the model learning process and the speech synthesis process is 80 [dimensions], the window function is 1,024 [points], and the frame shift is 11.6 [ms].

音響特徴量生成モデル６０の学習には、前述の女性ナレーターの音声コーパスに含まれる音声データである学習用音声データＡ４と、この音声コーパスの仮名漢字混じり文から作成された学習用中間言語データＢ４とを対にした学習データを用いた。実験で用いた学習用中間言語データＢ４は、音声コーパスの仮名漢字混じり文を言語解析して求められた仮名及び韻律記号に対して人手で修正を行い、発話スタイル記号を付加して生成したものである。学習回数は３１０，０００回である。また、ボコーダ部４３の学習には、１２，４５１文（１８時間）の音声データから算出したメルスペクトログラムを直接用いた。学習回数は１，２２０，０００回である。 For learning of the acoustic feature quantity generation model 60, learning speech data A4, which is speech data contained in the speech corpus of the female narrator, and learning intermediate language data B4 created from the kana-kanji mixed sentences of this speech corpus. We used training data paired with The learning intermediate language data B4 used in the experiment was generated by manually correcting the kana and prosody symbols obtained by linguistically analyzing the kana-kanji mixed sentences of the speech corpus and adding the utterance style symbols. is. The number of times of learning is 310,000. For the learning of the vocoder unit 43, mel spectrograms calculated from speech data of 12,451 sentences (18 hours) were directly used. The number of times of learning is 1,220,000.

評価実験では、音声コーパスには含まれていない１０文の仮名及び韻律記号に、実況、平静、悲哀の３種類の発話スタイル記号を付加して中間言語データを生成した。これら生成された中間言語データを使用して音響特徴量推定部４２が推定したメルスペクトログラムをボコーダ部４３に入力することによって、３０個の音声を合成した。これら合成された音声（以下、「発話スタイル付き合成音声」とも記載）の音量を平均ラウドネス値に基づいて調整したものを音声刺激として使用した。実験は防音室において、ヘッドホン受聴にて各被験者が聞きやすい音量で行った。被験者は１３人である。実験は防音室において、ヘッドホン受聴にて各被験者が聞きやすい音量で行った。被験者はランダムに提示された音声刺激に対して評定を行った。 In the evaluation experiment, intermediate language data was generated by adding 3 types of utterance style symbols, ie commentary, serenity, and sorrow, to 10 sentences of kana and prosodic symbols that are not included in the speech corpus. By inputting the mel-spectrogram estimated by the acoustic feature amount estimation unit 42 using the generated intermediate language data to the vocoder unit 43, 30 voices were synthesized. The volume of these synthesized voices (hereinafter also referred to as "synthetic voice with utterance style") adjusted based on the average loudness value was used as a speech stimulus. The experiment was conducted in a soundproof room at a volume comfortable for each subject to hear with headphones. There are 13 subjects. The experiment was conducted in a soundproof room at a volume comfortable for each subject to hear with headphones. Subjects rated randomly presented audio stimuli.

図１６は、本実施形態により合成した発話スタイル付き合成音声に対する発話スタイルの再現性の評価結果として得られた５段階評価のＤＭＯＳ値（Degradation Mean Opinion Score）と９５％信頼区間を示す図である。ＤＭＯＳについては、例えば、参考文献６「日本電信電話株式会社，[online]，"音声品質評価法 3.音声品質の主観評価法 3.2.DMOS(Degradation Mean Opinion Score)"，インターネット<URL: http://www.ntt.co.jp/qos/technology/sound/03_2.html>」に記載されている。この発話スタイルが再現されているかの実験では、リファレンス音声（発話付きスタイル収録音声）と、本実施形態の音声合成装置１ａが音声合成した評価対象音声（発話スタイル付き合成音声）とを連続で再生し、それらの発話スタイル（悲しい口調か実況のような口調か）の類似性を５段階評価で評定し、その平均値をまとめた。実況、平静、悲哀の３種類の発話スタイルについて各文を５回評定するために、音声コーパスには含まれていない１０文のそれぞれに対して５種類のリファレンス音声を用意した。そして、３０個の発話スタイル付き合成音声それぞれに、５種類のリファレンス音声を組み合わせることにより、被験者１人当たり合計１５０回分の音声刺激を評価に用いた。被験者は音声刺激に対して発話スタイルの類似性に関する５段階評価を行った。図１６に示すように、各発話スタイルとも、高い再現性が得られたが、実況が有意に高く評価された。悲哀と平静の間に有意な差はなかった。実況は話速が早く、明瞭な発話の特徴が、平静や悲哀の発話よりもわかりやすい。これを精度よく再現できたことが理由と考えられる。 FIG. 16 is a diagram showing 5-level DMOS values (degradation mean opinion scores) and 95% confidence intervals obtained as evaluation results of the reproducibility of the utterance style for synthesized speech with an utterance style synthesized according to the present embodiment. . For DMOS, for example, Reference 6 "Nippon Telegraph and Telephone Corporation, [online]," Voice Quality Evaluation Method 3. Subjective Evaluation Method of Voice Quality 3.2. DMOS (Degradation Mean Opinion Score)", Internet <URL: http: //www.ntt.co.jp/qos/technology/sound/03_2.html>”. In the experiment to see if the utterance style is reproduced, the reference voice (recorded voice with utterance style) and the evaluation target voice (synthetic voice with utterance style) synthesized by the speech synthesizer 1a of the present embodiment are continuously reproduced. Then, the similarity of their utterance styles (whether it is a sad tone or a commentary tone) was evaluated on a 5-point scale, and the average value was summarized. Five reference speeches were prepared for each of the 10 sentences not included in the speech corpus, in order to rate each sentence five times for three speech styles: live, calm, and sad. A total of 150 speech stimuli per subject were used for evaluation by combining 5 types of reference speech with each of the 30 synthetic speeches with utterance styles. Subjects rated the similarity of speech style to the speech stimuli on a 5-point scale. As shown in FIG. 16, high reproducibility was obtained for each utterance style, but the commentary was evaluated significantly higher. There was no significant difference between sad and calm. The commentary has a fast speaking speed, and the characteristics of clear speech are easier to understand than those of calmness and sadness. The reason for this is thought to be that this was reproduced with high accuracy.

図１７は、本実施形態により合成した発話スタイル付き合成音声に対する発話スタイルの自然性評価として得られたＭＯＳ値と９５％信頼区間を示す図である。被験者は１３人である。実況、平静、悲哀の３種類の発話スタイルについて１０文ずつ、合計３０文の音声刺激を評価に用いた。被験者は、１音声刺激について５回ずつ、一人当たり合計１５０回の自然性に関する５段階評価を行った。図１７に示すように、自然性は平静、実況、悲哀の順に高く評価された。これは、各発話スタイルの音声コーパスのデータ量が評価結果に影響したものと考えられる。 FIG. 17 is a diagram showing the MOS value and the 95% confidence interval obtained as the evaluation of the naturalness of the utterance style for synthetic speech with utterance style synthesized according to this embodiment. There are 13 subjects. A total of 30 sentences of speech stimuli were used for evaluation, with 10 sentences for each of the three utterance styles of commentary, calmness, and sadness. The subjects evaluated the naturalness of each speech stimulus five times, for a total of 150 times per person. As shown in FIG. 17, naturalness was evaluated in the order of calmness, liveliness, and sadness. This is probably because the amount of data in the speech corpus for each utterance style affected the evaluation results.

第１の実施形態では、韻律という局所的な音響特徴量の制御を実現しており、日本語における、読み仮名以外のアクセントによる音響的な特徴を記号により再現する。本実施形態では、音声発話の「全体的」な音響特徴量の制御を実現しており、発話全体にわたる特徴の再現を可能とする。 In the first embodiment, the control of the local acoustic feature quantity called prosody is realized, and the acoustic features of Japanese accents other than reading kana are reproduced by symbols. In this embodiment, the control of the "overall" acoustic feature amount of voice utterance is realized, and it is possible to reproduce the feature over the entire utterance.

本実施形態の音声合成装置１ａによれば、学習用テキストデータ、音声合成を行う際に入力するテキストデータ共に、簡易な表記法により、合成される音声の感情、発話スタイル、話者の制御が可能である。 According to the speech synthesizer 1a of the present embodiment, both text data for training and text data to be input when performing speech synthesis are expressed in a simple notation, so that the emotions of the synthesized speech, the utterance style, and the control of the speaker can be expressed. It is possible.

本実施形態は、日本語だけではなく、他の言語にも適用することができる。この場合、日本語の仮名に代えて、その言語の読み方を表す文字又は文字列を用いる。また、本実施形態では、日本語の音声合成を行うために、読み方を表す文字として仮名を用い、さらに、韻律記号を用いているが、英語などの他の言語には、単語の綴り（文字列）自体が読み方と韻律記号を兼ねる場合がある。このような言語の場合は、読み方を表す文字又は文字列と、発話全体に与える特徴を表す文字又は文字列とを用いて発話内容を表す文章を記述した中間言語のテキストデータを音響特徴量推定部４２に入力すればよい。 This embodiment can be applied not only to Japanese but also to other languages. In this case, instead of Japanese kana, characters or character strings representing the reading of the language are used. In addition, in the present embodiment, kana characters are used as characters representing readings in order to synthesize Japanese speech, and prosody symbols are used. column) itself may serve as both a reading and a prosody mark. In the case of such a language, the text data of the intermediate language describing the sentences representing the contents of the utterance using the characters or character strings representing the reading and the characters or character strings representing the characteristics given to the entire utterance is used for acoustic feature estimation. It is sufficient to input it to the part 42 .

また、仮名と発話スタイル記号とを含み、韻律記号を含まないテキストデータを音響特徴量推定部４２に入力してもよい。このような中間言語を用いることにより、単語レベルの局所的な特徴については精度が低下するものの、発話に与える特徴については精度よく制御することできる。 Alternatively, text data containing kana and utterance style symbols but not including prosody symbols may be input to the acoustic feature estimation unit 42 . By using such an intermediate language, the accuracy of local features at the word level is reduced, but the features given to utterances can be controlled with high accuracy.

従来は、発話に与える特徴別に音響特徴量生成モデルを組み替えたり、発話に与える特徴に応じた切り替えを制御するための入力をエンコーダに与えたりしなければならなかった。本実施形態の音声合成装置１ａによれば、発話スタイル記号が記述された中間言語を用いて、一つの音響特徴量生成モデルにより複数の特徴（感情、発話スタイル、話者）の音声を学習し、学習時に用いた発話スタイル記号が表す特徴を有する任意の発話内容の音声を合成することができる。 Conventionally, it has been necessary to rearrange the acoustic feature value generation model for each feature given to the utterance, or to provide an encoder with an input for controlling switching according to the feature given to the utterance. According to the speech synthesizer 1a of the present embodiment, an intermediate language in which utterance style symbols are described is used to learn speech of a plurality of features (emotion, utterance style, speaker) by one acoustic feature amount generation model. , it is possible to synthesize speech of arbitrary utterance contents having features represented by the utterance style symbols used during learning.

なお、上述の音声合成装置１、１ａは、内部にコンピュータシステムを有している。そして、音声合成装置１、１ａの動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 Note that the speech synthesizers 1 and 1a described above have a computer system therein. The process of operation of the speech synthesizers 1 and 1a is stored in a computer-readable recording medium in the form of a program, and the computer system reads out and executes the program to perform the above processing. The computer system here includes hardware such as a CPU, various memories, an OS, and peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 The "computer system" also includes the home page providing environment (or display environment) if the WWW system is used.
The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. Furthermore, "computer-readable recording medium" means a medium that dynamically retains a program for a short period of time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It also includes those that hold programs for a certain period of time, such as volatile memories inside computer systems that serve as servers and clients in that case. Further, the program may be for realizing part of the functions described above, or may be capable of realizing the functions described above in combination with a program already recorded in the computer system.

１、１ａ…音声合成装置
２０…記憶部
２０－１…音響特徴量生成モデル
２０－２…音声波形生成モデル
３０…学習部
３１…正解音響特徴量算出部
３２…モデル更新部
４０、４０ａ…音声合成部
４１、４１ａ…言語処理部
４２…音響特徴量推定部
４３…ボコーダ部
６０…音響特徴量生成モデル Reference Signs List 1, 1a Speech synthesizer 20 Storage unit 20-1 Acoustic feature value generation model 20-2 Speech waveform generation model 30 Learning unit 31 Correct acoustic feature value calculation unit 32 Model updating unit 40, 40a Speech Synthesis units 41, 41a...Language processing unit 42...Acoustic feature amount estimation unit 43...Vocoder unit 60...Acoustic feature amount generation model

Claims

First text data in which a sentence representing the contents of a Japanese utterance is described by a character string using characters or character strings representing how to read the contents of the utterance, prosody symbols representing prosody, and utterance style symbols representing characteristics given to the utterance. is input to a first acoustic feature generation model that generates an acoustic feature from the first text data, and a first estimation process for estimating an acoustic feature of the speech corresponding to the utterance content, or the reading The second text data described by a character or character string representing and the character string used with the prosody symbol is input to a second acoustic feature value generation model that generates an acoustic feature value from the second text data, a second estimation process for estimating acoustic features of speech corresponding to the utterance content , or third text data described by a character string or a character string representing the reading and a character string using the utterance style symbol, Acoustic features that are input to a third acoustic feature value generation model that generates acoustic feature values from the third text data, and perform any of a third estimation process of estimating acoustic feature values of speech corresponding to the utterance content an amount estimator;
a vocoder for estimating a speech waveform using the acoustic features estimated by the acoustic feature quantity estimating unit through any one of the first estimation process, the second estimation process, and the third estimation process;
with
The first acoustic feature value generation model , the second acoustic feature value generation model, and the third acoustic feature value generation model have encoders and decoders using deep neural networks,
The encoder uses a recursive neural network to generate a character string feature amount in consideration of character strings before and after the utterance content in the text in the utterance content indicated by the text data,
The decoder uses a recursive neural network to generate an acoustic feature amount of speech corresponding to the utterance content indicated by the text data based on the feature amount generated by the encoder and the acoustic feature amount generated in the past. ,
A speech synthesizer characterized by:

The characters representing the reading are katakana, hiragana, alphabet or phonetic symbols,
The first acoustic feature value generation model, the second acoustic feature value generation model, and the third acoustic feature value generation model further have an attention network using a deep neural network,
The attention network generates a weight for weighting the feature quantity output from the encoder, weights the feature quantity with the generated weight, and inputs the feature quantity to the decoder;
The decoder uses a recurrent neural network to generate an acoustic feature quantity of speech corresponding to the utterance content indicated by the text data based on the feature quantity input from the attention network and acoustic feature quantity generated in the past. generate,
2. The speech synthesizer according to claim 1, wherein:

The prosodic symbols include any one of a symbol that designates an accent position, a symbol that designates a phrase or phrase delimiter, a symbol that designates the intonation at the end of a sentence, and a symbol that designates a pause .
3. The speech synthesizer according to claim 1, wherein:

wherein the feature imparted to speech is emotion, speech style, or speaker;
4. The speech synthesizer according to any one of claims 1 to 3, characterized by:

The utterance to which the feature is given is the entire utterance of one or more sentences to which the utterance style symbol is added at a predetermined position, the entire utterance of one or more sentences surrounded by the utterance style symbol, or the utterance style symbol. is an utterance of one or more phrases enclosed by
5. The speech synthesizer according to any one of claims 1 to 4, characterized in that:

A program for causing a computer to function as the speech synthesizer according to any one of claims 1 to 5.