JP2022081691A

JP2022081691A - Voice synthesis device and program

Info

Publication number: JP2022081691A
Application number: JP2022049374A
Authority: JP
Inventors: 信正清山; Nobumasa Seiyama; 清栗原; Kiyoshi Kurihara; 正熊野; Tadashi Kumano; 篤今井; Atsushi Imai; 徹都木; Toru Tsugi
Original assignee: Nippon Hoso Kyokai NHK; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2018-06-14
Filing date: 2022-03-25
Publication date: 2022-05-31
Anticipated expiration: 2038-06-14
Also published as: JP2019215468A; JP7362976B2; JP7126384B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesis device and a program capable of obtaining high quality synthesized voice signals when generating synthesized voice signals in which reading of specific portions of a text is adjusted.

SOLUTION: A language analysis unit 20 of a voice synthesis device 2 linguistically analyzes a text to be voice-synthesized to obtain a language feature amount, and an adjustment amount addition unit 21 adds adjustment amount information of adjustment parameters to the language feature amount. An acoustic feature amount estimation unit 22 estimates an acoustic feature amount using a statistical model learned in advance based on the language feature amount to which the adjustment amount information is added. A voice generation unit 23 synthesizes a voice signal based on the acoustic feature amount, and outputs the voice signal adjusted by the adjustment parameters to the text.

SELECTED DRAWING: Figure 10

Description

本発明は、テキストから音声信号を合成するための統計モデルを用いて音声信号を合成する音声合成装置及びプログラムに関する。 The present invention relates to a speech synthesizer and a program that synthesizes speech signals using a statistical model for synthesizing speech signals from text.

従来、テキストとこれに対応する音声信号を用いて統計モデルを学習し、任意のテキストに対する合成音声を得る方法として、ディープニューラルネットワーク（ＤＮＮ：Deep Neural Network）を用いた深層学習（ＤＬ：Deep Learing）に基づく技術が知られている（例えば、非特許文献１を参照）。 Conventionally, as a method of learning a statistical model using a text and a voice signal corresponding to the text and obtaining a synthetic voice for an arbitrary text, deep learning (DL: Deep Learing) using a deep neural network (DNN) is used. ) Is known (see, for example, Non-Patent Document 1).

一方、音声信号の読み上げ方を調整する方法として、音声分析生成処理に基づく技術が知られている（例えば、非特許文献２を参照）。 On the other hand, as a method of adjusting how to read out a voice signal, a technique based on a voice analysis generation process is known (see, for example, Non-Patent Document 2).

図１５は、非特許文献１に記載された従来の学習方法及び合成方法を示す説明図である。この学習方法を実現する学習装置は、事前に用意された音声コーパスのテキストとこれに対応する音声信号を用いて、テキストについては言語分析処理により言語特徴量を抽出する（ステップＳ１５０１）。また、学習装置は、音声信号について音声分析処理により音響特徴量を抽出する（ステップＳ１５０２）。 FIG. 15 is an explanatory diagram showing a conventional learning method and synthesis method described in Non-Patent Document 1. A learning device that realizes this learning method uses a text of a voice corpus prepared in advance and a voice signal corresponding thereto, and extracts a language feature amount of the text by a language analysis process (step S1501). Further, the learning device extracts an acoustic feature amount by voice analysis processing for the voice signal (step S1502).

学習装置は、言語特徴量と音響特徴量の時間対応付けを行い（ステップＳ１５０３）、言語特徴量と音響特徴量を用いて統計モデルを学習する（ステップＳ１５０４）。 The learning device performs time correspondence between the language features and the acoustic features (step S1503), and learns the statistical model using the language features and the acoustic features (step S1504).

また、この合成方法を実現する音声合成装置は、任意のテキストを入力し、テキストの言語分析処理により言語特徴量を抽出する（ステップＳ１５０５）。そして、音声合成装置は、学習装置により学習された統計モデルを用いて、言語特徴量から音響特徴量を推定し（ステップＳ１５０６）、音声生成処理により、音響特徴量から音声信号波形を求める（ステップＳ１５０７）。これにより、任意のテキストに対応する合成音声信号を得ることができる。 Further, the speech synthesizer that realizes this synthesis method inputs an arbitrary text and extracts a language feature amount by a language analysis process of the text (step S1505). Then, the voice synthesizer estimates the acoustic feature amount from the language feature amount using the statistical model learned by the learning device (step S1506), and obtains the voice signal waveform from the acoustic feature amount by the voice generation process (step S1506). S1507). This makes it possible to obtain a synthetic speech signal corresponding to any text.

図１６は、非特許文献２に記載された従来の音声信号調整方法を示す説明図である。この音声信号調整方法を実現する音声調整装置は、音声分析処理により、音声信号からフレーム毎の音響特徴量を抽出し（ステップＳ１６０１）、調整パラメータに基づいて、音響特徴量の所望の部分に所望の調整を加える（ステップＳ１６０２）。 FIG. 16 is an explanatory diagram showing a conventional audio signal adjusting method described in Non-Patent Document 2. The voice adjustment device that realizes this voice signal adjustment method extracts the acoustic feature amount for each frame from the voice signal by voice analysis processing (step S1601), and obtains a desired portion of the acoustic feature amount based on the adjustment parameter. (Step S1602).

音声調整装置は、音声生成処理により、調整が加えられたフレーム毎の音響特徴量から音声信号を生成する（ステップＳ１６０３）。これにより、調整を加えた音声信号を得ることができる。 The voice adjusting device generates a voice signal from the acoustic feature amount for each frame to which the voice is adjusted by the voice generation process (step S1603). This makes it possible to obtain an audio signal with adjustments.

Zhizheng Wu, Oliver Watts, Simon King,“ Merlin：An Open Source Neural Network Speech Synthesis System”, in Proc. 9th ISCA Speech Synthesis Workshop (SSW9), September 2016, Sunnyvale, CA, USA.Zhizheng Wu, Oliver Watts, Simon King, “Merlin: An Open Source Neural Network Speech Synthesis System”, in Proc. 9th ISCA Speech Synthesis Workshop (SSW9), September 2016, Sunnyvale, CA, USA. M. Morise, F. Yokomori, and K. Ozawa,“WORLD：a vocoder-based high-quality speech synthesis system for real-time applications”, IEICE transactions on information and systems, vol. E99-D, no, 7, pp. 1877-1884, 2016M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications”, IEICE transactions on information and systems, vol. E99-D, no, 7, pp. 1877-1884, 2016

例えば、放送番組等のコンテンツ制作に合成音声信号を利用する際に、演出効果として、テキストの特定部分の読み上げ方を調整した合成音声信号が求められることがある。 For example, when a synthetic voice signal is used for producing content such as a broadcast program, a synthetic voice signal adjusted for reading a specific part of a text may be required as an effect.

前述の非特許文献１の方法は、任意のテキストに対して合成音声信号を得るものであり、同一のテキストに対して常に同一の合成音声信号が得られる。また、前述の非特許文献２の方法は、音声信号の読み上げ方を調整するものである。 The above-mentioned method of Non-Patent Document 1 obtains a synthetic speech signal for an arbitrary text, and always obtains the same synthetic speech signal for the same text. Further, the above-mentioned method of Non-Patent Document 2 is for adjusting how to read out an audio signal.

そこで、テキストの特定部分の読み上げ方を調整した合成音声信号を求める方法として、前述の非特許文献１，２を組み合わせることが想定される。 Therefore, it is assumed that the above-mentioned non-patent documents 1 and 2 are combined as a method of obtaining a synthetic speech signal in which the reading method of a specific part of the text is adjusted.

図１７は、非特許文献１，２の従来技術を組み合わせた想定例を示す説明図である。この想定例の学習方法は、図１５に示したステップＳ１５０１～Ｓ１５０４と同様である（ステップＳ１７０１～Ｓ１７０４）。 FIG. 17 is an explanatory diagram showing a hypothetical example in which the prior arts of Non-Patent Documents 1 and 2 are combined. The learning method of this assumed example is the same as steps S1501 to S1504 shown in FIG. 15 (steps S1701 to S1704).

この想定例の合成方法は、図１５に示したステップＳ１５０５～Ｓ１５０７の処理に、図１６に示したステップＳ１６０２の処理を挿入したものである。具体的には、音声合成装置は、任意のテキストから言語特徴量を抽出し（ステップＳ１７０５）、統計モデルを用いて言語特徴量から音響特徴量を推定する（ステップＳ１７０６）。 In the synthesis method of this assumed example, the process of step S1602 shown in FIG. 16 is inserted into the process of steps S1505 to S1507 shown in FIG. Specifically, the speech synthesizer extracts language features from arbitrary text (step S1705) and estimates acoustic features from language features using a statistical model (step S1706).

音声合成装置は、調整パラメータに基づいて、音響特徴量の所望の部分に所望の調整を加える（ステップＳ１７０７）。音声合成装置は、音声生成処理により、調整が加えられたフレーム毎の音響特徴量から音声信号を生成する（ステップＳ１７０８）。これにより、任意のテキストに対応する合成音声信号を得ることができる。 The speech synthesizer makes a desired adjustment to a desired portion of the acoustic feature amount based on the adjustment parameter (step S1707). The voice synthesizer generates a voice signal from the adjusted acoustic feature amount for each frame by the voice generation process (step S1708). This makes it possible to obtain a synthetic speech signal corresponding to any text.

しかしながら、この想定例では、ステップＳ１７０６にて統計モデルを用いて言語特徴量から推定した音響特徴量は、実際の音声信号から音声分析処理により抽出した音響特徴量とは異なり、時間的に平滑化された特性を持っている。このため、ステップＳ１７０７にて統計モデルを用いて推定した音響特徴量に調整を加え、ステップＳ１７０８にて調整後のフレーム毎の音響特徴量から合成音声信号を得ると、合成音声信号に音質劣化を生じてしまう。 However, in this assumed example, the acoustic features estimated from the language features using the statistical model in step S1706 are different from the acoustic features extracted from the actual voice signal by the voice analysis process, and are smoothed in time. Has the characteristics that have been made. Therefore, if the acoustic features estimated using the statistical model in step S1707 are adjusted and the synthesized speech signal is obtained from the adjusted acoustic features for each frame in step S1708, the sound quality of the synthesized speech signal deteriorates. It will occur.

このように、図１７に示した想定例では、高品質の合成音声信号を得ることができないという問題があった。このため、テキストの特定部分の読み上げ方を調整した、高品質の合成音声信号を得るために、新たな手法が所望されていた。 As described above, in the assumed example shown in FIG. 17, there is a problem that a high-quality synthetic voice signal cannot be obtained. Therefore, a new method has been desired in order to obtain a high-quality synthetic speech signal in which the reading method of a specific part of the text is adjusted.

そこで、本発明は前記課題を解決するためになされたものであり、その目的は、テキストの特定部分の読み上げ方を調整した合成音声信号を生成する際に、高品質の合成音声信号を得ることが可能な音声合成装置及びプログラムを提供することにある。 Therefore, the present invention has been made to solve the above-mentioned problems, and an object thereof is to obtain a high-quality synthetic speech signal when generating a synthetic speech signal in which a reading method of a specific part of a text is adjusted. It is an object of the present invention to provide a speech synthesizer and a program capable of the above.

前記課題を解決するために、請求項１の音声合成装置は、音声合成対象のテキストを言語分析し、言語特徴量を求める言語分析部と、前記言語分析部により求めた前記言語特徴量に、音響の特徴を調整するための調整パラメータの調整量情報を追加する調整量追加部と、前記調整量追加部により前記調整量情報が追加された前記言語特徴量に基づき、予め学習された統計モデルを用いて、音響特徴量を推定する音響特徴量推定部と、前記音響特徴量推定部により推定された前記音響特徴量に基づいて、音声信号を合成し、前記テキストに対して前記調整パラメータによる調整が加えられた音声信号を出力する音声生成部と、を備えたことを特徴とする。 In order to solve the above-mentioned problem, the speech synthesizer according to claim 1 has a language analysis unit that linguistically analyzes the text to be voice-synthesized and obtains a language feature amount, and the language feature amount obtained by the language analysis unit. A statistical model learned in advance based on the adjustment amount addition part for adding the adjustment amount information of the adjustment parameter for adjusting the acoustic feature and the language feature amount to which the adjustment amount information is added by the adjustment amount addition part. A voice signal is synthesized based on the acoustic feature amount estimation unit that estimates the acoustic feature amount and the acoustic feature amount estimated by the acoustic feature amount estimation unit, and the text is adjusted according to the adjustment parameter. It is characterized by being provided with a voice generation unit that outputs a voice signal to which adjustments have been made.

また、請求項２の音声合成装置は、請求項１に記載の音声合成装置において、前記統計モデルが、ニューラルネットワークで構成された時間長モデル及び音響モデルからなり、前記音響特徴量推定部が、前記時間長モデルを用いて、音素毎の前記言語特徴量を前記時間長モデルの入力データとして、前記時間長モデルの出力データである音素毎の時間長を推定し、音素毎の前記時間長からフレーム毎の時間長を生成し、前記音響モデルを用いて、フレーム毎の前記言語特徴量及びフレーム毎の前記時間長を入力データとし、前記音響モデルの出力データであるフレーム毎の前記音響特徴量を推定する、ことを特徴とする。 Further, in the speech synthesizer according to claim 1, the speech synthesizer according to claim 1 comprises a time-length model and an acoustic model in which the statistical model is composed of a neural network, and the acoustic feature amount estimation unit is used. Using the time length model, the time length of each phonetic element, which is the output data of the time length model, is estimated by using the language feature amount of each phone element as the input data of the time length model, and the time length of each phonetic element is used. The time length for each frame is generated, the language feature amount for each frame and the time length for each frame are used as input data using the acoustic model, and the acoustic feature amount for each frame, which is the output data of the acoustic model. It is characterized by estimating.

また、請求項３の音声合成装置は、請求項１または２に記載の音声合成装置において、前記調整パラメータを、話速または時間長、パワー、ピッチ、及び抑揚の４つのパラメータのうちのいずれか１つまたは２つ以上の組み合わせとする、ことを特徴とする。 Further, in the speech synthesizer according to claim 1, the speech synthesizer according to claim 1 or 2 sets the adjustment parameter to any one of four parameters of speaking speed or time length, power, pitch, and intonation. It is characterized by having one or a combination of two or more.

また、請求項４の音声合成装置は、請求項１または２に記載の音声合成装置において、前記調整パラメータを、話速または時間長、パワー、ピッチ、及び抑揚の４つのパラメータとし、当該４つのパラメータのうちのいずれか１つのパラメータの調整量は、所定範囲内の任意の値が指定され、他の３つのパラメータの調整量は、固定値が用いられる、ことを特徴とする。 Further, the speech synthesizer according to claim 4 has the adjustment parameters as four parameters of speech speed or time length, power, pitch, and intonation in the speech synthesizer according to claim 1 or 2. The adjustment amount of any one of the parameters is specified by an arbitrary value within a predetermined range, and the adjustment amount of the other three parameters is a fixed value.

また、請求項５の音声合成装置は、請求項１または２に記載の音声合成装置において、前記調整パラメータを、話速または時間長、パワー、ピッチ、及び抑揚の４つのパラメータとし、当該４つのパラメータにおけるそれぞれの調整量は、それぞれの所定範囲内の任意の値が指定される、ことを特徴とする。 Further, in the speech synthesizer according to claim 1, the speech synthesizer according to claim 1 has the adjustment parameters as four parameters of speaking speed or time length, power, pitch, and intonation, and the four parameters are used. Each adjustment amount in the parameter is characterized in that an arbitrary value within each predetermined range is specified.

また、請求項６のプログラムは、コンピュータを、請求項１から５までのいずれか一項に記載の音声合成装置として機能させることを特徴とする。 The program according to claim 6 is characterized in that the computer functions as the speech synthesizer according to any one of claims 1 to 5.

以上のように、本発明によれば、テキストの特定部分の読み上げ方を調整した合成音声信号を生成する際に、高品質の合成音声信号を得ることが可能となる。 As described above, according to the present invention, it is possible to obtain a high-quality synthetic speech signal when generating a synthetic speech signal in which the reading method of a specific part of the text is adjusted.

本発明の実施形態による学習装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the learning apparatus by embodiment of this invention. 学習装置による事前学習処理例を示すフローチャートである。It is a flowchart which shows the example of the pre-learning process by a learning device. 言語特徴量のデータ構成例を説明する図である。It is a figure explaining the data structure example of a language feature quantity. 音声分析部による音声分析処理例を示すフローチャートである。It is a flowchart which shows the example of the voice analysis processing by the voice analysis unit. 音響特徴量のデータ構成例を説明する図である。It is a figure explaining the data structure example of the acoustic feature quantity. 時間情報が追加された言語特徴量のデータ構成例を説明する図である。It is a figure explaining the data structure example of the language feature quantity to which time information is added. 調整量情報が追加された言語特徴量のデータ構成例を説明する図である。It is a figure explaining the data structure example of the language feature amount to which the adjustment amount information is added. 時間長モデルの学習処理例を説明する図である。It is a figure explaining the learning processing example of a time length model. 音響モデルの学習処理例を説明する図である。It is a figure explaining the learning processing example of an acoustic model. 本発明の実施形態による音声合成装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech synthesis apparatus by embodiment of this invention. 音声合成装置による音声合成処理例を示すフローチャートである。It is a flowchart which shows the voice synthesis processing example by a voice synthesis apparatus. 時間長モデルを用いた時間長推定処理例を説明する図である。It is a figure explaining the time length estimation processing example using a time length model. 音響モデルを用いた音響特徴量推定処理例を説明する図である。It is a figure explaining the example of the acoustic feature amount estimation processing using an acoustic model. 音声生成部による音声合成処理例を説明する図である。It is a figure explaining the example of the voice synthesis processing by the voice generation part. 非特許文献１に記載された従来の学習方法及び合成方法を示す説明図である。It is explanatory drawing which shows the conventional learning method and synthesis method described in Non-Patent Document 1. 非特許文献２に記載された従来の音声信号調整方法を示す説明図である。It is explanatory drawing which shows the conventional audio signal adjustment method described in Non-Patent Document 2. 非特許文献１，２の従来技術を組み合わせた想定例を示す説明図である。It is explanatory drawing which shows the hypothetical example which combined the prior arts of Non-Patent Documents 1 and 2.

以下、本発明を実施するための形態について図面を用いて詳細に説明する。
〔学習装置〕
まず、本発明の実施形態による学習装置について説明する。図１は、学習装置の構成例を示すブロック図であり、図２は、学習装置による事前学習処理例を示すフローチャートである。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings.
[Learning device]
First, a learning device according to an embodiment of the present invention will be described. FIG. 1 is a block diagram showing a configuration example of a learning device, and FIG. 2 is a flowchart showing an example of pre-learning processing by the learning device.

この学習装置１は、記憶部１０，１７、言語分析部１１、音声分析部１２、対応付け部１３、調整量追加部１４、音響特徴量調整部１５及び学習部１６を備えている。音声信号はモノラルであり、標本化周波数４８ｋＨｚ及びビット数１６で標本化されているものとする。 The learning device 1 includes storage units 10 and 17, language analysis unit 11, voice analysis unit 12, association unit 13, adjustment amount addition unit 14, acoustic feature amount adjustment unit 15, and learning unit 16. It is assumed that the audio signal is monaural and is sampled at a sampling frequency of 48 kHz and a bit number of 16.

記憶部１０には、予め設定された音声コーパスが格納されている。音声コーパスは、テキストと、これに対応する音声信号から構成される。例えば、ＡＴＲ（株式会社国際電気通信基礎技術研究所）により作成された音素バランス５０３文を利用する場合、テキストと、これを読み上げた音声信号は、５０３対からなる。音声コーパスについては、以下の文献を参照されたい。
磯健一、渡辺隆夫、桑原尚夫、「音声データベース用文セットの設計」、音講論（春）、 pp.89-90（1988.3） A preset voice corpus is stored in the storage unit 10. The voice corpus consists of text and the corresponding voice signal. For example, when using a phoneme balance 503 sentence created by ATR (International Telecommunications Research Institute, Inc.), the text and the audio signal read aloud thereof consist of 503 pairs. For the voice corpus, refer to the following documents.
Kenichi Iso, Takao Watanabe, Nao Kuwahara, "Design of Sentence Set for Speech Database", Otokoron (Spring), pp.89-90 (1988.3)

言語分析部１１は、記憶部１０から音声コーパスの各テキストを読み出し、テキストについて既知の言語分析処理を行い、音素毎の所定情報からなる言語特徴量を求める（ステップＳ２０１）。そして、言語分析部１１は、音素毎の言語特徴量を対応付け部１３に出力する。 The language analysis unit 11 reads each text of the voice corpus from the storage unit 10, performs a known language analysis process on the text, and obtains a language feature amount consisting of predetermined information for each phoneme (step S201). Then, the language analysis unit 11 outputs the language feature amount for each phoneme to the association unit 13.

具体的には、言語分析部１１は、言語分析処理により、文を構成する音素毎に、音素情報、アクセント情報、品詞情報、アクセント句情報、呼気段落情報及び総数情報を求め、これらの情報からなる言語特徴量を求める。 Specifically, the language analysis unit 11 obtains phoneme information, accent information, part of speech information, accent phrase information, exhalation paragraph information, and total number information for each phoneme constituting the sentence by language analysis processing, and from these information. Find the language feature quantity.

言語分析処理としては、例えば以下に記載された形態素解析処理が用いられる。
“MeCab：Yet Another Part-of-Speech and Morphological Analyzer”，インターネット＜ＵＲＬ：http://taku910.github.io/mecab/＞
また、言語分析処理としては、例えば以下に記載された係り受け解析処理が用いられる。
“CaboCha/南瓜：Yet Another Japanese Dependency Structure Analyzer”，インターネット＜ＵＲＬ：https://taku910.github.io/cabocha/＞ As the language analysis process, for example, the morphological analysis process described below is used.
"MeCab: Yet Another Part-of-Speech and Morphological Analyzer", Internet <URL: http://taku910.github.io/mecab/>
Further, as the language analysis process, for example, the dependency analysis process described below is used.
“CaboCha / Pumpkin: Yet Another Japanese Dependency Structure Analyzer”, Internet <URL: https://taku910.github.io/cabocha/>

図３は、言語特徴量のデータ構成例を説明する図である。図３に示すように、言語特徴量は、音素毎に、音素情報、アクセント情報、品詞情報、アクセント句情報、呼気段落情報及び総数情報から構成される。 FIG. 3 is a diagram illustrating an example of data composition of language features. As shown in FIG. 3, the language feature amount is composed of phoneme information, accent information, part-of-speech information, accent phrase information, exhalation paragraph information, and total number information for each phoneme.

図１及び図２に戻って、音声分析部１２は、記憶部１０から音声コーパスの各テキストに対応する各音声信号を読み出し、フレーム毎に音声信号を切り出し、フレーム毎の音声信号について既知の音響分析処理を行う。そして、音声分析部１２は、フレーム毎の所定情報からなる音響特徴量を求め（ステップＳ２０２）、フレーム毎の音響特徴量を対応付け部１３に出力する。音響特徴量は、後述するように、１９９次元のデータから構成される。 Returning to FIGS. 1 and 2, the voice analysis unit 12 reads out each voice signal corresponding to each text of the voice corpus from the storage unit 10, cuts out the voice signal for each frame, and has a known sound for each frame. Perform analysis processing. Then, the voice analysis unit 12 obtains an acoustic feature amount consisting of predetermined information for each frame (step S202), and outputs the acoustic feature amount for each frame to the matching unit 13. The acoustic features are composed of 199-dimensional data, as will be described later.

音響分析処理としては、例えば以下に記載された音響分析処理が用いられる。
“A high-quality speech analysis, manipulation and synthesis system”，インターネット＜ＵＲＬ：https://github.com/mmorise/World＞
また、音響分析処理としては、例えば以下に記載された音声信号処理が用いられる。
“Speech Signal Processing Toolkit(SPTK) Version 3.11 December 25, 2017”，インターネット＜ＵＲＬ：http://sp-tk.sourceforge.net/＞
“REFERENCE MANUAL for Speech Signal Processing Toolkit Ver. 3.9” As the acoustic analysis process, for example, the acoustic analysis process described below is used.
“A high-quality speech analysis, manipulation and synthesis system”, Internet <URL: https://github.com/mmorise/World>
Further, as the acoustic analysis processing, for example, the voice signal processing described below is used.
“Speech Signal Processing Toolkit (SPTK) Version 3.11 December 25, 2017”, Internet <URL: http://sp-tk.sourceforge.net/>
“REFERENCE MANUAL for Speech Signal Processing Toolkit Ver. 3.9”

図４は、音声分析部１２による音声分析処理例を示すフローチャートである。音声分析部１２は、記憶部１０から音声コーパスの各音声信号を読み出し、フレーム長２５ｍｓの音声信号をフレームシフト５ｍｓ毎に切り出す（ステップＳ４０１）。そして、音声分析部１２は、フレーム毎の音声信号について音響分析処理を行い、スペクトル、ピッチ周波数及び非周期成分を求める（ステップＳ４０２）。 FIG. 4 is a flowchart showing an example of voice analysis processing by the voice analysis unit 12. The voice analysis unit 12 reads out each voice signal of the voice corpus from the storage unit 10, and cuts out the voice signal having a frame length of 25 ms every 5 ms of a frame shift (step S401). Then, the voice analysis unit 12 performs acoustic analysis processing on the voice signal for each frame, and obtains the spectrum, the pitch frequency, and the aperiodic component (step S402).

音声分析部１２は、スペクトルをメルケプストラム分析してメルケプストラム係数ＭＧＣを求める（ステップＳ４０３）。また、音声分析部１２は、ピッチ周波数から有声／無声判定情報ＶＵＶを求め、ピッチ周波数の有声区間を対数化し、無声及び無音区間については前後の有声区間の情報を用いて補間することにより、対数ピッチ周波数ＬＦ０を求める（ステップＳ４０４）。また、音声分析部１２は、非周期成分をメルケプストラム分析して帯域非周期成分ＢＡＰを求める（ステップＳ４０５）。 The voice analysis unit 12 analyzes the spectrum with mer cepstrum to obtain the mer cepstrum coefficient MGC (step S403). Further, the voice analysis unit 12 obtains the voiced / unvoiced determination information VUV from the pitch frequency, logarithms the voiced sections of the pitch frequency, and interpolates the unvoiced and unvoiced sections using the information of the voiced sections before and after the pitch frequency. The pitch frequency LF0 is obtained (step S404). Further, the voice analysis unit 12 analyzes the aperiodic component by mer cepstrum to obtain the band aperiodic component BAP (step S405).

これにより、静特性の音響特徴量として、フレーム毎に、メルケプストラム係数ＭＧＣ、有声／無声判定情報ＶＵＶ、対数ピッチ周波数ＬＦ０及び帯域非周期成分ＢＡＰが得られる。 As a result, the mer cepstrum coefficient MGC, the voiced / unvoiced determination information VUV, the logarithmic pitch frequency LF0, and the band aperiodic component BAP can be obtained for each frame as the acoustic features of the static characteristics.

音声分析部１２は、メルケプストラム係数ＭＧＣの１次差分Δを算出して１次差分メルケプストラム係数ΔＭＧＣを求め（ステップＳ４０６）、２次差分Δ²を算出して２次差分メルケプストラム係数Δ²ＭＧＣを求める（ステップＳ４０７）。 The voice analysis unit 12 calculates the first-order difference Δ of the mer-cepstrum coefficient MGC to obtain the first-order difference mer-cepstrum coefficient ΔMGC (step S406), and calculates the second-order difference Δ ² to calculate the second-order difference mer-cepstrum coefficient Δ ² . Find the MGC (step S407).

音声分析部１２は、対数ピッチ周波数ＬＦ０の１次差分Δを算出して１次差分対数ピッチ周波数ΔＬＦ０を求め（ステップＳ４０８）、２次差分Δ²を算出して２次差分対数ピッチ周波数Δ²ＬＦ０を求める（ステップＳ４０９）。 The voice analysis unit 12 calculates the primary difference Δ of the logarithmic pitch frequency LF0 to obtain the primary difference logarithmic pitch frequency ΔLF0 (step S408), calculates the ^secondary difference Δ2, and calculates the ^secondary difference logarithmic pitch frequency Δ2. LF0 is obtained (step S409).

音声分析部１２は、帯域非周期成分ＢＡＰの１次差分Δを算出して１次差分帯域非周期成分ΔＢＡＰを求め（ステップＳ４１０）、２次差分Δ²を算出して２次差分帯域非周期成分Δ²ＢＡＰを求める（ステップＳ４１１）。 The voice analysis unit 12 calculates the primary difference Δ of the band aperiodic component BAP to obtain the primary difference band aperiodic component ΔBAP (step S410), and calculates the secondary difference Δ ² to obtain the secondary difference band aperiodic. The component Δ ² BAP is obtained (step S411).

これにより、動特性の音響特徴量として、フレーム毎に、１次差分メルケプストラム係数ΔＭＧＣ、２次差分メルケプストラム係数Δ²ＭＧＣ、１次差分対数ピッチ周波数ΔＬＦ０、２次差分対数ピッチ周波数Δ²ＬＦ０、１次差分帯域非周期成分ΔＢＡＰ及び２次差分帯域非周期成分Δ²ＢＡＰが得られる。 As a result, as the acoustic feature amount of the dynamic characteristics, the first-order difference mel cepstrum coefficient ΔMGC, the second-order difference mel cepstrum coefficient Δ ² MGC, the first-order difference log-pitch frequency ΔLF0, and the second-order difference log-pitch frequency Δ ² LF0 are used for each frame. The first-order difference band aperiodic component ΔBAP and the second-order difference band aperiodic component Δ ² BAP are obtained.

音声分析部１２は、フレーム毎の静特性及び動特性の所定情報からなる音響特徴量を対応付け部１３に出力する。 The voice analysis unit 12 outputs an acoustic feature amount consisting of predetermined information of static characteristics and dynamic characteristics for each frame to the association unit 13.

図５は、音響特徴量のデータ構成例を説明する図である。図５に示すように、音響特徴量は、フレーム毎に、静特性のメルケプストラム係数ＭＧＣ、対数ピッチ周波数ＬＦ０及び帯域非周期成分ＢＡＰ、動特性の１次差分メルケプストラム係数ΔＭＧＣ、１次差分対数ピッチ周波数ΔＬＦ０、１次差分帯域非周期成分ΔＢＡＰ、２次差分メルケプストラム係数Δ²ＭＧＣ、２次差分対数ピッチ周波数Δ²ＬＦ０及び２次差分帯域非周期成分Δ²ＢＡＰ、並びに静特性の有声／無声判定情報ＶＵＶから構成される。この音響特徴量は、後述するように、１９９次元のデータから構成される。 FIG. 5 is a diagram illustrating an example of data composition of acoustic features. As shown in FIG. 5, the acoustic feature amount is the static characteristic merkepstrum coefficient MGC, the logarithmic pitch frequency LF0 and the band aperiodic component BAP, and the dynamic characteristic first-order difference merkepstrum coefficient ΔMGC, first-order difference logarithm for each frame. Pitch frequency ΔLF0, primary difference band aperiodic component ΔBAP, secondary difference mel cepstrum coefficient Δ ² MGC, secondary difference log pitch frequency Δ ² LF0 and secondary difference band aperiodic component Δ ² BAP, and static characteristic voice / It is composed of silent judgment information VUV. As will be described later, this acoustic feature amount is composed of 199-dimensional data.

図１及び図２に戻って、対応付け部１３は、言語分析部１１から音素毎の言語特徴量を入力すると共に、音声分析部１２からフレーム毎の音響特徴量を入力する。そして、対応付け部１３は、既知の音素アラインメントの技術を用いて、音素毎の言語特徴量とフレーム毎の音響特徴量とを時間的に対応付けることで、テキストの文を構成する各音素が音声信号のどの時刻に位置（対応）するのかを算出する（ステップＳ２０３）。 Returning to FIGS. 1 and 2, the mapping unit 13 inputs the language feature amount for each phoneme from the language analysis unit 11 and the acoustic feature amount for each frame from the voice analysis unit 12. Then, the correspondence unit 13 temporally associates the language feature amount for each phoneme with the acoustic feature amount for each frame by using a known phoneme alignment technique, so that each phoneme constituting the text sentence is voiced. It is calculated at which time of the signal the position (correspondence) is (step S203).

対応付け部１３は、音素毎に、対応する開始フレームの番号及び終了フレームの番号からなる時間情報を生成し、言語特徴量を構成する音素毎の所定情報に時間情報を追加すると共に、音素の時間長（フレーム数）を求める。そして、対応付け部１３は、対応付けた音素毎の時間情報を追加した言語特徴量を調整量追加部１４に出力する。また、対応付け部１３は、音素毎の時間長を音響特徴量に含め、対応付けたフレーム毎の音響特徴量（時間長については音素毎のデータ）を音響特徴量調整部１５に出力する。 The association unit 13 generates time information consisting of a corresponding start frame number and end frame number for each phoneme, adds time information to predetermined information for each phoneme constituting a language feature amount, and adds time information to the phoneme. Find the time length (number of frames). Then, the correspondence unit 13 outputs the language feature amount to which the time information for each associated phoneme is added to the adjustment amount addition unit 14. Further, the matching unit 13 includes the time length of each phoneme in the acoustic feature amount, and outputs the acoustic feature amount of each associated frame (data for each phoneme for the time length) to the acoustic feature amount adjusting unit 15.

ここで、言語特徴量に追加される時間情報は、ミリ秒単位の情報である。また、音素毎の時間長は、後述する統計モデルにおける時間長モデルの出力データに用いられ、音素におけるミリ秒単位の時間の長さをフレームシフト５ｍｓで除算した５ｍｓフレーム単位の数値、すなわち音素のフレーム数が用いられる。 Here, the time information added to the language feature is information in milliseconds. The time length for each phonetic element is used for the output data of the time length model in the statistical model described later, and is a numerical value in 5 ms frame units obtained by dividing the time length in milliseconds in the phonetic element by a frame shift of 5 ms, that is, the sound element. The number of frames is used.

音素アラインメントの技術としては、例えば以下に記載された音声認識処理が用いられる。
“The Hidden Markov Model Toolkit（HTK）”，インターネット＜ＵＲＬ：http://htk.eng.cam.ac.uk＞
“Speech Signal Processing Toolkit(SPTK) Version 3.11 December 25, 2017” As a technique for phoneme alignment, for example, the speech recognition process described below is used.
"The Hidden Markov Model Toolkit (HTK)", Internet <URL: http://htk.eng.cam.ac.uk>
“Speech Signal Processing Toolkit (SPTK) Version 3.11 December 25, 2017”

尚、対応付け部１３は、言語特徴量及び音響特徴量の時間的な対応付け処理の後に、各文の文頭及び文末の無音区間を削除する。 The mapping unit 13 deletes the silent sections at the beginning and end of each sentence after the temporal mapping processing of the language features and the acoustic features.

図６は、時間情報が追加された言語特徴量のデータ構成例を説明する図である。図６に示すように、時間情報が追加された言語特徴量は、図３に示した言語特徴量に時間情報を追加して構成される。具体的には、この言語特徴量は、音素毎に、時間情報、音素情報、アクセント情報、品詞情報、アクセント句情報、呼気段落情報及び総数情報から構成される。 FIG. 6 is a diagram illustrating an example of data composition of a language feature amount to which time information is added. As shown in FIG. 6, the language feature amount to which the time information is added is configured by adding the time information to the language feature amount shown in FIG. Specifically, this language feature quantity is composed of time information, phoneme information, accent information, part-of-speech information, accent phrase information, exhalation paragraph information, and total number information for each phoneme.

図１及び図２に戻って、調整量追加部１４は、対応付け部１３から音素毎の言語特徴量を入力すると共に、所定の調整パラメータを入力する。そして、調整量追加部１４は、言語特徴量を構成する音素毎の所定情報に、調整パラメータの調整量情報を追加する（ステップＳ２０４）。調整量追加部１４は、音素毎の調整量情報を追加した言語特徴量を学習部１６に出力する。 Returning to FIGS. 1 and 2, the adjustment amount addition unit 14 inputs the language feature amount for each phoneme from the association unit 13, and also inputs a predetermined adjustment parameter. Then, the adjustment amount addition unit 14 adds the adjustment amount information of the adjustment parameter to the predetermined information for each phoneme constituting the language feature amount (step S204). The adjustment amount addition unit 14 outputs the language feature amount to which the adjustment amount information for each phoneme is added to the learning unit 16.

所定の調整パラメータは、音声信号を調整する（音響の特徴を調整する）ためのパラメータであり、話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDのうちのいずれか、またはこれらの組み合わせとし、ユーザにより選択されるものとする。また、調整パラメータは、学習部１６において学習データの一部として用いられる。 The predetermined adjustment parameters are parameters for adjusting the audio signal (adjusting the acoustic characteristics), and are any of the speaking speed R _ST , the power R _PW , the pitch R _PT and the _intonation R P D, or these. It shall be a combination and shall be selected by the user. Further, the adjustment parameter is used as a part of the learning data in the learning unit 16.

話速Ｒ_STは話速の調整量を示し、パワーＲ_PWはパワー（声の大きさ）の調整量を示し、Ｒ_PTはピッチ（声の高さ）の調整量を示し、抑揚Ｒ_PDは抑揚（声の高さの変化幅）の調整量を示す。尚、話速の代わりに、時間長を用いるようにしてもよい。 Speaking speed _RST indicates the amount of adjustment of speaking speed, power _RPW indicates the amount of adjustment of power (loudness), _RPT indicates the amount of adjustment of pitch (pitch of voice), and intonation _RPD indicates the amount of adjustment. The amount of adjustment of intonation (change range of voice pitch) is shown. It should be noted that the time length may be used instead of the speaking speed.

話速Ｒ_STの範囲（話速の調整量範囲）は、例えば以下のとおりとする。
（遅い）0.5<=Ｒ_ST<=4.0（速い）
これは、話速Ｒ_STは0.5から4.0までの範囲において、0.5に近いほど遅く、4.0に近いほど速いことを意味する。 The range of the speaking speed R _ST (range of adjusting the speaking speed) is as follows, for example.
(Slow) 0.5 <= R _ST <= 4.0 (Fast)
This means that the speaking speed _RST is slower as it is closer to 0.5 and faster as it is closer to 4.0 in the range of 0.5 to 4.0.

パワーＲ_PWの範囲（パワーの調整量範囲）は、例えば以下のとおりとする。
（小さい）1.0E-5<=Ｒ_PW<=2.0（大きい）
これは、パワーＲ_PWは1.0E-5から2.0までの範囲において、1.0E-5に近いほど小さく、2.0に近いほど大きいことを意味する。 The range of power R _PW (power adjustment amount range) is, for example, as follows.
(Small) 1.0E-5 <= R _PW <= 2.0 (Large)
This means that the power R _PW is smaller as it is closer to 1.0E-5 and larger as it is closer to 2.0 in the range of 1.0E-5 to 2.0.

ピッチＲ_PTの範囲（ピッチの調整量範囲）は、例えば以下のとおりとする。
（低い）0.5<=Ｒ_PT<=2.0（高い）
これは、ピッチＲ_PTは0.5から2.0までの範囲において、0.5に近いほど低く、2.0に近いほど高いことを意味する。 The range of pitch R _PT (pitch adjustment amount range) is, for example, as follows.
(Low) 0.5 <= R _PT <= 2.0 (High)
This means that the pitch R _PT is lower as it is closer to 0.5 and higher as it is closer to 2.0 in the range of 0.5 to 2.0.

抑揚Ｒ_PDの範囲（抑揚の調整量範囲）は、例えば以下のとおりとする。
（小さい）1.0E-5<=Ｒ_PD<=2.0（大きい）
これは、抑揚Ｒ_PDは1.0E-5から2.0までの範囲において、1.0E-5に近いほど小さく、2.0に近いほど大きいことを意味する。話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDの標準値は、いずれも1.0とする。 The range of intonation _RPD (range of adjustment amount of intonation) is as follows, for example.
(Small) 1.0E-5 <= R _PD <= 2.0 (Large)
This means that the intonation _RPD is smaller as it is closer to 1.0E-5 and larger as it is closer to 2.0 in the range of 1.0E-5 to 2.0. The standard values for speaking speed R _ST , power R _PW , pitch R _PT , and intonation R _PD are all 1.0.

また、これらの調整パラメータのそれぞれは、例えば以下に示す１１個のデータから選択されるものとする。すなわち、学習装置１における話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDの調整パラメータは、それぞれ１１個のデータのいずれかが使用される。
［数１］

Further, each of these adjustment parameters shall be selected from, for example, 11 data shown below. That is, any one of 11 data is used as the adjustment parameter of the speaking speed R _ST , the power R _PW , the pitch R _PT , and the intonation R _PD in the learning device 1.
[Number 1]

ここで、４つの調整パラメータを以下の調整ベクトルで表現する。

話速、パワー等の調整量を変化させないで元の話速、パワー等を維持する場合、調整ベクトルは以下のとおりである。

Here, the four adjustment parameters are expressed by the following adjustment vectors.

When maintaining the original speaking speed, power, etc. without changing the adjustment amount of speaking speed, power, etc., the adjustment vector is as follows.

４つの調整パラメータにおいて、それぞれ１１個のデータから１個のデータが選択されるものとすると、全ての組み合わせ数は、11⁴＝14,641となる。このため、統計モデルを学習するためには、膨大なデータ量が必要となることから、学習の負荷が高くなり、時間もかかってしまう。 Assuming that one data is selected from each of 11 data in each of the ^four adjustment parameters, the total number of combinations is 114 = 14,641. Therefore, in order to train the statistical model, a huge amount of data is required, which increases the learning load and takes time.

そこで、本発明の実施形態では、ユーザは、４つの調整パラメータのうちの１つの調整パラメータについて、所定範囲の１１個のデータから１個のデータを選択し、他の３つの調整パラメータについては、標準値1.0を固定値として用いるようにしてもよい。音響特徴量調整部１５、及び後述する図１０の音声合成装置２についても同様である。 Therefore, in the embodiment of the present invention, the user selects one data from 11 data in a predetermined range for one of the four adjustment parameters, and the other three adjustment parameters are set. The standard value 1.0 may be used as a fixed value. The same applies to the acoustic feature amount adjusting unit 15 and the voice synthesizer 2 of FIG. 10 which will be described later.

例えば、ユーザは、話速Ｒ_STについて１１個のデータから１個のデータを選択し、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDについて標準値1.0を固定値として用いるものとすると、調整ベクトルは以下のとおりである。

この場合、調整量追加部１４は、調整パラメータとして、ユーザにより１１個のデータのうち１個のデータが選択された話速Ｒ_ST、並びに、標準値1.0を固定値としたパワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDを入力する。 For example, assume that the user selects one data out of 11 data for the speaking speed _RST and uses the standard value 1.0 as the fixed value for the power _RPW , pitch RP _T and _intonation RP D, and the adjustment vector is It is as follows.

In this case, the adjustment amount addition unit 14 has a speaking speed R _ST in which one of 11 data is selected by the user as an adjustment parameter, a power R _PW with a standard value of 1.0 as a fixed value, and a pitch. Enter the R _PT and intonation R _PD .

このように、４つの調整パラメータのうちの１つの調整パラメータについては１１個のデータから１個のデータが選択され、他の３つの調整パラメータについては標準値である1.0を固定値として用いることは、調整ベクトルＲのいずれか１つの要素の軸方向のみに調整量をプロットしたことと等価である。この場合の組み合わせ数は、10×4＋1＝41となる。これにより、統計モデルを学習する際に、学習データの数を減らすことができるから、学習処理の負荷を低減し、学習処理の時間を短縮することができる。 In this way, one data is selected from 11 data for one of the four adjustment parameters, and the standard value of 1.0 is used as a fixed value for the other three adjustment parameters. , Is equivalent to plotting the adjustment amount only in the axial direction of any one element of the adjustment vector R. In this case, the number of combinations is 10 × 4 + 1 = 41. As a result, when learning the statistical model, the number of training data can be reduced, so that the load of the learning process can be reduced and the time of the learning process can be shortened.

また、本発明の実施形態における他の例として、ユーザは、４つの調整パラメータを１１段階で連動させて選択するようにしてもよい。音響特徴量調整部１５、及び後述する図１０の音声合成装置２についても同様である。 Further, as another example in the embodiment of the present invention, the user may select the four adjustment parameters in an interlocking manner in 11 steps. The same applies to the acoustic feature amount adjusting unit 15 and the voice synthesizer 2 of FIG. 10 which will be described later.

この場合、調整量追加部１４は、調整パラメータとして、予め設定された１１種類のパターンのうち、ユーザにより選択されたいずれかのパターンの話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDを入力する。１１種類のパターンの調整ベクトルは以下のとおりである。

a1，b1，・・・，c11，d11は、対応する調整パラメータの調整量範囲に含まれる値とする。 In this case, the adjustment amount addition unit 14 has the speaking speed R _ST , power R _PW , pitch R _PT , and intonation R of any of the 11 types of patterns preset as adjustment parameters, which are selected by the user. Enter the _PD . The adjustment vectors of the 11 types of patterns are as follows.

a1, b1, ..., c11, d11 are values included in the adjustment amount range of the corresponding adjustment parameter.

この場合の組み合わせ数は、１１となる。これにより、統計モデルを学習する際に、学習データの数を一層減らすことができるから、その負荷を一層低減し、その時間を一層短縮することができる。 In this case, the number of combinations is 11. As a result, when training the statistical model, the number of training data can be further reduced, so that the load can be further reduced and the time can be further shortened.

尚、調整量追加部１４は、文章単位、呼気段落単位またはアクセント句単位で、異なる調整パラメータを入力するようにしてもよい。音響特徴量調整部１５、及び後述する音声合成装置２についても同様である。 The adjustment amount addition unit 14 may input different adjustment parameters for each sentence, each breath paragraph, or each accent phrase. The same applies to the acoustic feature amount adjusting unit 15 and the voice synthesizer 2 described later.

図７は、調整量情報が追加された言語特徴量のデータ構成例を説明する図である。図７に示すように、調整量情報が追加された言語特徴量は、図６に示した言語特徴量に、調整パラメータの調整量情報を追加して構成される。具体的には、この言語特徴量は、音素毎に、時間情報、音素情報、アクセント情報、品詞情報、アクセント句情報、呼気段落情報、総数情報及び調整量情報から構成される。 FIG. 7 is a diagram illustrating an example of data configuration of a language feature amount to which adjustment amount information is added. As shown in FIG. 7, the language feature amount to which the adjustment amount information is added is configured by adding the adjustment amount information of the adjustment parameter to the language feature amount shown in FIG. Specifically, this language feature amount is composed of time information, phoneme information, accent information, part-of-speech information, accent phrase information, exhalation paragraph information, total number information, and adjustment amount information for each phoneme.

調整量情報は、話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDの調整パラメータにおける調整量が反映された情報である。 The adjustment amount information is information that reflects the adjustment amount in the adjustment parameters of the speaking speed R _ST , the power R _PW , the pitch R _PT , and the intonation R _PD .

前述のとおり、調整量追加部１４は、話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDのうちのどれか、またはこれらの組み合わせの調整パラメータを入力する。調整量追加部１４は、例えば話速Ｒ_STのみの調整パラメータを入力した場合、言語特徴量に、入力した話速Ｒ_ST、並びに固定値である標準値1.0のパワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDの調整量情報を追加する。また、調整量追加部１４は、例えば話速Ｒ_ST及びパワーＲ_PWの調整パラメータを入力した場合、言語特徴量に、入力した話速Ｒ_ST及びパワーＲ_PW、並びに固定値である標準値1.0のピッチＲ_PT及び抑揚Ｒ_PDの調整量情報を追加する。 As described above, the adjustment amount addition unit 14 inputs the adjustment parameter of any one of the speaking speed R _ST , the power R _PW , the pitch R _PT and the _intonation R P D, or a combination thereof. For example, when the adjustment parameter of only the speaking speed R _ST is input, the adjustment amount adding unit 14 inputs the input speaking speed R _ST , the power R _PW , the pitch R _P T, and the pitch R P T having a standard value of 1.0, which are fixed values, in the language feature quantity. Add the adjustment amount information of the intonation _RPD . Further, when the adjustment parameters of the speaking speed _RST and the power _RPW are input to the adjustment amount addition unit 14, for example, the input speaking speed _RST and the power _RPW and the standard value 1.0 which is a fixed value are input to the language feature amount. The adjustment amount information of the pitch R _PT and the intonation R _PD of is added.

図１及び図２に戻って、音響特徴量調整部１５は、対応付け部１３から、調整量追加部１４が入力する音素毎の言語特徴量に対応するフレーム毎の音響特徴量（時間長については音素毎のデータ）を入力する。また、音響特徴量調整部１５は、調整量追加部１４と同様の所定の調整パラメータを入力する。 Returning to FIGS. 1 and 2, the acoustic feature amount adjusting unit 15 has an acoustic feature amount (time length) for each frame corresponding to the language feature amount for each phoneme input by the adjustment amount addition unit 14 from the matching unit 13. Is the data for each phoneme). Further, the acoustic feature amount adjusting unit 15 inputs a predetermined adjustment parameter similar to the adjustment amount adding unit 14.

音響特徴量調整部１５は、調整パラメータに従ってフレーム毎の音響特徴量を調整し、調整後のフレーム毎の音響特徴量（時間長については音素毎のデータ）を学習部１６に出力する。 The acoustic feature amount adjusting unit 15 adjusts the acoustic feature amount for each frame according to the adjustment parameter, and outputs the adjusted acoustic feature amount for each frame (data for each phoneme for the time length) to the learning unit 16.

話速Ｒ_STの調整パラメータに従い話速が調整される場合、音響特徴量調整部１５は、以下の式のとおり、対応付け部１３から入力した時間長DURに話速Ｒ_STの逆数を乗算し、乗算結果を整数化し、新たな時間長DUR’を求めることで、時間長を調整する。
［数２］
DUR’＝ int（DUR×１／Ｒ_ST）・・・（２）
対応付け部１３から入力した時間長をDUR、調整後の時間長をDUR’とする。 When the speaking speed is adjusted according to the adjustment parameter of the speaking speed _RST , the acoustic feature amount adjusting unit 15 multiplies the time length _DUR input from the corresponding unit 13 by the reciprocal of the speaking speed RST as shown in the following equation. , Adjust the time length by converting the multiplication result into an integer and finding the new time length DUR'.
[Number 2]
DUR'= int (DUR x 1 / R _ST ) ・・・ (2)
The time length input from the matching unit 13 is DUR, and the adjusted time length is DUR'.

尚、話速Ｒ_STの代わりに時間長の調整パラメータＲ_DR（＝１／Ｒ_ST）に従い時間長が調整される場合、音響特徴量調整部１５は、対応付け部１３から入力した時間長DURに対し、話速Ｒ_STの逆数の代わりに、時間長の調整パラメータＲ_DRを乗算し、乗算結果を整数化し、新たな時間長DUR’を求めることで、時間長を調整する。 When the time length is adjusted according to the time length adjustment parameter R _DR (= 1 / R _ST ) instead of the speaking speed R _ST , the acoustic feature amount adjusting unit 15 uses the time length DUR input from the corresponding unit 13. On the other hand, instead of the reciprocal of the speaking speed R _ST , the time length is adjusted by multiplying the time length adjustment parameter R _DR , converting the multiplication result into an integer, and obtaining a new time length DUR'.

音響特徴量調整部１５は、調整後の時間長に応じて、対応付け部１３から入力したフレームの音響特徴量を繰り返しまたは間引きして、音響特徴量のフレーム数を揃えることで、音響特徴量を調整する。このように、音素毎の時間長の調整に応じて、音響特徴量のフレーム数が揃えられる。 The acoustic feature amount adjusting unit 15 repeats or thins out the acoustic feature amount of the frame input from the corresponding unit 13 according to the time length after adjustment, and arranges the number of frames of the acoustic feature amount, thereby aligning the acoustic feature amount. To adjust. In this way, the number of frames of the acoustic feature amount is arranged according to the adjustment of the time length for each phoneme.

尚、音響特徴量調整部１５は、調整後の時間長に応じて、対応するフレームの音響特徴量を繰り返しまたは間引くことで音響特徴量を調整する際に、前後のフレームの音響特徴量を用いて補間を行うようにしてもよい。これにより、高品質の音響特徴量を得ることができる。また、話速Ｒ_STの調整パラメータ及び他の調整パラメータに従い話速等が調整される場合、音響特徴量調整部１５は、話速を調整する前に、他の調整パラメータによる調整を行う。 The acoustic feature amount adjusting unit 15 uses the acoustic feature amounts of the front and rear frames when adjusting the acoustic feature amount by repeating or thinning out the acoustic feature amount of the corresponding frame according to the time length after the adjustment. May be performed by interpolation. As a result, high-quality acoustic features can be obtained. Further, when the speaking speed or the like is adjusted according to the adjustment parameter of the speaking speed _RST and other adjustment parameters, the acoustic feature amount adjusting unit 15 adjusts by other adjustment parameters before adjusting the speaking speed.

また、パワーＲ_PWの調整パラメータに従い音声のパワーが調整される場合、音響特徴量調整部１５は、対応付け部１３から入力した音響特徴量に含まれる静特性のメルケプストラム係数ＭＧＣにおける０次元目の値MGC[0]に、パワーＲ_PWを対数化した値を加算する。 Further, when the voice power is adjusted according to the adjustment parameter of the power R _PW , the acoustic feature amount adjusting unit 15 is the 0th dimension in the mer cepstrum coefficient MGC of the static characteristic included in the acoustic feature amount input from the matching unit 13. The logarithmic value of the power R _PW is added to the value MGC [0] of.

音響特徴量調整部１５は、以下の式のとおり、加算した値と０とを比較して大きい方を、新たな静特性のメルケプストラム係数ＭＧＣにおける０次元目の値MGC[0]’として求めることで、音響特徴量を調整する。
［数３］
MGC[0]’＝ max（0，MGC[0]＋logＲ_PW）・・・（３）
対応付け部１３から入力した音響特徴量に含まれる静特性のメルケプストラム係数ＭＧＣにおける０次元目の値をMGC[0]、調整後の値をMGC[0]’とする。 As shown in the following equation, the acoustic feature amount adjusting unit 15 compares the added value with 0 and finds the larger one as the 0th-dimensional value MGC [0]'in the new static characteristic mercepstrum coefficient MGC. By doing so, the amount of acoustic features is adjusted.
[Number 3]
MGC [0]'= max (0, MGC [0] + logR _PW ) ・・・ (3)
The 0th-dimensional value in the mer cepstrum coefficient MGC of the static characteristic included in the acoustic feature amount input from the corresponding unit 13 is MGC [0], and the adjusted value is MGC [0]'.

また、ピッチＲ_PTの調整パラメータに従い音声のピッチ周波数が調整される場合、音響特徴量調整部１５は、対応付け部１３から入力した音響特徴量に含まれる静特性の対数ピッチ周波数ＬＦ０における０次元目の値LF0[0]に、ピッチＲ_PTを対数化した値を加算する。 Further, when the pitch frequency of the voice is adjusted according to the adjustment parameter of the pitch R _PT , the acoustic feature amount adjusting unit 15 is 0-dimensional in the logarithmic pitch frequency LF0 of the static characteristic included in the acoustic feature amount input from the matching unit 13. Add the logarithmic value of pitch R _PT to the eye value LF0 [0].

音響特徴量調整部１５は、以下の式のとおり、加算した値と０とを比較して大きい方を、新たな静特性の対数ピッチ周波数ＬＦ０における０次元目の値LF0[0]’として求めることで、音響特徴量を調整する。
［数４］
LF0[0]’＝ max（0，LF0[0]＋logＲ_PT）・・・（４）
対応付け部１３から入力した音響特徴量に含まれる静特性の対数ピッチ周波数ＬＦ０における０次元目の値をLF0[0]、調整後の値をLF0[0]’とする。 As shown in the following equation, the acoustic feature amount adjusting unit 15 compares the added value with 0 and finds the larger one as the 0th-dimensional value LF0 [0]'at the logarithmic pitch frequency LF0 of the new static characteristic. By doing so, the amount of acoustic features is adjusted.
[Number 4]
LF0 [0]'= max (0, LF0 [0] + logR _PT ) ・・・ (4)
The 0th dimension value in the logarithmic pitch frequency LF0 of the static characteristic included in the acoustic feature amount input from the corresponding unit 13 is LF0 [0], and the adjusted value is LF0 [0]'.

また、抑揚Ｒ_PDの調整パラメータに従い音声の抑揚が調整される場合、音響特徴量調整部１５は、対応付け部１３から入力した音響特徴量に含まれる静特性の対数ピッチ周波数ＬＦ０から、予め算出しておいた平均値μ_LF0を減算する。そして、音響特徴量調整部１５は、減算結果を、予め算出しておいた標準偏差Σ_LF0で除算し、除算結果を求める。平均値μ_LF0は、対応付け部１３から入力した音響特徴量に含まれる静特性の対数ピッチ周波数ＬＦ０の平均値であり、標準偏差Σ_LF0はその標準偏差である。 Further, when the intonation of voice is adjusted according to the adjustment parameter of the intonation RPD, the acoustic feature amount adjusting unit 15 calculates in advance from the logarithmic pitch frequency _LF0 of the static characteristic included in the acoustic feature amount input from the matching unit 13. Subtract the average value μ _LF0 that has been set. Then, the acoustic feature amount adjusting unit 15 divides the subtraction result by the standard deviation Σ _LF0 calculated in advance, and obtains the division result. The average value μ _LF0 is the average value of the logarithmic pitch frequency LF0 of the static characteristics included in the acoustic feature quantity input from the mapping unit 13, and the standard deviation Σ _LF0 is the standard deviation thereof.

音響特徴量調整部１５は、以下の式のとおり、対応付け部１３から入力した音響特徴量に含まれる静特性の対数ピッチ周波数ＬＦ０について、その平均値μ_LF0及び標準偏差Σ_LF0を文毎に算出しておくものとする。Ｎは、文に対応するフレーム数である。
［数５］

［数６］

As shown in the following equation, the acoustic feature amount adjusting unit 15 sets the mean value μ _LF0 and the standard deviation Σ _LF0 of the logarithmic pitch frequency LF0 of the static characteristics included in the acoustic feature amount input from the matching unit 13 for each sentence. It shall be calculated. N is the number of frames corresponding to the sentence.
[Number 5]

[Number 6]

音響特徴量調整部１５は、標準偏差Σ_LF0に、抑揚Ｒ_PDを対数化した値を加算し、加算結果と０とを比較して大きい方を求める。そして、音響特徴量調整部１５は、前記除算結果に、大きい方の値を乗算し、乗算結果に平均値μ_LF0を加算する。 The acoustic feature amount adjusting unit 15 adds the logarithmic value of the intonation _RPD to the standard deviation Σ _LF0 , compares the addition result with 0, and obtains the larger one. Then, the acoustic feature amount adjusting unit 15 multiplies the division result by the larger value, and adds the average value μ _LF0 to the multiplication result.

音響特徴量調整部１５は、加算した値と０とを比較して大きい方を、新たな静特性の対数ピッチ周波数ＬＦ０’として求める。音響特徴量調整部１５による演算処理の式は以下のとおりである。
［数７］
LF0’＝ max（0，((LF0-μ_LF0)／Σ_LF0)×max(0，Σ_LF0＋logＲ_PD)＋μ_LF0）
・・・（７）
対応付け部１３から入力した音響特徴量に含まれる静特性の対数ピッチ周波数をLF0、その平均値をμ_LF0、その標準偏差をΣ_LF0、調整後の静特性の対数ピッチ周波数をLF0’とする。 The acoustic feature amount adjusting unit 15 compares the added value with 0 and obtains the larger one as the logarithmic pitch frequency LF0'of the new static characteristic. The formula of the arithmetic processing by the acoustic feature amount adjusting unit 15 is as follows.
[Number 7]
LF0'= max (0, ((LF0-μ _LF0 ) / Σ _LF0 ) × max (0, Σ _LF0 + logR _PD ) + μ _LF0 )
... (7)
The logarithmic pitch frequency of the static characteristics included in the acoustic feature quantity input from the mapping unit 13 is LF0, the average value is μ _LF0 , the standard deviation is Σ _LF0 , and the logarithmic pitch frequency of the static characteristics after adjustment is LF0'. ..

音響特徴量調整部１５は、前記のように各調整パラメータに従い算出された新たな静特性の１次差分Δを算出して新たな動特性の１次差分を求める。また、音響特徴量調整部１５は、２次差分Δ²を算出して新たな動特性の２次差分を求める。このようにして、音響特徴量調整部１５は、音響特徴量を調整する。 The acoustic feature amount adjusting unit 15 calculates the first-order difference Δ of the new static characteristic calculated according to each adjustment parameter as described above, and obtains the first-order difference of the new dynamic characteristic. Further, the acoustic feature amount adjusting unit 15 calculates the quadratic difference Δ ² to obtain the quadratic difference of the new dynamic characteristic. In this way, the acoustic feature amount adjusting unit 15 adjusts the acoustic feature amount.

尚、音響特徴量調整部１５による音響特徴量の調整処理は、調整量追加部１４による調整量情報の言語特徴量への追加処理と連動するものとする。 The adjustment process of the acoustic feature amount by the acoustic feature amount adjusting unit 15 is linked to the addition process of the adjustment amount information to the language feature amount by the adjustment amount addition unit 14.

学習部１６は、調整量追加部１４から音素毎の言語特徴量を入力すると共に、音響特徴量調整部１５からフレーム毎の音響特徴量（時間長については音素毎のデータ）を入力する。そして、学習部１６は、これらのデータを標準化し、統計モデルである時間長モデル及び音響モデルを学習する。 The learning unit 16 inputs the language feature amount for each phoneme from the adjustment amount addition unit 14, and also inputs the acoustic feature amount for each frame (data for each phoneme for the time length) from the acoustic feature amount adjustment unit 15. Then, the learning unit 16 standardizes these data and learns the time-length model and the acoustic model, which are statistical models.

（時間長モデルの学習）
次に、学習部１６による時間長モデルの学習処理について説明する。図８は、時間長モデルの学習処理例を説明する図である。学習部１６は、調整量追加部１４から入力した音素毎の言語特徴量に基づいて、言語特徴を表す３１２次元のバイナリ値及び１３次元の数値データ、並びに１次元の調整データを生成する。１次元の調整データは話速データであり、言語特徴量の次元数は３２６である。 (Learning of time length model)
Next, the learning process of the time length model by the learning unit 16 will be described. FIG. 8 is a diagram illustrating an example of learning processing of the time length model. The learning unit 16 generates 312-dimensional binary values and 13-dimensional numerical data representing linguistic features, and one-dimensional adjustment data based on the linguistic feature amount for each phone element input from the adjustment amount addition unit 14. The one-dimensional adjustment data is speech speed data, and the number of dimensions of the language feature is 326.

ここで、言語特徴量における３１２次元のバイナリ値及び１３次元の数値データは、言語特徴量に含まれる音素情報、アクセント情報、品詞情報、アクセント句情報、呼気段落情報及び総数情報に基づいて生成される。言語特徴量における１次元の調整データは、言語特徴量に含まれる調整量情報（話速の調整量、パワーの調整量、ピッチの調整量及び抑揚の調整量）のうち、話速の調整量に基づいて生成される。 Here, the 312-dimensional binary value and the 13-dimensional numerical data in the language feature are generated based on the phoneme information, the accent information, the part of speech information, the accent phrase information, the exhalation paragraph information, and the total number information included in the language feature. To. The one-dimensional adjustment data in the language feature amount is the adjustment amount of the speech speed among the adjustment amount information (speaking speed adjustment amount, power adjustment amount, pitch adjustment amount, and intonation adjustment amount) included in the language feature amount. Is generated based on.

学習部１６は、言語特徴量の３１２次元のバイナリ値、１３次元の数値データ及び１次元の調整データ（話速データ）からなる３２６次元のデータを、時間長モデルの入力データとして扱う（ステップＳ８０１）。 The learning unit 16 handles 326-dimensional data consisting of 312-dimensional binary values of language features, 13-dimensional numerical data, and one-dimensional adjustment data (speech speed data) as input data for a time-length model (step S801). ).

学習部１６は、言語特徴量の３２６次元の全てのデータを用いて、次元毎に、最大値及び最小値を求めて記憶部１７に格納すると共に、全てのデータのそれぞれについて、次元毎の最大値及び最小値を用いて標準化する（ステップＳ８０２）。 The learning unit 16 obtains the maximum value and the minimum value for each dimension using all the data of the 326 dimensions of the language feature amount and stores them in the storage unit 17, and for each of the all data, the maximum value for each dimension. Standardize using the value and the minimum value (step S802).

また、学習部１６は、音響特徴量調整部１５から入力したフレーム毎の音響特徴量（時間長については音素毎のデータ）のうちの音素毎の時間長について、当該時間長の１次元のデータを、時間モデルの出力データとして扱う（ステップＳ８０３）。この時間長は、５ｍｓ単位のフレーム数であり、テキストを表現する音素毎に１次元の整数値からなる。 Further, the learning unit 16 has one-dimensional data of the time length of each phoneme in the acoustic feature amount (data for each phoneme for the time length) for each frame input from the acoustic feature amount adjustment unit 15. Is treated as the output data of the time model (step S803). This time length is the number of frames in units of 5 ms, and consists of a one-dimensional integer value for each phoneme expressing the text.

学習部１６は、時間長の１次元の全てのデータを用いて、平均値及び標準偏差を求めて記憶部１７に格納すると共に、全てのデータのそれぞれについて、平均値及び標準偏差を用いて標準化する（ステップＳ８０４）。 The learning unit 16 obtains the mean value and the standard deviation using all the one-dimensional data of the time length and stores them in the storage unit 17, and standardizes all the data using the mean value and the standard deviation. (Step S804).

学習部１６は、ステップＳ８０２，Ｓ８０４から移行して、音素毎に、言語特徴量の３２６次元の標準化されたデータを入力データとし、時間長の１次元の標準化されたデータを出力データとして時間長モデルを学習する（ステップＳ８０５）。そして、学習部１６は、学習済みの時間長モデルを記憶部１７に格納する。 The learning unit 16 shifts from steps S802 and S804, and uses 326-dimensional standardized data of language features as input data and one-dimensional standardized data of time length as output data for each phonetic element. Learn the model (step S805). Then, the learning unit 16 stores the learned time length model in the storage unit 17.

ステップＳ８０５における時間長モデルの学習の際には、以下のサイトに記載された技術が用いられる。
“CSTR-Edinburgh/merlin”，インターネット＜ＵＲＬ：https://github.com/CSTR-Edinburgh/merlin＞
後述する図９のステップＳ９０５における音響モデルの学習の場合も同様である。 When learning the time length model in step S805, the technique described at the following site is used.
"CSTR-Edinburgh / merlin", Internet <URL: https://github.com/CSTR-Edinburgh/merlin>
The same applies to the learning of the acoustic model in step S905 of FIG. 9, which will be described later.

時間長モデルは、例えば入力層を３２６次元、隠れ層を１０２４次元の６層、出力層を１次元とした順伝播型のニューラルネットワークで構成される。隠れ層における活性化関数は双曲線正接関数が用いられ、損失誤差関数は平均二乗誤差関数が用いられる。また、ミニバッチ数を６４、エポック数を１００、dropout（ドロップアウト）率を０．５、学習係数の最適化方法として確率的勾配降下法、開始学習率を０．０１、１０エポックを過ぎてからエポック毎に学習率を指数減衰させ、誤差逆伝播法にて学習するものとする。尚、１５エポックを過ぎてから、５エポック連続して評価誤差が減少しない場合は学習を早期終了するものとする。 The time-length model is composed of, for example, a forward propagation type neural network in which the input layer is 326 dimensions, the hidden layer is 1024 dimensions, and the output layer is one dimension. The hyperbolic tangential function is used as the activation function in the hidden layer, and the mean square error function is used as the loss error function. In addition, the number of mini-batch is 64, the number of epochs is 100, the dropout rate is 0.5, the stochastic gradient descent method is used as the learning coefficient optimization method, and the start learning rate is 0.01. The learning rate is exponentially attenuated for each epoch, and learning is performed by the error back propagation method. If the evaluation error does not decrease for 5 consecutive epochs after 15 epochs, the learning is terminated early.

これにより、記憶部１７には、統計モデルとして時間長モデルが格納される。また、記憶部１７には、統計モデルとして、時間長モデルの入力データである言語特徴量の３１２次元のバイナリ値、１３次元の数値データ及び１次元の調整データ（話速データ）からなる３２６次元のデータに関する次元毎の最大値及び最小値が格納される。また、記憶部１７には、統計モデルとして、時間長モデルの出力データである時間長の１次元のデータに関する平均値及び標準偏差が格納される。 As a result, the time length model is stored in the storage unit 17 as a statistical model. Further, as a statistical model, the storage unit 17 has 326 dimensions including 312-dimensional binary values of language features, which are input data of the time-length model, 13-dimensional numerical data, and 1-dimensional adjustment data (speaking speed data). The maximum and minimum values for each dimension related to the data of are stored. Further, the storage unit 17 stores, as a statistical model, the mean value and the standard deviation of the time-length one-dimensional data which is the output data of the time-length model.

（音響モデルの学習）
次に、学習部１６による音響モデルの学習処理について説明する。図９は、音響モデルの学習処理例を説明する図である。学習部１６は、調整量追加部１４から入力した音素毎の言語特徴量に基づいて、言語特徴を表す３１２次元のバイナリ値、１３次元の数値データ、４次元の時間データ及び３次元の調整データを生成する。 (Learning of acoustic model)
Next, the learning process of the acoustic model by the learning unit 16 will be described. FIG. 9 is a diagram illustrating an example of learning processing of an acoustic model. The learning unit 16 has a 312-dimensional binary value representing a language feature, a 13-dimensional numerical data, a 4-dimensional time data, and a 3-dimensional adjustment data based on the language feature amount for each phone element input from the adjustment amount addition section 14. To generate.

４次元の時間データは、当該フレームに対応する音素のフレーム数（１次元のデータ）、及び当該フレームの音素内における位置（３次元のデータ）からなる。３次元の調整データは、パワーデータ、ピッチデータ及び抑揚データである。これらの調整データは、言語特徴量に含まれる調整量情報（話速の調整量、パワーの調整量、ピッチの調整量及び抑揚の調整量）のうち、パワーの調整量、ピッチの調整量及び抑揚の調整量に基づいて生成される。また、言語特徴量の次元数は３３２である。 The four-dimensional time data consists of the number of frames of the sound element corresponding to the frame (one-dimensional data) and the position of the frame in the sound element (three-dimensional data). The three-dimensional adjustment data are power data, pitch data, and intonation data. These adjustment data are the power adjustment amount, the pitch adjustment amount, and the adjustment amount information (speaking speed adjustment amount, power adjustment amount, pitch adjustment amount, and intonation adjustment amount) included in the language feature amount. Generated based on the amount of intonation adjustment. The number of dimensions of the language feature is 332.

学習部１６は、音素毎の言語特徴量における３１２次元のバイナリ値、１３次元の数値データ、４次元の時間データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）からなる３３２次元のデータから、フレーム毎の言語特徴量における３３２次元のデータを生成する。 The learning unit 16 is a 332-dimensional binary value consisting of a 312-dimensional binary value, a 13-dimensional numerical data, a 4-dimensional time data, and a 3-dimensional adjustment data (power data, pitch data, and intonation data) in the language feature quantity for each phonetic element. From the data, 332-dimensional data in the language feature quantity for each frame is generated.

学習部１６は、フレーム毎の言語特徴量について、言語特徴量の３１２次元のバイナリ値、１３次元の数値データ、４次元の時間データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）からなる３３２次元のデータを、音響モデルの入力データとして扱う（ステップＳ９０１）。 Regarding the language feature amount for each frame, the learning unit 16 has a 312-dimensional binary value of the language feature amount, a 13-dimensional numerical data, a 4-dimensional time data, and a 3-dimensional adjustment data (power data, pitch data, and intonation data). The 332-dimensional data consisting of the above is treated as the input data of the acoustic model (step S901).

学習部１６は、言語特徴量の３３２次元の全てのデータを用いて、次元毎に、最大値及び最小値を求めて記憶部１７に格納すると共に、全てのデータのそれぞれについて、次元毎の最大値及び最小値を用いて標準化する（ステップＳ９０２）。 The learning unit 16 obtains the maximum value and the minimum value for each dimension using all the data of the 332 dimensions of the language feature amount and stores them in the storage unit 17, and for each of the all data, the maximum value for each dimension. Standardize using the value and the minimum value (step S902).

また、学習部１６は、音響特徴量調整部１５から入力したフレーム毎の音響特徴量（時間長については音素毎のデータ）のうちの時間長を除く音響特徴量について、１９９次元のデータを、音響モデルの出力データとして扱う（ステップＳ９０３）。 Further, the learning unit 16 obtains 199-dimensional data for the acoustic feature amount excluding the time length of the acoustic feature amount for each frame (data for each phoneme for the time length) input from the acoustic feature amount adjusting unit 15. It is treated as output data of an acoustic model (step S903).

ここで、前述のとおり、時間長を除く音響特徴量は、静特性のメルケプストラム係数ＭＧＣ、対数ピッチ周波数ＬＦ０及び帯域非周期成分ＢＡＰ、動特性の１次差分メルケプストラム係数ΔＭＧＣ、１次差分対数ピッチ周波数ΔＬＦ０、１次差分帯域非周期成分ΔＢＡＰ、２次差分メルケプストラム係数Δ²ＭＧＣ、２次差分対数ピッチ周波数Δ²ＬＦ０及び２次差分帯域非周期成分Δ²ＢＡＰ、並びに静特性の有声／無声判定情報ＶＵＶからなる。 Here, as described above, the acoustic feature amount excluding the time length is the static characteristic mel cepstrum coefficient MGC, the logarithmic pitch frequency LF0 and the band aperiodic component BAP, and the dynamic characteristic first-order difference mel cepstrum coefficient ΔMGC, the first-order difference log. Pitch frequency ΔLF0, primary difference band aperiodic component ΔBAP, secondary difference mel cepstrum coefficient Δ ² MGC, secondary difference log pitch frequency Δ ² LF0 and secondary difference band aperiodic component Δ ² BAP, and static characteristic voice / It consists of silent judgment information VUV.

具体的には、時間長を除く音響特徴量は、静特性の６０次元のメルケプストラム係数、１次元の対数ピッチ周波数及び５次元の帯域非周期成分を併せた静特性の６６次元のデータと、これらの静特性のデータを１次差分及び２次差分して得られた動特性の１３２次元のデータと、１次元の有声／無声判定データとからなる。つまり、時間長を除く音響特徴量の次元数は１９９である。 Specifically, the acoustic feature amount excluding the time length includes the static characteristic 60-dimensional Melkeptrum coefficient, the static characteristic 66-dimensional data that combines the one-dimensional log pitch frequency and the five-dimensional band aperiodic component. It consists of 132-dimensional data of dynamic characteristics obtained by first-order difference and second-order difference of these static characteristic data, and one-dimensional voiced / unvoiced determination data. That is, the number of dimensions of the acoustic feature amount excluding the time length is 199.

学習部１６は、音響特徴量の１９９次元の全てのデータを用いて、次元毎に、平均値及び標準偏差を求めて記憶部１７に格納すると共に、全てのデータのそれぞれについて、次元毎の平均値及び標準偏差を用いて標準化する（ステップＳ９０４）。 The learning unit 16 obtains the mean value and standard deviation for each dimension using all the 199-dimensional data of the acoustic feature quantity and stores them in the storage unit 17, and also for each of the data, the average for each dimension. Standardize using the value and standard deviation (step S904).

学習部１６は、ステップＳ９０２，Ｓ９０４から移行して、フレーム毎に、言語特徴量の３３２次元の標準化されたデータを入力データとし、音響特徴量の１９９次元の標準化されたデータを出力データとして音響モデルを学習する（ステップＳ９０５）。そして、学習部１６は、学習済みの音響モデルを記憶部１７に格納する。 The learning unit 16 shifts from steps S902 and S904, and for each frame, the 332-dimensional standardized data of the language feature amount is used as input data, and the 199-dimensional standardized data of the acoustic feature amount is used as output data. Learn the model (step S905). Then, the learning unit 16 stores the learned acoustic model in the storage unit 17.

音響モデルは、例えば入力層を３３２次元、隠れ層を１０２４次元の６層、出力層を１９９次元とした順伝播型のニューラルネットワークで構成される。隠れ層における活性化関数は双曲線正接関数が用いられ、損失誤差関数は平均二乗誤差関数が用いられる。また、ミニバッチ数を２５６、エポック数を１００、dropout（ドロップアウト）率を０．５
学習係数の最適化方法として確率的勾配降下法、開始学習率を０．００１、１０エポックを過ぎてからエポック毎に学習率を指数減衰させ、誤差逆伝播法にて学習するものとする。尚、１５エポックを過ぎてから、５エポック連続して評価誤差が減少しない場合は学習を早期終了するものとする。 The acoustic model is composed of, for example, a forward propagation type neural network in which the input layer is 332 dimensions, the hidden layer is 1024 dimensions, and the output layer is 199 dimensions. The hyperbolic tangential function is used as the activation function in the hidden layer, and the mean square error function is used as the loss error function. In addition, the number of mini batches is 256, the number of epochs is 100, and the dropout rate is 0.5.
As a method for optimizing the learning coefficient, it is assumed that the learning is performed by the stochastic gradient descent method, the starting learning rate is 0.001, the learning rate is exponentially attenuated for each epoch after passing 10 epochs, and the learning is performed by the error back propagation method. If the evaluation error does not decrease for 5 consecutive epochs after 15 epochs, the learning is terminated early.

これにより、記憶部１７には、統計モデルとして音響モデルが格納される。また、記憶部１７には、統計モデルとして、音響モデルの入力データである言語特徴量の３１２次元のバイナリ値、１３次元の数値データ、４次元の時間データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）からなる３３２次元のデータに関する次元毎の最大値及び最小値が格納される。また、記憶部１７には、統計モデルとして、音響モデルの出力データである音響特徴量の１９９次元のデータに関する次元毎の平均値及び標準偏差が格納される。 As a result, the acoustic model is stored in the storage unit 17 as a statistical model. Further, in the storage unit 17, as statistical models, 312-dimensional binary values of language features, which are input data of acoustic models, 13-dimensional numerical data, 4-dimensional time data, and 3-dimensional adjustment data (power data, The maximum value and the minimum value for each dimension of the 332-dimensional data (pitch data and intonation data) are stored. Further, in the storage unit 17, as a statistical model, the mean value and the standard deviation for each dimension of the 199-dimensional data of the acoustic features, which are the output data of the acoustic model, are stored.

以上のように、本発明の実施形態の学習装置１によれば、言語分析部１１は、音声コーパスのテキストについて既知の言語分析処理を行い、音素毎の言語特徴量を求める。音声分析部１２は、音声コーパスのテキストに対応する音声信号をフレーム毎に切り出し、フレーム毎の音声信号について既知の音響分析処理を行い、フレーム毎の音響特徴量を求める。 As described above, according to the learning device 1 of the embodiment of the present invention, the language analysis unit 11 performs a known language analysis process on the text of the speech corpus, and obtains the language features for each phoneme. The voice analysis unit 12 cuts out a voice signal corresponding to the text of the voice corpus for each frame, performs a known acoustic analysis process on the voice signal for each frame, and obtains an acoustic feature amount for each frame.

対応付け部１３は、既知の音素アラインメントの技術を用いて、音素毎の言語特徴量とフレーム毎の音響特徴量とを時間的に対応付け、音素毎の時間長を求める。そして、対応付け部１３は、時間情報を追加した音素毎の言語特徴量を生成し、対応付けたフレーム毎の音響特徴量（時間長については音素毎のデータ）を生成する。 The associating unit 13 temporally associates the language feature amount for each phoneme with the acoustic feature amount for each frame by using a known phoneme alignment technique, and obtains the time length for each phoneme. Then, the association unit 13 generates a language feature amount for each phoneme to which time information is added, and generates an acoustic feature amount for each associated frame (data for each phoneme for the time length).

調整量追加部１４は、時間情報を追加した音素毎の言語特徴量に、調整パラメータの調整量情報を追加する。音響特徴量調整部１５は、調整パラメータに従って、フレーム毎の音響特徴量（時間長については音素毎のデータ）を調整する。 The adjustment amount addition unit 14 adds the adjustment amount information of the adjustment parameter to the language feature amount for each phoneme to which the time information is added. The acoustic feature amount adjusting unit 15 adjusts the acoustic feature amount for each frame (data for each phoneme for the time length) according to the adjustment parameter.

学習部１６は、言語特徴量の３１２次元のバイナリ値、１３次元の数値データ及び１次元の調整データ（話速データ）からなる３２６次元のデータに基づいて、次元毎に、最大値及び最小値を求め、全てのデータのそれぞれを標準化する。また、学習部１６は、時間長の１次元のデータに基づいて平均値及び標準偏差を求め、時間長の１次元のデータを標準化する。 The learning unit 16 has a maximum value and a minimum value for each dimension based on 326-dimensional data consisting of 312-dimensional binary values of language features, 13-dimensional numerical data, and 1-dimensional adjustment data (speech speed data). And standardize each of all the data. Further, the learning unit 16 obtains the mean value and the standard deviation based on the one-dimensional data of the time length, and standardizes the one-dimensional data of the time length.

学習部１６は、音素毎に、言語特徴量の３２６次元の標準化されたデータを入力データとし、時間長の１次元の標準化されたデータを出力データとして時間長モデルを学習する。 The learning unit 16 learns a time length model using 326-dimensional standardized data of language features as input data and one-dimensional standardized data of time length as output data for each phonetic element.

学習部１６は、言語特徴量の３１２次元のバイナリ値、１３次元の数値データ、４次元の時間データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）からなる３３２次元のデータに基づいて、次元毎に、最大値及び最小値を求め、全てのデータのそれぞれを標準化する。また、学習部１６は、音響特徴量の１９９次元のデータに基づいて、次元毎に、平均値及び標準偏差を求め、全てのデータのそれぞれを標準化する。 The learning unit 16 is based on 332-dimensional data consisting of 312-dimensional binary values of language features, 13-dimensional numerical data, 4-dimensional time data, and 3-dimensional adjustment data (power data, pitch data, and intonation data). Then, the maximum value and the minimum value are obtained for each dimension, and each of all the data is standardized. Further, the learning unit 16 obtains the mean value and the standard deviation for each dimension based on the 199-dimensional data of the acoustic feature amount, and standardizes each of all the data.

学習部１６は、フレーム毎に、言語特徴量の３３２次元の標準化されたデータを入力データとし、音響特徴量の１９９次元の標準化されたデータを出力データとして音響モデルを学習する。 The learning unit 16 learns an acoustic model for each frame using 332-dimensional standardized data of language features as input data and 199-dimensional standardized data of acoustic features as output data.

これにより、記憶部１７には、学習済みの統計モデルとして、調整パラメータの調整量情報が反映された時間長モデル、音響モデル及び最大値等が格納される。 As a result, the storage unit 17 stores, as a learned statistical model, a time-length model, an acoustic model, a maximum value, and the like that reflect the adjustment amount information of the adjustment parameters.

そして、後述の音声合成装置２により、調整パラメータの調整量情報が反映された学習モデルを用いて、調整パラメータの調整量情報が追加された言語特徴量に基づき音響特徴量が推定され、フレーム毎の音響特徴量から合成音声信号が生成される。 Then, by the speech synthesizer 2 described later, the acoustic feature amount is estimated based on the language feature amount to which the adjustment amount information of the adjustment parameter is added by using the learning model in which the adjustment amount information of the adjustment parameter is reflected, and the acoustic feature amount is estimated for each frame. A synthetic speech signal is generated from the acoustic features of.

図１７に示した非特許文献１，２の従来技術を組み合わせた想定例では、学習モデルを用いた推定により時間的に平滑化された特性を有する音響特徴量に調整を加え、調整後のフレーム毎の音響特徴量から合成音声信号を生成することから、合成音声信号に音質劣化が生じてしまう。さらに、入力文章の特定部分に対応する音響特徴量に調整を加え、調整後のフレーム毎の音響特徴量から合成音声信号を生成することから、調整を加えた部分と、これに隣接する調整を加えていない部分との間の接続部分において、合成音声信号に不連続を生じてしまう。 In the assumed example in which the prior arts of Non-Patent Documents 1 and 2 shown in FIG. 17 are combined, the frame after adjustment is made by adjusting the acoustic feature amount having the characteristic smoothed in time by estimation using the learning model. Since the synthetic voice signal is generated from each acoustic feature amount, the sound quality of the synthetic voice signal deteriorates. Furthermore, since the acoustic features corresponding to the specific part of the input text are adjusted and the synthesized speech signal is generated from the adjusted acoustic features for each frame, the adjusted part and the adjustment adjacent to it are adjusted. Discontinuity occurs in the synthesized speech signal at the connection portion between the portion not added.

これに対し、本発明の実施形態による音声合成装置２は、調整パラメータの調整量情報が反映された学習モデルを用いて音響特徴量を推定し、合成音声信号を生成するから、学習モデルを用いた推定により時間的に平滑化された特性を有する音響特徴量に調整を加える必要がない。また、入力文章の特定部分に対応する言語特徴量を調整したものを学習モデルに入力して音響特徴量を求め、合成音声信号を生成することから、調整を加えた部分と、これに隣接する調整を加えていない部分との間の接続部分において、合成音声信号に不連続を生じることがない On the other hand, the speech synthesizer 2 according to the embodiment of the present invention estimates the acoustic feature amount using the learning model reflecting the adjustment amount information of the adjustment parameter and generates the synthesized speech signal, so that the learning model is used. It is not necessary to make adjustments to the acoustic features that have the characteristics smoothed in time by the estimation. In addition, since the linguistic features adjusted for a specific part of the input sentence are input to the learning model to obtain the acoustic features and the synthetic speech signal is generated, the adjusted part and the adjacent part thereof No discontinuity in the synthesized speech signal at the connection between the unadjusted part

したがって、テキストの特定部分の読み上げ方を調整した合成音声信号を生成する際に、高品質の合成音声信号を得ることができる。 Therefore, it is possible to obtain a high-quality synthetic speech signal when generating a synthetic speech signal in which the reading method of a specific part of the text is adjusted.

また、本発明の実施形態では、調整パラメータは、話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDのうちのどれか、またはこれらの組み合わせであり、ユーザにより選択される。この場合、ユーザは、例えば４つの調整パラメータのうちの１つの調整パラメータについて、１１個のデータから１個のデータを選択し、他の３つの調整パラメータについては、標準値1.0を固定値として用いる。または、ユーザは、例えば４つの調整パラメータを１１段階で連動させて選択する。 Further, in the embodiment of the present invention, the adjustment parameter is any one of the speaking speed R _ST , the power R _PW , the pitch R _PT and the _intonation R P D, or a combination thereof, and is selected by the user. In this case, for example, the user selects one data from 11 data for one of the four adjustment parameters, and uses the standard value 1.0 as a fixed value for the other three adjustment parameters. .. Alternatively, the user selects, for example, four adjustment parameters in an interlocking manner in 11 steps.

このように、調整パラメータの選択範囲を限定することにより、統計モデルを学習する際の学習データを少なくすることができ、低負荷かつ短時間で、統計モデルを学習することができる。 By limiting the selection range of the adjustment parameter in this way, it is possible to reduce the training data when learning the statistical model, and it is possible to learn the statistical model with a low load and in a short time.

〔音声合成装置〕
次に、本発明の実施形態による音声合成装置について説明する。図１０は、音声合成装置の構成例を示すブロック図であり、図１１は、音声合成装置による音声合成処理例を示すフローチャートである。 [Speech synthesizer]
Next, the speech synthesizer according to the embodiment of the present invention will be described. FIG. 10 is a block diagram showing a configuration example of a voice synthesizer, and FIG. 11 is a flowchart showing an example of voice synthesis processing by the voice synthesizer.

この音声合成装置２は、言語分析部２０、調整量追加部２１、音響特徴量推定部２２、記憶部１７及び音声生成部２３を備えている。記憶部１７は、図１に示した記憶部１７に相当し、学習装置１により学習された統計モデルとして、時間長モデル、音響モデル及び最大値等が格納されている。 The voice synthesizer 2 includes a language analysis unit 20, an adjustment amount addition unit 21, an acoustic feature amount estimation unit 22, a storage unit 17, and a voice generation unit 23. The storage unit 17 corresponds to the storage unit 17 shown in FIG. 1, and stores a time length model, an acoustic model, a maximum value, and the like as statistical models learned by the learning device 1.

尚、学習装置１により学習された統計モデルは、学習装置１に備えた記憶部１７から読み出され、音声合成装置２に備えた記憶部１７に格納されるようにしてもよい。また、音声合成装置２は、インターネットを介して、学習装置１に備えた記憶部１７へ直接アクセスするようにしてもよい。 The statistical model learned by the learning device 1 may be read from the storage unit 17 provided in the learning device 1 and stored in the storage unit 17 provided in the speech synthesis device 2. Further, the voice synthesizer 2 may directly access the storage unit 17 provided in the learning device 1 via the Internet.

言語分析部２０は、音声合成対象のテキストを入力し、図１に示した言語分析部１１と同様に、テキストについて既知の言語分析処理を行い、音素毎の所定情報からなる言語特徴量を求める（ステップＳ１１０１）。そして、言語分析部２０は、音素毎の言語特徴量を調整量追加部２１に出力する。 The language analysis unit 20 inputs a text to be voice-synthesized, performs a known language analysis process on the text in the same manner as the language analysis unit 11 shown in FIG. 1, and obtains a language feature amount consisting of predetermined information for each phoneme. (Step S1101). Then, the language analysis unit 20 outputs the language feature amount for each phoneme to the adjustment amount addition unit 21.

調整量追加部２１は、言語分析部２０から音素毎の言語特徴量を入力すると共に、所定の調整パラメータを入力する。そして、調整量追加部２１は、図１に示した調整量追加部１４と同様に、言語特徴量を構成する音素毎の所定情報に、調整パラメータの調整量情報を追加する（ステップＳ１１０２）。調整量追加部２１は、音素毎の調整量情報を追加した言語特徴量を音響特徴量推定部２２に出力する。 The adjustment amount addition unit 21 inputs a language feature amount for each phoneme from the language analysis unit 20, and also inputs a predetermined adjustment parameter. Then, the adjustment amount addition unit 21 adds the adjustment amount information of the adjustment parameter to the predetermined information for each phoneme constituting the language feature amount, similarly to the adjustment amount addition unit 14 shown in FIG. 1 (step S1102). The adjustment amount addition unit 21 outputs the language feature amount to which the adjustment amount information for each phoneme is added to the acoustic feature amount estimation unit 22.

所定の調整パラメータは、前述と同様に、話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDのうちのどれか、またはこれらの組み合わせとし、ユーザにより指定されるものとする。調整パラメータの値は、前述した調整の範囲において任意の実数とする。つまり、所定の調整パラメータは、話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDのうちのいずれか１つまたは２つ以上の組み合わせとする。 As described above, the predetermined adjustment parameter shall be any one of the speaking speed R _ST , the power R _PW , the pitch R _PT and the _intonation R P D, or a combination thereof, and shall be specified by the user. The value of the adjustment parameter shall be an arbitrary real number within the range of the adjustment described above. That is, the predetermined adjustment parameter is any one or a combination of two or more of the speaking speed R _ST , the power R _PW , the pitch R _PT , and the intonation R _PD .

尚、所定の調整パラメータは、話速Ｒ_ST、パワーＲ_PW、ピッチＲ_PT及び抑揚Ｒ_PDとし、これらの４つのパラメータのうちのいずれか１つのパラメータの調整量は、所定範囲内の任意の値が指定され、他の３つのパラメータの調整量は、固定値が用いられるようにしてもよい。また、所定の調整パラメータは、前述の４つのパラメータとし、それぞれの調整量は、それぞれの所定範囲内の任意の値が指定されるようにしてもよい。 The predetermined adjustment parameters are speaking speed R _ST , power R _PW , pitch R _PT , and intonation R _PD , and the adjustment amount of any one of these four parameters is arbitrary within the predetermined range. A value is specified, and a fixed value may be used as the adjustment amount of the other three parameters. Further, the predetermined adjustment parameters may be the above-mentioned four parameters, and any value within the respective predetermined range may be specified for each adjustment amount.

尚、調整量追加部２１は、図１に示した調整量追加部１４と同様に、文章単位、呼気段落単位またはアクセント句単位で、異なる調整パラメータを入力するようにしてもよい。 The adjustment amount addition unit 21 may input different adjustment parameters in sentence units, exhalation paragraph units, or accent phrase units, as in the adjustment amount addition unit 14 shown in FIG.

音響特徴量推定部２２は、調整量追加部２１から音素毎の言語特徴量を入力し、記憶部１７に格納された最大値等を用いて標準化及び逆標準化の処理を行い、時間長モデルを用いて音素毎の時間長を推定する。 The acoustic feature amount estimation unit 22 inputs the language feature amount for each phoneme from the adjustment amount addition unit 21, performs standardization and destandardization processing using the maximum value and the like stored in the storage unit 17, and creates a time-length model. Use to estimate the time length for each phoneme.

音響特徴量推定部２２は、記憶部１７に格納された最大値等を用いて標準化及び逆標準化の処理を行い、音響モデルを用いてフレーム毎の音響特徴量を推定する（ステップＳ１１０３）。音響特徴量推定部２２は、フレーム毎の音響特徴量を音声生成部２３に出力する。 The acoustic feature amount estimation unit 22 performs standardization and destandardization processing using the maximum value and the like stored in the storage unit 17, and estimates the acoustic feature amount for each frame using the acoustic model (step S1103). The acoustic feature amount estimation unit 22 outputs the acoustic feature amount for each frame to the voice generation unit 23.

（時間長モデルを用いた時間長の推定）
次に、音響特徴量推定部２２による時間長モデルを用いた時間長の推定処理について説明する。図１２は、時間長モデルを用いた時間長推定処理例を説明する図である。音響特徴量推定部２２は、調整量追加部２１から入力した音素毎の言語特徴量に基づいて、言語特徴を表す３１２次元のバイナリ値及び１３次元の数値データ、並びに１次元の調整データ（話速データ）を生成する。言語特徴量の次元数は３２６である。 (Estimation of time length using time length model)
Next, the time length estimation process using the time length model by the acoustic feature amount estimation unit 22 will be described. FIG. 12 is a diagram illustrating an example of time length estimation processing using a time length model. The acoustic feature amount estimation unit 22 has 312-dimensional binary values and 13-dimensional numerical data representing language features, and one-dimensional adjustment data (talk) based on the language feature amount for each phone element input from the adjustment amount addition unit 21. (Speed data) is generated. The number of dimensions of language features is 326.

音響特徴量推定部２２は、言語特徴量の３１２次元のバイナリ値、１３次元の数値データ及び１次元の調整データ（話速データ）からなる３２６次元のデータを、時間長モデルの入力データとして扱う（ステップＳ１２０１）。 The acoustic feature amount estimation unit 22 handles 326-dimensional data consisting of a 312-dimensional binary value of a language feature amount, a 13-dimensional numerical data, and a one-dimensional adjustment data (speech speed data) as input data of a time-length model. (Step S1201).

音響特徴量推定部２２は、記憶部１７から、時間長モデルの入力データである言語特徴量の３１２次元のバイナリ値、１３次元の数値データ及び１次元の調整データ（話速データ）からなる３２６次元のデータに関する次元毎の最大値及び最小値を読み出す。そして、音響特徴量推定部２２は、言語特徴量の３２６次元のデータのそれぞれについて、次元毎に、最大値及び最小値を用いて標準化を行う（ステップＳ１２０２）。 From the storage unit 17, the acoustic feature amount estimation unit 22 is 326 composed of a 312-dimensional binary value of the language feature amount, which is input data of the time-length model, a 13-dimensional numerical data, and a one-dimensional adjustment data (speech speed data). Read the maximum and minimum values for each dimension related to the dimension data. Then, the acoustic feature amount estimation unit 22 standardizes each of the 326-dimensional data of the language feature amount by using the maximum value and the minimum value for each dimension (step S1202).

音響特徴量推定部２２は、記憶部１７に格納された時間長モデルを用いて、言語特徴量の３２６次元の標準化されたデータを時間長モデルの入力データとして、時間長モデルの出力データである時間長の１次元の標準化されたデータを推定する（ステップＳ１２０３）。 The acoustic feature amount estimation unit 22 is the output data of the time length model using the time length model stored in the storage unit 17 and using the 326-dimensional standardized data of the language feature amount as the input data of the time length model. Estimate one-dimensional standardized data of time length (step S1203).

音響特徴量推定部２２は、記憶部１７から、時間長モデルの出力データである時間長の１次元のデータに関する平均値及び標準偏差を読み出す。そして、音響特徴量推定部２２は、ステップＳ１２０３にて推定した時間長の１次元の標準化されたデータについて、平均値及び標準偏差を用いて逆標準化を行い（ステップＳ１２０４）、時間長の１次元のデータを求める（ステップＳ１２０５）。 The acoustic feature amount estimation unit 22 reads out the average value and the standard deviation of the time-length one-dimensional data, which is the output data of the time-length model, from the storage unit 17. Then, the acoustic feature amount estimation unit 22 reverse-standardizes the one-dimensional standardized data of the time length estimated in step S1203 using the mean value and the standard deviation (step S1204), and the one-dimensional time length. Data is obtained (step S1205).

これにより、記憶部１７に格納された時間長モデル、時間長モデルの入力データである言語特徴量の３２６次元のデータに関する次元毎の最大値及び最小値、並びに、時間長モデルの出力データである時間長の１次元のデータに関する平均値及び標準偏差を用いて、音素毎の言語特徴量の３２６次元のデータから、音素毎の時間長の１次元のデータを得ることができる。 As a result, the time length model stored in the storage unit 17, the maximum and minimum values for each dimension of the 326-dimensional data of the language feature amount which is the input data of the time length model, and the output data of the time length model. The time-length one-dimensional data for each phonetic element can be obtained from the 326-dimensional data of the language feature quantity for each phonetic element by using the average value and the standard deviation for the time-length one-dimensional data.

（音響モデルを用いた音響特徴量の推定）
次に、音響特徴量推定部２２による音響モデルを用いた音響特徴量の推定処理について説明する。図１３は、音響モデルを用いた音響特徴量推定処理例を説明する図である。音響特徴量推定部２２は、ステップＳ１２０５にて求めた音素毎の時間長の１次元のデータに基づいて、図９のステップＳ９０１と同様に、音素に対応する複数フレームのそれぞれについて、時間データの４次元のデータを生成する（ステップＳ１３０１）。 (Estimation of acoustic features using an acoustic model)
Next, the acoustic feature estimation process using the acoustic model by the acoustic feature estimation unit 22 will be described. FIG. 13 is a diagram illustrating an example of acoustic feature amount estimation processing using an acoustic model. The acoustic feature amount estimation unit 22 is based on the one-dimensional data of the time length for each phoneme obtained in step S1205, and similarly to step S901 in FIG. 9, the time data for each of the plurality of frames corresponding to the phonemes. Generate four-dimensional data (step S1301).

音響特徴量推定部２２は、調整量追加部２１から入力した音素毎の言語特徴量に基づいて、言語特徴を表す３１２次元のバイナリ値、１３次元の数値データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）を生成する。そして、音響特徴量推定部２２は、音素毎の言語特徴量における３１２次元のバイナリ値、１３次元の数値データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）からなる３２８次元のデータから、フレーム毎の言語特徴量における３２８次元のデータを生成する。 The acoustic feature amount estimation unit 22 has a 312-dimensional binary value representing a language feature, a 13-dimensional numerical data, and a three-dimensional adjustment data (power data) based on the language feature amount for each phone element input from the adjustment amount addition unit 21. , Pitch data and intonation data). Then, the acoustic feature amount estimation unit 22 is a 328-dimensional data composed of a 312-dimensional binary value, a 13-dimensional numerical data, and a three-dimensional adjustment data (power data, pitch data, and intonation data) in the language feature amount for each phone element. From, 328-dimensional data in the language feature quantity for each frame is generated.

音響特徴量推定部２２は、フレーム毎の言語特徴量の３１２次元のバイナリ値、１３次元の数値データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）からなる３２８次元のデータ、並びにステップＳ１３０１にて生成した時間データの４次元のデータを、音響モデルの入力データとして扱う（ステップＳ１３０２）。 The acoustic feature amount estimation unit 22 has a 328-dimensional data consisting of a 312-dimensional binary value of the language feature amount for each frame, a 13-dimensional numerical data, and a 3-dimensional adjustment data (power data, pitch data, and intonation data), and a 328-dimensional data. The four-dimensional data of the time data generated in step S1301 is treated as the input data of the acoustic model (step S1302).

音響特徴量推定部２２は、記憶部１７から、音響モデルの入力データである言語特徴量の３１２次元のバイナリ値、１３次元の数値データ、４次元の時間データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）からなる３３２次元のデータに関する次元毎の最大値及び最小値を読み出す。そして、音響特徴量推定部２２は、言語特徴量の３２８次元のデータ及び時間データの４次元のデータからなる３３２次元のデータのそれぞれについて、次元毎に、最大値及び最小値を用いて標準化を行う（ステップＳ１３０３）。 From the storage unit 17, the acoustic feature amount estimation unit 22 contains 312-dimensional binary values of language features, which are input data of the acoustic model, 13-dimensional numerical data, 4-dimensional time data, and 3-dimensional adjustment data (power data). , Pitch data and intonation data), and the maximum and minimum values for each dimension of the 332-dimensional data are read out. Then, the acoustic feature amount estimation unit 22 standardizes each of the 332-dimensional data consisting of the 328-dimensional data of the language feature amount and the 4-dimensional data of the time data by using the maximum value and the minimum value for each dimension. (Step S1303).

音響特徴量推定部２２は、記憶部１７に格納された音響モデルを用いて、言語特徴量の３２８次元の標準化されたデータ及び時間データの４次元の標準化されたデータからなる３３２次元の標準化されたデータを音響モデルの入力データとして、音響モデルの出力データである音響特徴量の１９９次元の標準化されたデータを推定する（ステップＳ１３０４）。 The acoustic feature estimation unit 22 is 332-dimensional standardized consisting of 328-dimensional standardized data of language features and 4-dimensional standardized data of time data using an acoustic model stored in the storage unit 17. Using the collected data as input data of the acoustic model, 199-dimensional standardized data of acoustic features, which is output data of the acoustic model, is estimated (step S1304).

音響特徴量推定部２２は、記憶部１７から、音響モデルの出力データである音響特徴量の１９９次元のデータに関する平均値及び標準偏差を読み出す。そして、音響特徴量推定部２２は、ステップＳ１３０４にて推定した音響特徴量の１９９次元の標準化されたデータについて、次元毎に、平均値及び標準偏差を用いて逆標準化を行う（ステップＳ１３０５）。音響特徴量推定部２２は、フレーム毎の音響特徴量の１９９次元のデータを生成する（ステップＳ１３０６）。 The acoustic feature amount estimation unit 22 reads out the average value and the standard deviation of the 199-dimensional data of the acoustic feature amount, which is the output data of the acoustic model, from the storage unit 17. Then, the acoustic feature amount estimation unit 22 reverse-standardizes the 199-dimensional standardized data of the acoustic feature amount estimated in step S1304 by using the mean value and the standard deviation for each dimension (step S1305). The acoustic feature amount estimation unit 22 generates 199-dimensional data of the acoustic feature amount for each frame (step S1306).

このようにして推定され逆標準化された音響特徴量は、フレーム毎に離散的な値をとる。そこで、音響特徴量推定部２２は、連続するフレーム毎の音響特徴量の１９９次元のデータに対して、最尤推定または移動平均をとり、新たなフレーム毎の音響特徴量の１９９次元のデータを求める。これにより、フレーム毎の音響特徴量は滑らかな値となる。 The acoustic features estimated and destandardized in this way take discrete values for each frame. Therefore, the acoustic feature estimation unit 22 obtains maximum likelihood estimation or moving average for the 199-dimensional data of the acoustic features for each continuous frame, and obtains the 199-dimensional data of the new acoustic features for each frame. Ask. As a result, the acoustic feature amount for each frame becomes a smooth value.

これにより、記憶部１７に格納された音響モデル、音響モデルの入力データである言語特徴量の３３２次元のデータに関する次元毎の最大値及び最小値、並びに、音響モデルの出力データである音響特徴量の１９９次元のデータに関する平均値及び標準偏差を用いて、フレーム毎の言語特徴量の３２８次元のデータ及び時間データの４次元のデータから、フレーム毎の音響特徴量の１９９次元のデータを得ることができる。 As a result, the acoustic model stored in the storage unit 17, the maximum and minimum values for each dimension of the 332-dimensional data of the language feature amount which is the input data of the acoustic model, and the acoustic feature amount which is the output data of the acoustic model. Obtaining 199-dimensional data of acoustic features per frame from 328-dimensional data of language features and 4-dimensional data of time data for each frame using the average value and standard deviation of the 199-dimensional data of. Can be done.

図１０及び図１１に戻って、音声生成部２３は、音響特徴量推定部２２からフレーム毎の音響特徴量を入力し、フレーム毎の音響特徴量に基づいて音声信号を合成する（ステップＳ１１０４）。そして、音声生成部２３は、音声合成対象のテキストに対して調整パラメータによる調整が加えられた音声信号を出力する。 Returning to FIGS. 10 and 11, the voice generation unit 23 inputs the acoustic feature amount for each frame from the acoustic feature amount estimation unit 22, and synthesizes the voice signal based on the acoustic feature amount for each frame (step S1104). .. Then, the voice generation unit 23 outputs a voice signal adjusted by the adjustment parameter to the text to be voice-synthesized.

図１４は、音声生成部２３による音声合成処理例を説明する図である。音声生成部２３は、音響特徴量推定部２２から入力したフレーム毎の音響特徴量のうち、フレーム毎のメルケプストラム係数ＭＧＣ、対数ピッチ周波数ＬＦ０及び帯域非周期成分ＢＡＰである静特性の音響特徴量を選択する（ステップＳ１４０１）。 FIG. 14 is a diagram illustrating an example of voice synthesis processing by the voice generation unit 23. Of the acoustic features for each frame input from the acoustic feature estimation unit 22, the voice generation unit 23 has a mercepstrum coefficient MGC for each frame, a logarithmic pitch frequency LF0, and a static characteristic acoustic feature that is a band aperiodic component BAP. Is selected (step S1401).

音声生成部２３は、メルケプストラム係数ＭＧＣをメルケプストラムスペクトル変換し、スペクトルを求める（ステップＳ１４０２）。また、音声生成部２３は、対数ピッチ周波数ＬＦ０から有声／無声判定情報ＶＵＶを求め、対数ピッチ周波数ＬＦ０の有声区間を指数化し、無声及び無音区間についてはゼロとし、ピッチ周波数を求める（ステップＳ１４０３）。また、音声生成部２３は、帯域非周期成分ＢＡＰをメルケプストラムスペクトル変換し、非周期成分を求める（ステップＳ１４０４）。 The voice generation unit 23 converts the mel cepstrum coefficient MGC into a mel cepstrum spectrum and obtains a spectrum (step S1402). Further, the voice generation unit 23 obtains the voiced / unvoiced determination information VUV from the logarithmic pitch frequency LF0, indexes the voiced section of the logarithmic pitch frequency LF0, sets the silent and silent sections to zero, and obtains the pitch frequency (step S1403). .. Further, the voice generation unit 23 converts the band aperiodic component BAP into a merkepstrum spectrum and obtains the aperiodic component (step S1404).

音声生成部２３は、ステップＳ１４０２にて求めたフレーム毎のスペクトル、ステップＳ１４０３にて求めたフレーム毎のピッチ周波数、及びステップＳ１４０４にて求めたフレーム毎の非周期成分を用いて連続的に音声波形を生成し（ステップＳ１４０５）、音声信号を出力する（ステップＳ１４０６）。 The voice generation unit 23 continuously uses the spectrum for each frame obtained in step S1402, the pitch frequency for each frame obtained in step S1403, and the aperiodic component for each frame obtained in step S1404. Is generated (step S1405), and an audio signal is output (step S1406).

これにより、音声合成対象のテキストに対して所定の調整パラメータによる調整が加えられた音声信号を得ることができる。 As a result, it is possible to obtain a voice signal adjusted by a predetermined adjustment parameter with respect to the text to be voice-synthesized.

以上のように、本発明の実施形態の音声合成装置２によれば、言語分析部２０は、音声合成対象のテキストについて既知の言語分析処理を行い、音素毎の言語特徴量を求め、調整量追加部２１は、音素毎の言語特徴量に、調整パラメータの調整量情報を追加する。 As described above, according to the speech synthesizer 2 of the embodiment of the present invention, the language analysis unit 20 performs a known language analysis process on the text to be speech-synthesized, obtains a language feature amount for each phoneme, and adjusts the amount. The addition unit 21 adds the adjustment amount information of the adjustment parameter to the language feature amount for each phoneme.

音響特徴量推定部２２は、言語特徴量の３１２次元のバイナリ値、１３次元の数値データ及び１次元の調整データ（話速データ）からなる３２６次元のデータを、記憶部１７に格納された最大値等を用いて標準化する。そして、音響特徴量推定部２２は、記憶部１７に格納された時間長モデルを用いて、これらの標準化されたデータを入力データとして、出力データである時間長の１次元の標準化されたデータを推定する。 The acoustic feature amount estimation unit 22 stores a maximum of 326-dimensional data including a 312-dimensional binary value of a language feature amount, a 13-dimensional numerical data, and a one-dimensional adjustment data (speech speed data) in the storage unit 17. Standardize using values. Then, the acoustic feature amount estimation unit 22 uses the time length model stored in the storage unit 17 to use these standardized data as input data, and uses the time length standardized data which is the output data. presume.

音響特徴量推定部２２は、時間長の１次元の標準化されたデータを、記憶部１７に格納された平均値等を用いて逆標準化し、フレーム毎の時間データを求める。音響特徴量推定部２２は、言語特徴量の３２９次元のデータのうち３１２次元のバイナリ値、１３次元の数値データ及び３次元の調整データ（パワーデータ、ピッチデータ及び抑揚データ）からなる３２８次元のデータ、並びに時間データの４次元のデータを、記憶部１７に格納された最大値等を用いて標準化する。そして、音響特徴量推定部２２は、記憶部１７に格納された音響モデルを用いて、これらの標準化されたデータを入力データとして、出力データである音響特徴量の１９９次元の標準化されたデータを推定する。 The acoustic feature amount estimation unit 22 destandardizes the one-dimensional standardized data of the time length using the average value or the like stored in the storage unit 17, and obtains the time data for each frame. The acoustic feature amount estimation unit 22 has a 328-dimensional binary value, a 13-dimensional numerical data, and a three-dimensional adjustment data (power data, pitch data, and intonation data) out of the 329-dimensional data of the language feature amount. The four-dimensional data of the data and the time data are standardized by using the maximum value and the like stored in the storage unit 17. Then, the acoustic feature amount estimation unit 22 uses the acoustic model stored in the storage unit 17 to use these standardized data as input data, and 199-dimensional standardized data of the acoustic feature amount which is the output data. presume.

音響特徴量推定部２２は、音響特徴量の１９９次元の標準化されたデータを、記憶部１７に格納された平均値等を用いて逆標準化し、フレーム毎の音響特徴量を求める。そして、音声生成部２３は、フレーム毎の音響特徴量に基づいて音声信号を合成し、合成音声信号を生成する。 The acoustic feature amount estimation unit 22 destandardizes the 199-dimensional standardized data of the acoustic feature amount using the average value and the like stored in the storage unit 17, and obtains the acoustic feature amount for each frame. Then, the voice generation unit 23 synthesizes a voice signal based on the acoustic feature amount for each frame, and generates a synthesized voice signal.

図１７に示した非特許文献１，２の従来技術を組み合わせた想定例では、学習モデルを用いた推定により時間的に平滑化された特性を有する音響特徴量に調整を加え、調整後のフレーム毎の音響特徴量から合成音声信号を生成することから、合成音声信号に音質劣化を生じてしまう。さらに、入力文章の特定部分に対応する音響特徴量に調整を加え、調整後のフレーム毎の音響特徴量から合成音声信号を生成することから、調整を加えた部分と、これに隣接する調整を加えていない部分との間の接続部分において、合成音声信号に不連続を生じてしまう。 In the assumed example in which the prior arts of Non-Patent Documents 1 and 2 shown in FIG. 17 are combined, the frame after adjustment is made by adjusting the acoustic feature amount having the characteristic smoothed in time by estimation using the learning model. Since the synthetic voice signal is generated from each acoustic feature amount, the sound quality of the synthetic voice signal is deteriorated. Furthermore, since the acoustic features corresponding to the specific part of the input text are adjusted and the synthesized speech signal is generated from the adjusted acoustic features for each frame, the adjusted part and the adjustment adjacent to it are adjusted. Discontinuity occurs in the synthesized speech signal at the connection portion between the portion not added.

これに対し、本発明の実施形態による音声合成装置２は、調整パラメータの調整量情報が反映された学習モデルを用いて音響特徴量を推定し、合成音声信号を生成するから、学習モデルを用いた推定により時間的に平滑化された特性を有する音響特徴量に調整を加える必要がない。また、入力文章の特定部分に対応する言語特徴量を調整したものを学習モデルに入力して音響特徴量を求め、合成音声信号を生成することから、調整を加えた部分と、これに隣接する調整を加えていない部分との間の接続部分において、合成音声信号に不連続を生じることがない。 On the other hand, the speech synthesizer 2 according to the embodiment of the present invention estimates the acoustic feature amount using the learning model reflecting the adjustment amount information of the adjustment parameter and generates the synthesized speech signal, so that the learning model is used. It is not necessary to make adjustments to the acoustic features that have the characteristics smoothed in time by the estimation. In addition, the adjusted linguistic features corresponding to a specific part of the input sentence are input to the learning model to obtain the acoustic features, and the synthetic speech signal is generated. There is no discontinuity in the synthetic speech signal at the connection between the unadjusted portion.

以上、実施形態を挙げて本発明を説明したが、本発明は前記実施形態に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。 Although the present invention has been described above with reference to embodiments, the present invention is not limited to the above-described embodiment and can be variously modified without departing from the technical idea.

尚、本発明の実施形態による学習装置１及び音声合成装置２のハードウェア構成としては、通常のコンピュータを使用することができる。学習装置１及び音声合成装置２は、ＣＰＵ、ＲＡＭ等の揮発性の記憶媒体、ＲＯＭ等の不揮発性の記憶媒体、及びインターフェース等を備えたコンピュータによって構成される。 As the hardware configuration of the learning device 1 and the speech synthesizer 2 according to the embodiment of the present invention, a normal computer can be used. The learning device 1 and the speech synthesis device 2 are composed of a computer provided with a volatile storage medium such as a CPU and RAM, a non-volatile storage medium such as a ROM, and an interface.

学習装置１に備えた記憶部１０，１７、言語分析部１１、音声分析部１２、対応付け部１３、調整量追加部１４、音響特徴量調整部１５及び学習部１６の各機能は、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現される。また、音声合成装置２に備えた言語分析部２０、調整量追加部２１、音響特徴量推定部２２、記憶部１７及び音声生成部２３の各機能も、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現される。 The functions of the storage units 10 and 17, the language analysis unit 11, the voice analysis unit 12, the association unit 13, the adjustment amount addition unit 14, the acoustic feature amount adjustment unit 15, and the learning unit 16 provided in the learning device 1 are these. This is realized by having the CPU execute a program that describes the function. Further, each function of the language analysis unit 20, the adjustment amount addition unit 21, the acoustic feature amount estimation unit 22, the storage unit 17, and the voice generation unit 23 provided in the speech synthesizer 2 also has a program describing these functions in the CPU. It is realized by executing each.

これらのプログラムは、前記記憶媒体に格納されており、ＣＰＵに読み出されて実行される。また、これらのプログラムは、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスク等）、光ディスク（ＣＤ－ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記憶媒体に格納して頒布することもでき、ネットワークを介して送受信することもできる。 These programs are stored in the storage medium, read by the CPU, and executed. In addition, these programs can be stored and distributed in storage media such as magnetic disks (floppy (registered trademark) disks, hard disks, etc.), optical disks (CD-ROM, DVD, etc.), semiconductor memories, etc., and can be distributed via a network. You can also send and receive.

１学習装置
２音声合成装置
１０，１７記憶部
１１，２０言語分析部
１２音声分析部
１３対応付け部
１４，２１調整量追加部
１５音響特徴量調整部
１６学習部
２２音響特徴量推定部
２３音声生成部
1 Learning device 2 Speech synthesizer 10, 17 Storage unit 11, 20 Language analysis unit 12 Speech analysis unit 13 Correspondence unit 14, 21 Adjustment amount addition unit 15 Acoustic feature amount adjustment unit 16 Learning unit 22 Acoustic feature amount estimation unit 23 Voice Generator

Claims

A language analysis unit that linguistically analyzes the text to be voice-synthesized and obtains linguistic features.
An adjustment amount addition unit that adds adjustment amount information of adjustment parameters for adjusting acoustic characteristics to the language feature amount obtained by the language analysis unit, and an adjustment amount addition unit.
An acoustic feature amount estimation unit that estimates an acoustic feature amount using a statistical model learned in advance based on the language feature amount to which the adjustment amount information is added by the adjustment amount addition unit.
A voice generation unit that synthesizes a voice signal based on the acoustic feature amount estimated by the acoustic feature amount estimation unit and outputs a voice signal adjusted by the adjustment parameter to the text is provided. A voice synthesizer characterized by the fact that.

In the voice synthesizer according to claim 1,
The statistical model consists of a time-length model and an acoustic model composed of a neural network.
The acoustic feature amount estimation unit is
Using the time length model, the time length of each phoneme, which is the output data of the time length model, is estimated by using the language feature amount for each phoneme as the input data of the time length model.
The time length for each frame is generated from the time length for each phoneme.
Using the acoustic model, the language feature amount for each frame and the time length for each frame are used as input data, and the acoustic feature amount for each frame, which is the output data of the acoustic model, is estimated. Speech synthesizer.

In the speech synthesizer according to claim 1 or 2.
A speech synthesizer comprising the adjustment parameter as any one or a combination of two or more of the four parameters of speaking speed or time length, power, pitch, and intonation.

In the speech synthesizer according to claim 1 or 2.
The adjustment parameters are set as four parameters of speaking speed or time length, power, pitch, and intonation.
The adjustment amount of any one of the four parameters is specified as an arbitrary value within a predetermined range, and the adjustment amount of the other three parameters is a fixed value. Synthesizer.

In the speech synthesizer according to claim 1 or 2.
The adjustment parameters are set as four parameters of speaking speed or time length, power, pitch, and intonation.
A speech synthesizer characterized in that an arbitrary value within a predetermined range is specified for each adjustment amount in the four parameters.

A program for operating a computer as the speech synthesizer according to any one of claims 1 to 5.