JP2023171025A

JP2023171025A - Training device, training method, and training program

Info

Publication number: JP2023171025A
Application number: JP2022083206A
Authority: JP
Inventors: 勇祐井島; Yusuke Ijima; 大輔齋藤; Daisuke Saito
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2023-12-01

Abstract

To generate a synthetic speech of a discretionary utterance skill of a discretionary speaker without changing speaker individuality.SOLUTION: A training device that trains a speech synthesizing model acquires the utterance skill of a speaker from speech data of a plurality of speakers. Then, the training device generates speech data and utterance information (pseudo-speech data and pseudo-utterance information), in which the utterance skill is lowered for each speaker, from the speech data of and utterance information on the plurality of speakers. Then, using training data that includes the generated pseudo-speech data and pseudo-utterance information, the training device trains the speech synthesizing model that outputs for inputted text the speech feature quantity of a synthesized speech of a designated speaker and utterance skill.SELECTED DRAWING: Figure 6

Description

本発明は、音声合成モデルを学習するための、学習装置、学習方法、および、学習プログラムに関する。 The present invention relates to a learning device, a learning method, and a learning program for learning a speech synthesis model.

従来、音声合成の分野で、Deep Neural Network（DNN）を用いて音声合成を行う技術が提案されている（非特許文献１参照）。この音声合成技術において、１つのDNNで複数の話者の音声特徴量（例えば、スペクトル、音高等）をモデル化する技術も提案されている（特許文献１参照）。この技術によれば、話者１名当たりの音声データ量が少なくても高品質な合成音声を生成することができる。 Conventionally, in the field of speech synthesis, a technology has been proposed that performs speech synthesis using a deep neural network (DNN) (see Non-Patent Document 1). In this speech synthesis technology, a technology has also been proposed in which voice features (eg, spectrum, pitch) of multiple speakers are modeled using one DNN (see Patent Document 1). According to this technique, high-quality synthesized speech can be generated even if the amount of speech data per speaker is small.

特許第６６２２５０５号公報Patent No. 6622505

Heiga Zen, et al.,"STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.Heiga Zen, et al., "STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS", Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013. Ozuru Takuya, et al. "Are you professional?: Analysis of prosodic features between a newscaster and amateur speakers through partial substitution by DNN-TTS",Proc. 10th International Conference on Speech Prosody 2020. 2020.Ozuru Takuya, et al. "Are you professional?: Analysis of prosodic features between a newscaster and amateur speakers through partial substitution by DNN-TTS", Proc. 10th International Conference on Speech Prosody 2020. 2020. 益子他，“動的特徴を用いたHMMに基づく音声合成”，信学論，vol.J79-D-II，no.12，pp.2184-2190，Dec. 1996.Masuko et al., “Speech synthesis based on HMM using dynamic features”, IEICE, vol. J79-D-II, no. 12, pp. 2184-2190, Dec. 1996. 今井他，“音声合成のためのメル対数スペクトル近似（MLSA）フィルタ”，電子情報通信学会論文誌 A Vol.J66-A No.2 pp.122-129, Feb. 1983.Imai et al., “Mel-logarithmic spectral approximation (MLSA) filter for speech synthesis,” IEICE Transactions A Vol.J66-A No.2 pp.122-129, Feb. 1983.

しかし、上記の技術により生成される合成音声の発話スキル（音声の聞き取りやすさ）は、収録された音声の品質に依存する。このため、収録された音声の発話スキルが低いと、生成される合成音声の発話スキルも低くなってしまう。 However, the speaking skill (ease of hearing the voice) of the synthesized voice generated by the above technology depends on the quality of the recorded voice. Therefore, if the speech skill of the recorded voice is low, the speech skill of the generated synthetic voice will also be low.

ここで、ユーザが所望する話者の任意の発話スキルの合成音声を生成するため、DNNへ、合成音声の話者を示す識別子と発話スキルとを入力する手法も考えられる。 Here, in order to generate synthesized speech of an arbitrary speech skill of a speaker desired by the user, a method of inputting an identifier indicating the speaker of the synthesized speech and the speech skill to the DNN may be considered.

しかし、この手法は、DNNの学習時に１名の話者に対して１つの発話スキルしか付与しないため、ユーザがDNNに入力する発話スキルを変更すると、合成音声の話者性も変わってしまうという問題がある。 However, this method only assigns one speaking skill to one speaker when training the DNN, so if the user changes the speaking skill input into the DNN, the speaker characteristics of the synthesized speech will also change. There's a problem.

そこで、本発明は、任意の話者の任意の発話スキルの合成音声を、話者性を変えずに生成することを課題とする。 Therefore, an object of the present invention is to generate synthesized speech of any speech skill of any speaker without changing the speaker characteristics.

前記した課題を解決するため、本発明は、複数の話者の音声情報および前記話者それぞれの発話スキルを取得する取得部と、１人の話者の前記音声情報ごとに発話スキルを低く変換した音声情報を生成する生成部と、前記複数の話者の音声情報および前記話者それぞれの発話スキルと、発話スキルを低く変換した前記話者の音声情報を学習データとして用いて、入力テキストに対し、指定された話者および発話スキルの合成音声の音声特徴量を出力する音声合成モデルの学習を行う学習部とを備えることを特徴とする。 In order to solve the above-mentioned problems, the present invention includes an acquisition unit that acquires voice information of a plurality of speakers and speech skills of each of the speakers, and a method that converts the speech skill to a lower level for each of the voice information of one speaker. a generation unit that generates voice information of the plurality of speakers, the speech information of the plurality of speakers, the speech skills of each of the speakers, and the speech information of the speaker whose speech skill has been lowered as learning data; On the other hand, the present invention is characterized by comprising a learning section that performs learning of a speech synthesis model that outputs speech feature amounts of synthesized speech of a specified speaker and speech skill.

本発明によれば、任意の話者の任意の発話スキルの合成音声を、話者性を変えずに生成することができる。 According to the present invention, synthesized speech of any speaking skill of any speaker can be generated without changing the speaker characteristics.

図１は、本実施形態の学習装置により学習される音声合成モデルの概要を説明するための図である。FIG. 1 is a diagram for explaining an overview of a speech synthesis model learned by the learning device of this embodiment. 図２は、音素セグメンテーション情報の例を示す図である。FIG. 2 is a diagram illustrating an example of phoneme segmentation information. 図３は、話者識別子および発話スキルが付与された音素セグメンテーション情報の例を示す図である。FIG. 3 is a diagram showing an example of phoneme segmentation information to which a speaker identifier and a speaking skill are added. 図４は、話者識別子および発話スキルが付与された音素セグメンテーション情報の例を示す図である。FIG. 4 is a diagram showing an example of phoneme segmentation information to which a speaker identifier and a speaking skill are added. 図５は、学習装置の構成例を示す図である。FIG. 5 is a diagram showing an example of the configuration of the learning device. 図６は、学習装置が音声合成モデルを学習する手順の例を説明するための図である。FIG. 6 is a diagram for explaining an example of a procedure in which the learning device learns a speech synthesis model. 図７は、学習後の音声合成モデルを用いて、合成音声を生成する手順の例を説明するための図である。FIG. 7 is a diagram for explaining an example of a procedure for generating synthesized speech using a learned speech synthesis model. 図８は、学習装置が音声合成モデルを学習する手順の例を示すフローチャートである。FIG. 8 is a flowchart illustrating an example of a procedure in which the learning device learns a speech synthesis model. 図９は、学習装置が、学習後の音声合成モデルを用いて、合成音声を生成する手順の例を示すフローチャートである。FIG. 9 is a flowchart illustrating an example of a procedure in which the learning device generates synthesized speech using the learned speech synthesis model. 図１０は、学習プログラムを実行するコンピュータの構成例を示す図である。FIG. 10 is a diagram showing an example of the configuration of a computer that executes a learning program.

以下、図面を参照しながら、本発明を実施するための形態（実施形態）について説明する。本発明は、本実施形態に限定されない。 Hereinafter, modes for carrying out the present invention (embodiments) will be described with reference to the drawings. The present invention is not limited to this embodiment.

［概要］
まず、図１を用いて、本実施形態の学習装置の概要を説明する。学習装置は、入力されたテキストの合成音声を生成する音声合成モデルの学習を行う。 [overview]
First, an overview of the learning device of this embodiment will be explained using FIG. The learning device performs learning of a speech synthesis model that generates synthesized speech of input text.

この音声合成モデルは、例えば、図１に示すように、合成音声の生成対象のテキストの読み・アクセント、話者の識別子および発話スキルの入力を受け付けると、当該テキストを当該話者が当該発話スキルで発話した合成音声の音声特徴量を出力するモデルである。この音声合成モデルは、例えば、DNN（ディープニューラルネットワーク）により実現される。 For example, as shown in Figure 1, this speech synthesis model accepts input of the pronunciation/accent of the text for which synthetic speech is to be generated, the speaker's identifier, and the speech skill. This is a model that outputs the voice features of the synthesized speech uttered. This speech synthesis model is realized by, for example, a DNN (deep neural network).

学習装置は、音声データ、当該音声データの話者の識別子、および、当該話者の発話スキルを学習データとして用いて上記の音声合成モデルを学習する。 The learning device learns the above-mentioned speech synthesis model using the audio data, the identifier of the speaker of the audio data, and the speaking skill of the speaker as learning data.

ここで、学習装置は、学習データから発話スキルを低くしたデータ（図２における疑似音声データ、疑似発話スキル、疑似発話情報）を生成する。そして、学習装置は、生成したデータを含む学習データを用いて音声合成モデルを学習する。 Here, the learning device generates data (pseudo voice data, pseudo speech skill, pseudo speech information in FIG. 2) with a lower speech skill from the learning data. The learning device then learns the speech synthesis model using learning data including the generated data.

これにより学習装置は、音声合成モデルの学習を行う際、１人の話者に対して複数の発話スキルを学習することができる。その結果、学習後の音声合成モデルは、任意の話者の任意の発話スキルの入力を受け付けたとしても、話者性ができるだけ変わらない合成音声の音響特徴量を出力することができる。 Thereby, the learning device can learn a plurality of speaking skills for one speaker when learning the speech synthesis model. As a result, even if the trained speech synthesis model accepts input of any speech skill from any speaker, it is possible to output acoustic features of synthesized speech whose speaker characteristics do not change as much as possible.

［音声データ］
次に、音声合成モデルの学習に用いる音声データおよび発話情報について説明する。まず、音声データについて説明する。 [Audio data]
Next, the speech data and utterance information used for learning the speech synthesis model will be explained. First, audio data will be explained.

音声合成モデルの学習に用いる音声データは、N名（2名以上）の話者の音声データから構成される。この音声データは、音声信号に対する信号処理により得られる音声特徴量を含む。音声特徴量は、例えば、音声信号の音高パラメータ（基本周波数等）、スペクトルパラメータ（ケプストラム、メルケプストラム等）等である。 The speech data used to train the speech synthesis model is composed of speech data from N (two or more) speakers. This audio data includes audio features obtained by signal processing of the audio signal. The audio feature amount is, for example, a pitch parameter (fundamental frequency, etc.), a spectral parameter (cepstrum, mel cepstrum, etc.) of the audio signal, and the like.

また、音声合成モデルの学習に用いる音声データには、様々な発話スキルの話者の音声データが含まれていることが望ましい。例えば、この音声データには、発話スキルが高い話者（例えば、アナウンサー、声優等）の音声データと、発話スキルが低い話者（例えば、上記のアナウンサー、声優等以外の話者）の音声データの両方が含まれているのが望ましい。 Furthermore, it is desirable that the voice data used for learning the voice synthesis model include voice data of speakers with various speaking skills. For example, this audio data includes audio data of speakers with high speaking skills (e.g., announcers, voice actors, etc.) and audio data of speakers with low speaking skills (e.g., speakers other than the above-mentioned announcers, voice actors, etc.). It is desirable that both are included.

［発話情報］
次に、音声合成モデルの学習に用いる発話情報について説明する。発話情報は、例えば、音声データを構成する各発話の発音等を示した情報である。この発話情報は、例えば、音素セグメンテーション情報を含む。 [Utterance information]
Next, utterance information used for learning the speech synthesis model will be explained. The utterance information is, for example, information indicating the pronunciation of each utterance constituting the audio data. This speech information includes, for example, phoneme segmentation information.

音素セグメンテーション情報は、例えば、図２に示すように、発話を構成する音素ごとの開始時間および終了時間を示した情報である。この開始時間および終了時間は、発話の始点を0（秒）とした経過時間である。 For example, as shown in FIG. 2, the phoneme segmentation information is information indicating the start time and end time of each phoneme that constitutes an utterance. The start time and end time are elapsed times from the start point of the utterance to 0 (seconds).

発話情報は、音素セグメンテーション情報以外にもアクセント情報（アクセント型、アクセント句長）、品詞情報等を含んでいてもよい。 The utterance information may include accent information (accent type, accent phrase length), part of speech information, etc. in addition to phoneme segmentation information.

また、学習装置は、発話情報に、当該発話情報の話者の識別子（話者識別子）および当該話者の発話スキルを付与する。例えば、学習装置は、図３および図４に示すように、発話の音素セグメンテーション情報に、当該発話の話者の識別子および発話スキルの情報を付与する。なお、発話スキルの情報は、例えば、0：発話スキル低～1：発話スキル高の１次元のベクトルにより表現される。発話スキルの付与方法については、後記する。 Furthermore, the learning device assigns to the speech information an identifier (speaker identifier) of a speaker of the speech information and a speech skill of the speaker. For example, as shown in FIGS. 3 and 4, the learning device adds an identifier of the speaker of the utterance and information on the utterance skill to the phoneme segmentation information of the utterance. Note that the speech skill information is expressed, for example, by a one-dimensional vector of 0: low speech skill to 1: high speech skill. The method for imparting speech skills will be described later.

［構成例］
次に、図５を用いて学習装置の構成例を説明する。学習装置１０は、例えば、入出力部１１、記憶部１２、および、制御部１３を備える。 [Configuration example]
Next, a configuration example of the learning device will be described using FIG. 5. The learning device 10 includes, for example, an input/output section 11, a storage section 12, and a control section 13.

入出力部１１は、各種データの入力および出力を司るインターフェースである。例えば、入出力部１１は、音声合成モデルの学習データ（音声データ、当該音声データの発話情報、当該音声データの話者識別子）の入力を受け付ける。また、入出力部１１は、合成音声の生成対象のテキストの読み・アクセント、話者の識別子、発話スキル等の入力を受け付ける。また、入出力部１１は、制御部１３により生成された合成音声等を出力する。 The input/output unit 11 is an interface that controls input and output of various data. For example, the input/output unit 11 receives input of learning data (voice data, utterance information of the voice data, speaker identifier of the voice data) of the voice synthesis model. The input/output unit 11 also receives inputs such as the pronunciation/accent of the text for which synthetic speech is to be generated, the speaker's identifier, and the speaking skill. Further, the input/output unit 11 outputs synthesized speech etc. generated by the control unit 13.

記憶部１２は、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。 The storage unit 12 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.

記憶部１２には、例えば、学習装置１０の機能を実現するための処理プログラム、音声合成モデルの学習に用いられる学習データ（音声データ、当該音声データの発話情報、当該音声データの話者識別子等）が記憶される。 The storage unit 12 stores, for example, a processing program for realizing the functions of the learning device 10, learning data used for learning a speech synthesis model (voice data, utterance information of the voice data, speaker identifier of the voice data, etc.). ) is memorized.

また、学習装置１０により、学習データに含まれる音声データの話者の発話スキルが取得されると、記憶部１２の学習データに各話者の発話スキルが追加される。さらに、学習装置１０により音声合成モデルが学習されると、学習により得られた音声合成モデルのパラメータが記憶部１２に記憶される。 Further, when the learning device 10 acquires the speaking skills of the speakers of the audio data included in the learning data, the speaking skills of each speaker are added to the learning data in the storage unit 12. Further, when the speech synthesis model is learned by the learning device 10, the parameters of the speech synthesis model obtained by the learning are stored in the storage unit 12.

制御部１３は、例えば、ＣＰＵ（Central Processing Unit）等を用いて実現され、記憶部１２に記憶された処理プログラムを実行する。これにより、制御部１３は、図５に示す、発話スキル獲得部１３１、取得部１３２、生成部１３３、ベクトル表現変換部１３４、学習部１３５、入力受付部１３６、テキスト解析部１３７、音声特徴量生成部１３８および合成音声生成部１３９として機能する。 The control unit 13 is realized using, for example, a CPU (Central Processing Unit), and executes a processing program stored in the storage unit 12. As a result, the control unit 13 operates as shown in FIG. It functions as a generation section 138 and a synthesized speech generation section 139.

［発話スキル獲得部］
発話スキル獲得部１３１は、学習データに含まれる音声データの話者の発話スキルを獲得する（図６参照）。発話スキル獲得部１３１は、例えば、入出力部１１経由でユーザ（評価者）から音声データの話者の発話スキルの主観評価の入力を受け付けると、その評価結果に基づき、当該話者の発話スキルを付与する。 [Speech Skills Acquisition Department]
The utterance skill acquisition unit 131 acquires the utterance skill of the speaker of the audio data included in the learning data (see FIG. 6). For example, when the speech skill acquisition unit 131 receives an input of a subjective evaluation of the speech skills of a speaker of audio data from a user (evaluator) via the input/output unit 11, the speech skill acquisition unit 131 determines the speech skills of the speaker based on the evaluation result. Grant.

（１）例えば、評価者は、音声データに含まれるN名の話者に対して、一対比較法により発話スキルを評価する。一対比較法では、全ての話者の組み合わせ（_NC₂組：N=10名の場合、₁₀C₂=45組）に対して、どちらの話者の発話スキルが高いかを評価する。 (1) For example, the evaluator evaluates the speaking skills of N speakers included in the audio data using the paired comparison method. In the paired comparison method, for all combinations of speakers ( _N C ₂ pairs: if N = 10 people, ₁₀ C ₂ =45 pairs), it is evaluated which speaker's speaking skill is higher.

（２）評価尺度は、例えば、評価対象となる話者の組み合わせ（2名）のうち、どちらの話者の発話スキルが高いか選択するAB評価や、比較範疇尺度法（CCR）によるスコアの付与を使用することができる。スコアは、例えば、-2：後者の話者が発話スキルが高い、-1：後者の発話スキルが若干高い、0：両者の発話スキルは変わらない、1：前者の発話スキルの方が若干高い、2：前者の発話スキルの方が高い等である。 (2) Evaluation scales include, for example, AB evaluation, which selects which speaker has higher speaking skills among a combination of speakers to be evaluated (two speakers), and score evaluation using comparative category scaling (CCR). Grant can be used. The score is, for example, -2: The latter speaker has higher speaking skills, -1: The latter speaker has slightly higher speaking skills, 0: Both speaking skills are the same, 1: The former speaker has slightly higher speaking skills. , 2: The former's speaking skill is higher, etc.

（３）また、例えば、複数の評価者が同一の話者の組み合わせに対して評価を行い、発話スキル獲得部１３１は、その評価結果の平均値を当該話者間の発話スキルとする。ここので評価者数は、評価の信頼性を担保するために、１つの話者の組み合わせに対し、10名以上が望ましい。 (3) Also, for example, a plurality of evaluators evaluate the same combination of speakers, and the speech skill acquisition unit 131 takes the average value of the evaluation results as the speech skill between the speakers. Here, in order to ensure the reliability of the evaluation, it is desirable that the number of evaluators be 10 or more for one combination of speakers.

（４）発話スキル獲得部１３１は、上記の方法による話者間のスコアを発話スキルとして獲得する。また、発話スキル獲得部１３１は、上記の方法による話者間のスコアをシェッフェの一対比較法等により１次元の尺度に変換してもよい。発話スキル獲得部１３１により獲得された各話者の発話スキルは記憶部１２に記憶される。 (4) The speech skill acquisition unit 131 obtains the score between speakers using the above method as a speech skill. Furthermore, the speech skill acquisition unit 131 may convert the scores between speakers obtained by the above method into a one-dimensional scale using Scheffe's pairwise comparison method or the like. The speech skills of each speaker acquired by the speech skill acquisition section 131 are stored in the storage section 12.

［取得部］
取得部１３２は、学習データである複数の話者の音声情報（音声データ、当該音声データの発話情報、当該音声データの話者識別子）と、発話スキル獲得部１３１により獲得された各話者の発話スキルとを取得する。 [Acquisition part]
The acquisition unit 132 acquires voice information of a plurality of speakers as learning data (voice data, utterance information of the voice data, speaker identifier of the voice data) and the voice information of each speaker acquired by the speech skill acquisition unit 131. Acquire speaking skills.

［生成部］
生成部１３３は、１人の話者の音声情報ごとに発話スキルを低く変換した音声情報を生成する。例えば、生成部１３３は、取得部１３２により取得された音声情報から、発話スキルを低くした音声データ（疑似音声データ）、発話情報（疑似発話情報）、および疑似発話スキルを生成する（図６参照）。 [Generation part]
The generation unit 133 generates voice information in which the speaking skill is lowered for each voice information of one speaker. For example, the generation unit 133 generates voice data (pseudo voice data) with lowered speech skills, speech information (pseudo speech information), and pseudo speech skills from the speech information acquired by the acquisition unit 132 (see FIG. 6). ).

一例を挙げると、生成部１３３は、発話スキル=1の話者Ｂの音声データおよび発話情報を、発話スキル=0.8に変換した疑似音声データおよび疑似発話情報と、発話スキル=0.5に変換した疑似音声データおよび疑似発話情報と、発話スキル=0.2に変換した疑似音声データおよび疑似発話情報とを生成する。 For example, the generation unit 133 generates pseudo voice data and pseudo speech information converted from speech data and speech information of speaker B with speech skill=1 to speech skill=0.8, and pseudo speech data and pseudo speech information converted to speech skill=0.5. Speech data and pseudo speech information, and pseudo speech data and pseudo speech information converted to speech skill=0.2 are generated.

そして、生成部１３３は、生成した疑似音声データ、疑似発話情報、疑似発話スキルを記憶部１２に記憶する。 Then, the generation unit 133 stores the generated pseudo voice data, pseudo speech information, and pseudo speech skill in the storage unit 12.

例えば、生成部１３３は、疑似音声データ、疑似発話情報、疑似発話スキルを生成するため、（１）音声情報に含まれる音声データの音声パラメータの変換、（２）音声情報に含まれる発話情報（音素セグメンテーション情報）の変換のいずれか、もしくは組み合わせを実施する。 For example, in order to generate pseudo voice data, pseudo speech information, and pseudo speech skills, the generation unit 133 performs (1) conversion of voice parameters of voice data included in the voice information, (2) speech information ( (phoneme segmentation information) or a combination thereof.

［音高パラメータの変換］
生成部１３３が、音声データの音声パラメータの変換により、疑似音声データおよび疑似発話スキルを生成する場合の処理手順の例について説明する。 [Conversion of pitch parameters]
An example of a processing procedure when the generation unit 133 generates pseudo speech data and pseudo speech skills by converting the audio parameters of the audio data will be described.

生成部１３３は、例えば、学習データの音声データにおける発話内の音高パラメータの分布を変更することにより疑似音声データを生成する。一例を挙げると、生成部１３３は、学習データの音声データにおける話者iのj番目の発話の音高パラメータ（F0_ij(t)、t=1,…,T：音声パラメータのフレーム数）の平均μ_ijを用いて、音高パラメータの疑似音高パラメータF0’_ijを生成する。例えば、生成部１３３は、以下の式（１）を用いて、発話内の各音高パラメータの標準偏差が小さくなるように各音高パラメータを変換する。 The generation unit 133 generates pseudo voice data by, for example, changing the distribution of pitch parameters within utterances in the voice data of the learning data. For example, the generation unit 133 generates the pitch parameter (F0 _ij (t), t=1,...,T: number of frames of audio parameter) of the j-th utterance of speaker i in the audio data of the learning data. A pseudo pitch parameter F0' _ij of pitch parameters is generated using the average μ _ij . For example, the generation unit 133 converts each pitch parameter using the following equation (1) so that the standard deviation of each pitch parameter in the utterance becomes small.

F0’_ij(t)=(F0_ij(t)-μ_ij)×α+μ_ij(α：変換重み（0≦α≦1）)…式（１） F0' _ij (t)=(F0 _ij (t)-μ _ij )×α+μ _ij (α: conversion weight (0≦α≦1))…Equation (1)

ここで、αは変換重みである。α=1の場合、生成部１３３は音高パラメータの変換を行わない。α=0の場合、音高パラメータが平均μ_ijと同一になる。 Here, α is the transformation weight. When α=1, the generation unit 133 does not convert the pitch parameter. When α=0, the pitch parameter is the same as the average μ _ij .

また、生成部１３３は、この時の話者iの発話スキルS_iを以下の式（２）に基づき変換することで、疑似発話スキルS’_iを得る。 Further, the generation unit 133 obtains a pseudo speech skill S′ _i by converting the speech skill S _i of the speaker i at this time based on the following equation (2).

S’_i=α×Si…式（２） S' _i =α×Si...Equation (2)

なお、生成部１３３が音高パラメータの変換を行う場合、発話情報の変換は行わないため、元の発話情報を疑似発話情報とする。 Note that when the generation unit 133 converts the pitch parameter, it does not convert the speech information, so the original speech information is used as pseudo speech information.

［発話情報（音素セグメンテーション情報）の変換］
生成部１３３が、発話情報（音素セグメンテーション情報）の変換により、疑似発話情報および疑似発話スキルを生成する場合の処理手順の例について説明する。 [Conversion of speech information (phoneme segmentation information)]
An example of a processing procedure when the generation unit 133 generates pseudo speech information and pseudo speech skills by converting speech information (phoneme segmentation information) will be described.

生成部１３３は、例えば、発話情報における話者iのj番目の発話のp番目の音素セグメンテーション情報（d_ij(p)、p=1,…,P：音素セグメンテーション情報に含まれる音素数）の継続時間長を伸縮する変換を行うことで、疑似発話情報d’_ij(p)を生成する。 For example, the generation unit 133 generates the p-th phoneme segmentation information (d _ij (p), p=1,...,P: the number of phonemes included in the phoneme segmentation information) of the j-th utterance of speaker i in the utterance information. Pseudo utterance information d' _ij (p) is generated by performing conversion to extend or contract the duration length.

ここで、生成部１３３は、例えば、発話情報における話者iのj番目の発話のすべての音素に対し上記の変換を行うのではなく、母音もしくは子音のみに対して変換を行うことで、非線形な変換を行う。変換後の発話情報（疑似発話情報）d’_ij(p)は、以下の式（３）により表される。 Here, the generation unit 133 can generate a nonlinear Perform a conversion. The converted speech information (pseudo speech information) d' _ij (p) is expressed by the following equation (3).

d’_ij(p)=β×d_ij(p)(β：変換重み（0≦β≦2）)…式（３） d' _ij (p)=β×d _ij (p)(β: conversion weight (0≦β≦2))...Equation (3)

式（３）における、βは変換重みである。式（３）において、β=1の場合、生成部１３３は、p番目の音素の変換を行わない。β=0の場合、生成部１３３は、p番目の音素の継続時間長を0とする変換を行う。つまり、生成部１３３は、p番目の音素を消去する。 In equation (3), β is a transformation weight. In equation (3), when β=1, the generation unit 133 does not convert the p-th phoneme. When β=0, the generation unit 133 performs conversion to set the duration length of the p-th phoneme to zero. In other words, the generation unit 133 deletes the p-th phoneme.

また、生成部１３３は、上記のβを用いて、話者iの発話スキルSiを以下の式（４）に基づき変換することにより、疑似発話スキルS’iを得る。 Furthermore, the generation unit 133 obtains a pseudo speech skill S'i by converting the speech skill Si of the speaker i based on the following equation (4) using the above β.

S’_i=(1-|1-β|)×S_i…式（４） S' _i =(1-|1-β|)×S _i ...Equation (4)

このように生成部１３３は、音声データにおける音高パラメータの分布の変換、発話情報（音素セグメンテーション情報）における音素の継続時間長を伸縮する変換を行う。これは、音声における音高パラメータの分布および音素の継続時間長が、発話スキルの評価に大きな影響を与えているからである。生成部１３３が、音声の音高パラメータの分布、音素の継続時間長等を変換することで、自然な音声とは異なる不自然な音声（≒発話スキルが低い音声）を生成することができる。これにより生成部１３３は、同一話者の発話でありながら、発話スキルが低い音声の生成を行うことができる。 In this way, the generation unit 133 converts the distribution of pitch parameters in the audio data and converts the duration of phonemes in the utterance information (phoneme segmentation information) to expand or contract. This is because the distribution of pitch parameters and the duration of phonemes in speech have a great influence on the evaluation of speech skills. The generation unit 133 can generate unnatural speech (≈speech with low speaking skill) that differs from natural speech by converting the pitch parameter distribution, phoneme duration, etc. of the speech. Thereby, the generation unit 133 can generate voices with low speaking skill even though they are uttered by the same speaker.

なお、生成部１３３は、発話スキルを低く変換した音声情報を生成する際、１人の話者の音声情報をもとに、複数の発話スキルの音声情報を生成することが好ましい。例えば、生成部１３３は、話者（発話スキル=0.8）の音声情報をもとに、当該話者の発話スキル=0.5の音声情報と、発話スキル=0.2の音声情報とを生成する。このように生成部１３３が１人の話者の音声情報をもとに、複数の発話スキルの音声情報を生成することで、学習データに同じ話者について様々な発話スキルの音声情報が追加されることになる。その結果、学習装置１０は、指定された話者について様々な発話スキルの合成音声の音声特徴量を出力可能な音声合成モデルを学習することができる。 Note that, when generating voice information obtained by converting a speech skill into a lower value, it is preferable that the generation unit 133 generates voice information of a plurality of speech skills based on voice information of one speaker. For example, the generation unit 133 generates, based on the voice information of a speaker (speech skill=0.8), voice information of the speaker's speaking skill=0.5 and voice information of the speaker's speaking skill=0.2. In this way, the generation unit 133 generates the audio information of multiple speaking skills based on the audio information of one speaker, so that the audio information of various speaking skills for the same speaker is added to the learning data. That will happen. As a result, the learning device 10 is able to learn a speech synthesis model capable of outputting speech features of synthesized speech of various speech skills for a specified speaker.

［ベクトル表現変換部］
ベクトル表現変換部１３４は、発話情報、話者識別子および発話スキルを、音声合成モデルで使用可能な表現（数値表現）へ変換する。例えば、ベクトル表現変換部１３４は、発話情報を言語ベクトルに変換し、話者識別子を話者ベクトルに変換し、発話スキルを発話スキルベクトルに変換する。 [Vector representation conversion unit]
The vector expression conversion unit 134 converts the speech information, the speaker identifier, and the speech skill into an expression (numerical expression) that can be used in a speech synthesis model. For example, the vector expression conversion unit 134 converts speech information into a language vector, a speaker identifier into a speaker vector, and a speech skill into a speech skill vector.

例えば、ベクトル表現変換部１３４が発話情報を言語ベクトルへ変換する際には、非特許文献１に記載の、発話情報を音素やアクセント等を示す数値ベクトル（言語ベクトル）へ変換する技術を用いる。 For example, when the vector expression conversion unit 134 converts speech information into a language vector, it uses a technique described in Non-Patent Document 1 for converting speech information into numerical vectors (language vectors) indicating phonemes, accents, etc.

また、ベクトル表現変換部１３４が話者識別子を話者ベクトルへ変換する際には、one-hot表現を使用する。このone-hot表現のベクトルの次元数は、音声データに含まれる話者数=Nである。この話者ベクトルにおいて、話者識別子に該当する次元の値は1、それ以外の次元の値は0である。 Further, when the vector expression conversion unit 134 converts a speaker identifier into a speaker vector, a one-hot expression is used. The number of dimensions of this one-hot expression vector is the number of speakers included in the audio data = N. In this speaker vector, the value of the dimension corresponding to the speaker identifier is 1, and the value of the other dimensions is 0.

さらに、ベクトル表現変換部１３４が、発話スキルを発話スキルベクトルへ変換する際には、発話スキル獲得部１３１で得られた発話スキルをそのまま使用してもよいし、何らかの方法で次元圧縮を行ってもよい。次元圧縮には、例えば、各話者間の発話スキルの差をシェッフェの一対比較法を用いた尺度値としてもよいし、主成分分析（PCA）の結果を用いてもよい。 Furthermore, when the vector expression conversion unit 134 converts the speech skill into a speech skill vector, the speech skill obtained by the speech skill acquisition unit 131 may be used as is, or it may be dimensionally compressed by some method. Good too. For dimension reduction, for example, differences in speaking skills between speakers may be used as a scale value using Scheffe's paired comparison method, or results of principal component analysis (PCA) may be used.

例えば、図６に示すように、ベクトル表現変換部１３４は、学習データに含まれる話者の発話スキルおよび疑似発話スキルを発話スキルベクトルに変換する。また、ベクトル表現変換部１３４は、学習データに含まれる発話情報および疑似発話情報を言語ベクトルに変換する。さらに、ベクトル表現変換部１３４は、学習データに含まれる話者識別子を話者ベクトルに変換する。 For example, as shown in FIG. 6, the vector representation conversion unit 134 converts the speaker's speaking skill and pseudo speaking skill included in the learning data into a speaking skill vector. Furthermore, the vector expression conversion unit 134 converts the utterance information and pseudo utterance information included in the learning data into language vectors. Further, the vector expression conversion unit 134 converts the speaker identifier included in the learning data into a speaker vector.

［学習部］
学習部１３５は、学習データを用いて音声合成モデルを学習する。例えば、学習部１３５は、学習データとして、複数の話者の音声情報と、複数の話者それぞれの発話スキルを低く変換した音声情報（疑似音声データ、疑似発話スキル、疑似発話情報）とを用いて音声合成モデルを学習する。 [Study Department]
The learning unit 135 learns a speech synthesis model using the learning data. For example, the learning unit 135 uses, as learning data, voice information of a plurality of speakers and voice information (pseudo voice data, pseudo speech skills, pseudo speech information) obtained by converting the speech skills of each of the plurality of speakers to a lower level. to learn a speech synthesis model.

例えば、学習部１３５は、ベクトル表現変換部１３４により変換された、学習データの発話情報ごとの言語ベクトル、話者ベクトルおよび発話スキルベクトルを取得する。また、学習部１３５は、各発話情報に対応する音声データの音声特徴量を取得する。そして、学習部１３５は、取得した各発話情報の言語ベクトル、話者ベクトルおよび発話スキルベクトルと、各発話情報に対応する音響特徴量とを用いて、言語ベクトル、話者ベクトルおよび発話スキルベクトルから音声特徴量を推定する音声合成モデルを学習する（図６参照）。 For example, the learning unit 135 obtains the language vector, speaker vector, and utterance skill vector for each utterance information of the learning data converted by the vector expression conversion unit 134. Further, the learning unit 135 acquires the audio feature amount of the audio data corresponding to each piece of utterance information. Then, the learning unit 135 uses the acquired language vector, speaker vector, and speech skill vector of each piece of speech information and the acoustic feature amount corresponding to each piece of speech information to calculate the language vector, speaker vector, and speech skill vector. A speech synthesis model for estimating speech features is learned (see FIG. 6).

音声合成モデルの学習アルゴリズムは、例えば、特許文献１、非特許文献１等に記載の学習アルゴリズムを用いる。学習部１３５は、学習後の音声合成モデルのパラメータを記憶部１２に記憶する。 As the learning algorithm for the speech synthesis model, for example, the learning algorithm described in Patent Document 1, Non-Patent Document 1, etc. is used. The learning unit 135 stores the parameters of the learned speech synthesis model in the storage unit 12.

なお、音声合成モデルに用いられるニューラルネットワークは、例えば、通常のMultilayer perceptron（MLP）、Recurrent Neural Network（RNN）、RNN-LSTM（Long Short Term Memory）、Convolutional Neural Network（CNN）等のニューラルネットワーク、またそれらを組み合わせたニューラルネットワークである。 Note that the neural networks used in the speech synthesis model include, for example, normal Multilayer perceptron (MLP), Recurrent Neural Network (RNN), RNN-LSTM (Long Short Term Memory), Convolutional Neural Network (CNN), etc. It is also a neural network that combines these.

音声合成モデルは、合成音声を生成するための入力データとして、合成音声の対象とする話者の識別子および発話スキルを用いる。そのため学習部１３５は、話者の特徴と発話スキルの特徴とを分けて音声合成モデルの学習を行う。これにより、学習装置１０は、任意の話者による任意の発話スキルの合成音声を生成する音声合成モデルを学習することができる。 The speech synthesis model uses the identifier and speaking skill of the speaker targeted for the synthesized speech as input data for generating the synthesized speech. Therefore, the learning unit 135 performs learning of the speech synthesis model by separating the characteristics of the speaker and the characteristics of the speaking skill. Thereby, the learning device 10 can learn a speech synthesis model that generates synthesized speech of any speaking skill by any speaker.

［入力受付部］
入力受付部１３６は、入出力部１１経由で、ユーザから指定された、合成音声の生成対象の話者の識別子、発話スキル、テキスト等を受け付ける。 [Input reception section]
The input receiving unit 136 receives, via the input/output unit 11, the identifier, speech skill, text, etc. of the speaker for whom synthetic speech is to be generated, specified by the user.

［テキスト解析部］
テキスト解析部１３７は、入力受付部１３６から入力されたテキストの解析を行う。例えば、テキスト解析部１３７は、入力されたテキストの解析を行い、テキストの読み・アクセント等を示した情報を生成する。 [Text analysis department]
The text analysis unit 137 analyzes the text input from the input reception unit 136. For example, the text analysis unit 137 analyzes the input text and generates information indicating the pronunciation, accent, etc. of the text.

［音声特徴量生成部］
音声特徴量生成部１３８は、学習部１３５により学習された音声合成モデルを用いて、合成音声の生成対象のテキストから、ユーザにより指定された話者および発話スキルの合成音声の音声特徴量を生成する。 [Speech feature generator]
The speech feature generation unit 138 uses the speech synthesis model learned by the learning unit 135 to generate speech features of the synthesized speech of the speaker and speech skill specified by the user from the text for which the synthesized speech is to be generated. do.

例えば、図７に示すように、まず、ベクトル表現変換部１３４は、テキスト解析部１３７により生成されたテキストの解析結果を言語ベクトルに変換する。また、ベクトル表現変換部１３４は、入力受付部１３６で受け付けた話者識別子を話者ベクトルに変換し、また、入力受付部１３６で受け付けた発話スキルを発話スキルベクトルに変換する。 For example, as shown in FIG. 7, the vector expression conversion unit 134 first converts the analysis result of the text generated by the text analysis unit 137 into a language vector. Further, the vector representation conversion unit 134 converts the speaker identifier received by the input reception unit 136 into a speaker vector, and also converts the speech skill received by the input reception unit 136 into a speech skill vector.

そして、音声特徴量生成部１３８は、上記の話者ベクトル、発話スキルベクトルおよび言語ベクトルを、学習後の音声合成モデルに入力し、音声合成モデルから出力された合成音声の音声特徴量を得る。この音声特徴量は、例えば、合成音声の生成に用いられる音高パラメータ等である。 Then, the speech feature generation unit 138 inputs the above-mentioned speaker vector, speech skill vector, and language vector to the learned speech synthesis model, and obtains the speech feature of the synthesized speech output from the speech synthesis model. This voice feature amount is, for example, a pitch parameter used to generate synthesized voice.

［合成音声生成部］
合成音声生成部１３９は、音声特徴量生成部１３８により生成された音声特徴量を用いて合成音声を生成する。合成音声生成部１３９は、例えば、Maximum Likelihood Paramerter Generation(MLPG)アルゴリズム(非特許文献２参照)を用いて、音声特徴量を時間方向に平滑化した音声パラメータ系列を得る。そして、合成音声生成部１３９は、この音声パラメータ系列を用いて音声波形の生成を行う。音声波形の生成には、例えば、非特許文献３に記載の技術を用いる。そして、合成音声生成部１３９は、生成した音声波形を用いて合成音声を生成し、入出力部１１経由で出力する。 [Synthetic speech generation unit]
The synthesized speech generation section 139 generates synthetic speech using the speech features generated by the speech feature generation section 138. The synthesized speech generation unit 139 uses, for example, the Maximum Likelihood Parameter Generation (MLPG) algorithm (see Non-Patent Document 2) to obtain a speech parameter sequence in which the speech feature amount is smoothed in the temporal direction. Then, the synthesized speech generation unit 139 generates a speech waveform using this speech parameter series. For example, the technique described in Non-Patent Document 3 is used to generate the audio waveform. Then, the synthesized speech generation section 139 generates synthesized speech using the generated speech waveform and outputs it via the input/output section 11.

［処理手順の例］
次に、学習装置１０が実行する処理手順の例を説明する。まず、図８を用いて、学習装置１０が音声合成モデルを学習する手順の例を説明する。 [Example of processing procedure]
Next, an example of a processing procedure executed by the learning device 10 will be described. First, an example of a procedure in which the learning device 10 learns a speech synthesis model will be described using FIG. 8.

学習装置１０は、音声合成モデルの学習用の音声情報（音声データ、発話情報、話者識別子）を取得する（Ｓ１）。そして、取得した音声情報を記憶部１２に記憶する。次に、発話スキル獲得部１３１は、Ｓ１で取得した音声データの発話スキルを獲得する（Ｓ２）。そして、獲得した発話スキルを記憶部１２に記憶する。 The learning device 10 acquires speech information (speech data, utterance information, speaker identifier) for learning a speech synthesis model (S1). Then, the acquired audio information is stored in the storage unit 12. Next, the speech skill acquisition unit 131 obtains the speech skill of the audio data obtained in S1 (S2). Then, the acquired speech skill is stored in the storage unit 12.

Ｓ２の後、生成部１３３は、１人の話者の音声情報ごとに発話スキルを低く変換した音声情報（疑似音声データ、疑似発話スキル、疑似発話情報）を生成する（Ｓ３）。その後、取得部１３２は、Ｓ３で生成された発話スキルを低くした音声情報と、学習用の音声情報を取得する（Ｓ４）。 After S2, the generation unit 133 generates voice information (pseudo voice data, pseudo speech skill, pseudo speech information) in which the speech skill is lowered for each voice information of one speaker (S3). After that, the acquisition unit 132 acquires the voice information with a lowered speaking skill generated in S3 and the learning voice information (S4).

Ｓ４の後、ベクトル表現変換部１３４は、Ｓ４で取得部１３２が取得した音声情報を、音声合成モデルで使用可能な値で表現したベクトルに変換する（Ｓ５）。例えば、ベクトル表現変換部１３４は、Ｓ４で取得部１３２が取得した、発話スキルを低くした音声情報について、当該音声情報に含まれる発話スキルを発話スキルベクトルに変換し、発話情報を言語ベクトルに変換し、話者識別子を話者ベクトルに変換する。また、ベクトル表現変換部１３４は、Ｓ４で取得部１３２が取得した、学習用の音声情報についても、当該音声情報に含まれる発話スキルを発話スキルベクトルに変換し、発話情報を言語ベクトルに変換し、話者識別子を話者ベクトルに変換する。 After S4, the vector representation conversion unit 134 converts the audio information acquired by the acquisition unit 132 in S4 into a vector expressed with a value that can be used in the voice synthesis model (S5). For example, the vector expression conversion unit 134 converts the speech skill included in the speech information into a speech skill vector, and converts the speech information into a language vector, regarding the speech information with a lower speech skill acquired by the acquisition unit 132 in S4. and converts the speaker identifier into a speaker vector. Furthermore, regarding the learning audio information acquired by the acquisition unit 132 in S4, the vector representation conversion unit 134 converts the utterance skill included in the audio information into a utterance skill vector, and converts the utterance information into a language vector. , convert the speaker identifier into a speaker vector.

Ｓ５の後、学習部１３５は、学習用の音声情報に含まれる音声データと、Ｓ５で変換された発話スキルベクトル、言語ベクトルおよび話者ベクトルとを用いて、音声合成モデルを学習する（Ｓ６）。 After S5, the learning unit 135 learns a speech synthesis model using the speech data included in the learning speech information and the speech skill vector, language vector, and speaker vector converted in S5 (S6). .

学習装置１０が上記の処理を実行することにより、音声合成モデルの学習を行う際、１人の話者に対して複数の発話スキルを学習することができる。 By performing the above processing, the learning device 10 can learn a plurality of speaking skills for one speaker when learning a speech synthesis model.

次に、図９を用いて、学習後の音声合成モデルを用いた合成音声の生成処理の手順の例を説明する。 Next, an example of a procedure for generating synthesized speech using the learned speech synthesis model will be described with reference to FIG.

まず、学習装置１０の入力受付部１３６は、ユーザから合成音声の生成対象の話者識別子、発話スキル、テキストの入力を受け付ける（Ｓ１１）。その後、テキスト解析部１３７は、Ｓ１１で入力されたテキストの解析を行う。また、ベクトル表現変換部１３４は、Ｓ１１で入力された情報をベクトルに変換する（Ｓ１２）。例えば、ベクトル表現変換部１３４は、話者識別子を話者ベクトルに変換し、発話スキルを発話スキルベクトルに変換し、テキストの解析結果を言語ベクトルに変換する。 First, the input receiving unit 136 of the learning device 10 receives input from the user of a speaker identifier for which synthetic speech is to be generated, a speaking skill, and a text (S11). Thereafter, the text analysis unit 137 analyzes the text input in S11. Further, the vector representation conversion unit 134 converts the information input in S11 into a vector (S12). For example, the vector expression conversion unit 134 converts a speaker identifier into a speaker vector, converts a speech skill into a speech skill vector, and converts a text analysis result into a language vector.

その後、音声特徴量生成部１３８は、Ｓ１２で変換された話者ベクトル、発話スキルベクトル、言語ベクトルと、学習後の音声合成モデルとを用いて、合成音声の音声特徴量を生成する（Ｓ１３）。つまり、音声特徴量生成部１３８は、Ｓ１２で変換された話者ベクトル、発話スキルベクトルおよび言語ベクトルを学習後の音声合成モデルに入力することにより、合成音声の音声特徴量を生成する。 Thereafter, the speech feature generation unit 138 generates speech features of the synthesized speech using the speaker vector, speech skill vector, and language vector converted in S12 and the learned speech synthesis model (S13). . That is, the voice feature generation unit 138 generates the voice feature of the synthesized speech by inputting the speaker vector, speech skill vector, and language vector converted in S12 to the learned voice synthesis model.

その後、合成音声生成部１３９は、Ｓ１３で生成した音声特徴量を用いて合成音声を生成し、出力する（Ｓ１４）。 Thereafter, the synthesized speech generation unit 139 generates synthesized speech using the speech features generated in S13 and outputs it (S14).

学習装置１０が上記の処理を実行することで、ユーザから指定された話者および発話スキルの合成音声を出力することができる。 By the learning device 10 performing the above processing, it is possible to output synthesized speech of the speaker and speaking skill specified by the user.

［システム構成等］
また、図示した各部の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each part shown in the drawings is functionally conceptual, and does not necessarily need to be physically configured as shown in the drawings. In other words, the specific form of distributing and integrating each device is not limited to what is shown in the diagram, and all or part of the devices can be functionally or physically distributed or integrated in arbitrary units depending on various loads, usage conditions, etc. Can be integrated and configured. Furthermore, all or any part of each processing function performed by each device may be realized by a CPU and a program executed by the CPU, or may be realized as hardware using wired logic.

また、前記した実施形態において説明した処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the embodiments described above, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed manually. All or part of this can also be performed automatically using known methods. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings may be changed arbitrarily, unless otherwise specified.

［プログラム］
前記した学習装置１０は、パッケージソフトウェアやオンラインソフトウェアとしてプログラム（学習プログラム）を所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置を学習装置１０として機能させることができる。ここで言う情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等の端末等がその範疇に含まれる。 [program]
The learning device 10 described above can be implemented by installing a program (learning program) in a desired computer as packaged software or online software. For example, by causing the information processing device to execute the above program, the information processing device can be made to function as the learning device 10. The information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHSs (Personal Handyphone Systems), as well as terminals such as PDAs (Personal Digital Assistants).

図１０は、学習プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインターフェース１０３０、ディスクドライブインターフェース１０４０、シリアルポートインターフェース１０５０、ビデオアダプタ１０６０、ネットワークインターフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 10 is a diagram showing an example of a computer that executes a learning program. Computer 1000 includes, for example, a memory 1010 and a CPU 1020. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These parts are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインターフェース１０３０は、ハードディスクドライブ１０３１に接続される。ディスクドライブインターフェース１０４０は、ディスクドライブ１０４１に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１０４１に挿入される。シリアルポートインターフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031. Disk drive interface 1040 is connected to disk drive 1041. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041. Serial port interface 1050 is connected to, for example, mouse 1110 and keyboard 1120. Video adapter 1060 is connected to display 1130, for example.

ハードディスクドライブ１０３１は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、上記の学習装置１０が実行する各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０３１に記憶される。例えば、学習装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０３１に記憶される。なお、ハードディスクドライブ１０３１は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process executed by the learning device 10 described above is implemented as a program module 1093 in which computer-executable code is written. Program module 1093 is stored in hard disk drive 1031, for example. For example, a program module 1093 for executing processing similar to the functional configuration of the learning device 10 is stored in the hard disk drive 1031. Note that the hard disk drive 1031 may be replaced by an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられるデータは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０３１に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０３１に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 Further, data used in the processing of the embodiment described above is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1031. Then, the CPU 1020 reads out the program module 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1031 to the RAM 1012 as necessary and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０３１に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１０４１等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続される他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインターフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1031, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1041 or the like. Alternatively, program module 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and program data 1094 may then be read by the CPU 1020 from another computer via the network interface 1070.

１０学習装置
１１入出力部
１２記憶部
１３制御部
１３１発話スキル獲得部
１３２取得部
１３３生成部
１３４ベクトル表現変換部
１３５学習部
１３６入力受付部
１３７テキスト解析部
１３８音声特徴量生成部
１３９合成音声生成部 10 Learning device 11 Input/output section 12 Storage section 13 Control section 131 Speech skill acquisition section 132 Acquisition section 133 Generation section 134 Vector expression conversion section 135 Learning section 136 Input reception section 137 Text analysis section 138 Speech feature generation section 139 Synthetic speech generation Department

Claims

an acquisition unit that acquires voice information of a plurality of speakers and speech skills of each of the speakers;
a generation unit that generates voice information with a low speaking skill converted for each of the voice information of one speaker;
Using the voice information of the plurality of speakers, the speaking skill of each of the speakers, and the voice information of the speaker whose speaking skill has been converted to a lower value as learning data, a specified speaker and speaking skill is determined for the input text. A learning device comprising: a learning unit that learns a speech synthesis model that outputs speech features of synthesized speech.

The voice information of the speaker is
The learning device according to claim 1, further comprising at least one of a pitch parameter and phoneme segmentation information of the utterance of the speaker.

When the voice information of the speaker includes pitch parameters for each frame of utterance,
The generation unit is
The learning device according to claim 2, wherein the speech information is generated in which the speech skill is lowered by changing the distribution of pitch parameters within the speech.

When the voice information of the speaker includes phoneme segmentation information of the utterance of the speaker,
The generation unit is
3. The speech information according to claim 2, characterized in that the speech information in which the speech skill is lowered is generated by expanding or contracting the duration of each phoneme constituting the utterance, which is shown in the phoneme segmentation information of the utterance of the speaker. The learning device described.

The method further includes a voice feature generating unit that uses the learned voice synthesis model to generate voice features for generating synthesized speech of a speaker and speaking skill specified by the user, for the input text. The learning device according to claim 1, characterized in that:

The learning device according to claim 5, further comprising: a synthetic speech generation unit that generates synthetic speech using the generated speech feature amount.

A learning method performed by a learning device, comprising:
acquiring voice information of a plurality of speakers and speech skills of each of the speakers;
a step of generating voice information with a lower speaking skill converted for each of the voice information of one speaker;
Using the voice information of the plurality of speakers, the utterance skills of each of the speakers, and the voice information of the speaker whose utterance skills have been converted to a lower level as learning data, a specified speaker and utterance are generated for the input text. A learning method comprising the steps of: training a speech synthesis model that outputs speech features of synthesized speech of the skill.

acquiring voice information of a plurality of speakers and speech skills of each of the speakers;
a step of generating voice information with a lower speaking skill converted for each of the voice information of one speaker;
Using the voice information of the plurality of speakers, the utterance skills of each of the speakers, and the voice information of the speaker whose utterance skills have been converted to a lower level as learning data, a specified speaker and utterance are generated for the input text. The process of learning a speech synthesis model that outputs the speech features of the synthesized speech of the skill, and the learning program for the computer to execute.